Анализ данных с помощью pandas. Часть 2: Выбор данных и нахождение наиболее частых жалоб
В этой части мы будем использовать новый набор данных, чтобы показать, как быть с большими объёмами данных. Это данные о 311 сервисных запросов (или жалоб) жителей, предоставленные NYC Open Data (скачать данные).
In [1]:%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)
complaints = pd.read_csv('data/311-service-requests.csv')
/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
В зависимости от вашей версии pandas, вы можете увидеть предупреждение "DtypeWarning: Columns (8) have mixed types". Это означает, что pandas столкнулся с проблемой чтения в наших данных. В нашем случае это почти наверняка означает, что данные имеют столбцы, где некоторые записи являются строками, а некоторые представляют собой целые числа.
На данный момент мы будем игнорировать это, надеясь на то, что пронесёт, но в долгосрочной перспективе мы должны исследовать это предупреждение.
А что это такое? (сводка)
При печати большого dataframe, будут показаны только первые несколько строк.
Для получения первых 5 строк dataframe, мы можем использовать срез: df[:5]
.
complaints[:5]
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
1 | 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
2 | 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
3 | 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
4 | 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, -73.94738703491433) |
5 rows × 52 columns
Выбор строк и столбцов
Чтобы выбрать столбец, мы берём имя столбца в качестве индекса (как в словаре), например так:
In [4]:complaints['Complaint Type'][:5]
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
неважно, в каком порядке мы это сделаем (взять первые 5 строк у столбца или столбец у первых 5 строк)
In [5]:complaints[:5]['Complaint Type']
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
Выбор нескольких столбцов
Что если мы хотим знать только тип жалобы и район, а остальное нам неинтересно? Pandas позволяет легко выбрать подмножество столбцов: просто проиндексируйте списком столбцов.
In [6]:complaints[['Complaint Type', 'Borough']]
Complaint Type | Borough | |
---|---|---|
0 | Noise - Street/Sidewalk | QUEENS |
1 | Illegal Parking | QUEENS |
2 | Noise - Commercial | MANHATTAN |
3 | Noise - Vehicle | MANHATTAN |
4 | Rodent | MANHATTAN |
5 | Noise - Commercial | QUEENS |
6 | Blocked Driveway | QUEENS |
7 | Noise - Commercial | QUEENS |
8 | Noise - Commercial | MANHATTAN |
9 | Noise - Commercial | BROOKLYN |
10 | Noise - House of Worship | BROOKLYN |
11 | Noise - Commercial | MANHATTAN |
12 | Illegal Parking | MANHATTAN |
13 | Noise - Vehicle | BRONX |
14 | Rodent | BROOKLYN |
15 | Noise - House of Worship | MANHATTAN |
16 | Noise - Street/Sidewalk | STATEN ISLAND |
17 | Illegal Parking | BROOKLYN |
18 | Street Light Condition | BROOKLYN |
19 | Noise - Commercial | MANHATTAN |
20 | Noise - House of Worship | BROOKLYN |
21 | Noise - Commercial | MANHATTAN |
22 | Noise - Vehicle | QUEENS |
23 | Noise - Commercial | BROOKLYN |
24 | Blocked Driveway | STATEN ISLAND |
25 | Noise - Street/Sidewalk | STATEN ISLAND |
26 | Street Light Condition | BROOKLYN |
27 | Harboring Bees/Wasps | MANHATTAN |
28 | Noise - Street/Sidewalk | MANHATTAN |
29 | Street Light Condition | STATEN ISLAND |
... | ... | ... |
111039 | Noise - Commercial | MANHATTAN |
111040 | Noise - Commercial | MANHATTAN |
111041 | Noise | BROOKLYN |
111042 | Noise - Street/Sidewalk | MANHATTAN |
111043 | Noise - Commercial | BROOKLYN |
111044 | Noise - Street/Sidewalk | MANHATTAN |
111045 | Water System | MANHATTAN |
111046 | Noise | BROOKLYN |
111047 | Illegal Parking | QUEENS |
111048 | Noise - Street/Sidewalk | MANHATTAN |
111049 | Noise - Commercial | BROOKLYN |
111050 | Noise | MANHATTAN |
111051 | Noise - Commercial | MANHATTAN |
111052 | Water System | MANHATTAN |
111053 | Derelict Vehicles | QUEENS |
111054 | Noise - Street/Sidewalk | BROOKLYN |
111055 | Noise - Commercial | BROOKLYN |
111056 | Street Sign - Missing | QUEENS |
111057 | Noise | MANHATTAN |
111058 | Noise - Commercial | BROOKLYN |
111059 | Noise - Street/Sidewalk | MANHATTAN |
111060 | Noise | MANHATTAN |
111061 | Noise - Commercial | QUEENS |
111062 | Water System | MANHATTAN |
111063 | Water System | MANHATTAN |
111064 | Maintenance or Facility | BROOKLYN |
111065 | Illegal Parking | QUEENS |
111066 | Noise - Street/Sidewalk | MANHATTAN |
111067 | Noise - Commercial | BROOKLYN |
111068 | Blocked Driveway | BROOKLYN |
111069 rows × 2 columns
Посмотрим первые 10 строк:
In [7]:complaints[['Complaint Type', 'Borough']][:10]
Complaint Type | Borough | |
---|---|---|
0 | Noise - Street/Sidewalk | QUEENS |
1 | Illegal Parking | QUEENS |
2 | Noise - Commercial | MANHATTAN |
3 | Noise - Vehicle | MANHATTAN |
4 | Rodent | MANHATTAN |
5 | Noise - Commercial | QUEENS |
6 | Blocked Driveway | QUEENS |
7 | Noise - Commercial | QUEENS |
8 | Noise - Commercial | MANHATTAN |
9 | Noise - Commercial | BROOKLYN |
Какой самый частый тип жалобы?
Это очень простой вопрос для pandas! Используем метод .value_counts()
:
complaints['Complaint Type'].value_counts()
HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 PAINT - PLASTER 5149 Blocked Driveway 4590 NONCONST 3998 Street Condition 3473 Illegal Parking 3343 Noise 3321 Traffic Signal Condition 3145 Dirty Conditions 2653 Water System 2636 Noise - Commercial 2578 ELECTRIC 2350 Broken Muni Meter 2070 Noise - Street/Sidewalk 1928 Sanitation Condition 1824 Rodent 1632 Sewer 1627 Taxi Complaint 1227 Consumer Complaint 1227 Damaged Tree 1180 Overgrown Tree/Branches 1083 Missed Collection (All Materials) 973 Graffiti 973 Building/Use 942 Root/Sewer/Sidewalk Condition 836 Derelict Vehicle 803 ... Fire Alarm - Modification 5 Posting Advertisement 5 Internal Code 5 Poison Ivy 5 Miscellaneous Categories 5 Ferry Complaint 4 Transportation Provider Complaint 4 Special Natural Area District (SNAD) 4 Illegal Animal Sold 4 Illegal Fireworks 3 Invitation 3 Fire Alarm - Replacement 3 Adopt-A-Basket 3 Misc. Comments 2 Legal Services Provider Complaint 2 DFTA Literature Request 2 Opinion for the Mayor 2 Public Assembly 2 Window Guard 2 Tunnel Condition 1 Snow 1 X-Ray Machine/Equipment 1 Stalled Sites 1 Municipal Parking Facility 1 Open Flame Permit 1 Trans Fat 1 Ferry Permit 1 DWD 1 Highway Sign - Damaged 1 DHS Income Savings Requirement 1 Name: Complaint Type, dtype: int64
10 наиболее частых типов:
In [9]:complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 PAINT - PLASTER 5149 Blocked Driveway 4590 NONCONST 3998 Street Condition 3473 Illegal Parking 3343 Name: Complaint Type, dtype: int64
Теперь мы можем построить график!
In [10]:complaint_counts[:10].plot(kind='bar')