Анализ данных с помощью pandas. Часть 2: Выбор данных и нахождение наиболее частых жалоб
В этой части мы будем использовать новый набор данных, чтобы показать, как быть с большими объёмами данных. Это данные о 311 сервисных запросов (или жалоб) жителей, предоставленные NYC Open Data (скачать данные).
In [1]:%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)complaints = pd.read_csv('data/311-service-requests.csv')/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
В зависимости от вашей версии pandas, вы можете увидеть предупреждение "DtypeWarning: Columns (8) have mixed types". Это означает, что pandas столкнулся с проблемой чтения в наших данных. В нашем случае это почти наверняка означает, что данные имеют столбцы, где некоторые записи являются строками, а некоторые представляют собой целые числа.
На данный момент мы будем игнорировать это, надеясь на то, что пронесёт, но в долгосрочной перспективе мы должны исследовать это предупреждение.
А что это такое? (сводка)
При печати большого dataframe, будут показаны только первые несколько строк.
Для получения первых 5 строк dataframe, мы можем использовать срез: df[:5].
complaints[:5]| Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
| 1 | 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
| 2 | 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
| 3 | 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
| 4 | 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, -73.94738703491433) |
5 rows × 52 columns
Выбор строк и столбцов
Чтобы выбрать столбец, мы берём имя столбца в качестве индекса (как в словаре), например так:
In [4]:complaints['Complaint Type'][:5]0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
неважно, в каком порядке мы это сделаем (взять первые 5 строк у столбца или столбец у первых 5 строк)
In [5]:complaints[:5]['Complaint Type']0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
Выбор нескольких столбцов
Что если мы хотим знать только тип жалобы и район, а остальное нам неинтересно? Pandas позволяет легко выбрать подмножество столбцов: просто проиндексируйте списком столбцов.
In [6]:complaints[['Complaint Type', 'Borough']]| Complaint Type | Borough | |
|---|---|---|
| 0 | Noise - Street/Sidewalk | QUEENS |
| 1 | Illegal Parking | QUEENS |
| 2 | Noise - Commercial | MANHATTAN |
| 3 | Noise - Vehicle | MANHATTAN |
| 4 | Rodent | MANHATTAN |
| 5 | Noise - Commercial | QUEENS |
| 6 | Blocked Driveway | QUEENS |
| 7 | Noise - Commercial | QUEENS |
| 8 | Noise - Commercial | MANHATTAN |
| 9 | Noise - Commercial | BROOKLYN |
| 10 | Noise - House of Worship | BROOKLYN |
| 11 | Noise - Commercial | MANHATTAN |
| 12 | Illegal Parking | MANHATTAN |
| 13 | Noise - Vehicle | BRONX |
| 14 | Rodent | BROOKLYN |
| 15 | Noise - House of Worship | MANHATTAN |
| 16 | Noise - Street/Sidewalk | STATEN ISLAND |
| 17 | Illegal Parking | BROOKLYN |
| 18 | Street Light Condition | BROOKLYN |
| 19 | Noise - Commercial | MANHATTAN |
| 20 | Noise - House of Worship | BROOKLYN |
| 21 | Noise - Commercial | MANHATTAN |
| 22 | Noise - Vehicle | QUEENS |
| 23 | Noise - Commercial | BROOKLYN |
| 24 | Blocked Driveway | STATEN ISLAND |
| 25 | Noise - Street/Sidewalk | STATEN ISLAND |
| 26 | Street Light Condition | BROOKLYN |
| 27 | Harboring Bees/Wasps | MANHATTAN |
| 28 | Noise - Street/Sidewalk | MANHATTAN |
| 29 | Street Light Condition | STATEN ISLAND |
| ... | ... | ... |
| 111039 | Noise - Commercial | MANHATTAN |
| 111040 | Noise - Commercial | MANHATTAN |
| 111041 | Noise | BROOKLYN |
| 111042 | Noise - Street/Sidewalk | MANHATTAN |
| 111043 | Noise - Commercial | BROOKLYN |
| 111044 | Noise - Street/Sidewalk | MANHATTAN |
| 111045 | Water System | MANHATTAN |
| 111046 | Noise | BROOKLYN |
| 111047 | Illegal Parking | QUEENS |
| 111048 | Noise - Street/Sidewalk | MANHATTAN |
| 111049 | Noise - Commercial | BROOKLYN |
| 111050 | Noise | MANHATTAN |
| 111051 | Noise - Commercial | MANHATTAN |
| 111052 | Water System | MANHATTAN |
| 111053 | Derelict Vehicles | QUEENS |
| 111054 | Noise - Street/Sidewalk | BROOKLYN |
| 111055 | Noise - Commercial | BROOKLYN |
| 111056 | Street Sign - Missing | QUEENS |
| 111057 | Noise | MANHATTAN |
| 111058 | Noise - Commercial | BROOKLYN |
| 111059 | Noise - Street/Sidewalk | MANHATTAN |
| 111060 | Noise | MANHATTAN |
| 111061 | Noise - Commercial | QUEENS |
| 111062 | Water System | MANHATTAN |
| 111063 | Water System | MANHATTAN |
| 111064 | Maintenance or Facility | BROOKLYN |
| 111065 | Illegal Parking | QUEENS |
| 111066 | Noise - Street/Sidewalk | MANHATTAN |
| 111067 | Noise - Commercial | BROOKLYN |
| 111068 | Blocked Driveway | BROOKLYN |
111069 rows × 2 columns
Посмотрим первые 10 строк:
In [7]:complaints[['Complaint Type', 'Borough']][:10]| Complaint Type | Borough | |
|---|---|---|
| 0 | Noise - Street/Sidewalk | QUEENS |
| 1 | Illegal Parking | QUEENS |
| 2 | Noise - Commercial | MANHATTAN |
| 3 | Noise - Vehicle | MANHATTAN |
| 4 | Rodent | MANHATTAN |
| 5 | Noise - Commercial | QUEENS |
| 6 | Blocked Driveway | QUEENS |
| 7 | Noise - Commercial | QUEENS |
| 8 | Noise - Commercial | MANHATTAN |
| 9 | Noise - Commercial | BROOKLYN |
Какой самый частый тип жалобы?
Это очень простой вопрос для pandas! Используем метод .value_counts():
complaints['Complaint Type'].value_counts()HEATING 14200
GENERAL CONSTRUCTION 7471
Street Light Condition 7117
DOF Literature Request 5797
PLUMBING 5373
PAINT - PLASTER 5149
Blocked Driveway 4590
NONCONST 3998
Street Condition 3473
Illegal Parking 3343
Noise 3321
Traffic Signal Condition 3145
Dirty Conditions 2653
Water System 2636
Noise - Commercial 2578
ELECTRIC 2350
Broken Muni Meter 2070
Noise - Street/Sidewalk 1928
Sanitation Condition 1824
Rodent 1632
Sewer 1627
Taxi Complaint 1227
Consumer Complaint 1227
Damaged Tree 1180
Overgrown Tree/Branches 1083
Missed Collection (All Materials) 973
Graffiti 973
Building/Use 942
Root/Sewer/Sidewalk Condition 836
Derelict Vehicle 803
...
Fire Alarm - Modification 5
Posting Advertisement 5
Internal Code 5
Poison Ivy 5
Miscellaneous Categories 5
Ferry Complaint 4
Transportation Provider Complaint 4
Special Natural Area District (SNAD) 4
Illegal Animal Sold 4
Illegal Fireworks 3
Invitation 3
Fire Alarm - Replacement 3
Adopt-A-Basket 3
Misc. Comments 2
Legal Services Provider Complaint 2
DFTA Literature Request 2
Opinion for the Mayor 2
Public Assembly 2
Window Guard 2
Tunnel Condition 1
Snow 1
X-Ray Machine/Equipment 1
Stalled Sites 1
Municipal Parking Facility 1
Open Flame Permit 1
Trans Fat 1
Ferry Permit 1
DWD 1
Highway Sign - Damaged 1
DHS Income Savings Requirement 1
Name: Complaint Type, dtype: int6410 наиболее частых типов:
In [9]:complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 PAINT - PLASTER 5149 Blocked Driveway 4590 NONCONST 3998 Street Condition 3473 Illegal Parking 3343 Name: Complaint Type, dtype: int64
Теперь мы можем построить график!
In [10]:complaint_counts[:10].plot(kind='bar')

LETPY.COM