Nyc Parking 2017 Final
Exploratory Data Analysis on Traffic Violations Issued in NYC
EDA Project for Jovian course - "Zero to Data Scientist"
By: Samantha Roberts
General Introduction to Open Government Data
I am a fan of government having an "Open Data Protocol." This has become important at the city, state, and country levels in the US as well as around the world.
In Decamber of 2007, 30 open government advocates from academia, industry, and government convened in Sebastopol, California, to discuss why "open data" was essential to democracy. The participants at this meeting developed a set of 8 Open Government Data Principles. Namely government data shall be considered open if it is:
- Complete (All data is made available without limitation)
- Primary (collected at the source with high granularity)
- Timely (Made available as quickly as possible)
- Accessible (Available to a wide range of users)
- Machine Processable (Reasonaby structured for automation)
- Non-discriminatory (Available to all - no requirement for regristration)
- Non-proprietary (Data has format where no entity has exclusive control)
- License-free (Data not subject to copyright, trademark or regulation)
In 2012 under Mayor Bloomberg the NYC Open Data repository was launched. This is an amazing compelation of data sets which is continously updated and supported, and supplies public data about all facets of the city, including police activity, budgets, business, city government, education, environment, and health information (recently including extensive COVID data).
NYC Parking Violation Data
One of the data sets in the NYC Open Data repository is from the Department of Finance on all of the Parking violations that are issued annually. I found this data on Kaggle before I explored the NYC Open Data site.
I found this an interesting data set for analysis for the following reasons:
- It is large. Currently there are over 10 million parking tickets issued each year throught the five boroughs of Brooklyn, Bronx, Queens, Staten Island, and Manhattan. [Kaggle]((https://www.kaggle.com/new-york-city/ny-parking-violations-issued) has packaged data from 2014-2017 inclusive, (actually, in two locations), but the same data is available directly through NY Open Data. Direct Links to the CSV files up to 2021 are provided at the end of this notebook.
- There are at least 43 features of each ticket, with include information about the vehicle, the registrataion, the type of violation, the location, the borough and precinct, the steet, the rank and division of the ticket issuer. This seemed especially rich for investigation.
- Because of the size, we needed to employ methods to reduce the data size and improve the processing time (such as Google Colab and Dask) to undertake this analysis in an efficient way.
- The data includes date and time information which lends itself to correlation with external events in the city, and gives experience working with time series.
- The geographc information lends itself to the creation of maps incorperating the various data. (Along with working with geoJSON files of NYC police precincts)
This data is difficult for the following reasons:
- Though the columns are well labled with descriptive names, there are several columns with catagorical information in the form of abreviations that are not well explained. However, this lends itself to pulling external files and scraping data to include in the analysis and further describe the data.
- It is also difficult because it is large. Not so much from a computational standpoint, but with respect to choosing and limiting the relevant data that I can analyze in the time available.
[xxx] The Methodology
This particular dataset lends itself to a perfect EDA project as it has the following elements:
- The data needs to be cleaned, and decisions need to be made -- many parking tickets are not written correctly, and therefore not all fields are filled out
- another challenging thing is that there is no table that explains the values in each column... and not all columns are filled out. for instance there are three columns that relate to the precinct the ticket was issued in namely (with the number of unique values):
Violation Location : 78
Violation Precinct : 79
Issuer Precinct : 111
- there is a date and time the ticket is issues, and working with time series is something that is fun to demonstrate.
- there is location information which allows exploration in geopandas and folium
- there are many questions that can be asked which would link to other data sets being imported
Must Have Items
- We will take a random sample of 10,000 tickets from the data of one year of parking tickets, namely 2017
- We will write the code to do the analysis for this sample and then try to apply it to the entire year of data
- Will incorperate other data sets such as fine amaounts
Would be nice for later Items
- We wil then see about applying this analysis to the other years of data that are available directly from the NYC data site
- may include scraping the data
- We will apply some compression to handle that large scale of data -- have no clue about that right now
- Will furthere improve ad convert addresses to lat long and see if we can incorperate folium maps with precinct boundaries
project_name = 'nyc-parking-2017-final' jovian.commit(project=project_name)
[jovian] Detected Colab notebook... [jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY: ·········· [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/samantha-roberts/nyc-parking-2017-final