US Used Cars Exploratory Data Analysis
In this project, we will analyze the US Used Cars Dataset from Kaggle. This dataset has over 3 million data in 66 columns. We'll use 19 columns for our exercise.
The main objective of this project is to get a better understanding of the used car market in the US by applying the data analysis & visualization skills to the real-world dataset.
Here is an outline of the steps we'll follow:
- Downloading a dataset from an online source.
- Data preparation and cleaning
- Open-ended exploratory analysis and visualization.
- Asking and answering interesting questions.
- summarizing inferences and conclusion.
Before we dive into our exercise, Let's look at the columns we are going to analyze.
- vin: Vehicle Identification Number is a unique encoded string for every vehicle.
- body_type: Body Type of the vehicle. Like Convertible, Hatchback, Sedan, etc.
- city: city where the car is listed. Eg: Houston, San Antonio, etc.
- daysonmarket: Days since the vehicle was first listed on the website.
- franchise_dealer: Whether the dealer is a franchise dealer.
- mileage: Mileage of the car when it was advertised.
- is_new: If True means the vehicle was launched less than 2 years ago.
- latitude: Latitude from the geolocation of the dealership.
- listeddate: The date the vehicle was listed on the website. Does not make daysonmarket obsolete. The prices is dayson_market days after the listed date.
- listing_color: Dominant color group from the exterior color.
- listing_id: Listing id from the website.
- longitude: Longitude from the geolocation of the dealership.
- make_name: Make of the car.
- maximum_seating: Seating capacity of the car.
- engine_type: The engine configuration. Eg: I4, V6, etc.
- price: price of the vehicle.
- seller_rating: Rating of the seller who advertised the vehicle.
- year: Car manufacturing year.
- fuel_type: Dominant type of fuel ingested by the vehicle.
What is Exploratory data analysis?
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
!pip install jovian --upgrade --quiet