Eda Company Datasets
Exploratory Data Analysis on Company Datasets
What is Exploratory Data Analysis?
Exploratory data analysis (EDA for short) is what data analysts do with large sets of data, looking for patterns and summarizing the dataset’s main characteristics beyond what they learn from modeling and hypothesis testing. EDA is a philosophy that allows data analysts to approach a database without assumptions. When a data analyst employs EDA, it’s like they’re asking the data to tell them what they don’t know.
It is an approach to data analysis, that uses these techniques:
- Maximize insights into a dataset.
- Uncover underlying structures.
- Extract important variables.
- Detect outliers and anomalies.
- Test underlying assumptions.
- Determine optimal factor settings.
Outline of Project
- Select and download real-world dataset
- Import and Install all the libraries
- Perform data preparation & cleaning
- Ask & answer questions about the data
- Perform exploratory analysis & visualization
- Summarize your inferences & write a conclusion
Select and download real-world dataset
This dataset is available on Kaggle. It contains information about the 7 million companies around the world. It includes the information about the companies in which year it is established, employees status, countries, and cities where these companies are spread. We will analyze this dataset and draw some conclusions.
Downloading the Dataset
Let's download the data into the Jupyter notebook. We'll use the opendatasets library from Jovian. Let's install and import it, and use the download method.
Use the "Run" button to execute the code.
!pip install jovian --upgrade --quiet import jovian # Execute this to save new versions of the notebook jovian.commit(project="7-million-company-datasets")
[jovian] Detected Colab notebook... [jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. Also, you can also delete this cell, it's no longer necessary.
# Install opendatsets library to downlaod the data from kaggle by using link of the data !pip install opendatasets --upgrade --quiet import opendatasets as od #Kaggle Dataset URL datasets_url = 'https://www.kaggle.com/peopledatalabssf/free-7-million-company-dataset' # Downloading the dataset od.download(datasets_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: pankajthakur3999 Your Kaggle Key: ·········· Downloading free-7-million-company-dataset.zip to ./free-7-million-company-dataset
100%|██████████| 278M/278M [00:04<00:00, 59.4MB/s]
# Convert dataset into CSV file datasets_url_to_csv = '/content/free-7-million-company-dataset/companies_sorted.csv'