Exploratory Data Analysis with Python

Today’s world is moving fastly towards using AI and Machine Learning in all fields. The most important key to this is DATA. Data is the key to everything. If as a Machine Learning Engineer we are able to understand and restructure the data toward our need, we would have completed half the task.

Let us try to learn to perform EDA (Exploratory Data Analysis) on data.

What we will learn in this tutorial :

  1. Collect data for our application.
  2. Structure of the data to our needs.
  3. Visualize the data.

Let’s get started. We will try to fetch some sample data — The IRIS Dataset which a very common dataset that is used when you want to get started with Machine Learning and Deep Learning.

  1. Collection of Data: The date for any application can be found on several websites like Kaggle, UCI, etc, or has to be made specific to some application. For example, if we want to classify between a dog and a cat we don’t need to build out a dataset by collecting images of dog and cat as there are several datasets available. Here let’s try to inspect the Iris Dataset.

Let’s fetch the data:

                                                                                  from sklearn.datasets import load_iris,


                                                                                  import pandas as pd
data = load_iris() #3.



                                                 df = pd.DataFrame(data.data, columns=data.feature_names)#4.

This (#3)will fetch the Dataset which sklearn has by default. Line #4 converts the dataset into a pandas data frame which is very commonly used to explore dataset with row-column attributes.

The first 5 rows of the data can be viewed using :

                                            df.head()
Image for post
Iris Dataset

The number of rows and columns, and the names of the columns of the dataset can be checked with :

                           print(df.shape)
                           print(df.columns)(150, 4)#Output 
                           Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                          'petal width (cm)'],
                           dtype='object')#Output

We can even download the dataset directly from UCI from here. The CSV file downloaded can be loaded into the df as :

                                               df = pd.read_csv("path to csv file")

2. Structuring the Data: Very often the Dataset will have several features that don’t directly affect our output. Using such features is useless as it leads to unnecessary memory constraints and also sometimes errors.

We can check which columns are important or affect the output column more by checking the correlation of the output column with the inputs. Let us try that out :

                                                            df.corr()
Image for post
Correlation Matrix

Clearly, you can see above the correlation matrix helps us in understanding how all features are affected by one another. For more information about the correlation matrix click here.

So if our output column was supposed sepal length (cm), my output y would be “sepal length (cm)” and my input X would be ‘petal length (cm)’,
‘petal width (cm)’ as they have a higher correlation with y.

Note: If ‘sepal width (cm)’ would have correlation -0.8, we would also take that as the correlation value though it is negative has a huge impact on output y (inversely proportional).

Note: The value of correlation in a correlation matrix can vary between -1(inversely proportional) and +1(directly proportional).

3. Visualize the Data: This is a very important step as it can help in two ways :

  1. Help you understand important points like how is the data split ie does it like close to a small range of values or higher.
  2. Helps to understand decision boundaries.
  3. Present it to people to make them understand your data rather than showing some tables.

There are several to plot and present the data like histograms, bar charts, pair plots, etc.

Let’s see how we plot a histogram for the IRIS dataset.

df.plot.hist( subplots = True, grid = True)
Image for post
Histogram Plot

By looking into the histogram it’s easier for us to understand what is the range of values for each feature.

Let’s simply plot the data now.

                                                        df.plot(subplots=True)
Image for post

Apart from these, there are several other graphs which can be plotted easily depending on the application.

Hence, We can conclude by stating one simple fact a well-structured dataset is an initial key to a good and efficient Machine Learning Model.

What’s the Difference Between a Data Analyst, Data Scientist, and Machine Learning Engineer?

Explore the distinction between these common job titles with the analogy of a track meet.

Image for post

This article uses the metaphor of a track team to differentiate between the role of a data analyst, data scientist, and machine learning engineer. We’ll start with the idea that conducting a data science project is similar to running a relay race. Hopefully, this analogy will help you make more informed choices around your education, job applications, and project staffing.

🔵 Data Analyst

The data analyst is capable of taking data from the “starting line” (i.e., pulling data from storage), doing data cleaning and processing, and creating a final product like a dashboard or report. The data analyst may also be responsible for transforming data for use by a data scientist, a hand-off that we’ll explore in a moment.

Image for post
The data analyst is capable of running half a lap

You might say that the data analyst is very capable of running the first part of the race, but no further.

🔴 Data Scientist

The data scientist has all the skills of the data analyst, though they might be less well-versed in dashboarding and perhaps a bit rusty at report writing. The data scientist can run further than the data analyst, though, in terms of their ability to apply statistical methodologies to create complex data products.

Image for post
The data scientist is capable of running the full lap…

The data scientist is capable of racing the entire lap. That means they have the skills required to query data, explore features to assess predictive power, select an appropriate crop of models for training and testing, conduct hyperparameter tuning, and ultimately arrive at a statistics-powered model that provides business value through classification or prediction. However, if an organization loads its data scientist with all these responsibilities — from data ingest through data modeling — the data scientist won’t be able to run as well as if he or she were asked to run only the second part of the race, focused on the data modeling.

Image for post
…the data scientist will run faster if only tasked with running the second half of the relay

Overall, the team’s performance will improve if a business analyst conducts the querying and data cleaning steps, allowing the data scientist to focus on statistical modeling.

🔶 Machine Learning Engineer

The machine learning engineer could be thought of as the team’s secret weapon. You might conceptualize the MLE as the person designing track shoes that empower the other runners to race at top speeds.

Image for post
The machine learning engineer is a versatile player, capable of developing advanced methodologies

The machine learning engineer may also be focused on bringing state-of-the-art solutions to the data science team. For example, an MLE may be more focused on deep learning techniques compared to a data scientist’s classical statistical approach.

Image for post
Machine learning engineers take it to the next level.

Increasingly, the distinction between these positions is blurring, as statistics becomes the domain of easy-to-implement packages in Python and R. Don’t get me wrong-a fundamental understanding of statistical testing remains paramount in this career field. However, with growing frequency, the enterprise data scientist is asked to execute models powered by deep learning. This refers to the field of data science enabled by GPU-based computing, where typical models include neural networks like CNNs, RNNs, LSTMs, and transformers.

Machine learning researchers at companies such as Google BrainOpenAI, and Deep Mind design new algorithmic approaches to advance toward state-of-the-art performance on specific use cases and, ultimately, the goal of building artificial general intelligence.

🚌 ML Ops

Another job title related to data science is MLOps. This refers to the responsibility of productionizing a model —in other words, creating a version of the model that is accessible to end users. MLOps is focused on creating a robust pipeline from data ingest, through preprocessing, to model inference (i.e., use in the real world to make classifications or predictions). This role’s responsibilities are closely related to those of the DevOps practitioner in software development.

Image for post
MLOps is the bus driver, responsible for getting everyone to the track meet

Summary

We explored the job titles of data analyst, data scientist, and a few positions related to machine learning using the metaphor of a track team. The data analyst might start off the relay, before passing cleaned data to the data scientist for modeling. The machine learning engineer is like an experienced coach, specialized in deep learning. Finally, the MLOps practitioner is like the bus driver responsible for getting the team to the track meet.