Handling missing values is an important data preprocessing step in machine learning pipelines.
Pandas is versatile in terms of detecting and handling missing values. However, when it comes to model training and evaluation with cross validation, there is a better approach.
The imputer of scikit-learn along with pipelines provide a more practical way of handling missing values in cross validation process..
In this post, we will first do a few examples that show different ways to handle missing values with Pandas. After that, I will explain why we need a different approach to handle missing values in cross validation.
Finally, we will do an example using the missing value imputer and pipeline of scikit-learn.
Let’s start with Pandas. Here is a simple dataframe with a few missing values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(8,5)), columns=list('ABCDE'))
df.iloc[[1,4],[0,3]] = np.nan
df.iloc[[3,7],[1,2,4]] = np.nan
The isna function returns Boolean values indicating the cells with missing values. The isna().sum() gives us the number of missing values in each column.
df.isna().sum() A 2 B 2 C 2 D 2 E 2 dtype: int64
The fillna function is used to handle missing values. It provides many options to fill in. Let’s use a different method for each column.
The missing values in columns A, B, and C are filled with mean, median, and mode of the column, respectively. For column D, we used ‘ffill’ method which uses the previous value in the column to fill a missing value. The ‘bfill’ does the opposite.
Here is the updated version of the dataframe:
We still have one missing value in column D because we used the ‘bfill’ method for this column. With this method, the missing values are supposed to be filled with the values after them. Since the last value is a missing value, it was not changed.
The fillna function also accepts constant values. Let’s replace the last missing value with a constant.
As you have seen, the fillna function is pretty flexible. However, when it comes to train machine learning models, we need to be careful at handling the missing values.
Unless we use constant values, the missing values need to be handled after splitting the training and test sets. Otherwise, the model will be given information about the test set which causes data leakage.
Data leakage is a serious issue in machine learning. Machine learning models should not be given any information about the test set. The data points in the test sets need to be previously unseen.
If we use the mean of the entire data set to fill in missing values, we leak information about the test set to the model.
One solution is to handle missing values after train-test split. It is definitely an acceptable way. What if we want to do cross validation?
Cross validation means partitioning the data set into subsets (i.e. folds). Then, run many iterations with different combinations so that each example will be used in both training and testing.
Consider the case with 5-fold cross validation. The data set is divided into 5 subsets (i.e. folds). At each iteration, 4 folds are used in training and 1 fold is used in testing. After 5 iterations, each fold will be used in both training and testing.
We need a practical way to handle missing values in cross validation process in order to prevent data leakage.
One way is to create a Pipeline with scikit-learn. The pipeline accepts data preprocessing functions and can be used in the cross validation process.
Let’s create a new dataframe that fits a simple linear regression task.
The “regression” pipeline contains a simple imputer that fills in the missing values with mean. The linear regression model does the prediction task.
We can now use this pipeline as estimator in cross validation.
X = df.drop('F', axis=1)
y = df['F']scores = cross_val_score(regressor, X, y, cv=4, scoring='r2')
The R-squared score is pretty high because this is a pre-designed data set.
The important point here is to handle missing values after splitting train and test sets. It can easily be done with pandas if we do a regular train-test split.
However, if you want to do cross validation, it will be tedious to use Pandas. The pipelines of scikit-learn library provide a more practical and easier way.
The scope of pipelines are quite broad. You can also add other preprocessing techniques in a pipeline such as a scaler for numerical values. Using pipelines allows automating certain tasks and thus optimizing processes.
If you’re new to data science, here’s a good place to start
One of the most well-known and essential sub-fields of data science is machine learning. The term machine learning was first used in 1959 by IBM researcher Arthur Samuel. From there, the field of machine learning gained much interest from others, especially for its use in classifications.
When you start your journey into learning and mastering the different aspects of data science, perhaps the first sub-field you come across is machine learning. Machine learning is the name used to describe a collection of computer algorithms that can learn and improve by gathering information while they are running.
Any machine learning algorithm is built upon some data. Initially, the algorithm uses some “training data” to build an intuition of solving a specific problem. Once the algorithm passes the learning phase, it can then use the knowledge it gained to solve similar problems based on different datasets.
In general, we categorize machine learning algorithms into 4 categories:
Supervised algorithms: Algorithms that involve some supervision from the developer during the operation. To do that, the developer labels the training data and set strict rules and boundaries for the algorithm to follow.
Unsupervised algorithms: Algorithms that do not involve direct control from the developer. In this case, the algorithms’ desired results are unknown and need to be defined by the algorithm.
Semi-supervised algorithms: Algorithms that combines aspects of both supervised and unsupervised algorithms. For example, not all training data will be labeled, and not all rules will be provided when initializing the algorithm.
Reinforcement algorithms: In these types of algorithms, a technique called exploration/exploitation is used. The gest of it is simple; the machine makes an action, observe the outcomes, and then consider those outcomes when executing the next action, and so on.
Each of these categories is designed for a purpose; for example, supervised learning is designed to scale the training data’s scope and make predictions of future or new data based on that. On the other hand, unsupervised algorithms are used to organize and filter data to make sense of it.
Under each of those categories lay various specific algorithms that are designed to perform certain tasks. This article will cover 5 basic algorithms every data scientist must know to cover machine learning basics.
Regression algorithms are supervised algorithms used to find possible relationships among different variables to understand how much the independent variables affect the dependent one.
You can think of regression analysis as an equation, For example, if I have the equation y = 2x + z, y is my dependant variable, and x,z are the independent ones. Regression analysis finds how much do x and z affect the value of y.
The same logic applies to more advanced and complex problems. To adapt to the various problems, there are many types of regression algorithms; perhaps the top 5 are:
Linear Regression: The simplest regression technique uses a linear approach for featuring the relationship between the dependent (predicted) and independent variables (predictors).
Logistic Regression: This type of regression is used on binary dependent variables. This type of regressing is widely used to analyze categorical data.
Ridge Regression: When the regression model becomes too complex, ridge regression corrects the model’s coefficients’ size.
Lasso Regression: Lasso (Least Absolute Shrinkage Selector Operator) Regression is used to select and regularize variables.
Polynomial Regression: This type of algorithm is used to fit non-linear data. Using it, the best prediction is not a straight line; it is a curve that tries to fit all data points.
Classification in machine learning is the process of grouping items into categories based on a pre-categorized training dataset. Classification is considered a supervised learning algorithm.
These algorithms use the training data’s categorization to calculate the likelihood that a new item will fall into one of the defined categories. A well-known example of classification algorithms is filtering incoming emails into spam or not-spam.
There are different types of classification algorithms; the top 4 ones are:
K-nearest neighbor: KNN is an algorithm that uses training datasets to find the k closest data points in some datasets.
Decision trees: You can think of it as a flow chart, classifying each data points into two categories at a time and then each to two more and so on.
Naive Bayes: This algorithm calculates the probability that an item falls under a specific category using the conditional probability rule.
Support Vector Machine (SVM): In this algorithm, the data is classified based on its degree of polarity, which can go beyond the X/Y prediction.
Ensembling algorithms are supervised algorithms made of combining the prediction of two or more other machine learning algorithms to produce more accurate results. Combining the results can either be done by voting or averaging the results. Voting is often used during classification and averaging during regression.
Ensembling algorithms have 3 basic types: Bagging, Boosting, and Stacking.
Bagging: In bagging, the algorithms are run in parallel on different training sets, all equal in size. All algorithms are then tested using the same dataset, and voting is used to determine the overall results.
Boosting: In the case of boosting, the algorithms are run sequentially. Then the overall results are chosen using weighted voting.
Stacking: From the name, stacking has two-level stacked on top of each other, the base level is a combination of algorithms, and the top level is a meta-algorithm based on the base level results.
Clustering algorithms are a group of unsupervised algorithms used to group data points. Points within the same cluster are more similar to each other than to points in different clusters.
There are 4 types of clustering algorithms:
Centroid-based Clustering: This clustering algorithm organizes the data into clusters based on initial conditions and outliers. k-means is the most knowledgeable and used centroid-based clustering algorithm.
Density-based Clustering: In this clustering type, the algorithm connects high-density areas into clusters creating arbitrary-shaped distributions.
Distribution-based Clustering: This clustering algorithm assumes the data is composed of probability distributions and then clusters the data into various versions of that distribution.
Hierarchical Clustering: This algorithm creates a tree of hierarchical data clusters, and the number of clusters can be varied by cutting the tree at the correct level.
Association algorithms are unsupervised algorithms used to discover the probability of some items to occur together in a specific dataset. It is mostly used in the market-basket analysis.
The most used association algorithm is Apriori.
The Apriori algorithm is a mining algorithm used commonly used in transactional databases. Apriori is used to mine frequent itemsets and generate some association rules from those item sets.
For example, if a person buys milk and bread, then they are likely to also get some eggs. These insights are built upon previous purchases from various clients. Association rules are then formed according to a specific threshold for confidence set by the algorithm based on how frequently these items are brought together.
Machine learning is one of the most famous, well-researched sub-field of data science. New machine learning algorithms are always under development to reach better accuracy and faster execution.
Regardless of the algorithm, it can generally be categorized as one of four categories: supervised, unsupervised, semi-supervised, and reinforced algorithms. Each one of these categories holds many algorithms that are used for different purposes.
In this article, I have gone through 5 types of supervised/ unsupervised algorithms that every machine learning beginner should be familiar with. These algorithms are well-studied and widely-used that you only need to understand how to use it rather than how to implement it.
Most famous Python machine learning modules — such as Scikit Learn — contain a pre-defined version of most — if not all — of these algorithms.
My advice is, understand the mechanic, and master the usage and start building.
Today’s world is moving fastly towards using AI and Machine Learning in all fields. The most important key to this is DATA. Data is the key to everything. If as a Machine Learning Engineer we are able to understand and restructure the data toward our need, we would have completed half the task.
Let us try to learn to perform EDA (Exploratory Data Analysis) on data.
What we will learn in this tutorial :
Collect data for our application.
Structure of the data to our needs.
Visualize the data.
Let’s get started. We will try to fetch some sample data — The IRIS Dataset which a very common dataset that is used when you want to get started with Machine Learning and Deep Learning.
Collection of Data: The date for any application can be found on several websites like Kaggle, UCI, etc, or has to be made specific to some application. For example, if we want to classify between a dog and a cat we don’t need to build out a dataset by collecting images of dog and cat as there are several datasets available. Here let’s try to inspect the Iris Dataset.
Let’s fetch the data:
from sklearn.datasets import load_iris,
import pandas as pd
data = load_iris() #3.
df = pd.DataFrame(data.data, columns=data.feature_names)#4.
This (#3)will fetch the Dataset which sklearn has by default. Line #4 converts the dataset into a pandas data frame which is very commonly used to explore dataset with row-column attributes.
The first 5 rows of the data can be viewed using :
The number of rows and columns, and the names of the columns of the dataset can be checked with :
We can even download the dataset directly from UCI from here. The CSV file downloaded can be loaded into the df as :
df = pd.read_csv("path to csv file")
2. Structuring the Data: Very often the Dataset will have several features that don’t directly affect our output. Using such features is useless as it leads to unnecessary memory constraints and also sometimes errors.
We can check which columns are important or affect the output column more by checking the correlation of the output column with the inputs. Let us try that out :
Clearly, you can see above the correlation matrix helps us in understanding how all features are affected by one another. For more information about the correlation matrix click here.
So if our output column was supposed sepal length (cm), my output y would be “sepal length (cm)” and my input X would be ‘petal length (cm)’, ‘petal width (cm)’ as they have a higher correlation with y.
Note: If ‘sepal width (cm)’ would have correlation -0.8, we would also take that as the correlation value though it is negative has a huge impact on output y (inversely proportional).
Note: The value of correlation in a correlation matrix can vary between -1(inversely proportional) and +1(directly proportional).
3. Visualize the Data: This is a very important step as it can help in two ways :
Help you understand important points like how is the data split ie does it like close to a small range of values or higher.
Helps to understand decision boundaries.
Present it to people to make them understand your data rather than showing some tables.
There are several to plot and present the data like histograms, bar charts, pair plots, etc.
Let’s see how we plot a histogram for the IRIS dataset.
df.plot.hist( subplots = True, grid = True)
By looking into the histogram it’s easier for us to understand what is the range of values for each feature.
Let’s simply plot the data now.
Apart from these, there are several other graphs which can be plotted easily depending on the application.
Hence, We can conclude by stating one simple fact a well-structured dataset is an initial key to a good and efficient Machine Learning Model.
Explore the distinction between these common job titles with the analogy of a track meet.
This article uses the metaphor of a track team to differentiate between the role of a data analyst, data scientist, and machine learning engineer. We’ll start with the idea that conducting a data science project is similar to running a relay race. Hopefully, this analogy will help you make more informed choices around your education, job applications, and project staffing.
🔵 Data Analyst
The data analyst is capable of taking data from the “starting line” (i.e., pulling data from storage), doing data cleaning and processing, and creating a final product like a dashboard or report. The data analyst may also be responsible for transforming data for use by a data scientist, a hand-off that we’ll explore in a moment.
You might say that the data analyst is very capable of running the first part of the race, but no further.
🔴 Data Scientist
The data scientist has all the skills of the data analyst, though they might be less well-versed in dashboarding and perhaps a bit rusty at report writing. The data scientist can run further than the data analyst, though, in terms of their ability to apply statistical methodologies to create complex data products.
The data scientist is capable of racing the entire lap. That means they have the skills required to query data, explore features to assess predictive power, select an appropriate crop of models for training and testing, conduct hyperparameter tuning, and ultimately arrive at a statistics-powered model that provides business value through classification or prediction. However, if an organization loads its data scientist with all these responsibilities — from data ingest through data modeling — the data scientist won’t be able to run as well as if he or she were asked to run only the second part of the race, focused on the data modeling.
Overall, the team’s performance will improve if a business analyst conducts the querying and data cleaning steps, allowing the data scientist to focus on statistical modeling.
🔶 Machine Learning Engineer
The machine learning engineer could be thought of as the team’s secret weapon. You might conceptualize the MLE as the person designing track shoes that empower the other runners to race at top speeds.
The machine learning engineer may also be focused on bringing state-of-the-art solutions to the data science team. For example, an MLE may be more focused on deep learning techniques compared to a data scientist’s classical statistical approach.
Increasingly, the distinction between these positions is blurring, as statistics becomes the domain of easy-to-implement packages in Python and R. Don’t get me wrong-a fundamental understanding of statistical testing remains paramount in this career field. However, with growing frequency, the enterprise data scientist is asked to execute models powered by deep learning. This refers to the field of data science enabled by GPU-based computing, where typical models include neural networks like CNNs, RNNs, LSTMs, and transformers.
Machine learning researchers at companies such as Google Brain, OpenAI, and Deep Mind design new algorithmic approaches to advance toward state-of-the-art performance on specific use cases and, ultimately, the goal of building artificial general intelligence.
🚌 ML Ops
Another job title related to data science is MLOps. This refers to the responsibility of productionizing a model —in other words, creating a version of the model that is accessible to end users. MLOps is focused on creating a robust pipeline from data ingest, through preprocessing, to model inference (i.e., use in the real world to make classifications or predictions). This role’s responsibilities are closely related to those of the DevOps practitioner in software development.
We explored the job titles of data analyst, data scientist, and a few positions related to machine learning using the metaphor of a track team. The data analyst might start off the relay, before passing cleaned data to the data scientist for modeling. The machine learning engineer is like an experienced coach, specialized in deep learning. Finally, the MLOps practitioner is like the bus driver responsible for getting the team to the track meet.
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from the combination/collection of the data.
Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. To understand outliers, we need to go through these points:
what causes the outliers?
Impact of the outlier
Methods to Identify outliers
What causes the outliers?
Before dealing with the outliers, one should know what causes them. There are three causes for outliers — data entry/An experiment measurement errors, sampling problems, and natural variation.
Data entry /An experimental measurement error
An error can occur while experimenting/entering data. During data entry, a typo can type the wrong value by mistake. Let us consider a dataset of age, where we found a person age is 356, which is impossible. So this is a Data entry error.
These types of errors are easy to identify. If you determine that an outlier value is an error, we can fix this error by deleting the data point because you know it’s an incorrect value.
2. Sampling problems
Outliers can occur while collecting random samples. Let us consider an example where we have records of bone density of various subjects, but there is an unusual growth of bone in a subject, after analyzing this has been discovered that the subject had diabetes, which affects bone health. The goal was to model bone density growth in girls with no health conditions that affect bone growth. Since the data is not a part of the target population so we will not consider this.
3. Natural variation
Suppose we need to check the reliability of a machine. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.
Impact of the outlier
Outliers can change the results of the data analysis and statistical modeling. Following are some impacts of outliers in the data set:
It may cause a significant impact on the mean and the standard deviation
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA, and other statistical model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.
Let’s examine what can happen to a data set with outliers. For the sample data set:
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4
We find the following mean, median, mode, and standard deviation:
Mean = 2.58
Median = 2.5
Mode = 2
Standard Deviation = 1.08
If we add an outlier to the data set:
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400
The new values of our statistics are:
Mean = 35.38
Median = 2.5
Mode = 2
Standard Deviation = 114.74
As you can see, having outliers often has a significant effect on your mean and standard deviation.
Methods to Identify outliers
There are various ways to identify outliers in a dataset, following are some of them:
Sorting the data
Using graphical Method
Using z score
Using the IQR interquartile range
Sorting the data
Sorting the dataset is the simplest and effective method to check unusual value. Let us consider an example of age dataset:
In the above dataset, we have sort the age dataset and get to know that 398 is an outlier. Sorting data method is most effective on the small dataset.
Using graphical Method
We can detect outliers with the help of graphical representation like Scatter plot and Boxplot.
1. Scatter Plot
Scatter plots often have a pattern. We call a data point an outlier if it doesn’t fit the pattern. Here we have a scatter plot of Weight vs height. Notice how two of the points don’t fit the pattern very well. There is no special rule that tells us whether or not a point is an outlier in a scatter plot. When doing more advanced statistics, it may become helpful to invent a precise definition of “outlier”.
Box-plot is one of the most effective ways of identifying Outliers in a dataset. When reviewing a box plot, an outlier is defined as a data point that is located outside the box of the box plot. As seen in the box plot of bill vs days. Box-Plot uses the Interquartile range(IQR) to detect outliers.
Z-score (also called a standard score) gives you an idea of how many standard deviations away a data point is from the mean.. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is.
Z score = (x -mean) / std. deviation
In a normal distribution, it is estimated that
68% of the data points lie between +/- 1 standard deviation.
95% of the data points lie between +/- 2 standard deviation.
99.7% of the data points lie between +/- 3 standard deviation.
Formula for Z score = (Observation — Mean)/Standard Deviation
z = (X — μ) / σ
Let us consider a dataset:
Using the IQR interquartile range
Interquartile range(IQR), is just the width of the box in the box-plot which can be used as a measure of how spread out the values are. An outlier is any value that lies more than one and a half times the length of the box from either end of the box.
Arrange the data in increasing order
Calculate first(q1) and third quartile(q3)
Find interquartile range (q3-q1)
Find lower bound q1*1.5
Find upper bound q3*1.5
Anything that lies outside of lower and upper bound is an outlier
Let us take the same example as of Z-score:
As you can see we have found Lower and upper values that is: 7.5 and 19.5, so anything that lies outside these values is an outlier.
It is a data mining technique that transforms raw data into an understandable format. Raw data(real world data) is always incomplete and that data cannot be sent through a model. That would cause certain errors. That is why we need to preprocess data before sending through a model.
At the heart of this intricate process is data. Your machine learning tools are as good as the quality of your data. Sophisticated algorithms will not make up for poor data.In this article I will try to simplify the exercise of data preprocessing, or in other words, the rituals programmers usually follow before it is ready to be used for machine learning models into steps
Steps in Data Preprocessing
Here are the steps Data Scientist follows 1. Import libraries 2. Read data 3. Checking for missing values 4. Checking for categorical data 5. Standardize the data 6. PCA transformation 7. Data splitting
Step 1: Import Libraries:Libraries are modules that you can call upon when you need them,essential collections that Data Scientist or Analyst need in Python to use in Data processing and to arrive at a decisive outcome or output.e.g
import pandas as pd.
step2:Import Dataset: Datasets comes in many formats but a lot of comes in CSV formats.keep the datasets in the same directory as your program and you can read data using the method read_csv which can be found in the Library called pandas
import pandas as pd
dataset = pd.read_csv(YourFile.csv’)
After importing datasets you do EDA(Exploratory Data Analysis).“Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process.After studying our datasets carefully we creates a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations.read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].In the industry, a data scientist often works with large datasets. It is impossible to understand the entire dataset at one go. So first get an idea of what you are dealing with by taking a subset of the entire dataset as a sample. Do not make any modifications in this stage. You are just observing the dataset and getting an idea of how to tackle it.
X = dataset.iloc[:, :-1].values
: as a parameter selects all. So the above piece of code selects all the rows. For columns we have :-1, which means all the columns except the last one.
Step 3: Taking care of Missing Data in Dataset:When you first get your data from a source most of the time it comes incomplete,missing data,incompatible measurement(cm or meters,dollars or pound sterling)you have to Normalized or standardized your data.Of course we would not get into Scaling right now.
Sometimes you may find some data are missing in the dataset. We need to be equipped to handle the problem when we come across them. you could remove the entire line of data but what if you are unknowingly removing crucial information? Of course we would not want to do that. One of the most common idea to handle the problem is to take a mean of all the values of the same column and have it to replace the missing data.The Library to use is Scikit-learn preprocessing it contains the class imputer which helps in the missing data
from sklearn.preprocessing import Imputer
Create an object of the same class to call the functions that are in that class object imputer it will take many parameters:
i. missing_values — We can either give it an integer or “NaN” for it to find the missing values. ii. strategy — we will find the average so we will set it to mean. We can also set it to median or most_frequent (for mode) as necessary. iii. axis — we can either assign it 0 or 1, 0 to impute along columns and 1 to impute along rows
we will fit the imputer object to our data.fit is means training or imposing the model to our data
imputer = imputer.fit(X[:,1:3])
The code above will fit the imputer object to our matrix of features X. Since we used :, it will select all rows and 1:3 will select the second and the third column (why? because in python index starts from 0 so 1 would mean the second column and the upper-bound is excluded. If we wanted to include the third column instead, we would have written 1:4).
Now we will just replace the missing values with the mean of the column by the methodtransform.which i called data transformation
X[:, 1:3] = imputer.transform(X[:, 1:3])
in the next article or Part 2 we will discuss about the rest steps in Data Preprocessing and Data Exploratory analysis which are
Checking for categorical data Standardize the data PCA transformation Data splitting(Training and Testing)
As data scientists, we often work with tons of data. The data we want to load can be stored in different ways. The most common formats are the CSV files, Excel files, or databases. Also, the data can be available throughout web services. Of course, there are many other formats. To work with the data, we need to represent it in a tabular structure. Anything tabular is arranged in a table with rows and columns.
In some cases, the data is already tabular and it’s easy to load it. In other cases, we work with unstructured data. The unstructured data is not organized in a pre-defined manner (plain text, images, audio, web pages). In this post, we’ll focus on loading data from CSV (Comma Separated Values) files.
Pandas is an open source library for the Python programming language developed by Wes McKinney. This library is very efficient and provides easy-to-use data structures and analysis tools.
Pandas contains a fast and efficient object for data manipulation called DataFrame. A commonly used alias for Pandas is pd. The library can load many different formats of data. When our data is clean and structured, every row represents an observation and every column a feature. The rows and the columns can have labels.
In the examples below, I’ll mark some parts with transparent rectangles for a better understanding of what we’re changing. Also, we’ll work with a very small subset from a dataset for simplicity. This dataset contains mobile cellular subscriptions for a given country and year. The full data can be found here. I’ve done some cleaning beforehand to make the data tidy.
Here is the data we want to load into a Pandas DataFrame. It’s uploaded in the in this GitHubGist web app and it’s already visualized with a tabular structure here. However, we can see it the raw format here. Also, we can see that this file contains comma separated values.
To load this data, we can use the pd.read_csv() function.
To create these examples, I’m using a Jupyter Notebook. If the last row in a code cell contains value it’s printed. So, that’s why I’ve put the cellular_data variable in the last row of the example.
We can see that the data is loaded, but there is something strange. What is this Unnamed: 0 column? We don’t have such column in our CSV file. Well, in our case this column contains the row labels (row index) of the data and we have to tell Pandas that. We can do this using the index_col argument.
In other cases, our data can be without the row labels. In these cases, pandas will auto-generate these labels starting from 0 to the length of the rows — 1. Let’s see examples with the same data, without the row labels.
Now our DataFrame looks fine. Sometimes, we want to change the row labels in order to work easily with our data later. Here we can set the row labels to be the country code for each row. We can do that by setting the index attribute of a Pandas DataFrame to a list. The length of the list and the length of the rows must be the same. After that, we can easily subset our data or look at a given country using the country codes.
In many cases, we don’t want to set the index manually and we want the index to be one of the columns in the DataFrame. In such cases, we can use the DataFrame object’s method called set_index. Note that pandas doesn’t set the index permanentlyunless we tell it. In case we want to set the index permanently, we can use the inplace argument to achieve this.
In the example above, we don’t tell pandas to set the index permanently and when we print the cellular_data DataFrame we see that the index is not changed. Let’s try again, with the inplace argument.
Now, we can clearly see that when we use inplace = True, our DataFrame’s index is changedpermanently.
Index and Select Data
There are many ways in which you can select data from DataFrames. In this blog post, we’ll see how to use square brackets and the methods loc and iloc to achieve this.
With square brackets, you can select a choice from the rows or you can select a choice from the columns. For a row selection, we can use a list of indexes or a slice. We can select rows using slicing like this: sliceable[start_index:end_index:step]
The end_index is not inclusive. I’ve already written about slicing in one of my previous blog post called Python Basics for Data Science. You can quickly look at the “Subsetting lists” part to understand it. Although the examples there are with lists, the idea here is the same. We just use DataFrames here, they are also sliceable.
For a column selection, we can use a list of the wanted columns. If we pass only one column as a string instead of a list, the result will be pandas Series. The pandas Series are a one-dimensional array which can be labeled. If we paste 2 or more Series together, we’ll create a DataFrame. In some cases, we might want to select only one column, but keep the data in a DataFrame. In such cases, we can pass a list with one column name.
The square brackets are useful, but their functionality is limited. We can select only columns or only rows from a given DataFrame. In many cases, we need to select both columns and rows. The loc and iloc methods give us this power.
The loc method allows us to select rows and columns of your data based on labels. First, you specify the row labelsto the left side, then you specify the column labels to theright side. The iloc allows us the same thing but based on the integer positions of our DataFrame.
If we want to select all rows or columns we can simply type : for the rows or for the columns side. Also if we want to select specific rows but all columns, we can just pass only the rows labels.
Understanding with examples is easier, so let’s see some. In these examples, we’ll compare the usage of these 2 methods.
Comparison operators in Python
The comparison operators can tell us how 2 values relate to each other. In many cases, Python can’t tell us how 2 values of different types relate to each other, but there are some exceptions. For example, we can compare float and integer numbers. Something to keep in mind is that we can compare Booleans with integers. True correspond to 1 and False correspond to 0. These operators are very straightforward.
Let’s see some very simple examples.Simple Comparison Operators
Filtering pandas DataFrame
The comparison operators can be used with pandas series. This can help us to filter our data by specific conditions. We can use comparison operators with series, the result will be a boolean series. Each item of these series will be True if the condition is met, and False otherwise. After we have these Boolean series, we can apply a row selection to get a filtered DataFrame as a result.
Note that, we’ve used another syntax here to get thecellular_subcriptioncolumn. DataFrame[column_name] and DataFrame.column_name code blocks returns the same result.
However, be careful with the dot syntax (used in these examples), because your column can be with the same name of the DataFrame’s methods. For example, if we have a column called “min”, we can’t use the dot syntax to get the values from that column. That’s because the DataFrame object has a method called “min”. Let’s now see how we can use the Boolean Series from above to filter our DataFrame.
Let’s see another example. Imagine that we want to get all records where the country is the United Kingdom.
Now that we know how to generate a Boolean series that meets some conditions, we can now use Boolean operators on them to create more complex filtering.
There are 3 types of Boolean operations
and – takes 2 Boolean values and return True if both the values are True. This operator is a short-circuit, it only evaluates the second argument if the first one is True.
or – takes 2 Boolean values and return True if at least one of them is True. This operator is also a short-circuit, it only evaluates the second argument if the first one is False.
not – take a Boolean value and return the opposite. This operator has a low priority than non-Boolean operators. For example not x == y is interpreted as not (x == y) and x == not y is a syntax error. Also, it is commonly used when we need to combine different Boolean operations and then want to negate the result.
Simple Boolean Operations
Subsetting by Multiple Conditions
When we want to filter our DataFrame by multiple conditions, we can use the Boolean operators. An important note here is that when we want to use Boolean operators with pandas, we must use them as follows:
& for and
| for or
~ for not
When we apply a Boolean operation on 2 Boolean series with the same size, the Boolean operation will apply for each pair.
Using the “and” operator
We can see that pandas doesn’t work with and operator, it expects the & operator. Now, let’s try again. The goal here is to get only the flights that have more than240 passengers andless than300 passengers.
Using the “or” operator
Let’s find all flights that have lower than200orgreater than 375 passengers. Remember that for the or operator we use the pipe |character.
Reversing conditions using the not operator
In some cases, we want to negate our condition. In such cases, we can use the not operator. For this operator, we use the tilde ~character.
Let’s say that we want to get all flights that the month is notNovember.
We can make a more complex filtering based very specific conditions.
Let’s get all flights that in November for the 1952 and 1954 years.
Now, let’s get all flights that are between the 1952 and 1954 years and the month is August or September.
The isin method
Imagine that we want to compare equality of a single column to multiple values. Let’s say that we want to get all flights that are in the months: February, August, and September. One way to achieve this is with multiple or conditions like this.
There is a repeated code and this is tedious. There is a better way to achieve the same result by using the isin method. We need to pass as a list or set the values to this method and it will return the wanted Boolean series.
Of course, we can combine the returned Boolean series of this method with other Boolean series.
Let’s say that we want to get the flights that are in the 1954 year and in the February, August, and September months.
The between method
This method can make our code cleaner when we want to select values inside in a range. Instead of writing 2 Boolean conditions, we can use this method.
Let’s say that we want to get all flights between the 1955 and 1960 years inclusive.
Again, we can combine this method with another conditional filtering.
Let’s get all the flights that are between the 1955 and 1960 years and are in the October month.
The isnull and isna methods
This isna method indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike). The isnull method is an alias for the isna method. This means that these 2 methods are exactly the same, but with different names.
I’ve changed the flights DataFrame which we have used. There are some NaN values in the month column. Let’s see how we can get all the records which have a missing month.
In many cases, we want to get the data that have no missing values. Let’s try to get the flights that have no missing month. We can use the not operator with the ~ character to negate the Boolean series returned by the isna method.
The notna method
There is also a method called notna. This method is the oppositeof theisnamethod. We can achieve the same result from the last example using this method.
Subsetting by a Condition in One Line
All the examples we look at for now can be written in one line. Some people like that, others hate it. When we’re subsetting by only one condition in many cases it’s more preferable and easy to write our filter in one line.
Let’s first see a subsetting example only with one condition.
Subsetting by Multiple Conditions in One Line
In some cases, it’s okay to write a simple expression in one line, but in other cases, it’s very unreadable. My suggestion here is to write the simple ones in one line and the complex ones in multiple lines. If your row is very long it can be unreadable, so be careful.
Subsetting with boolean series using the .loc method.
Remember the .loc method? We can select rows and columns based on labels with this method. The nice thing is that we can pass Boolean series instead of passing labels for a rows or columns selection and it will work.
All of the generated Boolean series of the examples above we used for subsetting can be passed for a row selection.
Conditional Probability and Unconditional Probability
Conditional Probability may be explained as the likelihood of an event or outcome occurring based on the occurrence of a previous event or outcome. Usually, it is calculated by multiplying the probability of the preceding event by the updated probability of the succeeding, or conditional, event.
My general observation says that in problems where the occurrence of one event affects the happening of the following event. These scenarios of probability are classic conditional probability examples.
In the context of mathematics, the probability of occurrence of any event A when another event B in relation to A has already occurred is known as conditional probability.
Our discussion would also include differences between Conditional and Unconditional Probability and round off with the basic differences between Conditional and Joint Probability.
Definition of Conditional Probability
The conditional probability may be defined as the probability of one event occurring with some relationship to one or more other events.
It is to be noted that the conditional probability does not state that there is always a causal relationship between the two events, as well as it does not indicate that both events occur simultaneously.
It’s primarily related to the Bayes’ theorem, which is one of the most influential theories in statistics.
The Formula for Conditional Probability may be explained as:
P(A|B) – the probability of event A occurring given that event B has already occurred
P (A ∩ B) – the joint probability of events A and B; the probability that both events A and B occur at the same time
P(B) – the probability of event B
Formula of Conditional Probability
The formula above is applied to the calculation of the conditional probability of events that are neither independent nor mutually exclusive.
Experts on Conditional Probability suggest another way of calculating it by using the Bayes’ theorem. The theorem can be used to determine the conditional probability of event A, given that the event B has occurred by knowing the conditional probability of event B, given the event A has occurred, as well as the individual probabilities of the event A and B. Mathematically, the Bayes’ theorem can be denoted in the following way:
Conditional Probability for Independent Events
Conditional Probability may be explained as two events that are independent of each other if the probability of the outcome of one event does not influence the probability of the outcome of another event. Therefore, the two independent events A and B may be represented as:
P(A|B) = P(A)
P(B|A) = P(B)
Conditional Probability of two independent events.
Conditional Probability for Mutually Exclusive Events
In probability theory, mutually exclusive events may be explained as the events that cannot occur simultaneously. In other words, if an event has already occurred, another event cannot occur. Thus, the conditional probability of the mutually exclusive events is always zero.
P(A|B) = 0
P(B|A) = 0
Conditional Probability Examples
Examplesusing a table of data
According to a research paper, a two-way table of data is one of the most common problems we see in Conditional Probability. Here, we take a look at how to find different probabilities using such a table.
A survey asked full time and part-time students how often they had visited the college’s tutoring center in the last month. The results are shown below.
In a survey conducted by a college both full time and part-time, students were asked how often they had visited the college’s tutoring center in the last two months. The results may be represented as follows:
Conditional Probability Example using Table Data
Suppose that a surveyed student is randomly selected.
(a) What is the probability the student visited the tutoring center four or more times, given that the student is full time?
Conditional probability is all about focusing on the information you know. When calculating this probability, we are given that the student is full time. Therefore, we should only look at full-time students to find the probability.
The probability the student visited the tutoring center four or more times,
(b) Suppose that a student is part-time. What is the probability that the student visited the tutoring center one or fewer times?
This one is a bit trickier, because of the wording. Let us put it in the following way:
Find: probability student visited the tutoring center one or fewer times
Assume or given: a student is part-time (“suppose that a student is part-time”)
The probability that the student visited the tutoring center one or fewer times. The student is part-time.
Since we are assuming (or supposing) the student is part-time, we will only look at part-time students for this calculation.
(c) If the student visited the tutoring center four or more times, what is the probability he or she is a part-time student?
As stated above, we must make sure we know what is given, and what we are finding.
Find probability he or she is part-time
Assume or given: the student visited the tutoring center four or more times (“if the student visited the tutoring center four or more times…”)
For this question, we are only looking at students who visited the tutoring center four or more times.
The probability that the student is a part-time assuming that the student visited the tutoring center four or more times
Difference between Conditional & Joint Probability
What is Joint Probability
The joint probability may be explained as a measure of how likely it is for two (or more) things to occur. For instance, if you roll two dice, you have the probability of getting a six on the first and a four on the second. This is a classic example of Joint Probability, where the probability of occurrence of both results is possible.
What is Conditional Probability
Conditional probability, on the other hand, maybe explained as a measure of how likely one situation is likely to happen if you are aware of the occurrence of another event.
For example, what is the probability that the second die shows a four if the sum of the numbers on the two dice is ten? If you know that the sum is ten, it turns out that it is far more likely that the second die is a four than if you knew nothing about the sum.
Difference between Conditional & Joint Probability
Difference between Conditional & Unconditional Probability
Definition of Conditional Probability
Conditional Probability may be explained as a probability that considers some other piece of information, knowledge, or evidence.
Definition of Unconditional Probability
Unconditional Probability may be explained as a probability that does not consider any other information, knowledge, or evidence.
Krishna Singh, an expert on mathematics and statistics, explains the difference between Conditional and Unconditional Probability with the following example:
Conditional Probability Examples
Pulling an ace out of a deck of cards and then drawing a second ace without replacing the first. You would have a 4/52 chance of getting the first ace, but a 4/51 chance (if you didn’t pull an ace) of the second, making the second conditional upon the results of the first.
Unconditional Probability Examples
Rolling a die. The fact that you got a 6 on one roll has no effect on whether you will roll a 6 later on.
Data Collection, Data Processing & Finished Result
Data Science and Conditional Probability
Data Science often uses statistical inferences to predict or analyze trends from data, while statistical inferences make use of probability distributions of data. Therefore, knowing probability and its applications are important to work effectively on data
Most data science techniques rely on Bayes’ theorem. Bayes’ theorem is a formula that describes at large how to update the probabilities of hypotheses when given evidence. You can build a learner using the Bayes’ theorem to predicts the probability of the response variable belonging to some class, given a new set of attributes.
Data Science is inextricably linked to Conditional Probability. Data Science professionals must have a thorough understanding of probability to solve complicated data science problems. A strong base in Probability and Conditional Probability is essential to fully understand and implement relevant algorithms for use.
Data Science & Conditional Probability
Do you aspire to be a Data Analyst, and then grow further to become a Data Scientist? Do you like finding answers to complex business challenges interests? Whatever you want, start early to gain a competitive advantage. You must be fluent in the programming languages and tools that will help you get hired.
You may start as a Data Analyst, go on to become a Data Scientist with some years of experience, and eventually a data evangelist. Data Science offers lucrative career options. There is enough scope for growth and expansion.
You might be a programmer, a mathematics graduate, or simply a bachelor of Computer Applications. Students with a master’s degree in Economics or Social Science can also be a data scientist. Take up a Data Science or Data Analytics course, to learn Data Science skills and prepare yourself for the Data Scientist job, you have been dreaming of.
A Career in Data Science
Taking up a good Data Science or Data Analytics course teaches you the key Data Science skills and prepares you for the Data Scientist, Data Scientist role (that you aspire for) in the near future. Do not forget to include all your skills in your data scientist’s resume.
In addition, students also get lifetime access to online course matter, 24×7 faculty support, expert advice from industry stalwarts, and assured placement support that prepares them better for the vastly expanding Data Science market.
Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”.
Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set.
An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification.
To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email.
Types of Learners in Classification
We have two types of learners in respective to classification problems −
As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.
As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN).
Building a Classifier in Python
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −
Step 1: Importing necessary python package
For building a classifier using scikit-learn, we need to import it. We can import it by using following script −
Step 2: Importing dataset
After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −
from sklearn.datasets import load_breast_cancer
The following script will load the dataset;
data = load_breast_cancer()
We also need to organize the data and it can be done with the help of following scripts −
Step 3: Organizing data into training & testing sets
As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −
from sklearn.model_selection import train_test_split
Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −
The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.
Step 5: Finding accuracy
We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.
from sklearn.metrics import accuracy_score
The above output shows that NaïveBayes classifier is 95.17% accurate.
Classification Evaluation Metrics
The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.
The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −
Confusion Matrix − It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes.
Various ML Classification Algorithms
The followings are some important ML classification algorithms −
Support Vector Machine (SVM)
We will be discussing all these classification algorithms in detail in further chapters.
Some of the most important applications of classification algorithms are as follows −