Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.
numeric to categorical, which converts numeric into categorical variables
sampling, wihch corresponds to data quantization.
You can download the full code of this tutorial from Github repository
In this tutorial we exploit the cupcake.csv dataset, which contains the trend search of the word cupcake on Google Trends. Data are extracted from this link. We exploit the pandas library to import the dataset and we transform it into a dataframe through the read_csv() function.
import pandas as pd df = pd.read_csv('cupcake.csv') df.head(5)
Numeric to categorical binning
In this case we group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions min() and max() of dataframe to calculate the minimum value and the maximum value of the column Cupcake.
Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):
small — (edge1, edge2)
medium — (edge2, edge3)
big — (edge3, edge4) We can use the linspace() function of the numpy package to calculate the 4 bins, equally distributed.
import numpy as np bins = np.linspace(min_value,max_value,4) bins
which gives the following output:
array([ 4., 36., 68., 100.])
Now we define the labels:
labels = ['small', 'medium', 'big']
We can use the cut() function to convert the numeric values of the column Cupcake into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter include_lowest to True in order to include also the minimum value.
We can plot the distribution of values, by using the hist() function of the matplotlib package.
import matplotlib.pyplot as pltplt.hist(df['bins'], bins=3)
Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:
by bin means: each value in a bin is replaced by the mean value of the bin.
by bin median: each bin value is replaced by its bin median value.
by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.
In order to perform sampling, the binned_statistic() function of the scipy.stats package can be used. This function receives two arrays as input, x_data and y_data, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the x values (x_bins) corresponding to the binned values (y_bins) as the values at the center of the bin range.
In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.
Data binning is very useful when discretization is needed.
Are you new to blogging, and do you want step-by-step guidance on how to publish and grow your blog? Learn more about our new Blogging for Beginners course and get 50% off through December 10th.
WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.
This is the second part in Machine Learning series where we discuss on Features handling before using the data for machine learning models. The articles contains below parts:
This article cover the basic concepts of modifying the features as data needs to be refined before it can be used for prediction. We need to remove garbage out of data and turn features into high quality features.
Your features need be represented as quantitative (preferably numeric) attributes of the thing you’re sampling. They can be real world values, such as the readings from a sensor, and other discernible, physical properties. Alternatively, your features can also be calculated derivatives, such as the presence of certain edges and curves in an image, or lack thereof.
But there is no guarantee that will be the case, and you will often encounter data in textual or other unstructured forms. Luckily, there are a few techniques that when applied, clean up these scenarios.
If you have a categorical feature, the way to represent it in your dataset depends on if it’s ordinal or nominal. For ordinal features, map the order as increasing integers in a single numeric feature.
On the other hand, if your feature is nominal (and thus there is no obvious numeric ordering), then you have two options. The first is you can encoded it similar as you did above. This would be a fast-and-dirty approach. This may or may not cause problems for you in the future. If you aren’t getting the results you hoped for, or even if you are getting the results you desired but would like to further increase the result accuracy, then a more precise encoding approach would be to separate the distinct values out into individual boolean features:
These newly created features are called boolean features because the only values they can contain are either 0 for non-inclusion, or 1 for inclusion. Pandas .get_dummies() method allows you to completely replace a single, nominal feature with multiple boolean indicator features. This method is quite powerful and has many configurable options, including the ability to return a SparseDataFrame, and other prefixing options. It’s benefit is that no erroneous ordering is introduced into your dataset.
Pure Textual Features
If you are trying to “featurize” a body of text such as a webpage, a tweet, a passage from a newspaper, an entire book, or a PDF document, creating a corpus of words and counting their frequency is an extremely powerful encoding tool. This is also known as the Bag of Words model, implemented with the CountVectorizer() method in SciKit-Learn.
In addition to text and natural language processing, bag of words has successfully been applied to images by categorizing a collection of regions and describing only their appearance, ignoring any spatial structure. However this is not the typical approach used to represent images as features, and requires you come up with methods of categorizing image regions. More often used methods include:
Split the image into a grid of smaller areas, and attempt feature extraction at each locality. Return a combined array of all discovered. features
Use variable-length gradients and other transformations as the features, such as regions of high / low luminosity, histogram counts for horizontal and vertical black pixels, stroke and edge detection, etc.
Resize the picture to a fixed size, convert it to grayscale, then encode every pixel as an element in a uni-dimensional feature array.
If you’re wondering what the :: is doing, that is called extended slicing. Notice the .reshape(-1) line. This tells Pandas to take your 2D image and flatten it into a 1D array. This is an all purpose method you can use to change the shape of your dataframes, so long as you maintain the number of elements. For example reshaping a [10, 10] to [100, 1] or [4, 25], etc. Another method called .ravel() will do the same thing as .reshape(-1), that is unravel a multi-dimensional NDArray into a one dimensional one. The reason why its important to reshape your 2D array images into one dimensional ones is because each image will represent a single sample, and Sklearn expects your dataframe to be shapes [num_samples, num_features].
Most of the times, we will have many non-informative features. For Example, Name or ID variables and it results in “garbage-in, garbage-out”. Also, extra features make a model complex, time-consuming, and harder to implement in production. Many machine learning algorithms suffer from the curse of dimensionality — that is, they do not perform well when given a large number of variables or features. So it’s better to remove highly irrelevant or redundant features to simplify the situation and improve performance.
For instance, if your dataset have columns you don’t need, you can remove them using drop() method by specifying the name of columns. Axis=1 tells that deletion will happen column-wise while axis=0 will imply that deletion will happen row-wise.
Or, if you want only select columns for analysis or visualization purposes, you can select those columns by enclosing them within double square brackets.
Sometimes, we want to remove a feature but use it as an index instead. We can do this by specifying the column name as index during data load method.
We can set the column as index later as well by using set_index() method.
We can further improve the situation of having too many features through dimensionality reduction.
Commonly used techniques are:
PCA (Principal Component Analysis) — Considered a more statistical approach than machine learning approach. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation. One important thing to note that it is an unsupervised dimensionality reduction technique, where you can cluster the similar data points based on the correlation between them without any labels.
t-SNE (t-Distributed Stochastic Neighboring Entities) — In this approach, target number of dimensions is typically 2 or 3 which means that t-SNE is used a lot for visualizing your data as visualizing data with more than 3 dimensions is not easy for human brain. t-SNE has remarkable capability of keeping close points from multi-dimensional space close in the two-dimensional space.
Feature embedding — It is based on training a separate machine learning model to encode a large number of features into small number of features
Pandas will automatically attempt to figure out the best data type to use for each series in your dataset. Most of the time it does this flawlessly, but other times it fails horribly! Particularly the .read_html() method is notorious for defaulting all series data types to Python objects. You should check, and double-check the actual type of each column in your dataset to avoid unwanted surprises. If your data types don’t look the way you expected them, explicitly convert them to the desired type using the .to_datetime(), .to_numeric(), and .to_timedelta() methods:
Take note how to_numeric properly converts to decimal or integer depending on the data it finds. The errors=’coerce’ parameter instructs Pandas to enter a NaN at any field where the conversion fails.
Sometimes, even though the data type is correct, but we still need to modify the values in features. Example — we need to divide all the values by 10 or we need to convert them to their logarithmic values.
Just as oil needs to be refined before it is used, similarly data needs to be refined before we use it for machine learning. Sometimes, we need to derive new features out of existing features. The process of extracting new features from existing ones is called feature engineering. Classical Machine Learning depends on feature engineering much more than Deep Learning.
Below are some types of Feature Engineering.
Aggregation — New features are created by getting a count, sum, average, mean, or median from a group of entities.
Part-Of — New features are created by extracting a part of data-structure. E.g. Extracting the month from a date.
Binning — Here you group your entities into bins and then you apply those aggregations over those bins. Example — group customers by age and then calculating average purchases within each group
Flagging—Here you derive a boolean (0/1 or True/False) value for each entity
Example — we need to summarize data by finding its sum, average, minimum or maximum value and then creating new features with those new values.
Covariance and correlation are widely-used measures in the field of statistics, and thus both are very important concepts in data science. Covariance and correlation provide insight about the ralationship between random variables or features in a dataset. Although these two concepts are highly related, we need to interpret them carefully not to cause any misunderstandings.
Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value. Consider the random variables “X” and “Y”. Some realizations of these variables are shown in the figure below. The orange dot show the mean of X and mean of Y. As the values of a get away from the mean of X in positive direction, the values of Y tend to change in similar way. Same relation is valid for negative direction as well.
The formula for covariance of two random variables:
where E means the expectation and µ is the mean.
If X and Y change in the same direction, as in the figure above, covariance is positive. Let’s confirm with the covariance function of numpy:
np.cov() returns the covariance matrix. The covariance of X and Y is 0.11. The value at position [0,0] shows the covariance of X with itself and the value at [1,1] shows the covariance of Y with itself. If you run the code np.cov(X,X), you will get the value at position [0,0] which is 0.07707877 in this case. Similarly, np.cov(Y,Y) will return the value at position [1,1].
The covariance of a variable with itself is actually indicates the variance of that variable:
Let’s go over another example. The figure below shows some realizations of random variables Z and T. As we can see, as T increases, Z tends to decrease. Thus, the covariance of Z and T should be negative:
We may also see variables that the variations are independent of each other. For example, in the figure below, realizations of variables A and B seems changing randomly with respect to each other. In this case, we expect to see a covariance value that is close to zero. Let’s confirm:
The following example will provide a little more intuition about the calculation of covariance.
Covariance describes how similarly two random variables deviate from their mean. The red lines show the means of series. The mean of s1 is the vertical line (x=8.5) and the mean of s2 is the horizontol line (y=9.3). Deviation from the mean is the difference between the values and the mean. Covariance is proportional to the product of deviation of s1 and s2 values. Consider the upper right rectangle in the plot above. Both s1 and s2 values are higher than the mean of s1 and s2, respectively. So, deviations are positive. When we multiply two positive values, we get a positive value. In the lower left rectangle, s1 and s2 values are lower than the mean of s1 and s2, respectively. Thus, deviations are negative but we get a positive number when two negative numbers are multiplied. For the points in lower right and upper left rectangle areas, deviations of s1 is positive when the deviation of s2 is negative and vice versa. So we get a negative number when two deviations are multiplied. All the deviations are combined to get the covariance. Hence, if we have more points in negative regions than positive regions, we will get a negative covariance.
Correlation is a normalization of covariance by the standard deviation of each variable.
where σ is the standard deviation.
This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value. In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.
Consider the dataframe below:
We want to measure the relationship between X-Y and X-Z. We want to find out which variable (Y or Z) is more correlated with X. Let’s use covariance first:
Covariance of X and Z is much higher than the covariance of X and Y. We may think the relationship between the deviations in X and Z is much stronger than that of X and Y. However, it is not the case. Covariance of X and Z is higher because of the value ranges. The range of Z values are in between 22 and 222 whereas the values of Y are around 1 (most of them are less than 1). Therefore, we need to use correlation to eliminate the effect of different value ranges.
As we can see from the correlation matrix, X and Y are actually more correlated than X and Z.
Learn how to present the relationships amongst the features using multivariate charts and plots in Python
While dealing with a big dataset, it is important to understand the relationship between the features. That is a big part of data analysis. The relationships can be between two variables or amongst several variables. I will discuss how to present the relationships between multiple variables with some simple techniques. Python’s Numpy, Pandas, Matplotlib, and Seaborn libraries will be used.
First, import the necessary packages and the dataset to be used.
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np df = pd.read_csv("nhanes_2015_2016.csv")
This dataset is very large. At least too large to show a screenshot here. Here are the columns in this dataset.
Column names may look strange to you. I will keep explaining as we keep using them.
In this dataset, we have two systolic blood pressure data (‘BPXSY1’, ‘BPXSY2) and two diastolic blood pressure data (‘BPXDI1’, ‘BPXDI2’). It is worth looking at if there is any relationship between them. Observe the relationship between the first and second systolic blood pressure.
To find out the relation between two variables, scatter plots have been being used for a long time. It is the most popular, basic, and easily understandable way of looking at a relationship between two variables.
The relationship between the two systolic blood pressures is positively linear. There is a lot of overlapping observed in the plot.
2. To understand the systolic and diastolic blood pressure data and their relationships more, make a joint plot. Jointplot shows the density of the data and the distribution of both the variables at the same time.
From the two correlation chart above, the correlation between two systolic blood pressure is 1% higher in the female population than in the male. If these things are new to you, I encourage you to try understanding the correlation between two diastolic blood pressures or systolic and diastolic blood pressures.
4. Human behavior can change with so many different factors such as gender, education level, ethnicity, financial situation, and so on. In this dataset, we have ethnicity (“RIDRETH1”) information as well. Check the effect of both ethnicity and gender on the relationship between both the systolic blood pressures.
With different ethnic origins and gender, correlations seem to be changing a little bit but generally stays positively linear as before.
5. Now, focus on some other variables in the dataset. Find the relationship between education and marital status.
Both the education column(‘DMDEDUC2’) and the marital status (‘DMDMARTL’) column are categorical. First, replace the numerical values with the string values that will make sense. We also need to get rid of values that do not add good information to the chart. Such as the education column has some values ‘Don’t know’ and the marital status column has some ‘Refused’ values.
Finally, we got this DataFrame that is clean and ready for the chart.
x = pd.crosstab(db.DMDEDUC2x, db.DMDMARTLx) x
Here is the result. The numbers look very simple to understand. But a chart of population proportions will be a more appropriate presentation. I am getting a population proportion based on marital status.
x.apply(lambda z: z/z.sum(), axis=1)
6. Find the population proportion of marital status segregated by Ethnicity (‘RIDRETH1’) and education level.
First, replace the numeric value with meaningful strings in the ethnicity column. I found these string values from the Center for Disease Control website.
Here, blue color shows the male population distribution and orange color represents the female population distribution. Only ‘never married’ and ‘living with partner’categories have similar distributions for the male and female populations. Every other category has a notable difference in the male and female populations.
Introducing the exploretransform package for Python
Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis¹
76% of data scientists view data preparation as the least enjoyable part of their work²
In this article, I will be demonstrating Python’s exploretransform package. It can save time during data exploration and transformation and hopefully make your data preparation more enjoyable!
I originally developed exploretransform for use in my own projects, but I figured it might be useful for others. My intention was to create a simple set of functions and classes that returned results in common Python data formats. This would enable practitioners to easily utilize the outputs or extend the original functions as part of their workflows.
How to use exploretransform:
Installation and import
!pip install exploretransformimport exploretransform as et
Let’s start by loading the Boston corrected dataset.
df, X, y = et.loadboston()
At this stage, I like to check that the data types align with the data dictionary and first five observations. Also, the # of lvls can indicate potential categorical features or features with high cardinality. Any dates or other data that need reformatting can also be detected here. We can use peek() here.
After analyzing the data types, we can use explore() to identify missing, zero, and infinity values.
Earlier, we saw that town was likely a categorical feature with high cardinality. We can use freq() to analyze categorical or ordinal features providing the count, percent, and cumulative percent for each level
t = et.freq(X['town'])t
To visualize the resutls of freq() we can use plotfreq(). It generates a bar plot showing the levels in descending order.
To pair with histograms you probably normally examine, skewstats() returns the skewness statistics and magnitude for each numeric feature. When you have too many features to easily plot, this function becomes more useful.
In order to determine the association between the predictors and target, ascores() calculates pearson, kendall, pearson, spearman, mic, and dcor statistics. A variety of these scores is useful since certain scores measure linear associations and others will detect non-linear relationships.
Correlation matrices can get unwieldy once we hit a certain number of features. While the Boston dataset is well below this threshold, one can imagine that having a table might be more useful than a matrix when dealing with high dimensionality. Corrtable() returns a table of all pairwise correlations and uses the average correlation for the row and column in to decide on potential drop/filter candidates. You can use any of the methods you normally would with pandas corr function:
CategoricalOtherLevel() is a custom transformer that creates “other” level in categorical / ordinal data based on threshold. This is useful in situation where you have high cardinality predictors and when there is a possibility of having new categories appear in future data.
co = et.CategoricalOtherLevel(colname = 'town', threshold = 0.015).fit_transform(cs)co.iloc[0:15, :]
CorrelationFilter() is a custom transformer that filters numeric features based on pairwise correlation. It uses corrtable() and calcdrop() to perform the drop evaluations and calcuations. For more information on how it works please see: Are you dropping too many correlated features?
This article is going to be about the first look every data enthusiast has taken into their project’s dataset. Before machine learning, before modeling, before feature selection — there has to be a fundamental understanding of the data you are using. That’s what we are doing — exploring. This article is about EDA, exploratory data analysis.
We will take it through several steps of analysis and even introduce a few techniques that help us determine the best course of action. For this article, I am going to assume you understand the difference between continuous and categorical data and knowledge about the different packages Python has to offer.
First, we are going to load a dataset that is relatively small and easy to understand. Luckily, Seaborn has a few datasets to choose from and I’ve decided to go with the ‘tips’ dataset and with it some other packages for this session like Matplotlib, Pandas, and SciPy. So the first step is to load it into a data frame in order to be used.https://medium.com/media/6c7e79e9b296e4dca7f88a5b12640557
While we are not going to necessarily need it for this data frame, you may run into data sets with hundreds of features and thousands of rows.
1. Descriptive Statistics
Python has a great method to use when you want an overview of a dataset. That is the .describe() method. Describe, when used on a data frame, allows us to see the statistical breakdown of the data frame. It is a great place to start and we can tailor it to the types of features we have. In this data frame, we have both continuous and categorical features. The statistical breakdown works differently on either one but you can use .describe() on both.https://medium.com/media/78142b25b9698dea407496876fa38d1a
The output for both at the same time is interesting, so let’s take a look.
The output from the code gives us insight into the statistics of the data frame. At the top of the breakdown is information about the data frame itself. The count is the number of observations (rows) that are in the data frame.
Before I go into the rest of the rows, I want to point out the NaN values. When you use the .describe() method on all of the features at once you apply all the statistics to all of them. This means that you will have statistics that aren’t applicable to one feature or another. For instance, the mean doesn’t make sense on a feature using days of the week. Mean is for numbers, the day of the week is a word. The same goes for unique — you can’t ask for the unique values of tips because there could be any number of different tip amounts.
Below that, are the statistics for categorical values. I am going to give a breakdown of the rest of the categorical statistics.
Remember, the list below will results in a NaN for continuous variables.
Unique — how many different entries are in the variable.
Top — the categorical answer that appears most frequently in the data frame
Freq— the number of times the value that the most in a data frame
Next are the continuous variables and will result in a NaN for categorical variables.
Mean — the average of all of the observations in that feature
Std — the standard deviation of all of the variables in that feature
Min — the lowest value out of all the observations
25% — this number is the location where the lowest quartile is
50% — this is the median of the feature
75% — this is where the upper quartile is
Max — the largest value of the feature
I would recommend using the include=’all’ parameter when using the .describe() method. It saves time and is still very easy to understand.
An expansion of the .describe() method use is using the .groupby() method.We can delve deeper into different features and use that information to make more informed decisions. Assume you want to know if there is any relationship between the different days a meal was eaten and the amount on the total bills.
When this code was run, it returned the following:
What is seen above is each day with the average value of the total bill and the size of the party. This can be done with any combination of continuous variables. More documentation on Pandas .groupby() can be found here.
The default order is descending so the method returns the most frequently occurring day at the top. Notice the last line that has the information Name and type. This is simply referring to the name of the column that was counted and the type of information it returned in the picture above. Each of the counts returned is done so in the int64 format.
Testing for correlation is the process of establishing a relationship or connection between two or more measures. Right now, we are going to look at whether an increase in the total bill results in an increase in the tip left for the server.
From the plot, we can see there is some linear relationship between the two variables. However, just because two variables seem to increase at the same time doesn’t mean that we know to what degree they both change. For this analysis, we need a more refined method of seeking correlation using statistics.
4. Correlation Statistics
To understand the idea of correlation statistics on a fundamental level, we need to know about two concepts. The Pearson Coefficient and the P-Value. First, let’s talk about the Pearson Coefficient.
The Pearson Coefficient a statistic that measures the linear correlation between two variables a and b. It has a value between +1 and −1. A value of +1 is a total positive linear correlation, 0 is no linear correlation, and −1 is a total negative linear correlation.
As stated above, the correlation of +1 and -1 are very strong correlations, positively, and negatively respectively. A strong correlation does not indicate the slope of the line but rather only the tightness of fit to the sloped line.
The p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the less likely the connection between the two variables happened randomly.
SciPy Statsmodel Package
Finally, we will use Python to determine the Pearson Coefficient and the p-value of the total bill’s impact on the change in the tip given. Working off of the Statsmodel library of the SciPy package, we use the pearsonr() function.https://medium.com/media/df5e69f39ba9d832d18f0734b97940d7Using the Pearson Correlation function from SciPy
The output from the code above is:
Analyzing our output we can come to two very important conclusions. Firstly, the Pearson coefficient is relatively high. With a value of 0.67, there is a relatively strong positive correlation between the total bill and the tips given.
Secondly, the p-value is very small. So small, in fact, that we can reject the idea that the correlation between the two variables is insignificant.
To summarize what we’ve learned today:
It is important to get a feel for the dimensions of a data frame before beginning to work with it.
Usage of the .describe() method to examine the continuous and categorical variables
We use .groupby() to see how specific attributes stack up in terms of aggregate functions
Correlation is the idea that two variables change at the same time.
Pearson Coefficient determines the correlation value. +1 is a strong positive correlation, -1 is a strong negative correlation. 0 is no correlation.
The p-value dictates how likely something is to occur. A large p-value means that a thing occurring happened by chance. A very small p-value means that a thing occurred not by chance, but because it has a good chance of being statistically significant.
I hope this guide helped you with the beginnings of your data science project.
Investigating Population, Gender Equality in Education & Income for Singapore, United States and China
Exploratory Data Analysis (EDA) is one of the most important aspect in every data science or data analysis problem. It provides us greater understanding on our data and can possibly unravel hidden insights that aren’t that obvious to us. This post will focus more on graphical EDA in Python using matplotlib, regression line and even motion chart!
The dataset we are using for this article can be obtained from Gapminder, and drilling down into Population, Gender Equality in Education and Income.
The Population data contains yearly data regarding the estimated resident population, grouped by countries around the world between 1800 and 2018.
The Gender Equality in Education data contains yearly data between 1970 and 2015 on the ratio between female to male in schools, among 25 to 34 years old which includes primary, secondary and tertiary education across different countries
The Income data contains yearly data of income per person adjusted for differences in purchasing power (in international dollars) across different countries around the world, for the period between 1800 and 2018.
EDA on Population
Let’s first plot the population data over time, and focus mainly on the three countries Singapore, United States and China. We will use matplotlib library to plot 3 different line charts on the same figure.
import pandas as pd import matplotlib.pylab as plt %matplotlib inline# read in data population = pd.read_csv('./population.csv')# plot for the 3 countries plt.plot(population.Year,population.Singapore,label="Singapore") plt.plot(population.Year,population.China,label="China") plt.plot(population.Year,population["United States"],label="United States")# add legends, labels and title plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Population') plt.title('Population Growth over time') plt.show()
As seen in the figure, the population values for the 3 countries Singapore, China and United States are increasing over time, though Singapore is not that visible since the axis is in billions, while the population in Singapore is only in the millions.
Now, let’s try to fit a linear regression line using linregressto the Singapore population data and plot the linear fit. We can even try predicting the Singapore population in 2020 and 2100.
from scipy.stats import linregress # set up regression line slope, intercept, r_value, p_value, std_err = linregress(population.Year,population.Singapore) line = [slope*xi + intercept for xi in population.Year]# plot the regression line and the linear fit plt.plot(population.Year,line,'r-', linewidth=3,label='Linear Regression Line') plt.scatter(population.Year, population.Singapore,label='Population of Singapore') plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Population') plt.title('Population Growth of Singapore over time') plt.show()# Calculate correlation coefficient to see how well is the linear fit print("The correlation coefficient is " + str(r_value)) ## Use the linear fit to predict the resident population in Singapore in 2020 and 2100. # Using equation y=mx + c, i.e. population=slope*year + intercept print("The predicted population in Singapore in 2020 will be " + str((slope*2020)+intercept)) print("The predicted population in Singapore in 2100 will be " + str((slope*2100)+intercept))
From the figure, we see that the linear fit did not seem to fit the Population of Singapore that well though we have a correlation coefficient close to 1. The prediction of the population was also well off as the current population of Singapore in 2020 is around 5.6 million, which is way above the 3.4 million predicted.
Notice that the population before 1850s were negative, which is definitely impossible. Since Singapore is founded in 1965, let’s filter to only use data from 1965 onwards.
from scipy.stats import linregress # set up regression line slope, intercept, r_value, p_value, std_err = linregress(population.Year[population.Year>=1965],population.Singapore[population.Year>=1965]) line = [slope*xi + intercept for xi in population.Year[population.Year>=1965]]plt.plot(population.Year[population.Year>=1965],line,'r-', linewidth=3,label='Linear Regression Line') plt.scatter(population.Year[population.Year>=1965], population.Singapore[population.Year>=1965],label='Singapore') plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Population') plt.title('Population Growth of Singapore from 1965 onwards') plt.show()# Calculate correlation coefficient to see how well is the linear fit print("The correlation coefficient is " + str(r_value)) ## Use the linear fit to predict the resident population in Singapore in 2020 and 2100. # Using equation y=mx + c, i.e. population=slope*year + intercept print("The predicted population in Singapore in 2020 will be " + str((slope*2020)+intercept)) print("The predicted population in Singapore in 2100 will be " + str((slope*2100)+intercept))
This linear regression line fits so much better as shown in the graph as well as the correlation coefficient. Furthermore, the predicted 2020 population is exactly what it is in Singapore currently, and let’s hope the 2100 population is not true since we know the land area in Singapore is considerably small.
EDA on Gender Equality in Education
Moving onto the second dataset, let’s try to plot the gender ratio (females to males) in schools for Singapore, China and the United States over time. We can also look at the maximum and minimum gender ratio percentage in Singapore.
# reading in data gender_equality = pd.read_csv('./GenderEquality.csv') # plot the graphs plt.plot(gender_equality.Year,gender_equality.Singapore,label="Singapore") plt.plot(gender_equality.Year,gender_equality.China,label="China") plt.plot(gender_equality.Year,gender_equality["United States"],label="United States")# set up legends, labels and title plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Gender Ratio of Female to Male in school') plt.title('Gender Ratio of Female to Male in school over time') plt.show()# What are the maximum and minimum values for gender ratio in Singapore over the time period? print("The maximum value is: " + str(max(gender_equality.Singapore)) + " and the minimum is " + str(min(gender_equality.Singapore)))
The gender ratios were generally increasing over time as seen in the output above. Gender Ratio for China and Singapore were increasing linearly over time. For United States, there was certain periods in which the gender ratio were stagnant before increasing again. The minimum gender ratio for Singapore was 79.5 while the maximum was 98.9, and this was expected since education in Singapore in the past was considerably more important for males than females.
Let’s plot the linear regression line on the gender ratio for Singapore.
# plot the regression line slope, intercept, r_value, p_value, std_err = linregress(gender_equality.Year,gender_equality["Singapore"]) line = [slope*xi + intercept for xi in gender_equality.Year]plt.plot(gender_equality.Year,line,'r-', linewidth=3,label='Linear Regression Line') plt.plot(gender_equality.Year, gender_equality["Singapore"],label='Singapore') plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Gender Ratio of Female to Male in school') plt.title('Gender Ratio of Female to Male in school for Singapore over time') plt.show() print("The correlation coefficient is " + str(r_value))
The correlation coefficient suggested that it is a good fit and gender ratio will potentially reach 100% in the future. This could be possible as education is no longer a privilege in Singapore as both males and females have equal opportunities in receiving formal education.
EDA on Income
Let’s finally move to Income data and plot the income of Singapore, United States and China over time.
# read in data income = pd.read_csv('./Income.csv') # plot the graphs plt.plot(income.Year,income.Australia,label="Singapore") plt.plot(income.Year,income.China,label="China") plt.plot(income.Year,income["United States"],label="United States") # set up legends, labels, title plt.legend(loc='best') plt.xlabel('Year') plt.ylabel('Income per person') plt.title('Income per person over time') plt.show()
Surprisingly, the income per person in Singapore is comparable to the United States, with both above those in China.
Motion Chart — Visualising relationships over time
Now, let’s try to build a motion chart to visualise relationships over time for all three factors of Population, Gender Ratio and Income. In order to build a motion chart in Python, we will need motionchart library.
Before that, we will need to merge all three datasets into a single one to plot our motion chart easily. Merging can be done using common pandas commands.
# Convert columns into rows for each data set based on country and population/gender ratio/income population=pd.melt(population,id_vars=['Year'],var_name='Country',value_name='Population') gender_equality=pd.melt(gender_equality,id_vars=['Year'],var_name='Country',value_name='Gender Ratio')# Merge the 3 datasets into one on common year and country income=pd.melt(income,id_vars=['Year'],var_name='Country',value_name='Income') overall=pd.merge(population,gender_equality,how="inner",on=["Year","Country"]) overall=pd.merge(overall,income,how="inner",on=["Year","Country"])
To visualise relationship over time, we will need to set the Year attribute as the key in our motion chart. Our x-axis will be the Gender Ratio, y-axis the Income, size of the bubble for Population and lastly, colour of bubble for the Country.
If we explore this motion chart, we know Afghanistan and Yemen had the lowest gender ratio in education of 23.7 and 30.1 respectively. Lesotho in South Africa has the highest gender ration throughout (note the little pink dot at the bottom right).
There is generally not a clear relationship between income and gender ratio in education. During the whole period of time, as gender ratio is generally increasing for all countries, income did not follow likewise by increasing nor did it decrease. There was a mix of being stagnant, increasing and decreasing which did not exhibit any clear relationship with gender ratio.
Let’s focus on building a motion chart for just Singapore.
Interestingly for Singapore, other than the Population increasing over time, Gender Ratio in Education as well as Income seems to increasing constantly over time as well. Income was at 11400 in 1970 and it increased tremendously to 80900 in 2015.
In this article, we made use of Python matplotlib, linear regression as well as the fanciful motion charts to conduct exploratory data analysis on three datasets, mainly Population, Gender Ratio in Education & Income. Through these graphical methods, we can discover some insights on our data and potentially, allow us to make better predictions. Hope you guys enjoy this graphical approach to Exploratory Data Analysis in Python, and have fun playing with your fanciful motion charts!
Handling missing values is an important data preprocessing step in machine learning pipelines.
Pandas is versatile in terms of detecting and handling missing values. However, when it comes to model training and evaluation with cross validation, there is a better approach.
The imputer of scikit-learn along with pipelines provide a more practical way of handling missing values in cross validation process..
In this post, we will first do a few examples that show different ways to handle missing values with Pandas. After that, I will explain why we need a different approach to handle missing values in cross validation.
Finally, we will do an example using the missing value imputer and pipeline of scikit-learn.
Let’s start with Pandas. Here is a simple dataframe with a few missing values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(8,5)), columns=list('ABCDE'))
df.iloc[[1,4],[0,3]] = np.nan
df.iloc[[3,7],[1,2,4]] = np.nan
The isna function returns Boolean values indicating the cells with missing values. The isna().sum() gives us the number of missing values in each column.
df.isna().sum() A 2 B 2 C 2 D 2 E 2 dtype: int64
The fillna function is used to handle missing values. It provides many options to fill in. Let’s use a different method for each column.
The missing values in columns A, B, and C are filled with mean, median, and mode of the column, respectively. For column D, we used ‘ffill’ method which uses the previous value in the column to fill a missing value. The ‘bfill’ does the opposite.
Here is the updated version of the dataframe:
We still have one missing value in column D because we used the ‘bfill’ method for this column. With this method, the missing values are supposed to be filled with the values after them. Since the last value is a missing value, it was not changed.
The fillna function also accepts constant values. Let’s replace the last missing value with a constant.
As you have seen, the fillna function is pretty flexible. However, when it comes to train machine learning models, we need to be careful at handling the missing values.
Unless we use constant values, the missing values need to be handled after splitting the training and test sets. Otherwise, the model will be given information about the test set which causes data leakage.
Data leakage is a serious issue in machine learning. Machine learning models should not be given any information about the test set. The data points in the test sets need to be previously unseen.
If we use the mean of the entire data set to fill in missing values, we leak information about the test set to the model.
One solution is to handle missing values after train-test split. It is definitely an acceptable way. What if we want to do cross validation?
Cross validation means partitioning the data set into subsets (i.e. folds). Then, run many iterations with different combinations so that each example will be used in both training and testing.
Consider the case with 5-fold cross validation. The data set is divided into 5 subsets (i.e. folds). At each iteration, 4 folds are used in training and 1 fold is used in testing. After 5 iterations, each fold will be used in both training and testing.
We need a practical way to handle missing values in cross validation process in order to prevent data leakage.
One way is to create a Pipeline with scikit-learn. The pipeline accepts data preprocessing functions and can be used in the cross validation process.
Let’s create a new dataframe that fits a simple linear regression task.
The “regression” pipeline contains a simple imputer that fills in the missing values with mean. The linear regression model does the prediction task.
We can now use this pipeline as estimator in cross validation.
X = df.drop('F', axis=1)
y = df['F']scores = cross_val_score(regressor, X, y, cv=4, scoring='r2')
The R-squared score is pretty high because this is a pre-designed data set.
The important point here is to handle missing values after splitting train and test sets. It can easily be done with pandas if we do a regular train-test split.
However, if you want to do cross validation, it will be tedious to use Pandas. The pipelines of scikit-learn library provide a more practical and easier way.
The scope of pipelines are quite broad. You can also add other preprocessing techniques in a pipeline such as a scaler for numerical values. Using pipelines allows automating certain tasks and thus optimizing processes.
If you’re new to data science, here’s a good place to start
One of the most well-known and essential sub-fields of data science is machine learning. The term machine learning was first used in 1959 by IBM researcher Arthur Samuel. From there, the field of machine learning gained much interest from others, especially for its use in classifications.
When you start your journey into learning and mastering the different aspects of data science, perhaps the first sub-field you come across is machine learning. Machine learning is the name used to describe a collection of computer algorithms that can learn and improve by gathering information while they are running.
Any machine learning algorithm is built upon some data. Initially, the algorithm uses some “training data” to build an intuition of solving a specific problem. Once the algorithm passes the learning phase, it can then use the knowledge it gained to solve similar problems based on different datasets.
In general, we categorize machine learning algorithms into 4 categories:
Supervised algorithms: Algorithms that involve some supervision from the developer during the operation. To do that, the developer labels the training data and set strict rules and boundaries for the algorithm to follow.
Unsupervised algorithms: Algorithms that do not involve direct control from the developer. In this case, the algorithms’ desired results are unknown and need to be defined by the algorithm.
Semi-supervised algorithms: Algorithms that combines aspects of both supervised and unsupervised algorithms. For example, not all training data will be labeled, and not all rules will be provided when initializing the algorithm.
Reinforcement algorithms: In these types of algorithms, a technique called exploration/exploitation is used. The gest of it is simple; the machine makes an action, observe the outcomes, and then consider those outcomes when executing the next action, and so on.
Each of these categories is designed for a purpose; for example, supervised learning is designed to scale the training data’s scope and make predictions of future or new data based on that. On the other hand, unsupervised algorithms are used to organize and filter data to make sense of it.
Under each of those categories lay various specific algorithms that are designed to perform certain tasks. This article will cover 5 basic algorithms every data scientist must know to cover machine learning basics.
Regression algorithms are supervised algorithms used to find possible relationships among different variables to understand how much the independent variables affect the dependent one.
You can think of regression analysis as an equation, For example, if I have the equation y = 2x + z, y is my dependant variable, and x,z are the independent ones. Regression analysis finds how much do x and z affect the value of y.
The same logic applies to more advanced and complex problems. To adapt to the various problems, there are many types of regression algorithms; perhaps the top 5 are:
Linear Regression: The simplest regression technique uses a linear approach for featuring the relationship between the dependent (predicted) and independent variables (predictors).
Logistic Regression: This type of regression is used on binary dependent variables. This type of regressing is widely used to analyze categorical data.
Ridge Regression: When the regression model becomes too complex, ridge regression corrects the model’s coefficients’ size.
Lasso Regression: Lasso (Least Absolute Shrinkage Selector Operator) Regression is used to select and regularize variables.
Polynomial Regression: This type of algorithm is used to fit non-linear data. Using it, the best prediction is not a straight line; it is a curve that tries to fit all data points.
Classification in machine learning is the process of grouping items into categories based on a pre-categorized training dataset. Classification is considered a supervised learning algorithm.
These algorithms use the training data’s categorization to calculate the likelihood that a new item will fall into one of the defined categories. A well-known example of classification algorithms is filtering incoming emails into spam or not-spam.
There are different types of classification algorithms; the top 4 ones are:
K-nearest neighbor: KNN is an algorithm that uses training datasets to find the k closest data points in some datasets.
Decision trees: You can think of it as a flow chart, classifying each data points into two categories at a time and then each to two more and so on.
Naive Bayes: This algorithm calculates the probability that an item falls under a specific category using the conditional probability rule.
Support Vector Machine (SVM): In this algorithm, the data is classified based on its degree of polarity, which can go beyond the X/Y prediction.
Ensembling algorithms are supervised algorithms made of combining the prediction of two or more other machine learning algorithms to produce more accurate results. Combining the results can either be done by voting or averaging the results. Voting is often used during classification and averaging during regression.
Ensembling algorithms have 3 basic types: Bagging, Boosting, and Stacking.
Bagging: In bagging, the algorithms are run in parallel on different training sets, all equal in size. All algorithms are then tested using the same dataset, and voting is used to determine the overall results.
Boosting: In the case of boosting, the algorithms are run sequentially. Then the overall results are chosen using weighted voting.
Stacking: From the name, stacking has two-level stacked on top of each other, the base level is a combination of algorithms, and the top level is a meta-algorithm based on the base level results.
Clustering algorithms are a group of unsupervised algorithms used to group data points. Points within the same cluster are more similar to each other than to points in different clusters.
There are 4 types of clustering algorithms:
Centroid-based Clustering: This clustering algorithm organizes the data into clusters based on initial conditions and outliers. k-means is the most knowledgeable and used centroid-based clustering algorithm.
Density-based Clustering: In this clustering type, the algorithm connects high-density areas into clusters creating arbitrary-shaped distributions.
Distribution-based Clustering: This clustering algorithm assumes the data is composed of probability distributions and then clusters the data into various versions of that distribution.
Hierarchical Clustering: This algorithm creates a tree of hierarchical data clusters, and the number of clusters can be varied by cutting the tree at the correct level.
Association algorithms are unsupervised algorithms used to discover the probability of some items to occur together in a specific dataset. It is mostly used in the market-basket analysis.
The most used association algorithm is Apriori.
The Apriori algorithm is a mining algorithm used commonly used in transactional databases. Apriori is used to mine frequent itemsets and generate some association rules from those item sets.
For example, if a person buys milk and bread, then they are likely to also get some eggs. These insights are built upon previous purchases from various clients. Association rules are then formed according to a specific threshold for confidence set by the algorithm based on how frequently these items are brought together.
Machine learning is one of the most famous, well-researched sub-field of data science. New machine learning algorithms are always under development to reach better accuracy and faster execution.
Regardless of the algorithm, it can generally be categorized as one of four categories: supervised, unsupervised, semi-supervised, and reinforced algorithms. Each one of these categories holds many algorithms that are used for different purposes.
In this article, I have gone through 5 types of supervised/ unsupervised algorithms that every machine learning beginner should be familiar with. These algorithms are well-studied and widely-used that you only need to understand how to use it rather than how to implement it.
Most famous Python machine learning modules — such as Scikit Learn — contain a pre-defined version of most — if not all — of these algorithms.
My advice is, understand the mechanic, and master the usage and start building.
Today’s world is moving fastly towards using AI and Machine Learning in all fields. The most important key to this is DATA. Data is the key to everything. If as a Machine Learning Engineer we are able to understand and restructure the data toward our need, we would have completed half the task.
Let us try to learn to perform EDA (Exploratory Data Analysis) on data.
What we will learn in this tutorial :
Collect data for our application.
Structure of the data to our needs.
Visualize the data.
Let’s get started. We will try to fetch some sample data — The IRIS Dataset which a very common dataset that is used when you want to get started with Machine Learning and Deep Learning.
Collection of Data: The date for any application can be found on several websites like Kaggle, UCI, etc, or has to be made specific to some application. For example, if we want to classify between a dog and a cat we don’t need to build out a dataset by collecting images of dog and cat as there are several datasets available. Here let’s try to inspect the Iris Dataset.
Let’s fetch the data:
from sklearn.datasets import load_iris,
import pandas as pd
data = load_iris() #3.
df = pd.DataFrame(data.data, columns=data.feature_names)#4.
This (#3)will fetch the Dataset which sklearn has by default. Line #4 converts the dataset into a pandas data frame which is very commonly used to explore dataset with row-column attributes.
The first 5 rows of the data can be viewed using :
The number of rows and columns, and the names of the columns of the dataset can be checked with :
We can even download the dataset directly from UCI from here. The CSV file downloaded can be loaded into the df as :
df = pd.read_csv("path to csv file")
2. Structuring the Data: Very often the Dataset will have several features that don’t directly affect our output. Using such features is useless as it leads to unnecessary memory constraints and also sometimes errors.
We can check which columns are important or affect the output column more by checking the correlation of the output column with the inputs. Let us try that out :
Clearly, you can see above the correlation matrix helps us in understanding how all features are affected by one another. For more information about the correlation matrix click here.
So if our output column was supposed sepal length (cm), my output y would be “sepal length (cm)” and my input X would be ‘petal length (cm)’, ‘petal width (cm)’ as they have a higher correlation with y.
Note: If ‘sepal width (cm)’ would have correlation -0.8, we would also take that as the correlation value though it is negative has a huge impact on output y (inversely proportional).
Note: The value of correlation in a correlation matrix can vary between -1(inversely proportional) and +1(directly proportional).
3. Visualize the Data: This is a very important step as it can help in two ways :
Help you understand important points like how is the data split ie does it like close to a small range of values or higher.
Helps to understand decision boundaries.
Present it to people to make them understand your data rather than showing some tables.
There are several to plot and present the data like histograms, bar charts, pair plots, etc.
Let’s see how we plot a histogram for the IRIS dataset.
df.plot.hist( subplots = True, grid = True)
By looking into the histogram it’s easier for us to understand what is the range of values for each feature.
Let’s simply plot the data now.
Apart from these, there are several other graphs which can be plotted easily depending on the application.
Hence, We can conclude by stating one simple fact a well-structured dataset is an initial key to a good and efficient Machine Learning Model.