Data Preprocessing with Python Pandas — Binning

Image for post

Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.

Data binning is a type of data preprocessing, a mechanism which includes also dealing with check this link missing valuesformattingnormalization and standardization.

There are two approaches to perform data binning:

  • numeric to categorical, which converts numeric into categorical variables
  • sampling, wihch corresponds to data quantization.

You can download the full code of this tutorial from Github repository

Data Import

In this tutorial we exploit the cupcake.csv dataset, which contains the trend search of the word cupcake on Google Trends. Data are extracted from this link. We exploit the pandas library to import the dataset and we transform it into a dataframe through the read_csv() function.

import pandas as pd
df = pd.read_csv('cupcake.csv')
df.head(5)
Image for post

Numeric to categorical binning

In this case we group values related to the column Cupcake into three groups: smallmedium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions min() and max() of dataframe to calculate the minimum value and the maximum value of the column Cupcake.

min_value = df['Cupcake'].min()
max_value = df['Cupcake'].max()
print(min_value)
print(max_value)

which gives the following output

4
100

Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):

  • small — (edge1, edge2)
  • medium — (edge2, edge3)
  • big — (edge3, edge4) We can use the linspace() function of the numpy package to calculate the 4 bins, equally distributed.
import numpy as np
bins = np.linspace(min_value,max_value,4)
bins

which gives the following output:

array([  4.,  36.,  68., 100.])

Now we define the labels:

labels = ['small', 'medium', 'big']

We can use the cut() function to convert the numeric values of the column Cupcake into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter include_lowest to True in order to include also the minimum value.

df['bins'] = pd.cut(df['Cupcake'], bins=bins, labels=labels, include_lowest=True)

We can plot the distribution of values, by using the hist() function of the matplotlib package.

import matplotlib.pyplot as pltplt.hist(df['bins'], bins=3)
Image for post

Sampling

Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:

  • by bin means: each value in a bin is replaced by the mean value of the bin.
  • by bin median: each bin value is replaced by its bin median value.
  • by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.

In order to perform sampling, the binned_statistic() function of the scipy.stats package can be used. This function receives two arrays as input, x_data and y_data, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the x values (x_bins) corresponding to the binned values (y_bins) as the values at the center of the bin range.

from scipy.stats import binned_statistic
x_data = np.arange(0, len(df))
y_data = df['Cupcake']
y_bins,bin_edges, misc = binned_statistic(x_data,y_data, statistic="median", bins=10)
x_bins = (bin_edges[:-1]+bin_edges[1:])/2
x_bins

which gives the following output:

array([ 10.15,  30.45,  50.75,  71.05,  91.35, 111.65, 131.95, 152.25, 172.55, 192.85])

Finally, we plot results.

plt.plot(x_data,y_data)
plt.xlabel("X");
plt.ylabel("Y")plt.scatter(x_bins, y_bins, color= 'red',linewidth=5)
plt.show()
Image for post

Summary

In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.

Data binning is very useful when discretization is needed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s