Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.
There are two approaches to perform data binning:
- numeric to categorical, which converts numeric into categorical variables
- sampling, wihch corresponds to data quantization.
You can download the full code of this tutorial from Github repository
In this tutorial we exploit the
cupcake.csv dataset, which contains the trend search of the word
cupcake on Google Trends. Data are extracted from this link. We exploit the
pandas library to import the dataset and we transform it into a dataframe through the
import pandas as pd
df = pd.read_csv('cupcake.csv')
Numeric to categorical binning
In this case we group values related to the column
Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions
max() of dataframe to calculate the minimum value and the maximum value of the column
min_value = df['Cupcake'].min()
max_value = df['Cupcake'].max()
which gives the following output
Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):
- small — (edge1, edge2)
- medium — (edge2, edge3)
- big — (edge3, edge4) We can use the
linspace()function of the
numpypackage to calculate the 4 bins, equally distributed.
import numpy as np
bins = np.linspace(min_value,max_value,4)
which gives the following output:
array([ 4., 36., 68., 100.])
Now we define the labels:
labels = ['small', 'medium', 'big']
We can use the
cut() function to convert the numeric values of the column
Cupcake into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter
True in order to include also the minimum value.
df['bins'] = pd.cut(df['Cupcake'], bins=bins, labels=labels, include_lowest=True)
We can plot the distribution of values, by using the
hist() function of the
import matplotlib.pyplot as pltplt.hist(df['bins'], bins=3)
Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:
- by bin means: each value in a bin is replaced by the mean value of the bin.
- by bin median: each bin value is replaced by its bin median value.
- by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.
In order to perform sampling, the
binned_statistic() function of the
scipy.stats package can be used. This function receives two arrays as input,
y_data, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the
x values (
x_bins) corresponding to the binned values (
y_bins) as the values at the center of the bin range.
from scipy.stats import binned_statistic
x_data = np.arange(0, len(df))
y_data = df['Cupcake']
y_bins,bin_edges, misc = binned_statistic(x_data,y_data, statistic="median", bins=10)
x_bins = (bin_edges[:-1]+bin_edges[1:])/2
which gives the following output:
array([ 10.15, 30.45, 50.75, 71.05, 91.35, 111.65, 131.95, 152.25, 172.55, 192.85])
Finally, we plot results.
plt.ylabel("Y")plt.scatter(x_bins, y_bins, color= 'red',linewidth=5)
In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.
Data binning is very useful when discretization is needed.