
Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.
Data binning is a type of data preprocessing, a mechanism which includes also dealing with check this link missing values, formatting, normalization and standardization.
There are two approaches to perform data binning:
- numeric to categorical, which converts numeric into categorical variables
- sampling, wihch corresponds to data quantization.
You can download the full code of this tutorial from Github repository
Data Import
In this tutorial we exploit the cupcake.csv
dataset, which contains the trend search of the word cupcake
on Google Trends. Data are extracted from this link. We exploit the pandas
library to import the dataset and we transform it into a dataframe through the read_csv()
function.
import pandas as pd
df = pd.read_csv('cupcake.csv')
df.head(5)

Numeric to categorical binning
In this case we group values related to the column Cupcake
into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions min()
and max()
of dataframe to calculate the minimum value and the maximum value of the column Cupcake
.
min_value = df['Cupcake'].min()
max_value = df['Cupcake'].max()
print(min_value)
print(max_value)
which gives the following output
4
100
Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):
- small — (edge1, edge2)
- medium — (edge2, edge3)
- big — (edge3, edge4) We can use the
linspace()
function of thenumpy
package to calculate the 4 bins, equally distributed.
import numpy as np
bins = np.linspace(min_value,max_value,4)
bins
which gives the following output:
array([ 4., 36., 68., 100.])
Now we define the labels:
labels = ['small', 'medium', 'big']
We can use the cut()
function to convert the numeric values of the column Cupcake
into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter include_lowest
to True
in order to include also the minimum value.
df['bins'] = pd.cut(df['Cupcake'], bins=bins, labels=labels, include_lowest=True)
We can plot the distribution of values, by using the hist()
function of the matplotlib
package.
import matplotlib.pyplot as pltplt.hist(df['bins'], bins=3)

Sampling
Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:
- by bin means: each value in a bin is replaced by the mean value of the bin.
- by bin median: each bin value is replaced by its bin median value.
- by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.
In order to perform sampling, the binned_statistic()
function of the scipy.stats
package can be used. This function receives two arrays as input, x_data
and y_data
, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the x
values (x_bins
) corresponding to the binned values (y_bins
) as the values at the center of the bin range.
from scipy.stats import binned_statistic
x_data = np.arange(0, len(df))
y_data = df['Cupcake']
y_bins,bin_edges, misc = binned_statistic(x_data,y_data, statistic="median", bins=10)
x_bins = (bin_edges[:-1]+bin_edges[1:])/2
x_bins
which gives the following output:
array([ 10.15, 30.45, 50.75, 71.05, 91.35, 111.65, 131.95, 152.25, 172.55, 192.85])
Finally, we plot results.
plt.plot(x_data,y_data)
plt.xlabel("X");
plt.ylabel("Y")plt.scatter(x_bins, y_bins, color= 'red',linewidth=5)
plt.show()

Summary
In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.
Data binning is very useful when discretization is needed.