Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.

Data binning is a type of data preprocessing, a mechanism which includes also dealing with check this link missing values, formatting, normalization and standardization.

There are two approaches to perform data binning:

**numeric to categorical**, which converts numeric into categorical variables**sampling**, wihch corresponds to data quantization.

You can download the full code of this tutorial from Github repository

# Data Import

In this tutorial we exploit the `cupcake.csv`

dataset, which contains the trend search of the word `cupcake`

on Google Trends. Data are extracted from this link. We exploit the `pandas`

library to import the dataset and we transform it into a dataframe through the `read_csv()`

function.

import pandas as pd

df = pd.read_csv('cupcake.csv')

df.head(5)

# Numeric to categorical binning

In this case we group values related to the column `Cupcake`

into three groups: *small*, *medium* and *big*. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions `min()`

and `max()`

of dataframe to calculate the minimum value and the maximum value of the column `Cupcake`

.

min_value = df['Cupcake'].min()

max_value = df['Cupcake'].max()

print(min_value)

print(max_value)

which gives the following output

4

100

Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):

- small — (edge1, edge2)
- medium — (edge2, edge3)
- big — (edge3, edge4) We can use the
`linspace()`

function of the`numpy`

package to calculate the 4 bins, equally distributed.

import numpy as np

bins = np.linspace(min_value,max_value,4)

bins

which gives the following output:

array([ 4., 36., 68., 100.])

Now we define the labels:

labels=['small', 'medium', 'big']

We can use the `cut()`

function to convert the numeric values of the column `Cupcake`

into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter `include_lowest`

to `True`

in order to include also the minimum value.

df['bins']=pd.cut(df['Cupcake'], bins=bins, labels=labels, include_lowest=True)

We can plot the distribution of values, by using the `hist()`

function of the `matplotlib`

package.

importmatplotlib.pyplotaspltplt.hist(df['bins'], bins=3)

# Sampling

Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:

- by bin means: each value in a bin is replaced by the mean value of the bin.
- by bin median: each bin value is replaced by its bin median value.
- by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.

In order to perform sampling, the `binned_statistic()`

function of the `scipy.stats`

package can be used. This function receives two arrays as input, `x_data`

and `y_data`

, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the `x`

values (`x_bins`

) corresponding to the binned values (`y_bins`

) as the values at the center of the bin range.

from scipy.stats import binned_statistic

x_data = np.arange(0, len(df))

y_data = df['Cupcake']

y_bins,bin_edges, misc = binned_statistic(x_data,y_data, statistic="median", bins=10)

x_bins = (bin_edges[:-1]+bin_edges[1:])/2

x_bins

which gives the following output:

array([ 10.15, 30.45, 50.75, 71.05, 91.35, 111.65, 131.95, 152.25, 172.55, 192.85])

Finally, we plot results.

plt.plot(x_data,y_data)

plt.xlabel("X");

plt.ylabel("Y")plt.scatter(x_bins, y_bins, color= 'red',linewidth=5)

plt.show()

# Summary

In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.

Data binning is very useful when discretization is needed.