Hyperplane in SVM Algorithm

In this post, we are going to introduce you to the Support Vector Machine (SVM) machine learning algorithm. We will follow a similar process to our recent post Naive Bayes for Dummies; A Simple Explanation by keeping it short and not overly-technical. The aim is to give those of you who are new to machine learning a basic understanding of the key concepts of this algorithm.

Support Vector Machines – What are they?

 
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.

SVM

Support Vectors

 
Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.

What is a hyperplane?

 
As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

How do we find the right hyperplane?

 
Or, in other words, how do we best segregate the two classes within the data?

The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.

SVM

But what happens when there is no clear hyperplane?

 
This is where it can get tricky. Data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset.

SVM< In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view. Explaining this is easiest with another simplified example. Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. You can read more on Kerneling here.

SVM

Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.

Pros & Cons of Support Vector Machines

 
Pros

  • Accuracy
  • Works well on smaller cleaner datasets
  • It can be more efficient because it uses a subset of training points

Cons

  • Isn’t suited to larger datasets as the training time with SVMs can be high
  • Less effective on noisier datasets with overlapping classes

SVM Uses

 
SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.

There you have it, a very high level introduction to Support Vector Machines.

Improve Your ML Model Accuracy

6 Ways to Improve Your ML Model Accuracy

Simple steps for much better results

Image for post

One of the most frustrating things that happen — more often than data scientists like to admit — after they spend hours upon hours gathering data, cleaning it, labeling it, and using it to train and develop a machine learning model is ending up with a model with low accuracy or large error range.

In machine learning, the term model accuracy refers to the measurements made to decide whether or not a certain model is the best to describe the relationship between the different problem variables. We often use training data (sample data) to train a model for new, unused data.

If our model has good accuracy, it will perform well on both the training data and the new one. Having a model with high accuracy is essential to the overall project’s success, and if you’re building it for a client, it’s important for your paycheck!

From a business perspective, performance equals money; if a model’s accuracy is low, it will result in more errors, which can be very costly. And I am not just talking about the financial aspect; imagine a model used to diagnose cancer or any other terminal diseases; a wrong diagnosis will not only cost the hospital money but will cost the patient and their family unnecessary emotional trauma.5 Types of Machine Learning Algorithms You Need to KnowIf you’re new to data science, here’s a good place to starttowardsdatascience.com

So, have can we avoid all of that and improve the accuracy of our machine learning model? There are different ways a data scientist can use to improve their model’s accuracy; in this article, we will go through 6 of such ways. Let’s jump right in…

Most ML engineers are familiar with the quote, “Garbage in, garbage out”. Your model can perform only so much when the data it is trained upon is poorly representative of the actual scenario. What do I mean by ‘representative’? It refers to how well the training data population mimics the target population; the proportions of the different classes, or the point estimates (like mean, or median), and the variability (like variance, standard deviation, or interquartile range) of the training and target populations.

Generally, the larger the data, the more likely it is to be representative of the target population to which you want to generalize. if you want to generalize the population of students in Grade 1 to 12 of a school you cannot just use 80% of Grade 8 population because the data you want to predict will be faulty because of your dataset. It is crucial to have a good understanding of the distribution of your target population in order to devise the right data collection techniques. Once you have the data, study the data (the exploratory data analysis phase) in order to determine its distribution and representativeness.

Outliers, missing values, and outright wrong or false data are some of the other considerations that you might have. Should you cap outliers at a certain value? Or remove them entirely? How about normalizing the values? Should you include data with some missing values? Or use the mean or median values instead to replace the missing values? Does the data collection method support the integrity of the data? These are some of the questions that you must evaluate before thinking about the model. Data cleaning is probably the most important step after data collection.

Method 1: Add more data samples

Data tells a story only if you have enough of it. Every data sample provides some input and perspective to your data’s overall story is trying to tell you. Perhaps the easiest and most straightforward way to improve your model’s performance and increase its accuracy is to add more data samples to the training data.

Doing so will add more details to your data and finetune your model resulting in a more accurate performance. Rember after all, the more information you give your model, the more it will learn and the more cases it will be able to identify correctly.

Method 2: Look at the problem differently

Sometimes adding more data couldn’t be the answer to your model inaccuracy problem. You’re providing your model with a good technique and the correct dataset. But you’re not getting the results you hope for; why?

Maybe you’re just asking the wrong questions or trying to hear the wrong story. Looking at the problem from a new perspective can add valuable information to your model and help you uncover hidden relationships between the story variables. Asking different questions may lead to better results and, eventually, better accuracy.Data Science Lingo 101: 10 Terms You Need to Know as a Data ScientistYour guide to understanding basic data science lingotowardsdatascience.com

Method 3: Add some context to your data.

Context is important in any situation, and training a machine learning model is no different. Sometimes, one point of data can’t tell a story, so you need to add more context for any algorithm we intend to apply to this data to have a good performance.

More context can always lead to a better understanding of the problem and, eventually, better performance of the model. Imagine I tell you I am selling a car, a BMW. That alone doesn’t give you much information about the car. But, if I add the color, model and distance traveled, then you’ll start to have a better picture of the car and its possible value.

Method 4: Finetune your hyperparameter

Training a machine learning model is a skill that you can only hone with practice. Yes, there are rules you can follow to train your model, but these rules don’t give you the answer your seeking, only the way to reach that answer.

However, to get the answer, you will need to do some trial and error until you reach your answer. When I first started learning the different machine learning algorithms, such as the K-means, I was lost on choosing the best number of clusters to reach the optimal results. The way to optimize the results is to tune its hyper-parameters. So, tuning the parameters of the algorithm will always lead to better accuracy.

Method 5: Train your model using cross-validation

In machine learning, cross-validation is a technique used to enhance the model training process by dividing the overall training set into smaller chunks and then use each chunk to train the model.

Using this approach, we can enhance the algorithm’s training process but train it using the different chunks and averaging over the result. Cross-validation is used to optimize the model’s performance. This approach is very popular because it’s so simple and easy to implement.7 Tips For Data Science NewbiesTo help make your learning journey easier.towardsdatascience.com

Method 6: Experiment with a different algorithm.

What if you tried all the approaches we talked about so far and your model still results in a low or average accuracy? What then?

Sometimes we choose an algorithm to implement that doesn’t really apply to the data we have, and so we don’t get the results we expect. Changing the algorithm, you’re using to implement your solution. Trying out different algorithms will lead you to uncover more details about your data and the story it’s trying to tell.

Takeaways

One of the most difficult things to learn as a new data scientist and to master as a professional one is improving your machine learning model’s accuracy. If you’re a freelance developer, own your own company, or have a role as a data scientist, having a high accuracy model can make or break your entire project.

A machine learning model with low accuracy can cause more than just financial loss. If the model is used in a sensitive scope, such as any medical application, an error in that model can lead to trauma and emotional loss for people involved with that model’s results.5 Reasons Why Every Data Scientist Should BlogExplaining something gives you a deeper understanding of it.towardsdatascience.com

Luckily, there are various simple yet efficient ways one can make to increase the accuracy of their model and save them much time, money, and effort that can be wasted on error mitigating if the model’s accuracy is low.

Improving the accuracy of a machine learning model is a skill that can only improve with practice. The more projects you build, the better your intuition will get about which approach you should use next time to improve your model’s accuracy. With time, your models will become more accurate and your projects more concrete.

Hyperparameters and Parameters …………………………………………………………………………………………………………………………………………………………..A Machine Learning model is defined as a mathematical model with a number of parameters that need to be learned from the data. By training a model with existing data, we are able to fit the model parameters.However, there is another kind of parameters, known Hyperparameters that cannot be directly learned from the regular training process. They are usually fixed before the actual training process begins. These parameters express important properties of the model such as its complexity or how fast it should learn.

Some examples of model hyperparameters include:

  1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
  2. The learning rate for training a neural network.
  3. The C and sigma hyperparameters for support vector machines.
  4. The k in k-nearest neighbors.

The aim of this article is to explore various strategies to tune hyperparameter for Machine learning model.

Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem. Two best strategies for Hyperparameter tuning are:

GridSearchCV
In GridSearchCV approach, machine learning model is evaluated for a range of hyperparameter values. This approach is called GridSearchCV, because it searches for best set of hyperparameters from a grid of hyperparameters values.

For example, if we want to set two hyperparameters C and Alpha of Logistic Regression Classifier model, with different set of values. The gridsearch technique will construct many versions of the model with all possible combinations of hyerparameters, and will return the best one.

As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4].
For a combination C=0.3 and Alpha=0.2, performance score comes out to be 0.726(Highest), therefore it is selected.
GridSearchCV

Following code illustrates how to use GridSearchCV

# Necessary imports from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV  # Creating the hyperparameter grid c_space = np.logspace(-5, 8, 15) param_grid = {'C': c_space}  # Instantiating logistic regression classifier logreg = LogisticRegression()  # Instantiating the GridSearchCV object logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)  logreg_cv.fit(X, y)  # Print the tuned parameters and score print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))  print("Best score is {}".format(logreg_cv.best_score_))

Output:

Tuned Logistic Regression Parameters: {‘C’: 3.7275937203149381}
Best score is 0.7708333333333334

Drawback : GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.
 
RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number of hyperparameter settings. It moves within the grid in random fashion to find the best set hyperparameters. This approach reduces unnecessary computation.

Following code illustrates how to use RandomizedSearchCV

# Necessary imports from scipy.stats import randint from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import RandomizedSearchCV  # Creating the hyperparameter grid  param_dist = {"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]}  # Instantiating Decision Tree classifier tree = DecisionTreeClassifier()  # Instantiating RandomizedSearchCV object tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5)  tree_cv.fit(X, y)  # Print the tuned parameters and score print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_)) print("Best score is {}".format(tree_cv.best_score_))

Output:

Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’: 5, ‘criterion’: ‘gini’}
Best score is 0.7265625

Decision Tree Classifier in Python and Scikit-Learn

Decision Tree Classifier for building a classification model using Python and Scikit

Decision Tree Classifier is a classification model that can be used for simple classification tasks where the data space is not huge and can be easily visualized. Despite being simple, it is showing very good results for simple tasks and outperforms other, more complicated models.

Article Overview:

  • Decision Tree Classifier Dataset
  • Decision Tree Classifier in Python with Scikit-Learn
  • Decision Tree Classifier – preprocessing
  • Training the Decision Tree Classifier model
  • Using our Decision Tree model for predictions
  • Decision Tree Visualisation

Decision Tree Classifier Dataset

Recently I’ve created a small dummy dataset to use for simple classification tasks. I’ll paste the dataset here again for your convenience.

Decision Tree Classifier - training data
Decision Tree Classifier – training data

The purpose of this data is, given 3 facts about a certain moment(the weather, whether it is a weekend or a workday or whether it is morning, lunch or evening), can we predict if there’s a traffic jam in the city?

Decision Tree Classifier in Python with Scikit-Learn

We have 3 dependencies to install for this project, so let’s install them now. Obviously, the first thing we need is the scikit-learn library, and then we need 2 more dependencies which we’ll use for visualization.

pip3 install scikit-learn
pip3 install matplotlib
pip3 install pydotplus

Decision Tree Classifier – installing dependencies

Now let’s import what we need from these packages.

from sklearn import preprocessing
from sklearn import tree
from IPython.display import Image
import pydotplus

Decision Tree Classifier – importing dependencies

def getWeather():
    return ['Clear', 'Clear', 'Clear', 'Clear', 'Clear', 'Clear',
            'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',
            'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy']

def getTimeOfWeek():
    return ['Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend']

def getTimeOfDay():
    return ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            ]

def getTrafficJam():
    return ['Yes', 'No', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'Yes', 'No', 'Yes'
            ]

Decision Tree Classifier – loading the data

Decision Tree Classifier – preprocessing

We know that computers have a really hard time when dealing with text and we can make their lives easier by converting the text to numerical values.

Label Encoder

We will use this encoder provided by scikit to transform categorical data from text to numbers. If we have n possible values in our dataset, then LabelEncoder model will transform it into numbers from 0 to n-1 so that each textual value has a number representation.

For example, let’s encode our time of day values.

    timeOfDay = ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            ]
    labelEncoder = preprocessing.LabelEncoder()
    encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay)
    print (encodedTimeOfDay)
    
    # Prints [2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0]

Decision Tree Classifier – encoding our data

Training the Decision Tree Classifier model

Now let’s train our model. So remember, since all our features are textual values, we need to encode all our values and only then we can jump to training.

if __name__=="__main__":
    # Get the data
    weather = getWeather()
    timeOfWeek = getTimeOfWeek()
    timeOfDay = getTimeOfDay()
    trafficJam = getTrafficJam()

    labelEncoder = preprocessing.LabelEncoder()

    # Encode the features and the labels
    encodedWeather = labelEncoder.fit_transform(weather)
    encodedTimeOfWeek = labelEncoder.fit_transform(timeOfWeek)
    encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay)
    encodedTrafficJam = labelEncoder.fit_transform(trafficJam)

    # Build the features
    features = []
    for i in range(len(encodedWeather)):
        features.append([encodedWeather[i], encodedTimeOfWeek[i], encodedTimeOfDay[i]])

    classifier = tree.DecisionTreeClassifier()
    classifier = classifier.fit(features, encodedTrafficJam)

Decision Tree Classifier – training our model

Using our Decision Tree model for predictions

Now we can use the model we have trained to make predictions about the traffic jam.


    # ["Snowy", "Workday", "Morning"]
    print(classifier.predict([[2, 1, 2]]))
    # Prints [1], meaning "Yes"
    # ["Clear", "Weekend", "Lunch"]
    print(classifier.predict([[0, 0, 1]]))
    # Prints [0], meaning "No"

Decision Tree Classifier – making predictions

And it seems to be working! It correctly predicts the traffic jam situations given our data.

Decision Tree Visualisation

Scikit also provides us with a way of visualizing a Decision Tree model. Here’s a quick helper method I wrote to generate a png image from our decision tree.

def printTree(classifier):
    feature_names = ['Weather', 'Time of Week', 'Time of Day']
    target_names = ['Yes', 'No']
    # Build the daya
    dot_data = tree.export_graphviz(classifier, out_file=None,
                                    feature_names=feature_names,
                                    class_names=target_names)
    # Build the graph
    graph = pydotplus.graph_from_dot_data(dot_data)

    # Show the image
    Image(graph.create_png())
    graph.write_png("tree.png")

Decision Tree Classifier – visualizing the decision tree

And here’s the result from that.

Decision Tree Classifier – visualization

Descriptive Statistics: Expectations vs. Reality (Exploratory Data Analysis)

An easy descriptive statistics approach to summarize the numeric and categoric data variables through the Measures of Central Tendency and Measures of Spread for every Exploratory Data Analysis process.

Image for post

About the Exploratory Data Analysis (EDA)

EDA is the first step in the data analysis process. It allows us to understand the data we are dealing with by describing and summarizing the dataset’s main characteristics, often through visual methods like bar and pie charts, histograms, boxplots, scatterplots, heatmaps, and many more.

Why is EDA important?

  • Maximize insight into a dataset (be able to listen to your data)
  • Uncover underlying structure/patterns
  • Detect outliers and anomalies
  • Extract and select important variables
  • Increase computational effenciency
  • Test underlying assumptions (e.g. business intuiton)

Moreover, to be capable of exploring and explain the dataset’s features with all its attributes getting insights and efficient numeric summaries of the data, we need help from Descriptive Statistics.

Statistics is divided into two major areas:

  • Descriptive statistics: describe and summarize data;
  • Inferential statistics: methods for using sample data to make general conclusions (inferences) about populations.

This tutorial focuses on descriptive statistics of both numerical and categorical variables and is divided into two parts:

  • Measures of central tendency;
  • Measures of spread.

Descriptive statistics

Also named Univariate Analysis (one feature analysis at a time), descriptive statistics, in short, help describe and understand the features of a specific dataset, by giving short numeric summaries about the sample and measures of the data.

Descriptive statistics are mere exploration as they do not allows us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made.

Numerical and categorical variables, as we will see shortly, have different descriptive statistics approaches.

Let’s review the type of variables:

Image for post
Type of variables — Image by author
  • Numerical continuous: The values are not countable and have an infinite number of possibilities (Someone’s age: 25 years, 4 days, 11 hours, 24 minutes, 5 seconds and so on to the infinite).
  • Numerical discrete: The values are countable and have an finite number of possibilities (It is impossible to count 27.52 countries in the EU).
  • Categorical ordinal: There is an order implied in the levels (January comes always before February and after December).
  • Categorical nominal: There is no order implied in the levels (Female/male, or the wind direction: north, south, east, west).

Numerical variables

Image for post
  • Measures of central tendency: Mean, median
  • Measures of spread: Standard deviation, variance, percentiles, maximum, minimum, skewness, kurtosis
  • Others: Size, unique, number of uniques

One approach to display the data is through a boxplot. It gives you the 5-basic-stats, such as the minimum, the 1st quartile (25th percentile), the median, the 3rd quartile (75th percentile), and the maximum.

Image for post

Categorical variables

Image for post
Bar plot of the categorical ordinal variable. Image by author
  • Measures of central tendency: Mode (most common)
  • Measures of spread: Number of uniques
  • Others: Size, % Highest unique

Understanding:

Measures of central tendency

  • Mean (average): The total sum of values divided by the total observations. The mean is highly sensitive to the outliers.
  • Median (center value): The total count of an ordered sequence of numbers divided by 2. The median is not affected by the outliers.
  • Mode (most common): The values most frequently observed. There can be more than one modal value in the same variable.

Measures of spread

  • Variance (variability from the mean): The square of the standard deviation. It is also affected by outliers.
  • Standard deviation (concentrated around the mean): The standard amount of deviation (distance) from the mean. The std is affected by the outliers. It is the square root of the variance.
  • Percentiles: The value below which a percentage of data falls. The 0th percentile is the minimum value, the 100th is the maximum, the 50th is the median.
  • Minimum: The smallest or lowest value.
  • Maximum: The greatest or highest value.
  • The number of uniques (total distinct): The total amount of distinct observations.
  • Uniques (distinct): The distinct values or groups of values observed.
  • Skewness (symmetric): How much a distribution derives from the normal distribution.
    >> Explained Skew concept in the next section.
  • Kurtosis (volume of outliers): How long are the tails and how sharp is the peak of the distribution.
    >> Explained Kurtosis concept in the next section.

Others

  • Count (size): The total sum of observations. Counting is also necessary for calculating the mean, median, and mode.
  • % highest unique (relativity): The proportion of the highest unique observation regarding all the unique values or group of values.

Skewness

In a perfect world, the data’s distribution assumes the form of a bell curve (Gaussian or normally distributed), but in the real world, data distributions usually are not symmetric (= skewed).

Therefore, the skewness indicates how much our distribution derives from the normal distribution (with the skewness value of zero or very close).

Image for post
Skewness curves. Image by author

There are three generic types of distributions:

  • Symmetrical [median = mean]: In a normal distribution, the mean (average) divides the data symmetrically at the median value or close.
  • Positive skew [median < mean]: The distribution is asymmetrical, the tail is skewed/longer towards the right-hand side of the curve. In this type, the majority of the observations are concentrated on the left tail, and the value of skewness is positive.
  • Negative skew [median > mean]: The distribution is asymmetrical and the tail is skewed/longer towards the left-hand side of the curve. In this type of distribution, the majority of the observations are concentrated on the right tail, and the value of skewness is negative.

Rule of thumbs:

  • Symmetric distribution: values between –0.5 to 0.5.
  • Moderate skew: values between –1 and -0.5 and 0.5 and 1.
  • High skew: values <-1 or >1.

Kurtosis

kurtosis is another useful tool when it comes to quantify the shape of a distribution. It measures both how long are the tails, but most important, and how sharp is the peak of the distributions.

If the distribution has a sharper and taller peak and shorter tails, then it has a higher kurtosis while a low kurtosis can be observed when the peak of the distribution is flatter with thinner tails. There are three types of kurtosis:

Image for post
  • Leptokurtic: The distribution is tall and thin. The value of a leptokurtic must be > 3.
  • Mesokurtic: This distribution looks the same or very similar to a normal distribution. The value of a “normal” mesokurtic is = 3.
  • Platykurtic: The distributions have a flatter and wider peak and thinner tails, meaning that the data is moderately spread out. The value of a platykurtic must be < 3.

The kurtosis values determine the volume of the outliers only.

Kurtosis is calculated by raising the average of the standardized data to the fourth power. If we raise any standardized number (less than 1) to the 4th power, the result would be a very small number, somewhere close to zero. Such a small value would not contribute much to the kurtosis. The conclusion is that the values that would make a difference to the kurtosis would be the ones far away from the region of the peak, put it in other words, the outliers.

The Jupyter notebook — IPython

In this section, we will be giving short numeric stats summaries concerning the different measures of central tendency and dispersion of the dataset.

let’s work on some practical examples through a descriptive statistics environment in Pandas.

Start by importing the required libraries:

import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the dataset:
df = pd.read_csv("sample.csv", sep=";")

Print the data:
df.head()

Image for post

Before any stats calculus, let’s just take a quick look at the data:
df.info

Image for post
Image by author

The dataset consists of 310 observations and 2 columns. One of the attributes is numerical, and the other categorical. Both columns have no missing values.

Numerical variable

The numerical variable we are going to analyze is age. First step is to visually observe the variable. So let’s plot an histogram and a boxplot.

plt.hist(df.age, bins=20)
plt.xlabel(“Age”)
plt.ylabel(“Absolute Frequency”)
plt.show()
Image for post
sns.boxplot(x=age, data=df, orient="h").set(xlabel="Age", title="Numeric variable 'Age'");
Image for post

It is also possible to visually observe the variable with both a histogram and a boxplot combined. I find it a useful graphical combination and use it a lot in my reports.

age = df.agef, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.8, 1.2)})mean=np.array(age).mean()
median=np.median(age)sns.boxplot(age, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')sns.distplot(age, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')plt.legend({'Mean':mean,'Median':median})
plt.title("'Age' histogram + boxplot")ax_box.set(xlabel='')
plt.show()
Image for post

Measures of central tendency

1. Mean:
df.age.mean()

35.564516129032256

2. Median:
df.age.median()

32.0

Measures of spread

3. Standard deviation:
df.age.std()

18.824363618000913

4. Variance:
df.age.var()

354.3566656227164

5. a) Percentiles 25%:
df.age.quantile(0.25)

23.0

b) Percentile 75%:
df.age.quantile(0.75)

45.0

c) In one go:
df.age.quantile(q=[.25, .75)

0.25    23.0
0.75 45.0
Name: age, dtype: float64

6. Minimum and maximum:
df.age.min(), df.age.max()

(3, 98)

7. Skewness (with scipy):
scipy.stats.skew(df.age)

0.9085582496839909

8. Kurtosis (with scipy):
scipy.stats.kurtosis(df.age)

0.7254158742250474

Others

9. Size (number of rows):
df.age.count()

310

10. Number of uniques (total distinct)
df.age.nunique()

74

11. Uniques (distinct):
df.age.unique()

array([46, 22, 54, 33, 69, 35, 11, 97, 50, 34, 67, 43, 21, 12, 23, 45, 89, 76, 5, 55, 65, 24, 27, 57, 38, 28, 36, 60, 56, 53, 26, 25, 42, 83, 16, 51, 90, 10, 70, 44, 20, 31, 47, 30, 91, 7, 6, 41, 66, 61, 96, 32, 58, 17, 52, 29, 75, 86, 98, 48, 40, 13, 4, 68, 62, 9, 18, 39, 15, 19,  8, 71, 3, 37])

Categorical variable

The categorical variable we are going to analyze is city. Let’s plot a bar chart and get a visual observation of the variable.

df.city.value_counts().plot.bar()
plt.xlabel("City")
plt.ylabel("Absolute Frequency")
plt.title("Categoric variable 'City'")
plt.show()
Image for post

Measures of central tendency

1. Mode:
df.city.mode()[0]

'Paris'

Measures of spread

2. Number of uniques:
df.city.nunique()

6

3. Uniques (distinct):
df.city.unique()

array(['Lisbon', 'Paris', 'Madrid', 'London', 'Luxembourg', 'Berlin'], dtype=object)

4. Most frequent unique (value count):
df.city.value_counts().head(1)

Paris     67
Name: city, dtype: int64

Others

5. Size (number of rows):
df.city.count()

310

6. % of the highest unique (fraction of the most common unique in regards to all the others):
p = df.city.value_counts(normalize=True)[0]
print(f"{p:.1%}")

21.6%

The describe() method shows the descriptive statistics gathered in one table. By default, stats for numeric data. The result is represented as a pandas dataframe.
df.describe()

Image for post

Adding other non-standard values, for instance, the ‘variance’.
describe_var = data.describe()
describe_var.append(pd.Series(data.var(), name='variance'))

Image for post

Displaying categorical data.
df.describe(include=["O"])
<=> df.describe(exclude=['float64','int64'])
<=> df.describe(include=[np.object])

Image for post

By passing the parameter include='all', displays both numeric and categoric variables at once.
df.describe(include='all')

Image for post

Conclusion

These are the basics of descriptive statistics when developing an exploratory data analysis project with the help of Pandas, Numpy, Scipy, Matplolib and/or Seaborn. When well performed, these stats help us to understand and transform the data for further processing.

10 Must-Know Statistical Concepts for Data Scientists

Statistics is a building block of data science

Image for post

Data science is an interdisciplinary field. One of the building blocks of data science is statistics. Without a decent level of statistics knowledge, it would be highly difficult to understand or interpret the data.

Statistics helps us explain the data. We use statistics to infer results about a population based on a sample drawn from that population. Furthermore, machine learning and statistics have plenty of overlaps.

Long story short, one needs to study and learn statistics and its concepts to become a data scientist. In this article, I will try to explain 10 fundamental statistical concepts.

1. Population and sample

Populationis all elements in a group. For example, college students in US is a population that includes all of the college students in US. 25-year-old people in Europe is a population that includes all of the people that fits the description.

It is not always feasible or possible to do analysis on population because we cannot collect all the data of a population. Therefore, we use samples.

Sampleis a subset of a population. For example, 1000 college students in US is a subset of “college students in US” population.

2. Normal distribution

Probability distribution is a function that shows the probabilities of the outcomes of an event or experiment. Consider a feature (i.e. column) in a dataframe. This feature is a variable and its probability distribution function shows the likelihood of the values it can take.

Probability distribution functions are quite useful in predictive analytics or machine learning. We can make predictions about a population based on the probability distribution function of a sample from that population.

Normal (Gaussian) distribution is a probability distribution function that looks like a bell.

Image for post
A typical normal distribution curve (image by author)

The peak of the curve indicates the most likely value the variable can take. As we move away from the peak the probability of the values decrease.

3. Measures of central tendency

Central tendency is the central (or typical) value of a probability distribution. The most common measures of central tendency are mean, median, and mode.

  • Mean is the average of the values in series.
  • Median is the value in the middle when values are sorted in ascending or descending order.
  • Mode is the value that appears most often.

4. Variance and standard deviation

Variance is a measure of the variation among values. It is calculated by adding up squared differences of each value and the mean and then dividing the sum by the number of samples.

Image for post
(image by author)

Standard deviation is a measure of how spread out the values are. To be more specific, it is the square root of variance.

Note: Mean, median, mode, variance, and standard deviation are basic descriptive statistics that help to explain a variable.

5. Covariance and correlation

Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value.

The figure below shows some values of the random variables X and Y. The orange dot represents the mean of these variables. The values change similarly with respect to the mean value of the variables. Thus, there is positive covariance between X and Y.

Image for post
(image by author)

The formula for covariance of two random variables:

Image for post
(image by author)

where E is the expected value and µ is the mean.

Note: The covariance of a variable with itself is the variance of that variable.

Correlation is a normalization of covariance by the standard deviation of each variable.

Image for post
(image by author)

where σ is the standard deviation.

This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value. In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.

6. Central limit theorem

In many fields including natural and social sciences, when the distribution of a random variable is unknown, normal distribution is used.

Central limit theorem (CLT) justifies why normal distribution can be used in such cases. According to the CLT, as we take more samples from a distribution, the sample averages will tend towards a normal distribution regardless of the population distribution.

Consider a case that we need to learn the distribution of the heights of all 20-year-old people in a country. It is almost impossible and, of course not practical, to collect this data. So, we take samples of 20-year-old people across the country and calculate the average height of the people in samples. CLT states that as we take more samples from the population, sampling distribution will get close to a normal distribution.

Why is it so important to have a normal distribution? Normal distribution is described in terms of mean and standard deviation which can easily be calculated. And, if we know the mean and standard deviation of a normal distribution, we can compute pretty much everything about it.

7. P-value

P-value is a measure of the likelihood of a value that a random variable takes. Consider we have a random variable A and the value x. The p-value of x is the probability that A takes the value x or any value that has the same or less chance to be observed. The figure below shows the probability distribution of A. It is highly likely to observe a value around 10. As the values get higher or lower, the probabilities decrease.

Image for post
Probability distribution of A (image by author)

We have another random variable B and want to see if B is greater than A. The average sample means obtained from B is 12.5 . The p value for 12.5 is the green area in the graph below. The green area indicates the probability of getting 12.5 or a more extreme value (higher than 12.5 in our case).

Image for post
(image by author)

Let’s say the p value is 0.11 but how do we interpret it? A p value of 0.11 means that we are 89% sure of the results. In other words, there is 11% chance that the results are due to random chance. Similarly, a p value of 0.05 means that there is 5% chance that the results are due to random chance.

Note: Lower p values show more certainty in the result.

If the average of sample means from the random variable B turns out to be 15 which is a more extreme value, the p value will be lower than 0.11.

Image for post
(image by author)

8. Expected value of random variables

The expected value of a random variable is the weighted average of all possible values of the variable. The weight here means the probability of the random variable taking a specific value.

The expected value is calculated differently for discrete and continuous random variables.

  • Discrete random variables take finitely many or countably infinitely many values. The number of rainy days in a year is a discrete random variable.
  • Continuous random variables take uncountably infinitely many values. For instance, the time it takes from your home to the office is a continuous random variable. Depending on how you measure it (minutes, seconds, nanoseconds, and so on), it takes uncountably infinitely many values.

The formula for the expected value of a discrete random variable is:

Image for post
(image by author)

The expected value of a continuous random variable is calculated with the same logic but using different methods. Since continuous random variables can take uncountably infinitely many values, we cannot talk about a variable taking a specific value. We rather focus on value ranges.

In order to calculate the probability of value ranges, probability density functions (PDF) are used. PDF is a function that specifies the probability of a random variable taking value within a particular range.

Image for post
(image by author)

9. Conditional probability

Probability simply means the likelihood of an event to occur and always takes a value between 0 and 1 (0 and 1 inclusive). The probability of event A is denoted as p(A) and calculated as the number of the desired outcome divided by the number of all outcomes. For example, when you roll a die, the probability of getting a number less than three is 2 / 6. The number of desired outcomes is 2 (1 and 2); the number of total outcomes is 6.

Conditional probability is the likelihood of an event A to occur given that another event that has a relation with event A has already occurred.

Suppose that we have 6 blue balls and 4 yellows placed in two boxes as seen below. I ask you to randomly pick a ball. The probability of getting a blue ball is 6 / 10 = 0,6. What if I ask you to pick a ball from box A? The probability of picking a blue ball clearly decreases. The condition here is to pick from box A which clearly changes the probability of the event (picking a blue ball). The probability of event A given that event B has occurred is denoted as p(A|B).

Image for post
(image by author)

10. Bayes’ theorem

According to Bayes’ theorem, probability of event A given that event B has already occurred can be calculated using the probabilities of event A and event B and probability of event B given that A has already occurred.

Image for post
(image by author)

Bayes’ theorem is so fundamental and ubiquitous that a field called “bayesian statistics” exists. In bayesian statistics, the probability of an event or hypothesis as evidence comes into play. Therefore, prior probabilities and posterior probabilities differ depending on the evidence.

Naive bayes algorithm is structured by combining bayes’ theorem and some naive assumptions. Naive bayes algorithm assumes that features are independent of each other and there is no correlation between features.

Conclusion

We have covered some basic yet fundamental statistical concepts. If you are working or plan to work in the field of data science, you are likely to encounter these concepts.

There is, of course, much more to learn about statistics. Once you understand the basics, you can steadily build your way up to advanced topics.

Data Preprocessing with Python Pandas — Binning

Image for post

Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.

Data binning is a type of data preprocessing, a mechanism which includes also dealing with check this link missing valuesformattingnormalization and standardization.

There are two approaches to perform data binning:

  • numeric to categorical, which converts numeric into categorical variables
  • sampling, wihch corresponds to data quantization.

You can download the full code of this tutorial from Github repository

Data Import

In this tutorial we exploit the cupcake.csv dataset, which contains the trend search of the word cupcake on Google Trends. Data are extracted from this link. We exploit the pandas library to import the dataset and we transform it into a dataframe through the read_csv() function.

import pandas as pd
df = pd.read_csv('cupcake.csv')
df.head(5)
Image for post

Numeric to categorical binning

In this case we group values related to the column Cupcake into three groups: smallmedium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions min() and max() of dataframe to calculate the minimum value and the maximum value of the column Cupcake.

min_value = df['Cupcake'].min()
max_value = df['Cupcake'].max()
print(min_value)
print(max_value)

which gives the following output

4
100

Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):

  • small — (edge1, edge2)
  • medium — (edge2, edge3)
  • big — (edge3, edge4) We can use the linspace() function of the numpy package to calculate the 4 bins, equally distributed.
import numpy as np
bins = np.linspace(min_value,max_value,4)
bins

which gives the following output:

array([  4.,  36.,  68., 100.])

Now we define the labels:

labels = ['small', 'medium', 'big']

We can use the cut() function to convert the numeric values of the column Cupcake into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter include_lowest to True in order to include also the minimum value.

df['bins'] = pd.cut(df['Cupcake'], bins=bins, labels=labels, include_lowest=True)

We can plot the distribution of values, by using the hist() function of the matplotlib package.

import matplotlib.pyplot as pltplt.hist(df['bins'], bins=3)
Image for post

Sampling

Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:

  • by bin means: each value in a bin is replaced by the mean value of the bin.
  • by bin median: each bin value is replaced by its bin median value.
  • by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.

In order to perform sampling, the binned_statistic() function of the scipy.stats package can be used. This function receives two arrays as input, x_data and y_data, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the x values (x_bins) corresponding to the binned values (y_bins) as the values at the center of the bin range.

from scipy.stats import binned_statistic
x_data = np.arange(0, len(df))
y_data = df['Cupcake']
y_bins,bin_edges, misc = binned_statistic(x_data,y_data, statistic="median", bins=10)
x_bins = (bin_edges[:-1]+bin_edges[1:])/2
x_bins

which gives the following output:

array([ 10.15,  30.45,  50.75,  71.05,  91.35, 111.65, 131.95, 152.25, 172.55, 192.85])

Finally, we plot results.

plt.plot(x_data,y_data)
plt.xlabel("X");
plt.ylabel("Y")plt.scatter(x_bins, y_bins, color= 'red',linewidth=5)
plt.show()
Image for post

Summary

In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.

Data binning is very useful when discretization is needed.

Feature Handling in Machine Learning

Image for post

This is the second part in Machine Learning series where we discuss on Features handling before using the data for machine learning models. The articles contains below parts:

  1. Feature representation
  2. Feature selection
  3. Feature transformation
  4. Feature engineering

This article cover the basic concepts of modifying the features as data needs to be refined before it can be used for prediction. We need to remove garbage out of data and turn features into high quality features.

Feature Representation

Your features need be represented as quantitative (preferably numeric) attributes of the thing you’re sampling. They can be real world values, such as the readings from a sensor, and other discernible, physical properties. Alternatively, your features can also be calculated derivatives, such as the presence of certain edges and curves in an image, or lack thereof.

But there is no guarantee that will be the case, and you will often encounter data in textual or other unstructured forms. Luckily, there are a few techniques that when applied, clean up these scenarios.

Textual Categorical-Features

If you have a categorical feature, the way to represent it in your dataset depends on if it’s ordinal or nominal. For ordinal features, map the order as increasing integers in a single numeric feature.

Image for post

On the other hand, if your feature is nominal (and thus there is no obvious numeric ordering), then you have two options. The first is you can encoded it similar as you did above. This would be a fast-and-dirty approach. This may or may not cause problems for you in the future. If you aren’t getting the results you hoped for, or even if you are getting the results you desired but would like to further increase the result accuracy, then a more precise encoding approach would be to separate the distinct values out into individual boolean features:

Image for post

These newly created features are called boolean features because the only values they can contain are either 0 for non-inclusion, or 1 for inclusion. Pandas .get_dummies() method allows you to completely replace a single, nominal feature with multiple boolean indicator features. This method is quite powerful and has many configurable options, including the ability to return a SparseDataFrame, and other prefixing options. It’s benefit is that no erroneous ordering is introduced into your dataset.

Pure Textual Features

If you are trying to “featurize” a body of text such as a webpage, a tweet, a passage from a newspaper, an entire book, or a PDF document, creating a corpus of words and counting their frequency is an extremely powerful encoding tool. This is also known as the Bag of Words model, implemented with the CountVectorizer() method in SciKit-Learn.

Image for post

Graphical Features

In addition to text and natural language processing, bag of words has successfully been applied to images by categorizing a collection of regions and describing only their appearance, ignoring any spatial structure. However this is not the typical approach used to represent images as features, and requires you come up with methods of categorizing image regions. More often used methods include:

  1. Split the image into a grid of smaller areas, and attempt feature extraction at each locality. Return a combined array of all discovered. features
  2. Use variable-length gradients and other transformations as the features, such as regions of high / low luminosity, histogram counts for horizontal and vertical black pixels, stroke and edge detection, etc.
  3. Resize the picture to a fixed size, convert it to grayscale, then encode every pixel as an element in a uni-dimensional feature array.
Image for post

If you’re wondering what the :: is doing, that is called extended slicing. Notice the .reshape(-1) line. This tells Pandas to take your 2D image and flatten it into a 1D array. This is an all purpose method you can use to change the shape of your dataframes, so long as you maintain the number of elements. For example reshaping a [10, 10] to [100, 1] or [4, 25], etc. Another method called .ravel() will do the same thing as .reshape(-1), that is unravel a multi-dimensional NDArray into a one dimensional one. The reason why its important to reshape your 2D array images into one dimensional ones is because each image will represent a single sample, and Sklearn expects your dataframe to be shapes [num_samples, num_features].

Feature Selection

Most of the times, we will have many non-informative features. For Example, Name or ID variables and it results in “garbage-in, garbage-out”. Also, extra features make a model complex, time-consuming, and harder to implement in production. Many machine learning algorithms suffer from the curse of dimensionality — that is, they do not perform well when given a large number of variables or features. So it’s better to remove highly irrelevant or redundant features to simplify the situation and improve performance.

For instance, if your dataset have columns you don’t need, you can remove them using drop() method by specifying the name of columns. Axis=1 tells that deletion will happen column-wise while axis=0 will imply that deletion will happen row-wise.

Image for post

Or, if you want only select columns for analysis or visualization purposes, you can select those columns by enclosing them within double square brackets.

Image for post

Sometimes, we want to remove a feature but use it as an index instead. We can do this by specifying the column name as index during data load method.

Image for post

We can set the column as index later as well by using set_index() method.

Image for post

We can further improve the situation of having too many features through dimensionality reduction.

Commonly used techniques are:

  • PCA (Principal Component Analysis) — Considered a more statistical approach than machine learning approach. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation. One important thing to note that it is an unsupervised dimensionality reduction technique, where you can cluster the similar data points based on the correlation between them without any labels.
  • t-SNE (t-Distributed Stochastic Neighboring Entities) — In this approach, target number of dimensions is typically 2 or 3 which means that t-SNE is used a lot for visualizing your data as visualizing data with more than 3 dimensions is not easy for human brain. t-SNE has remarkable capability of keeping close points from multi-dimensional space close in the two-dimensional space.
  • Feature embedding — It is based on training a separate machine learning model to encode a large number of features into small number of features

Feature Transformation

Pandas will automatically attempt to figure out the best data type to use for each series in your dataset. Most of the time it does this flawlessly, but other times it fails horribly! Particularly the .read_html() method is notorious for defaulting all series data types to Python objects. You should check, and double-check the actual type of each column in your dataset to avoid unwanted surprises. If your data types don’t look the way you expected them, explicitly convert them to the desired type using the .to_datetime(), .to_numeric(), and .to_timedelta() methods:

Image for post

Take note how to_numeric properly converts to decimal or integer depending on the data it finds. The errors=’coerce’ parameter instructs Pandas to enter a NaN at any field where the conversion fails.

Sometimes, even though the data type is correct, but we still need to modify the values in features. Example — we need to divide all the values by 10 or we need to convert them to their logarithmic values.

Image for post

Feature Engineering

Just as oil needs to be refined before it is used, similarly data needs to be refined before we use it for machine learning. Sometimes, we need to derive new features out of existing features. The process of extracting new features from existing ones is called feature engineeringClassical Machine Learning depends on feature engineering much more than Deep Learning.

Below are some types of Feature Engineering.

  1. Aggregation — New features are created by getting a count, sum, average, mean, or median from a group of entities.
  2. Part-Of — New features are created by extracting a part of data-structure. E.g. Extracting the month from a date.
  3. Binning — Here you group your entities into bins and then you apply those aggregations over those bins. Example — group customers by age and then calculating average purchases within each group
  4. Flagging—Here you derive a boolean (0/1 or True/False) value for each entity

Example — we need to summarize data by finding its sum, average, minimum or maximum value and then creating new features with those new values.

Image for post

Covariance vs Correlation in the Art of Data Science.

Detailed explanation with examples

Image for post

Covariance and correlation are widely-used measures in the field of statistics, and thus both are very important concepts in data science. Covariance and correlation provide insight about the ralationship between random variables or features in a dataset. Although these two concepts are highly related, we need to interpret them carefully not to cause any misunderstandings.

Covariance

Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value. Consider the random variables “X” and “Y”. Some realizations of these variables are shown in the figure below. The orange dot show the mean of and mean of Y. As the values of a get away from the mean of X in positive direction, the values of Y tend to change in similar way. Same relation is valid for negative direction as well.

Image for post
Positive covariance

The formula for covariance of two random variables:

Image for post

where E means the expectation and µ is the mean.

If X and Y change in the same direction, as in the figure above, covariance is positive. Let’s confirm with the covariance function of numpy:

Image for post

np.cov() returns the covariance matrix. The covariance of X and Y is 0.11. The value at position [0,0] shows the covariance of X with itself and the value at [1,1] shows the covariance of Y with itself. If you run the code np.cov(X,X), you will get the value at position [0,0] which is 0.07707877 in this case. Similarly, np.cov(Y,Y) will return the value at position [1,1].

The covariance of a variable with itself is actually indicates the variance of that variable:

Image for post

Let’s go over another example. The figure below shows some realizations of random variables Z and T. As we can see, as T increases, Z tends to decrease. Thus, the covariance of Z and T should be negative:

Image for post
Image for post
Negative covariance

We may also see variables that the variations are independent of each other. For example, in the figure below, realizations of variables A and B seems changing randomly with respect to each other. In this case, we expect to see a covariance value that is close to zero. Let’s confirm:

Image for post
Image for post
Covariance is close to zero

The following example will provide a little more intuition about the calculation of covariance.

Image for post

Covariance describes how similarly two random variables deviate from their mean. The red lines show the means of series. The mean of s1 is the vertical line (x=8.5) and the mean of s2 is the horizontol line (y=9.3). Deviation from the mean is the difference between the values and the mean. Covariance is proportional to the product of deviation of s1 and s2 values. Consider the upper right rectangle in the plot above. Both s1 and s2 values are higher than the mean of s1 and s2, respectively. So, deviations are positive. When we multiply two positive values, we get a positive value. In the lower left rectangle, s1 and s2 values are lower than the mean of s1 and s2, respectively. Thus, deviations are negative but we get a positive number when two negative numbers are multiplied. For the points in lower right and upper left rectangle areas, deviations of s1 is positive when the deviation of s2 is negative and vice versa. So we get a negative number when two deviations are multiplied. All the deviations are combined to get the covariance. Hence, if we have more points in negative regions than positive regions, we will get a negative covariance.

Correlation

Correlation is a normalization of covariance by the standard deviation of each variable.

Image for post

where σ is the standard deviation.

This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value. In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.

Consider the dataframe below:

Image for post

We want to measure the relationship between X-Y and X-Z. We want to find out which variable (Y or Z) is more correlated with X. Let’s use covariance first:

Image for post

Covariance of X and Z is much higher than the covariance of X and Y. We may think the relationship between the deviations in X and Z is much stronger than that of X and Y. However, it is not the case. Covariance of X and Z is higher because of the value ranges. The range of Z values are in between 22 and 222 whereas the values of Y are around 1 (most of them are less than 1). Therefore, we need to use correlation to eliminate the effect of different value ranges.

Image for post

As we can see from the correlation matrix, X and Y are actually more correlated than X and Z.

How to Present the Relationships Amongst Multiple Variables in Python

Learn how to present the relationships amongst the features using multivariate charts and plots in Python

While dealing with a big dataset, it is important to understand the relationship between the features. That is a big part of data analysis. The relationships can be between two variables or amongst several variables. I will discuss how to present the relationships between multiple variables with some simple techniques. Python’s Numpy, Pandas, Matplotlib, and Seaborn libraries will be used.

First, import the necessary packages and the dataset to be used.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.read_csv("nhanes_2015_2016.csv")

This dataset is very large. At least too large to show a screenshot here. Here are the columns in this dataset.

df.columns
#Output:
Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210'], dtype='object')

Now, let’s make the dataset smaller with a few columns. So, it’s easier to handle and show in this article.

df = df[['SMQ020', 'RIAGENDR', 'RIDAGEYR','DMDCITZN', 
'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ','SDMVPSU',
'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'RIDRETH1']]
df.head()
Image for post

Column names may look strange to you. I will keep explaining as we keep using them.

  1. In this dataset, we have two systolic blood pressure data (‘BPXSY1’, ‘BPXSY2) and two diastolic blood pressure data (‘BPXDI1’, ‘BPXDI2’). It is worth looking at if there is any relationship between them. Observe the relationship between the first and second systolic blood pressure.

To find out the relation between two variables, scatter plots have been being used for a long time. It is the most popular, basic, and easily understandable way of looking at a relationship between two variables.

sns.regplot(x = "BPXSY1", y="BPXSY2", data=df, fit_reg = False, scatter_kws={"alpha": 0.2})
Image for post

The relationship between the two systolic blood pressures is positively linear. There is a lot of overlapping observed in the plot.

2. To understand the systolic and diastolic blood pressure data and their relationships more, make a joint plot. Jointplot shows the density of the data and the distribution of both the variables at the same time.

sns.jointplot(x = "BPXSY1", y="BPXSY2", data=df, kind = 'kde')
Image for post

In this plot, it shows very clearly that the densest area is from 115 to 135. Both the first and second systolic blood pressure distribution is right-skewed. Also, both of them have some outliers.

3. Find out if the correlation between the first and second systolic blood pressures are different in the male and female population.

df["RIAGENDRx"] = df.RIAGENDR.replace({1: "Male", 2: "Female"}) 
sns.FacetGrid(df, col = "RIAGENDRx").map(plt.scatter, "BPXSY1", "BPXSY2", alpha =0.6).add_legend()
Image for post

This picture shows, both the correlations are positively linear. Let’s find out the correlation with more clarity.

print(df.loc[df.RIAGENDRx=="Female",["BPXSY1", "BPXSY2"]].dropna().corr())
print(df.loc[df.RIAGENDRx=="Male",["BPXSY1", "BPXSY2"]].dropna().corr())
Image for post

From the two correlation chart above, the correlation between two systolic blood pressure is 1% higher in the female population than in the male. If these things are new to you, I encourage you to try understanding the correlation between two diastolic blood pressures or systolic and diastolic blood pressures.

4. Human behavior can change with so many different factors such as gender, education level, ethnicity, financial situation, and so on. In this dataset, we have ethnicity (“RIDRETH1”) information as well. Check the effect of both ethnicity and gender on the relationship between both the systolic blood pressures.

sns.FacetGrid(df, col="RIDRETH1", row="RIAGENDRx").map(plt.scatter, "BPXSY1", "BPXSY2", alpha = 0.5).add_legend()
Image for post

With different ethnic origins and gender, correlations seem to be changing a little bit but generally stays positively linear as before.

5. Now, focus on some other variables in the dataset. Find the relationship between education and marital status.

Both the education column(‘DMDEDUC2’) and the marital status (‘DMDMARTL’) column are categorical. First, replace the numerical values with the string values that will make sense. We also need to get rid of values that do not add good information to the chart. Such as the education column has some values ‘Don’t know’ and the marital status column has some ‘Refused’ values.

df["DMDEDUC2x"] = df.DMDEDUC2.replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 7: "Refused", 9: "Don't know"})df["DMDMARTLx"] = df.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 6: "Living w/partner", 77: "Refused"})db = df.loc[(df.DMDEDUC2x != "Don't know") & (df.DMDMARTLx != "Refused"), :]

Finally, we got this DataFrame that is clean and ready for the chart.

x = pd.crosstab(db.DMDEDUC2x, db.DMDMARTLx)
x
Image for post

Here is the result. The numbers look very simple to understand. But a chart of population proportions will be a more appropriate presentation. I am getting a population proportion based on marital status.

x.apply(lambda z: z/z.sum(), axis=1)
Image for post

6. Find the population proportion of marital status segregated by Ethnicity (‘RIDRETH1’) and education level.

First, replace the numeric value with meaningful strings in the ethnicity column. I found these string values from the Center for Disease Control website.

db.groupby(["RIDRETH1x", "DMDEDUC2x", "DMDMARTLx"]).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)
Image for post

7. Observe the difference in education level with age.

Here, education level is a categorical variable and age is a continuous variable. A good way of observing the difference in education levels with age will be to make a boxplot.

plt.figure(figsize=(12, 4))
a = sns.boxplot(db.DMDEDUC2x, db.RIDAGEYR)
Image for post

This plot shows, the rate of a college education is higher in younger people. A violin plot may provide a better picture.

plt.figure(figsize=(12, 4))
a = sns.violinplot(db.DMDEDUC2x, db.RIDAGEYR)
Image for post

So, the violin plot shows a distribution. The most college-educated people are around age 30. At the same time, most people who are less than 9th grade, are about 68 to 88 years old.

8. Show the marital status distributed by and segregated by gender.

fig, ax = plt.subplots(figsize = (12,4))
ax = sns.violinplot(x= "DMDMARTLx", y="RIDAGEYR", hue="RIAGENDRx", data= db, scale="count", split=True, ax=ax)
Image for post

Here, blue color shows the male population distribution and orange color represents the female population distribution. Only ‘never married’ and ‘living with partner’categories have similar distributions for the male and female populations. Every other category has a notable difference in the male and female populations.