In this post, we are going to introduce you to the Support Vector Machine (SVM) machine learning algorithm. We will follow a similar process to our recent post Naive Bayes for Dummies; A Simple Explanation by keeping it short and not overly-technical. The aim is to give those of you who are new to machine learning a basic understanding of the key concepts of this algorithm.
Support Vector Machines – What are they?
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.
SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.
Support Vectors
Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.
What is a hyperplane?
As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.
Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.
So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.
How do we find the right hyperplane?
Or, in other words, how do we best segregate the two classes within the data?
The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.
But what happens when there is no clear hyperplane?
This is where it can get tricky. Data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset.
< In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view. Explaining this is easiest with another simplified example. Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. You can read more on Kerneling here.
Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.
Pros & Cons of Support Vector Machines
Pros
Accuracy
Works well on smaller cleaner datasets
It can be more efficient because it uses a subset of training points
Cons
Isn’t suited to larger datasets as the training time with SVMs can be high
Less effective on noisier datasets with overlapping classes
SVM Uses
SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.
There you have it, a very high level introduction to Support Vector Machines.
One of the most frustrating things that happen — more often than data scientists like to admit — after they spend hours upon hours gathering data, cleaning it, labeling it, and using it to train and develop a machine learning model is ending up with a model with low accuracy or large error range.
In machine learning, the term model accuracy refers to the measurements made to decide whether or not a certain model is the best to describe the relationship between the different problem variables. We often use training data (sample data) to train a model for new, unused data.
If our model has good accuracy, it will perform well on both the training data and the new one. Having a model with high accuracy is essential to the overall project’s success, and if you’re building it for a client, it’s important for your paycheck!
So, have can we avoid all of that and improve the accuracy of our machine learning model? There are different ways a data scientist can use to improve their model’s accuracy; in this article, we will go through 6 of such ways. Let’s jump right in…
Most ML engineers are familiar with the quote, “Garbage in, garbage out”. Your model can perform only so much when the data it is trained upon is poorly representative of the actual scenario. What do I mean by ‘representative’? It refers to how well the training data population mimics the target population; the proportions of the different classes, or the point estimates (like mean, or median), and the variability (like variance, standard deviation, or interquartile range) of the training and target populations.
Generally, the larger the data, the more likely it is to be representative of the target population to which you want to generalize. if you want to generalize the population of students in Grade 1 to 12 of a school you cannot just use 80% of Grade 8 population because the data you want to predict will be faulty because of your dataset. It is crucial to have a good understanding of the distribution of your target population in order to devise the right data collection techniques. Once you have the data, study the data (the exploratory data analysis phase) in order to determine its distribution and representativeness.
Outliers, missing values, and outright wrong or false data are some of the other considerations that you might have. Should you cap outliers at a certain value? Or remove them entirely? How about normalizing the values? Should you include data with some missing values? Or use the mean or median values instead to replace the missing values? Does the data collection method support the integrity of the data? These are some of the questions that you must evaluate before thinking about the model. Data cleaning is probably the most important step after data collection.
Method 1: Add more data samples
Data tells a story only if you have enough of it. Every data sample provides some input and perspective to your data’s overall story is trying to tell you. Perhaps the easiest and most straightforward way to improve your model’s performance and increase its accuracy is to add more data samples to the training data.
Doing so will add more details to your data and finetune your model resulting in a more accurate performance. Rember after all, the more information you give your model, the more it will learn and the more cases it will be able to identify correctly.
Method 2: Look at the problem differently
Sometimes adding more data couldn’t be the answer to your model inaccuracy problem. You’re providing your model with a good technique and the correct dataset. But you’re not getting the results you hope for; why?
Context is important in any situation, and training a machine learning model is no different. Sometimes, one point of data can’t tell a story, so you need to add more context for any algorithm we intend to apply to this data to have a good performance.
More context can always lead to a better understanding of the problem and, eventually, better performance of the model. Imagine I tell you I am selling a car, a BMW. That alone doesn’t give you much information about the car. But, if I add the color, model and distance traveled, then you’ll start to have a better picture of the car and its possible value.
Method 4: Finetune your hyperparameter
Training a machine learning model is a skill that you can only hone with practice. Yes, there are rules you can follow to train your model, but these rules don’t give you the answer your seeking, only the way to reach that answer.
However, to get the answer, you will need to do some trial and error until you reach your answer. When I first started learning the different machine learning algorithms, such as the K-means, I was lost on choosing the best number of clusters to reach the optimal results. The way to optimize the results is to tune its hyper-parameters. So, tuning the parameters of the algorithm will always lead to better accuracy.
Method 5: Train your model using cross-validation
In machine learning, cross-validation is a technique used to enhance the model training process by dividing the overall training set into smaller chunks and then use each chunk to train the model.
What if you tried all the approaches we talked about so far and your model still results in a low or average accuracy? What then?
Sometimes we choose an algorithm to implement that doesn’t really apply to the data we have, and so we don’t get the results we expect. Changing the algorithm, you’re using to implement your solution. Trying out different algorithms will lead you to uncover more details about your data and the story it’s trying to tell.
Takeaways
One of the most difficult things to learn as a new data scientist and to master as a professional one is improving your machine learning model’s accuracy. If you’re a freelance developer, own your own company, or have a role as a data scientist, having a high accuracy model can make or break your entire project.
Luckily, there are various simple yet efficient ways one can make to increase the accuracy of their model and save them much time, money, and effort that can be wasted on error mitigating if the model’s accuracy is low.
Improving the accuracy of a machine learning model is a skill that can only improve with practice. The more projects you build, the better your intuition will get about which approach you should use next time to improve your model’s accuracy. With time, your models will become more accurate and your projects more concrete.
The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
The learning rate for training a neural network.
The C and sigma hyperparameters for support vector machines.
The k in k-nearest neighbors.
The aim of this article is to explore various strategies to tune hyperparameter for Machine learning model.
Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem. Two best strategies for Hyperparameter tuning are:
GridSearchCV In GridSearchCV approach, machine learning model is evaluated for a range of hyperparameter values. This approach is called GridSearchCV, because it searches for best set of hyperparameters from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of Logistic Regression Classifier model, with different set of values. The gridsearch technique will construct many versions of the model with all possible combinations of hyerparameters, and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination C=0.3 and Alpha=0.2, performance score comes out to be 0.726(Highest), therefore it is selected.
Following code illustrates how to use GridSearchCV
Tuned Logistic Regression Parameters: {‘C’: 3.7275937203149381} Best score is 0.7708333333333334
Drawback : GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.
RandomizedSearchCV RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number of hyperparameter settings. It moves within the grid in random fashion to find the best set hyperparameters. This approach reduces unnecessary computation. Following code illustrates how to use RandomizedSearchCV
# Necessary imports fromscipy.stats importrandint fromsklearn.tree importDecisionTreeClassifier fromsklearn.model_selection importRandomizedSearchCV # Creating the hyperparameter grid param_dist ={"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} # Instantiating Decision Tree classifier tree =DecisionTreeClassifier() # Instantiating RandomizedSearchCV object tree_cv =RandomizedSearchCV(tree, param_dist, cv =5) tree_cv.fit(X, y) # Print the tuned parameters and score print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_)) print("Best score is {}".format(tree_cv.best_score_))
Output:
Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’: 5, ‘criterion’: ‘gini’} Best score is 0.7265625
Decision Tree Classifier for building a classification model using Python and Scikit
Decision Tree Classifier is a classification model that can be used for simple classification tasks where the data space is not huge and can be easily visualized. Despite being simple, it is showing very good results for simple tasks and outperforms other, more complicated models.
Article Overview:
Decision Tree Classifier Dataset
Decision Tree Classifier in Python with Scikit-Learn
Decision Tree Classifier – preprocessing
Training the Decision Tree Classifier model
Using our Decision Tree model for predictions
Decision Tree Visualisation
Decision Tree Classifier Dataset
Recently I’ve created a small dummy dataset to use for simple classification tasks. I’ll paste the dataset here again for your convenience.
Decision Tree Classifier – training data
The purpose of this data is, given 3 facts about a certain moment(the weather, whether it is a weekend or a workday or whether it is morning, lunch or evening), can we predict if there’s a traffic jam in the city?
Decision Tree Classifier in Python with Scikit-Learn
We have 3 dependencies to install for this project, so let’s install them now. Obviously, the first thing we need is the scikit-learn library, and then we need 2 more dependencies which we’ll use for visualization.
We know that computers have a really hard time when dealing with text and we can make their lives easier by converting the text to numerical values.
Label Encoder
We will use this encoder provided by scikit to transform categorical data from text to numbers. If we have n possible values in our dataset, then LabelEncoder model will transform it into numbers from 0 to n-1 so that each textual value has a number representation.
Now let’s train our model. So remember, since all our features are textual values, we need to encode all our values and only then we can jump to training.
if __name__=="__main__":
# Get the data
weather = getWeather()
timeOfWeek = getTimeOfWeek()
timeOfDay = getTimeOfDay()
trafficJam = getTrafficJam()
labelEncoder = preprocessing.LabelEncoder()
# Encode the features and the labels
encodedWeather = labelEncoder.fit_transform(weather)
encodedTimeOfWeek = labelEncoder.fit_transform(timeOfWeek)
encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay)
encodedTrafficJam = labelEncoder.fit_transform(trafficJam)
# Build the features
features = []
for i in range(len(encodedWeather)):
features.append([encodedWeather[i], encodedTimeOfWeek[i], encodedTimeOfDay[i]])
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(features, encodedTrafficJam)
Decision Tree Classifier – training our model
Using our Decision Tree model for predictions
Now we can use the model we have trained to make predictions about the traffic jam.
# ["Snowy", "Workday", "Morning"]
print(classifier.predict([[2, 1, 2]]))
# Prints [1], meaning "Yes"
# ["Clear", "Weekend", "Lunch"]
print(classifier.predict([[0, 0, 1]]))
# Prints [0], meaning "No"
Decision Tree Classifier – making predictions
And it seems to be working! It correctly predicts the traffic jam situations given our data.
Decision Tree Visualisation
Scikit also provides us with a way of visualizing a Decision Tree model. Here’s a quick helper method I wrote to generate a png image from our decision tree.
def printTree(classifier):
feature_names = ['Weather', 'Time of Week', 'Time of Day']
target_names = ['Yes', 'No']
# Build the daya
dot_data = tree.export_graphviz(classifier, out_file=None,
feature_names=feature_names,
class_names=target_names)
# Build the graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show the image
Image(graph.create_png())
graph.write_png("tree.png")
Decision Tree Classifier – visualizing the decision tree
An easy descriptive statistics approach to summarize the numeric and categoric data variables through the Measures of Central Tendency and Measures of Spread for every Exploratory Data Analysis process.
About the Exploratory Data Analysis (EDA)
EDA is the first step in the data analysis process. It allows us to understand the data we are dealing with by describingandsummarizing the dataset’s main characteristics, often through visualmethods like bar and pie charts, histograms, boxplots, scatterplots, heatmaps, and many more.
Why is EDA important?
Maximize insight into a dataset (be able to listen to your data)
Uncover underlying structure/patterns
Detect outliers and anomalies
Extract and select important variables
Increase computational effenciency
Test underlying assumptions (e.g. business intuiton)
Moreover, to be capable of exploring and explain the dataset’s features with all its attributes getting insights and efficient numeric summaries of the data, we need help from Descriptive Statistics.
Statistics is divided into two major areas:
Descriptive statistics: describe and summarize data;
Inferential statistics: methods for using sample data to make general conclusions (inferences) about populations.
This tutorial focuses on descriptive statistics of both numerical and categorical variables and is divided into two parts:
Measures of central tendency;
Measures of spread.
Descriptive statistics
Also named Univariate Analysis (one feature analysis at a time), descriptive statistics, in short, help describe and understand the features of a specific dataset, by giving short numeric summaries about the sample and measures of the data.
Descriptive statistics are mere exploration as they do not allows us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made.
Numerical and categorical variables, as we will see shortly, have different descriptive statistics approaches.
Let’s review the type of variables:
Type of variables — Image by author
Numerical continuous: The values are not countable and have an infinite number of possibilities (Someone’s age: 25 years, 4 days, 11 hours, 24 minutes, 5 seconds and so on to the infinite).
Numerical discrete: The values are countable and have an finite number of possibilities (It is impossible to count 27.52 countries in the EU).
Categorical ordinal: There is an order implied in the levels (January comes always before February and after December).
Categorical nominal: There is no order implied in the levels (Female/male, or the wind direction: north, south, east, west).
Numerical variables
Measures of central tendency: Mean, median
Measures of spread: Standard deviation, variance, percentiles, maximum, minimum, skewness, kurtosis
Others: Size, unique, number of uniques
One approach to display the data is through a boxplot. It gives you the 5-basic-stats, such as the minimum, the 1st quartile (25th percentile), the median, the 3rd quartile (75th percentile), and the maximum.
Categorical variables
Bar plot of the categorical ordinal variable. Image by author
Measures of central tendency: Mode (most common)
Measures of spread: Number of uniques
Others: Size, % Highest unique
Understanding:
Measures of central tendency
Mean (average): The total sum of values divided by the total observations. The mean is highly sensitive to the outliers.
Median (center value): The total count of an ordered sequence of numbers divided by 2. The median is not affected by the outliers.
Mode (most common): The values most frequently observed. There can be more than one modal value in the same variable.
Measures of spread
Variance (variability from the mean): The square of the standard deviation. It is also affected by outliers.
Standard deviation (concentrated around the mean): The standard amount of deviation (distance) from the mean. The std is affected by the outliers. It is the square root of the variance.
Percentiles: The value below which a percentage of data falls. The 0th percentile is the minimum value, the 100th is the maximum, the 50th is the median.
Minimum: The smallest or lowest value.
Maximum: The greatest or highest value.
The number of uniques (total distinct): The total amount of distinct observations.
Uniques (distinct): The distinct values or groups of values observed.
Skewness (symmetric): How much a distribution derives from the normal distribution. >> Explained Skew concept in the next section.
Kurtosis (volume of outliers): How long are the tails and how sharp is the peak of the distribution. >> Explained Kurtosis concept in the next section.
Others
Count (size): The total sum of observations. Counting is also necessary for calculating the mean, median, and mode.
% highest unique (relativity): The proportion of the highest unique observation regarding all the unique values or group of values.
Skewness
In a perfect world, the data’s distribution assumes the form of a bell curve (Gaussian or normally distributed), but in the real world, data distributions usually are not symmetric (= skewed).
Therefore, the skewness indicates how much our distribution derives from the normal distribution (with the skewness value of zero or very close).
Skewness curves. Image by author
There are three generic types of distributions:
Symmetrical [median = mean]: In a normal distribution, the mean (average) divides the data symmetrically at the median value or close.
Positive skew [median < mean]: The distribution is asymmetrical, the tail is skewed/longer towards the right-hand side of the curve. In this type, the majority of the observations are concentrated on the left tail, and the value of skewness is positive.
Negative skew [median > mean]: The distribution is asymmetrical and the tail is skewed/longer towards the left-hand side of the curve. In this type of distribution, the majority of the observations are concentrated on the right tail, and the value of skewness is negative.
Rule of thumbs:
Symmetric distribution: values between –0.5 to 0.5.
Moderate skew: values between –1 and -0.5 and 0.5 and 1.
High skew: values <-1 or >1.
Kurtosis
kurtosis is another useful tool when it comes to quantify the shape of a distribution. It measures both how long are the tails, but most important, and how sharp is the peak of the distributions.
If the distribution has a sharper and taller peak and shorter tails, then it has a higher kurtosis while a low kurtosis can be observed when the peak of the distribution is flatter with thinner tails. There are three types of kurtosis:
Leptokurtic: The distribution is tall and thin. The value of a leptokurtic must be > 3.
Mesokurtic: This distribution looks the same or very similar to a normal distribution. The value of a “normal” mesokurtic is = 3.
Platykurtic: The distributions have a flatter and wider peak and thinner tails, meaning that the data is moderately spread out. The value of a platykurtic must be < 3.
The kurtosis values determine the volume of the outliers only.
Kurtosis is calculated by raising the average of the standardized data to the fourth power. If we raise any standardized number (less than 1) to the 4th power, the result would be a very small number, somewhere close to zero. Such a small value would not contribute much to the kurtosis. The conclusion is that the values that would make a difference to the kurtosis would be the ones far away from the region of the peak, put it in other words, the outliers.
The Jupyter notebook — IPython
In this section, we will be giving short numeric stats summaries concerning the different measures of central tendency and dispersion of the dataset.
let’s work on some practical examples through a descriptive statistics environment in Pandas.
Start by importing the required libraries:
import pandas as pd import numpy as np import scipy import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
Load the dataset: df = pd.read_csv("sample.csv", sep=";")
Print the data: df.head()
Before any stats calculus, let’s just take a quick look at the data: df.info
Image by author
The dataset consists of 310 observations and 2 columns. One of the attributes is numerical, and the other categorical. Both columns have no missing values.
Numerical variable
The numerical variable we are going to analyze is age. First step is to visually observe the variable. So let’s plot an histogram and a boxplot.
It is also possible to visually observe the variable with both a histogram and a boxplot combined. I find it a useful graphical combination and use it a lot in my reports.
4. Most frequentunique (value count): df.city.value_counts().head(1)
Paris 67 Name: city, dtype: int64
Others
5. Size (number of rows): df.city.count()
310
6. % of the highest unique (fraction of the most common unique in regards to all the others): p = df.city.value_counts(normalize=True)[0] print(f"{p:.1%}")
21.6%
The describe() method shows the descriptive statistics gathered in one table. By default, stats for numeric data. The result is represented as a pandas dataframe. df.describe()
Adding other non-standard values, for instance, the ‘variance’. describe_var = data.describe() describe_var.append(pd.Series(data.var(), name='variance'))
By passing the parameter include='all', displays both numeric and categoric variables at once. df.describe(include='all')
Conclusion
These are the basics of descriptive statistics when developing an exploratory data analysis project with the help of Pandas, Numpy, Scipy, Matplolib and/or Seaborn. When well performed, these stats help us to understand and transform the data for further processing.
Data science is an interdisciplinary field. One of the building blocks of data science is statistics. Without a decent level of statistics knowledge, it would be highly difficult to understand or interpret the data.
Statistics helps us explain the data. We use statistics to infer results about a population based on a sample drawn from that population. Furthermore, machine learning and statistics have plenty of overlaps.
Long story short, one needs to study and learn statistics and its concepts to become a data scientist. In this article, I will try to explain 10 fundamental statistical concepts.
1. Population and sample
Populationis all elements in a group. For example, college students in US is a population that includes all of the college students in US. 25-year-old people in Europe is a population that includes all of the people that fits the description.
It is not always feasible or possible to do analysis on population because we cannot collect all the data of a population. Therefore, we use samples.
Sampleis a subset of a population. For example, 1000 college students in US is a subset of “college students in US” population.
2. Normal distribution
Probability distribution is a function that shows the probabilities of the outcomes of an event or experiment. Consider a feature (i.e. column) in a dataframe. This feature is a variable and its probability distribution function shows the likelihood of the values it can take.
Probability distribution functions are quite useful in predictive analytics or machine learning. We can make predictions about a population based on the probability distribution function of a sample from that population.
Normal (Gaussian) distribution is a probability distribution function that looks like a bell.
A typical normal distribution curve (image by author)
The peak of the curve indicates the most likely value the variable can take. As we move away from the peak the probability of the values decrease.
3. Measures of central tendency
Central tendency is the central (or typical) value of a probability distribution. The most common measures of central tendency are mean, median, and mode.
Mean is the average of the values in series.
Median is the value in the middle when values are sorted in ascending or descending order.
Mode is the value that appears most often.
4. Variance and standard deviation
Variance is a measure of the variation among values. It is calculated by adding up squared differences of each value and the mean and then dividing the sum by the number of samples.
(image by author)
Standard deviation is a measure of how spread out the values are. To be more specific, it is the square root of variance.
Note: Mean, median, mode, variance, and standard deviation are basic descriptive statistics that help to explain a variable.
5. Covariance and correlation
Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value.
The figure below shows some values of the random variables X and Y. The orange dot represents the mean of these variables. The values change similarly with respect to the mean value of the variables. Thus, there is positive covariance between X and Y.
(image by author)
The formula for covariance of two random variables:
(image by author)
where E is the expected value and µ is the mean.
Note: The covariance of a variable with itself is the variance of that variable.
Correlation is a normalization of covariance by the standard deviation of each variable.
(image by author)
where σ is the standard deviation.
This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value. In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.
6. Central limit theorem
In many fields including natural and social sciences, when the distribution of a random variable is unknown, normal distribution is used.
Central limit theorem (CLT) justifies why normal distribution can be used in such cases. According to the CLT, as we take more samples from a distribution, the sample averages will tend towards a normal distribution regardless of the population distribution.
Consider a case that we need to learn the distribution of the heights of all 20-year-old people in a country. It is almost impossible and, of course not practical, to collect this data. So, we take samples of 20-year-old people across the country and calculate the average height of the people in samples. CLT states that as we take more samples from the population, sampling distribution will get close to a normal distribution.
Why is it so important to have a normal distribution? Normal distribution is described in terms of mean and standard deviation which can easily be calculated. And, if we know the mean and standard deviation of a normal distribution, we can compute pretty much everything about it.
7. P-value
P-value is a measure of the likelihood of a value that a random variable takes. Consider we have a random variable A and the value x. The p-value of x is the probability that A takes the value x or any value that has the same or less chance to be observed. The figure below shows the probability distribution of A. It is highly likely to observe a value around 10. As the values get higher or lower, the probabilities decrease.
Probability distribution of A (image by author)
We have another random variable B and want to see if B is greater than A. The average sample means obtained from B is 12.5 . The p value for 12.5 is the green area in the graph below. The green area indicates the probability of getting 12.5 or a more extreme value (higher than 12.5 in our case).
(image by author)
Let’s say the p value is 0.11 but how do we interpret it? A p value of 0.11 means that we are 89% sure of the results. In other words, there is 11% chance that the results are due to random chance. Similarly, a p value of 0.05 means that there is 5% chance that the results are due to random chance.
Note: Lower p values show more certainty in the result.
If the average of sample means from the random variable B turns out to be 15 which is a more extreme value, the p value will be lower than 0.11.
(image by author)
8. Expected value of random variables
The expected value of a random variable is the weighted average of all possible values of the variable. The weight here means the probability of the random variable taking a specific value.
The expected value is calculated differently for discrete and continuous random variables.
Discrete random variables take finitely many or countably infinitely many values. The number of rainy days in a year is a discrete random variable.
Continuous random variables take uncountably infinitely many values. For instance, the time it takes from your home to the office is a continuous random variable. Depending on how you measure it (minutes, seconds, nanoseconds, and so on), it takes uncountably infinitely many values.
The formula for the expected value of a discrete random variable is:
(image by author)
The expected value of a continuous random variable is calculated with the same logic but using different methods. Since continuous random variables can take uncountably infinitely many values, we cannot talk about a variable taking a specific value. We rather focus on value ranges.
In order to calculate the probability of value ranges, probability density functions (PDF) are used. PDF is a function that specifies the probability of a random variable taking value within a particular range.
(image by author)
9. Conditional probability
Probability simply means the likelihood of an event to occur and always takes a value between 0 and 1 (0 and 1 inclusive). The probability of event A is denoted as p(A) and calculated as the number of the desired outcome divided by the number of all outcomes. For example, when you roll a die, the probability of getting a number less than three is 2 / 6. The number of desired outcomes is 2 (1 and 2); the number of total outcomes is 6.
Conditional probability is the likelihood of an event A to occur given that another event that has a relation with event A has already occurred.
Suppose that we have 6 blue balls and 4 yellows placed in two boxes as seen below. I ask you to randomly pick a ball. The probability of getting a blue ball is 6 / 10 = 0,6. What if I ask you to pick a ball from box A? The probability of picking a blue ball clearly decreases. The condition here is to pick from box A which clearly changes the probability of the event (picking a blue ball). The probability of event A given that event B has occurred is denoted as p(A|B).
(image by author)
10. Bayes’ theorem
According to Bayes’ theorem, probability of event A given that event B has already occurred can be calculated using the probabilities of event A and event B and probability of event B given that A has already occurred.
(image by author)
Bayes’ theorem is so fundamental and ubiquitous that a field called “bayesian statistics” exists. In bayesian statistics, the probability of an event or hypothesis as evidence comes into play. Therefore, prior probabilities and posterior probabilities differ depending on the evidence.
Naive bayes algorithm is structured by combining bayes’ theorem and some naive assumptions. Naive bayes algorithm assumes that features are independent of each other and there is no correlation between features.
Conclusion
We have covered some basic yet fundamental statistical concepts. If you are working or plan to work in the field of data science, you are likely to encounter these concepts.
There is, of course, much more to learn about statistics. Once you understand the basics, you can steadily build your way up to advanced topics.
Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.
numeric to categorical, which converts numeric into categorical variables
sampling, wihch corresponds to data quantization.
You can download the full code of this tutorial from Github repository
Data Import
In this tutorial we exploit the cupcake.csv dataset, which contains the trend search of the word cupcake on Google Trends. Data are extracted from this link. We exploit the pandas library to import the dataset and we transform it into a dataframe through the read_csv() function.
import pandas as pd df = pd.read_csv('cupcake.csv') df.head(5)
Numeric to categorical binning
In this case we group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions min() and max() of dataframe to calculate the minimum value and the maximum value of the column Cupcake.
Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):
small — (edge1, edge2)
medium — (edge2, edge3)
big — (edge3, edge4) We can use the linspace() function of the numpy package to calculate the 4 bins, equally distributed.
import numpy as np bins = np.linspace(min_value,max_value,4) bins
which gives the following output:
array([ 4., 36., 68., 100.])
Now we define the labels:
labels = ['small', 'medium', 'big']
We can use the cut() function to convert the numeric values of the column Cupcake into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter include_lowest to True in order to include also the minimum value.
We can plot the distribution of values, by using the hist() function of the matplotlib package.
import matplotlib.pyplot as pltplt.hist(df['bins'], bins=3)
Sampling
Sampling is another technique of data binning. It permits to reduce the number of samples, by grouping similar values or contiguous values. There are three approaches to perform sampling:
by bin means: each value in a bin is replaced by the mean value of the bin.
by bin median: each bin value is replaced by its bin median value.
by bin boundary: each bin value is replaced by the closest boundary value, i.e. maximum or minimum value of the bin.
In order to perform sampling, the binned_statistic() function of the scipy.stats package can be used. This function receives two arrays as input, x_data and y_data, as well as the statistics to be used (e.g. median or mean) and the number of bins to be created. The function returns the values of the bins as well as the edges of each bin. We can calculate the x values (x_bins) corresponding to the binned values (y_bins) as the values at the center of the bin range.
In this tutorial I have illustrated how to perform data binning, which is a technique for data preprocessing. Two approaches can be followed. The first approach converts numeric data into categorical data, the second approach performs data sampling, by reducing the number of samples.
Data binning is very useful when discretization is needed.
This is the second part in Machine Learning series where we discuss on Features handling before using the data for machine learning models. The articles contains below parts:
Feature representation
Feature selection
Feature transformation
Feature engineering
This article cover the basic concepts of modifying the features as data needs to be refined before it can be used for prediction. We need to remove garbage out of data and turn features into high quality features.
Feature Representation
Your features need be represented as quantitative (preferably numeric) attributes of the thing you’re sampling. They can be real world values, such as the readings from a sensor, and other discernible, physical properties. Alternatively, your features can also be calculated derivatives, such as the presence of certain edges and curves in an image, or lack thereof.
But there is no guarantee that will be the case, and you will often encounter data in textual or other unstructured forms. Luckily, there are a few techniques that when applied, clean up these scenarios.
Textual Categorical-Features
If you have a categorical feature, the way to represent it in your dataset depends on if it’s ordinal or nominal. For ordinal features, map the order as increasing integers in a single numeric feature.
On the other hand, if your feature is nominal (and thus there is no obvious numeric ordering), then you have two options. The first is you can encoded it similar as you did above. This would be a fast-and-dirty approach. This may or may not cause problems for you in the future. If you aren’t getting the results you hoped for, or even if you are getting the results you desired but would like to further increase the result accuracy, then a more precise encoding approach would be to separate the distinct values out into individual boolean features:
These newly created features are called boolean features because the only values they can contain are either 0 for non-inclusion, or 1 for inclusion. Pandas .get_dummies() method allows you to completely replace a single, nominal feature with multiple boolean indicator features. This method is quite powerful and has many configurable options, including the ability to return a SparseDataFrame, and other prefixing options. It’s benefit is that no erroneous ordering is introduced into your dataset.
Pure Textual Features
If you are trying to “featurize” a body of text such as a webpage, a tweet, a passage from a newspaper, an entire book, or a PDF document, creating a corpus of words and counting their frequency is an extremely powerful encoding tool. This is also known as the Bag of Words model, implemented with the CountVectorizer() method in SciKit-Learn.
Graphical Features
In addition to text and natural language processing, bag of words has successfully been applied to images by categorizing a collection of regions and describing only their appearance, ignoring any spatial structure. However this is not the typical approach used to represent images as features, and requires you come up with methods of categorizing image regions. More often used methods include:
Split the image into a grid of smaller areas, and attempt feature extraction at each locality. Return a combined array of all discovered. features
Use variable-length gradients and other transformations as the features, such as regions of high / low luminosity, histogram counts for horizontal and vertical black pixels, stroke and edge detection, etc.
Resize the picture to a fixed size, convert it to grayscale, then encode every pixel as an element in a uni-dimensional feature array.
If you’re wondering what the :: is doing, that is called extended slicing. Notice the .reshape(-1) line. This tells Pandas to take your 2D image and flatten it into a 1D array. This is an all purpose method you can use to change the shape of your dataframes, so long as you maintain the number of elements. For example reshaping a [10, 10] to [100, 1] or [4, 25], etc. Another method called .ravel() will do the same thing as .reshape(-1), that is unravel a multi-dimensional NDArray into a one dimensional one. The reason why its important to reshape your 2D array images into one dimensional ones is because each image will represent a single sample, and Sklearn expects your dataframe to be shapes [num_samples, num_features].
Feature Selection
Most of the times, we will have many non-informative features. For Example, Name or ID variables and it results in “garbage-in, garbage-out”. Also, extra features make a model complex, time-consuming, and harder to implement in production. Many machine learning algorithms suffer from the curse of dimensionality — that is, they do not perform well when given a large number of variables or features. So it’s better to remove highly irrelevant or redundant features to simplify the situation and improve performance.
For instance, if your dataset have columns you don’t need, you can remove them using drop() method by specifying the name of columns. Axis=1 tells that deletion will happen column-wise while axis=0 will imply that deletion will happen row-wise.
Or, if you want only select columns for analysis or visualization purposes, you can select those columns by enclosing them within double square brackets.
Sometimes, we want to remove a feature but use it as an index instead. We can do this by specifying the column name as index during data load method.
We can set the column as index later as well by using set_index() method.
We can further improve the situation of having too many features through dimensionality reduction.
Commonly used techniques are:
PCA (Principal Component Analysis) — Considered a more statistical approach than machine learning approach. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation. One important thing to note that it is an unsupervised dimensionality reduction technique, where you can cluster the similar data points based on the correlation between them without any labels.
t-SNE (t-Distributed Stochastic Neighboring Entities) — In this approach, target number of dimensions is typically 2 or 3 which means that t-SNE is used a lot for visualizing your data as visualizing data with more than 3 dimensions is not easy for human brain. t-SNE has remarkable capability of keeping close points from multi-dimensional space close in the two-dimensional space.
Feature embedding — It is based on training a separate machine learning model to encode a large number of features into small number of features
Feature Transformation
Pandas will automatically attempt to figure out the best data type to use for each series in your dataset. Most of the time it does this flawlessly, but other times it fails horribly! Particularly the .read_html() method is notorious for defaulting all series data types to Python objects. You should check, and double-check the actual type of each column in your dataset to avoid unwanted surprises. If your data types don’t look the way you expected them, explicitly convert them to the desired type using the .to_datetime(), .to_numeric(), and .to_timedelta() methods:
Take note how to_numeric properly converts to decimal or integer depending on the data it finds. The errors=’coerce’ parameter instructs Pandas to enter a NaN at any field where the conversion fails.
Sometimes, even though the data type is correct, but we still need to modify the values in features. Example — we need to divide all the values by 10 or we need to convert them to their logarithmic values.
Feature Engineering
Just as oil needs to be refined before it is used, similarly data needs to be refined before we use it for machine learning. Sometimes, we need to derive new features out of existing features. The process of extracting new features from existing ones is called feature engineering. Classical Machine Learning depends on feature engineering much more than Deep Learning.
Below are some types of Feature Engineering.
Aggregation — New features are created by getting a count, sum, average, mean, or median from a group of entities.
Part-Of — New features are created by extracting a part of data-structure. E.g. Extracting the month from a date.
Binning — Here you group your entities into bins and then you apply those aggregations over those bins. Example — group customers by age and then calculating average purchases within each group
Flagging—Here you derive a boolean (0/1 or True/False) value for each entity
Example — we need to summarize data by finding its sum, average, minimum or maximum value and then creating new features with those new values.
Covariance and correlation are widely-used measures in the field of statistics, and thus both are very important concepts in data science. Covariance and correlation provide insight about the ralationship between random variables or features in a dataset. Although these two concepts are highly related, we need to interpret them carefully not to cause any misunderstandings.
Covariance
Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value. Consider the random variables “X” and “Y”. Some realizations of these variables are shown in the figure below. The orange dot show the mean of X and mean of Y. As the values of a get away from the mean of X in positive direction, the values of Y tend to change in similar way. Same relation is valid for negative direction as well.
Positive covariance
The formula for covariance of two random variables:
where E means the expectation and µ is the mean.
If X and Y change in the same direction, as in the figure above, covariance is positive. Let’s confirm with the covariance function of numpy:
np.cov() returns the covariance matrix. The covariance of X and Y is 0.11. The value at position [0,0] shows the covariance of X with itself and the value at [1,1] shows the covariance of Y with itself. If you run the code np.cov(X,X), you will get the value at position [0,0] which is 0.07707877 in this case. Similarly, np.cov(Y,Y) will return the value at position [1,1].
The covariance of a variable with itself is actually indicates the variance of that variable:
Let’s go over another example. The figure below shows some realizations of random variables Z and T. As we can see, as T increases, Z tends to decrease. Thus, the covariance of Z and T should be negative:
Negative covariance
We may also see variables that the variations are independent of each other. For example, in the figure below, realizations of variables A and B seems changing randomly with respect to each other. In this case, we expect to see a covariance value that is close to zero. Let’s confirm:
Covariance is close to zero
The following example will provide a little more intuition about the calculation of covariance.
Covariance describes how similarly two random variables deviate from their mean. The red lines show the means of series. The mean of s1 is the vertical line (x=8.5) and the mean of s2 is the horizontol line (y=9.3). Deviation from the mean is the difference between the values and the mean. Covariance is proportional to the product of deviation of s1 and s2 values. Consider the upper right rectangle in the plot above. Both s1 and s2 values are higher than the mean of s1 and s2, respectively. So, deviations are positive. When we multiply two positive values, we get a positive value. In the lower left rectangle, s1 and s2 values are lower than the mean of s1 and s2, respectively. Thus, deviations are negative but we get a positive number when two negative numbers are multiplied. For the points in lower right and upper left rectangle areas, deviations of s1 is positive when the deviation of s2 is negative and vice versa. So we get a negative number when two deviations are multiplied. All the deviations are combined to get the covariance. Hence, if we have more points in negative regions than positive regions, we will get a negative covariance.
Correlation
Correlation is a normalization of covariance by the standard deviation of each variable.
where σ is the standard deviation.
This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value. In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.
Consider the dataframe below:
We want to measure the relationship between X-Y and X-Z. We want to find out which variable (Y or Z) is more correlated with X. Let’s use covariance first:
Covariance of X and Z is much higher than the covariance of X and Y. We may think the relationship between the deviations in X and Z is much stronger than that of X and Y. However, it is not the case. Covariance of X and Z is higher because of the value ranges. The range of Z values are in between 22 and 222 whereas the values of Y are around 1 (most of them are less than 1). Therefore, we need to use correlation to eliminate the effect of different value ranges.
As we can see from the correlation matrix, X and Y are actually more correlated than X and Z.
Learn how to present the relationships amongst the features using multivariate charts and plots in Python
While dealing with a big dataset, it is important to understand the relationship between the features. That is a big part of data analysis. The relationships can be between two variables or amongst several variables. I will discuss how to present the relationships between multiple variables with some simple techniques. Python’s Numpy, Pandas, Matplotlib, and Seaborn libraries will be used.
First, import the necessary packages and the dataset to be used.
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np df = pd.read_csv("nhanes_2015_2016.csv")
This dataset is very large. At least too large to show a screenshot here. Here are the columns in this dataset.
Column names may look strange to you. I will keep explaining as we keep using them.
In this dataset, we have two systolic blood pressure data (‘BPXSY1’, ‘BPXSY2) and two diastolic blood pressure data (‘BPXDI1’, ‘BPXDI2’). It is worth looking at if there is any relationship between them. Observe the relationship between the first and second systolic blood pressure.
To find out the relation between two variables, scatter plots have been being used for a long time. It is the most popular, basic, and easily understandable way of looking at a relationship between two variables.
The relationship between the two systolic blood pressures is positively linear. There is a lot of overlapping observed in the plot.
2. To understand the systolic and diastolic blood pressure data and their relationships more, make a joint plot. Jointplot shows the density of the data and the distribution of both the variables at the same time.
In this plot, it shows very clearly that the densest area is from 115 to 135. Both the first and second systolic blood pressure distribution is right-skewed. Also, both of them have some outliers.
3. Find out if the correlation between the first and second systolic blood pressures are different in the male and female population.
From the two correlation chart above, the correlation between two systolic blood pressure is 1% higher in the female population than in the male. If these things are new to you, I encourage you to try understanding the correlation between two diastolic blood pressures or systolic and diastolic blood pressures.
4. Human behavior can change with so many different factors such as gender, education level, ethnicity, financial situation, and so on. In this dataset, we have ethnicity (“RIDRETH1”) information as well. Check the effect of both ethnicity and gender on the relationship between both the systolic blood pressures.
With different ethnic origins and gender, correlations seem to be changing a little bit but generally stays positively linear as before.
5. Now, focus on some other variables in the dataset. Find the relationship between education and marital status.
Both the education column(‘DMDEDUC2’) and the marital status (‘DMDMARTL’) column are categorical. First, replace the numerical values with the string values that will make sense. We also need to get rid of values that do not add good information to the chart. Such as the education column has some values ‘Don’t know’ and the marital status column has some ‘Refused’ values.
Finally, we got this DataFrame that is clean and ready for the chart.
x = pd.crosstab(db.DMDEDUC2x, db.DMDMARTLx) x
Here is the result. The numbers look very simple to understand. But a chart of population proportions will be a more appropriate presentation. I am getting a population proportion based on marital status.
x.apply(lambda z: z/z.sum(), axis=1)
6. Find the population proportion of marital status segregated by Ethnicity (‘RIDRETH1’) and education level.
First, replace the numeric value with meaningful strings in the ethnicity column. I found these string values from the Center for Disease Control website.
7. Observe the difference in education level with age.
Here, education level is a categorical variable and age is a continuous variable. A good way of observing the difference in education levels with age will be to make a boxplot.
plt.figure(figsize=(12, 4)) a = sns.boxplot(db.DMDEDUC2x, db.RIDAGEYR)
This plot shows, the rate of a college education is higher in younger people. A violin plot may provide a better picture.
plt.figure(figsize=(12, 4)) a = sns.violinplot(db.DMDEDUC2x, db.RIDAGEYR)
So, the violin plot shows a distribution. The most college-educated people are around age 30. At the same time, most people who are less than 9th grade, are about 68 to 88 years old.
8. Show the marital status distributed by and segregated by gender.
Here, blue color shows the male population distribution and orange color represents the female population distribution. Only ‘never married’ and ‘living with partner’categories have similar distributions for the male and female populations. Every other category has a notable difference in the male and female populations.