what is Outliers in Data Point

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from the combination/collection of the data.

Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. To understand outliers, we need to go through these points:

  1. what causes the outliers?
  2. Impact of the outlier
  3. Methods to Identify outliers

What causes the outliers?

Before dealing with the outliers, one should know what causes them. There are three causes for outliers — data entry/An experiment measurement errors, sampling problems, and natural variation.

  1. Data entry /An experimental measurement error

An error can occur while experimenting/entering data. During data entry, a typo can type the wrong value by mistake. Let us consider a dataset of age, where we found a person age is 356, which is impossible. So this is a Data entry error.

Image for post

These types of errors are easy to identify. If you determine that an outlier value is an error, we can fix this error by deleting the data point because you know it’s an incorrect value.

2. Sampling problems

Outliers can occur while collecting random samples. Let us consider an example where we have records of bone density of various subjects, but there is an unusual growth of bone in a subject, after analyzing this has been discovered that the subject had diabetes, which affects bone health. The goal was to model bone density growth in girls with no health conditions that affect bone growth. Since the data is not a part of the target population so we will not consider this.

3. Natural variation

Suppose we need to check the reliability of a machine. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.

Impact of the outlier

Outliers can change the results of the data analysis and statistical modeling. Following are some impacts of outliers in the data set:

  1. It may cause a significant impact on the mean and the standard deviation
  2. If the outliers are non-randomly distributed, they can decrease normality
  3. They can bias or influence estimates that may be of substantive interest
  4. They can also impact the basic assumption of Regression, ANOVA, and other statistical model assumptions.

To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.

Let’s examine what can happen to a data set with outliers. For the sample data set:

1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4

We find the following mean, median, mode, and standard deviation:

Mean = 2.58

Median = 2.5

Mode = 2

Standard Deviation = 1.08

If we add an outlier to the data set:

1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400

The new values of our statistics are:

Mean = 35.38

Median = 2.5

Mode = 2

Standard Deviation = 114.74

As you can see, having outliers often has a significant effect on your mean and standard deviation.

Methods to Identify outliers

There are various ways to identify outliers in a dataset, following are some of them:

  1. Sorting the data
  2. Using graphical Method
  3. Using z score
  4. Using the IQR interquartile range

Sorting the data

Sorting the dataset is the simplest and effective method to check unusual value. Let us consider an example of age dataset:

Image for post

In the above dataset, we have sort the age dataset and get to know that 398 is an outlier. Sorting data method is most effective on the small dataset.

Using graphical Method

We can detect outliers with the help of graphical representation like Scatter plot and Boxplot.

Image for post

1. Scatter Plot

Scatter plots often have a pattern. We call a data point an outlier if it doesn’t fit the pattern. Here we have a scatter plot of Weight vs height. Notice how two of the points don’t fit the pattern very well. There is no special rule that tells us whether or not a point is an outlier in a scatter plot. When doing more advanced statistics, it may become helpful to invent a precise definition of “outlier”.

Image for post

2. Box-Plot

Box-plot is one of the most effective ways of identifying Outliers in a dataset. When reviewing a box plot, an outlier is defined as a data point that is located outside the box of the box plot. As seen in the box plot of bill vs days. Box-Plot uses the Interquartile range(IQR) to detect outliers.

Using z-score

Image for post

Z-score (also called a standard score) gives you an idea of how many standard deviations away a data point is from the mean.. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is.

Z score = (x -mean) / std. deviation

In a normal distribution, it is estimated that

68% of the data points lie between +/- 1 standard deviation.

95% of the data points lie between +/- 2 standard deviation.

99.7% of the data points lie between +/- 3 standard deviation.

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

Let us consider a dataset:

Image for post
Image for post

Using the IQR interquartile range

Image for post

Interquartile range(IQR), is just the width of the box in the box-plot which can be used as a measure of how spread out the values are. An outlier is any value that lies more than one and a half times the length of the box from either end of the box.


  1. Arrange the data in increasing order
  2. Calculate first(q1) and third quartile(q3)
  3. Find interquartile range (q3-q1)
  4. Find lower bound q1*1.5
  5. Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

Let us take the same example as of Z-score:

Image for post
Image for post

As you can see we have found Lower and upper values that is: 7.5 and 19.5, so anything that lies outside these values is an outlier.

This is what Outliers is about.

Data Preprocessing in Machine Learning for Data Science

It is a data mining technique that transforms raw data into an understandable format. Raw data(real world data) is always incomplete and that data cannot be sent through a model. That would cause certain errors. That is why we need to preprocess data before sending through a model.

At the heart of this intricate process is data. Your machine learning tools are as good as the quality of your data. Sophisticated algorithms will not make up for poor data.In this article I will try to simplify the exercise of data preprocessing, or in other words, the rituals programmers usually follow before it is ready to be used for machine learning models into steps

Steps in Data Preprocessing

Here are the steps Data Scientist follows
1. Import libraries
2. Read data
3. Checking for missing values
4. Checking for categorical data
5. Standardize the data
6. PCA transformation
7. Data splitting

Step 1: Import Libraries:Libraries are modules that you can call upon when you need them,essential collections that Data Scientist or Analyst need in Python to use in Data processing and to arrive at a decisive outcome or output.e.g

import pandas as pd.

step2:Import Dataset: Datasets comes in many formats but a lot of comes in CSV formats.keep the datasets in the same directory as your program and you can read data using the method read_csv which can be found in the Library called pandas

import pandas as pd

dataset = pd.read_csv(YourFile.csv’)

After importing datasets you do EDA(Exploratory Data Analysis).“Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process.After studying our datasets carefully we creates a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations.read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].In the industry, a data scientist often works with large datasets. It is impossible to understand the entire dataset at one go. So first get an idea of what you are dealing with by taking a subset of the entire dataset as a sample. Do not make any modifications in this stage. You are just observing the dataset and getting an idea of how to tackle it.

X = dataset.iloc[:, :-1].values

: as a parameter selects all. So the above piece of code selects all the rows. For columns we have :-1, which means all the columns except the last one.

Step 3: Taking care of Missing Data in Dataset:When you first get your data from a source most of the time it comes incomplete,missing data,incompatible measurement(cm or meters,dollars or pound sterling)you have to Normalized or standardized your data.Of course we would not get into Scaling right now.

Sometimes you may find some data are missing in the dataset. We need to be equipped to handle the problem when we come across them. you could remove the entire line of data but what if you are unknowingly removing crucial information? Of course we would not want to do that. One of the most common idea to handle the problem is to take a mean of all the values of the same column and have it to replace the missing data.The Library to use is Scikit-learn preprocessing it contains the class imputer which helps in the missing data

from sklearn.preprocessing import Imputer

Create an object of the same class to call the functions that are in that class object imputer it will take many parameters:

i. missing_values — We can either give it an integer or “NaN” for it to find the missing values.
ii. strategy — we will find the average so we will set it to mean. We can also set it to median or most_frequent (for mode) as necessary.
iii. axis — we can either assign it 0 or 1, 0 to impute along columns and 1 to impute along rows

.imputer = Imputer(missing_values = “NaN”, strategy = “mean”, axis = 0)

we will fit the imputer object to our data.fit is means training or imposing the model to our data

imputer = imputer.fit(X[:,1:3])

The code above will fit the imputer object to our matrix of features X. Since we used :, it will select all rows and 1:3 will select the second and the third column (why? because in python index starts from 0 so 1 would mean the second column and the upper-bound is excluded. If we wanted to include the third column instead, we would have written 1:4).

Now we will just replace the missing values with the mean of the column by the method transform.which i called data transformation

X[:, 1:3] = imputer.transform(X[:, 1:3])

in the next article or Part 2 we will discuss about the rest steps in Data Preprocessing and Data Exploratory analysis which are

Checking for categorical data
Standardize the data
PCA transformation
Data splitting(Training and Testing)

Loading Subsetting and Filtering Data in panddas.


As data scientists, we often work with tons of data. The data we want to load can be stored in different ways. The most common formats are the CSV filesExcel files, or databases. Also, the data can be available throughout web services. Of course, there are many other formats. To work with the data, we need to represent it in a tabular structure. Anything tabular is arranged in a table with rows and columns.

In some cases, the data is already tabular and it’s easy to load it. In other cases, we work with unstructured data. The unstructured data is not organized in a pre-defined manner (plain textimagesaudioweb pages). In this post, we’ll focus on loading data from CSV (Comma Separated Values) files.


Pandas is an open source library for the Python programming language developed by Wes McKinney. This library is very efficient and provides easy-to-use data structures and analysis tools.


Pandas contains a fast and efficient object for data manipulation called DataFrame. A commonly used alias for Pandas is pd. The library can load many different formats of data. When our data is clean and structured, every row represents an observation and every column a feature. The rows and the columns can have labels.

In the examples below, I’ll mark some parts with transparent rectangles for a better understanding of what we’re changing. Also, we’ll work with a very small subset from a dataset for simplicity. This dataset contains mobile cellular subscriptions for a given country and year. The full data can be found here. I’ve done some cleaning beforehand to make the data tidy.

Here is the data we want to load into a Pandas DataFrame. It’s uploaded in the in this GitHubGist web app and it’s already visualized with a tabular structure here. However, we can see it the raw format here. Also, we can see that this file contains comma separated values.

To load this data, we can use the pd.read_csv() function.

Loading data from a CSV file.

To create these examples, I’m using a Jupyter Notebook. If the last row in a code cell contains value it’s printed. So, that’s why I’ve put the cellular_data variable in the last row of the example.

We can see that the data is loaded, but there is something strange. What is this Unnamed: 0 column? We don’t have such column in our CSV file. Well, in our case this column contains the row labels (row index) of the data and we have to tell Pandas that. We can do this using the index_col argument.

Loading data from a CSV file using index_col

In other cases, our data can be without the row labels. In these cases, pandas will auto-generate these labels starting from 0 to the length of the rows — 1. Let’s see examples with the same data, without the row labels.

Loading data from a CSV file.

Now our DataFrame looks fine. Sometimes, we want to change the row labels in order to work easily with our data later. Here we can set the row labels to be the country code for each row. We can do that by setting the index attribute of a Pandas DataFrame to a list. The length of the list and the length of the rows must be the same. After that, we can easily subset our data or look at a given country using the country codes.

In many cases, we don’t want to set the index manually and we want the index to be one of the columns in the DataFrame. In such cases, we can use the DataFrame object’s method called set_index. Note that pandas doesn’t set the index permanently unless we tell it. In case we want to set the index permanently, we can use the inplace argument to achieve this.

Setting the country column to be the index for our DataFrame.
Setting the country column to be the index for our DataFrame.

In the example above, we don’t tell pandas to set the index permanently and when we print the cellular_data DataFrame we see that the index is not changed. Let’s try again, with the inplace argument.

Setting (inplace) the country column to be the index for our DataFrame.

Now, we can clearly see that when we use inplace = True, our DataFrame’s index is changed permanently.

Index and Select Data

There are many ways in which you can select data from DataFrames. In this blog post, we’ll see how to use square brackets and the methods loc and iloc to achieve this.

With square brackets, you can select a choice from the rows or you can select a choice from the columns. For a row selection, we can use a list of indexes or a slice. We can select rows using slicing like this: sliceable[start_index:end_index:step]

The end_index is not inclusive. I’ve already written about slicing in one of my previous blog post called Python Basics for Data Science. You can quickly look at the “Subsetting lists” part to understand it. Although the examples there are with lists, the idea here is the same. We just use DataFrames here, they are also sliceable.

Select all rows.
Select the first 2 rows.
Select the all rows from the third to the end.
Select the second row.

For a column selection, we can use a list of the wanted columns. If we pass only one column as a string instead of a list, the result will be pandas Series. The pandas Series are a one-dimensional array which can be labeled. If we paste 2 or more Series together, we’ll create a DataFrame. In some cases, we might want to select only one column, but keep the data in a DataFrame. In such cases, we can pass a list with one column name.

Select the “country” column only as series.
Select the “country” and “cellular_subscriptions” columns.

The square brackets are useful, but their functionality is limited. We can select only columns or only rows from a given DataFrame. In many cases, we need to select both columns and rows. The loc and iloc methods give us this power.

The loc method allows us to select rows and columns of your data based on labels. First, you specify the row labels to the left side, then you specify the column labels to the right side. The iloc allows us the same thing but based on the integer positions of our DataFrame.

If we want to select all rows or columns we can simply type : for the rows or for the columns side. Also if we want to select specific rows but all columns, we can just pass only the rows labels.

Understanding with examples is easier, so let’s see some. In these examples, we’ll compare the usage of these 2 methods.

Select the first row as Series.
Select the first row as DataFrame.
Select the rows for Bulgaria and Romania.
Select the rows for Russia and the United Kingdom and the “year” and “cellular_subscriptions” columns.
Select all the rows and the “year” and “cellular_subscriptions” columns.
Select the all the columns and the rows for Bulgaria, Russia, and Denmark.

Comparison operators in Python

The comparison operators can tell us how 2 values relate to each other. In many cases, Python can’t tell us how 2 values of different types relate to each other, but there are some exceptions. For example, we can compare float and integer numbers. Something to keep in mind is that we can compare Booleans with integersTrue correspond to 1 and False correspond to 0. These operators are very straightforward.

The comparison operators in Python.

Let’s see some very simple examples.Simple Comparison Operators

Filtering pandas DataFrame

The comparison operators can be used with pandas series. This can help us to filter our data by specific conditions. We can use comparison operators with series, the result will be a boolean series. Each item of these series will be True if the condition is met, and False otherwise. After we have these Boolean series, we can apply a row selection to get a filtered DataFrame as a result.

Creating a boolean series called is_small_subscr

Note that, we’ve used another syntax here to get the cellular_subcription columnDataFrame[column_name] and DataFrame.column_name code blocks returns the same result.

However, be careful with the dot syntax (used in these examples), because your column can be with the same name of the DataFrame’s methods. For example, if we have a column called “min”, we can’t use the dot syntax to get the values from that column. That’s because the DataFrame object has a method called “min”. Let’s now see how we can use the Boolean Series from above to filter our DataFrame.

Filtering the DataFrame using the boolean series from the previous example

Let’s see another example. Imagine that we want to get all records where the country is the United Kingdom.

Get all records where the country is the United Kingdom

Boolean Operators

Now that we know how to generate a Boolean series that meets some conditions, we can now use Boolean operators on them to create more complex filtering.

There are 3 types of Boolean operations

  • and – takes 2 Boolean values and return True if both the values are True. This operator is a short-circuit, it only evaluates the second argument if the first one is True.
  • or – takes 2 Boolean values and return True if at least one of them is True. This operator is also a short-circuit, it only evaluates the second argument if the first one is False.
  • not – take a Boolean value and return the opposite. This operator has a low priority than non-Boolean operators. For example not x == y is interpreted as not (x == y) and x == not y is a syntax error. Also, it is commonly used when we need to combine different Boolean operations and then want to negate the result.

Simple Boolean Operations

Subsetting by Multiple Conditions

When we want to filter our DataFrame by multiple conditions, we can use the Boolean operators. An important note here is that when we want to use Boolean operators with pandas, we must use them as follows:

  • & for and
  • | for or
  • ~ for not

When we apply a Boolean operation on 2 Boolean series with the same size, the Boolean operation will apply for each pair.

Using the “and” operator

We can see that pandas doesn’t work with and operator, it expects the & operator. Now, let’s try again. The goal here is to get only the flights that have more than 240 passengers and less than 300 passengers.

Using the “or” operator

Let’s find all flights that have lower than 200 or greater than 375 passengers. Remember that for the or operator we use the pipe | character.

Reversing conditions using the not operator

In some cases, we want to negate our condition. In such cases, we can use the not operator. For this operator, we use the tilde ~ character.

Let’s say that we want to get all flights that the month is not November.

Complex Conditions

We can make a more complex filtering based very specific conditions.

Let’s get all flights that in November for the 1952 and 1954 years.

Now, let’s get all flights that are between the 1952 and 1954 years and the month is August or September.

The isin method

Imagine that we want to compare equality of a single column to multiple values. Let’s say that we want to get all flights that are in the months: FebruaryAugust, and September. One way to achieve this is with multiple or conditions like this.

There is a repeated code and this is tedious. There is a better way to achieve the same result by using the isin method. We need to pass as a list or set the values to this method and it will return the wanted Boolean series.

Of course, we can combine the returned Boolean series of this method with other Boolean series.

Let’s say that we want to get the flights that are in the 1954 year and in the FebruaryAugust, and September months.

The between method

This method can make our code cleaner when we want to select values inside in a rangeInstead of writing 2 Boolean conditions, we can use this method.

Let’s say that we want to get all flights between the 1955 and 1960 years inclusive.

Again, we can combine this method with another conditional filtering.

Let’s get all the flights that are between the 1955 and 1960 years and are in the October month.

The isnull and isna methods

This isna method indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike). The isnull method is an alias for the isna method. This means that these 2 methods are exactly the same, but with different names.

I’ve changed the flights DataFrame which we have used. There are some NaN values in the month column. Let’s see how we can get all the records which have a missing month.

In many cases, we want to get the data that have no missing values. Let’s try to get the flights that have no missing month. We can use the not operator with the ~ character to negate the Boolean series returned by the isna method.

The notna method

There is also a method called notna. This method is the opposite of the isna method. We can achieve the same result from the last example using this method.

Subsetting by a Condition in One Line

All the examples we look at for now can be written in one line. Some people like that, others hate it. When we’re subsetting by only one condition in many cases it’s more preferable and easy to write our filter in one line.

Let’s first see a subsetting example only with one condition.

Subsetting by Multiple Conditions in One Line

In some cases, it’s okay to write a simple expression in one line, but in other cases, it’s very unreadable. My suggestion here is to write the simple ones in one line and the complex ones in multiple lines. If your row is very long it can be unreadable, so be careful.

Subsetting with boolean series using the .loc method.

Remember the .loc method? We can select rows and columns based on labels with this method. The nice thing is that we can pass Boolean series instead of passing labels for a rows or columns selection and it will work.

All of the generated Boolean series of the examples above we used for subsetting can be passed for a row selection.

Conditional Probability: Definition and Key Concepts

Conditional Probability and Unconditional Probability

Conditional Probability: Definition and Key Concepts

Conditional Probability may be explained as the likelihood of an event or outcome occurring based on the occurrence of a previous event or outcome. Usually, it is calculated by multiplying the probability of the preceding event by the updated probability of the succeeding, or conditional, event.

My general observation says that in problems where the occurrence of one event affects the happening of the following event. These scenarios of probability are classic conditional probability examples.

In the context of mathematics, the probability of occurrence of any event A when another event B in relation to A has already occurred is known as conditional probability.

Our discussion would also include differences between Conditional and Unconditional Probability and round off with the basic differences between Conditional and Joint Probability.Conditional Probability

Conditional Probability

Definition of Conditional Probability

The conditional probability may be defined as the probability of one event occurring with some relationship to one or more other events.

It is to be noted that the conditional probability does not state that there is always a causal relationship between the two events, as well as it does not indicate that both events occur simultaneously.

It’s primarily related to the Bayes’ theorem, which is one of the most influential theories in statistics.

The Formula for Conditional Probability may be explained as:

P(A|B) –  the probability of event A occurring given that event B has already occurred

P (A ∩ B) – the joint probability of events A and B; the probability that both events A and B occur at the same time

P(B) – the probability of event BConditional Probability

Formula of Conditional Probability

The formula above is applied to the calculation of the conditional probability of events that are neither independent nor mutually exclusive.

Experts on Conditional Probability suggest another way of calculating it by using the Bayes’ theorem. The theorem can be used to determine the conditional probability of event A, given that the event B has occurred by knowing the conditional probability of event B, given the event A has occurred, as well as the individual probabilities of the event A and B. Mathematically, the Bayes’ theorem can be denoted in the following way:Conditional Probability

Baye’s Theorem

Conditional Probability for Independent Events

Conditional Probability may be explained as two events that are independent of each other if the probability of the outcome of one event does not influence the probability of the outcome of another event. Therefore, the two independent events A and B may be represented as:

P(A|B) = P(A)

P(B|A) = P(B)Conditional Probability

Conditional Probability of two independent events.

Conditional Probability for Mutually Exclusive Events

In probability theory, mutually exclusive events may be explained as the events that cannot occur simultaneously. In other words, if an event has already occurred, another event cannot occur. Thus, the conditional probability of the mutually exclusive events is always zero.

P(A|B) = 0

P(B|A) = 0

Conditional Probability Examples

Examples using a table of data

According to a research paper, a two-way table of data is one of the most common problems we see in Conditional Probability. Here, we take a look at how to find different probabilities using such a table.


A survey asked full time and part-time students how often they had visited the college’s tutoring center in the last month. The results are shown below.

In a survey conducted by a college both full time and part-time, students were asked how often they had visited the college’s tutoring center in the last two months. The results may be represented as follows:Conditional Probability

Conditional Probability Example using Table Data

Suppose that a surveyed student is randomly selected.

(a) What is the probability the student visited the tutoring center four or more times, given that the student is full time?

Conditional probability is all about focusing on the information you know. When calculating this probability, we are given that the student is full time. Therefore, we should only look at full-time students to find the probability.Conditional Probability

The probability the student visited the tutoring center four or more times,

(b) Suppose that a student is part-time. What is the probability that the student visited the tutoring center one or fewer times?

This one is a bit trickier, because of the wording. Let us put it in the following way:

Find: probability student visited the tutoring center one or fewer times

Assume or given: a student is part-time (“suppose that a student is part-time”)Conditional Probability

The probability that the student visited the tutoring center one or fewer times. The student is part-time.

Since we are assuming (or supposing) the student is part-time, we will only look at part-time students for this calculation.

(c) If the student visited the tutoring center four or more times, what is the probability he or she is a part-time student?

As stated above, we must make sure we know what is given, and what we are finding.

Find probability he or she is part-time

Assume or given: the student visited the tutoring center four or more times (“if the student visited the tutoring center four or more times…”)

For this question, we are only looking at students who visited the tutoring center four or more times.Conditional Probability

The probability that the student is a part-time assuming that the student visited the tutoring center four or more times

Difference between Conditional & Joint Probability

What is Joint Probability

The joint probability may be explained as a measure of how likely it is for two (or more) things to occur.  For instance, if you roll two dice, you have the probability of getting a six on the first and a four on the second. This is a classic example of   Joint Probability, where the probability of occurrence of both results is possible.

What is Conditional Probability

Conditional probability, on the other hand, maybe explained as a measure of how likely one situation is likely to happen if you are aware of the occurrence of another event.  

For example, what is the probability that the second die shows a four if the sum of the numbers on the two dice is ten? If you know that the sum is ten, it turns out that it is far more likely that the second die is a four than if you knew nothing about the sum.Conditional Probability

Difference between Conditional & Joint Probability

Difference between Conditional & Unconditional Probability

Definition of Conditional Probability

Conditional Probability may be explained as a probability that considers some other piece of information, knowledge, or evidence.

Definition of Unconditional Probability

Unconditional Probability may be explained as a probability that does not consider any other information, knowledge, or evidence.

Krishna Singh, an expert on mathematics and statistics, explains the difference between Conditional and Unconditional Probability with the following example:

Conditional Probability Examples

Pulling an ace out of a deck of cards and then drawing a second ace without replacing the first. You would have a 4/52 chance of getting the first ace, but a 4/51 chance (if you didn’t pull an ace) of the second, making the second conditional upon the results of the first.

Unconditional Probability Examples

Rolling a die. The fact that you got a 6 on one roll has no effect on whether you will roll a 6 later on.Conditional Probability

Data Collection, Data Processing & Finished Result

Data Science and Conditional Probability

Data Science often uses statistical inferences to predict or analyze trends from data, while statistical inferences make use of probability distributions of data. Therefore, knowing probability and its applications are important to work effectively on data

Most data science techniques rely on Bayes’ theorem. Bayes’ theorem is a formula that describes at large how to update the probabilities of hypotheses when given evidence. You can build a learner using the Bayes’ theorem to predicts the probability of the response variable belonging to some class, given a new set of attributes.

Data Science is inextricably linked to Conditional Probability. Data Science professionals must have a thorough understanding of probability to solve complicated data science problems. A strong base in Probability and Conditional Probability is essential to fully understand and implement relevant algorithms for use.Conditional Probability

Data Science & Conditional Probability

Do you aspire to be a Data Analyst, and then grow further to become a Data Scientist?  Do you like finding answers to complex business challenges interests? Whatever you want, start early to gain a competitive advantage. You must be fluent in the programming languages and tools that will help you get hired.

You may also read my earliest post on How to Create a Killer Data Analyst Resume to create winning CVs that will leave an impression in the mind of the recruiters.

You may start as a Data Analyst, go on to become a Data Scientist with some years of experience, and eventually a data evangelist. Data Science offers lucrative career options. There is enough scope for growth and expansion.

You might be a programmer, a mathematics graduate, or simply a bachelor of Computer Applications. Students with a master’s degree in Economics or Social Science can also be a data scientist. Take up a Data Science or Data Analytics course, to learn Data Science skills and prepare yourself for the Data Scientist job, you have been dreaming of.Conditional Probability

A Career in Data Science

Taking up a good Data Science or Data Analytics course teaches you the key Data Science skills and prepares you for the Data Scientist, Data Scientist role (that you aspire for) in the near future. Do not forget to include all your skills in your data scientist’s resume.

In addition, students also get lifetime access to online course matter, 24×7 faculty support, expert advice from industry stalwarts, and assured placement support that prepares them better for the vastly expanding Data Science market.

What is Perceptron in Neural Network

Perceptron is a single layer neural network and a multi-layer perceptron is called Neural Networks.

Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the given input data. But how the heck it works ?

A normal neural network looks like this as we all know

Get this book 👇

Introduction to Machine Learning with Python: A Guide for Data Scientists

It helped me a lot. 🙌 👍

As you can see it has multiple layers.

The perceptron consists of 4 parts.

  1. Input values or One input layer
  2. Weights and Bias
  3. Net sum
  4. Activation Function

FYI: The Neural Networks work the same way as the perceptron. So, if you want to know how neural network works, learn how perceptron works.

Fig : Perceptron

But how does it work?

The perceptron works on these simple steps

a. All the inputs x are multiplied with their weights w. Let’s call it k.

Fig: Multiplying inputs with weights for 5 inputs

b. Add all the multiplied values and call them Weighted Sum.

Fig: Adding with Summation

c. Apply that weighted sum to the correct Activation Function.

For Example: Unit Step Activation Function.

Fig: Unit Step Activation Function

Why do we need Weights and Bias?

Weights shows the strength of the particular node.

A bias value allows you to shift the activation function curve up or down.

Why do we need Activation Function?

In short, the activation functions are used to map the input between the required values like (0, 1) or (-1, 1).

Where we use Perceptron?

Perceptron is used to classify data into two parts,therefore is known as Linear Binary classifier.

Classification Algorithms(Using NaiveBayes Classifier)

Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”.

Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set.

An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification.

To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email.

Types of Learners in Classification

We have two types of learners in respective to classification problems −

Lazy Learners

As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.

Eager Learners

As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN).

Building a Classifier in Python

Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −

Step 1: Importing necessary python package

For building a classifier using scikit-learn, we need to import it. We can import it by using following script −

import sklearn

Step 2: Importing dataset

After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −

from sklearn.datasets import load_breast_cancer

The following script will load the dataset;

data = load_breast_cancer()

We also need to organize the data and it can be done with the help of following scripts −

label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database.


The output of the above command is the names of the labels −

['malignant' 'benign']

These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1.

The feature names and feature values of these labels can be seen with the help of following commands −


The output of the above command is the names of the features for label 0 i.e. Malignant cancer −

mean radius

Similarly, names of the features for label can be produced as follows −


The output of the above command is the names of the features for label 1 i.e. Benign cancer −

mean texture

We can print the features for these labels with the help of following command −


This will give the following output −

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]

We can print the features for these labels with the help of following command −


This will give the following output −

[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
7.017e-02  1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03  1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01  1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01  8.902e-02]

Step 3: Organizing data into training & testing sets

As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −

from sklearn.model_selection import train_test_split

Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −

train, test, train_labels, test_labels = 
   train_test_split(features,labels,test_size = 0.40, random_state = 42)

Step 4: Model evaluation

After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module −

from sklearn.naive_bayes import GaussianNB

Now, initialize the model as follows −

gnb = GaussianNB()

Next, with the help of following command we can train the model −

model = gnb.fit(train, train_labels)

Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows −

preds = gnb.predict(test)

This will give the following output −

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
 0 0 1 1 0 1]

The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.

Step 5: Finding accuracy

We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.

from sklearn.metrics import accuracy_score

The above output shows that NaïveBayes classifier is 95.17% accurate.

Classification Evaluation Metrics

The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.

The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −

Confusion Matrix

  • Confusion Matrix − It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes.

Various ML Classification Algorithms

The followings are some important ML classification algorithms −

  • Logistic Regression
  • Support Vector Machine (SVM)
  • Decision Tree
  • Naïve Bayes
  • Random Forest

We will be discussing all these classification algorithms in detail in further chapters.


Some of the most important applications of classification algorithms are as follows −

  • Speech Recognition
  • Handwriting Recognition
  • Biometric Identification
  • Document Classification

A Comprehensive Guide to Types of Neural Networks

Much of modern technology is based on computational models known as artificial neural networks. There are many different types of neural networks which function on the same principles as the nervous system in the human body.

As Howard Rheingold said, “The neural network is this kind of technology that is not an algorithm, it is a network that has weights on it, and you can adjust the weights so that it learns. You teach it through trials.”

What are Artificial Neural Networks?

An artificial neural network is a system of hardware or software that is patterned after the working of neurons in the human brain and nervous system. Artificial neural networks are a variety of deep learning technology which comes under the broad domain of Artificial Intelligence.

Deep learning is a branch of Machine Learning which uses different types of neural networks. These algorithms are inspired by the way our brain functions and therefore many experts believe they are our best shot to moving towards real AI (Artificial Intelligence).

Deep learning is becoming especially exciting now as we have more amounts of data and larger neural networks to work with.

Moreover, the performance of neural networks improves as they grow bigger and work with more and more data, unlike other Machine Learning algorithms which can reach a plateau after a point.

How do Neural Networks work?

A neural network has a large number of processors. These processors operate parallelly but are arranged as tiers. The first tier receives the raw input similar to how the optic nerve receives the raw information in human beings.

Each successive tier then receives input from the tier before it and then passes on its output to the tier after it. The last tier processes the final output.

Small nodes make up each tier. The nodes are highly interconnected with the nodes in the tier before and after. Each node in the neural network has its own sphere of knowledge, including rules that it was programmed with and rules it has learnt by itself.  

The key to the efficacy of neural networks is they are extremely adaptive and learn very quickly. Each node weighs the importance of the input it receives from the nodes before it. The inputs that contribute the most towards the right output are given the highest weight.

What are the Different Types of Neural Networks?

Different types of neural networks use different principles in determining their own rules. There are many types of artificial neural networks, each with their unique strengths. You can take a look at this video to see the different types of neural networks and their applications in detail.

Here are some of the most important types of neural networks and their applications.

1. Feedforward Neural Network – Artificial Neuron

This is one of the simplest types of artificial neural networks. In a feedforward neural network, the data passes through the different input nodes till it reaches the output node.

In other words, data moves in only one direction from the first tier onwards until it reaches the output node. This is also known as a front propagated wave which is usually achieved by using a classifying activation function.

Unlike in more complex types of neural networks, there is no backpropagation and data moves in one direction only. A feedforward neural network may have a single layer or it may have hidden layers.

In a feedforward neural network, the sum of the products of the inputs and their weights are calculated. This is then fed to the output. Here is an example of a single layer feedforward neural network.Types of Neural Networks Source - analyticsindiamag.com

Feedforward Neural Network – Artificial Neuron

Feedforward neural networks are used in technologies like face recognition and computer vision. This is because the target classes in these applications are hard to classify.

A simple feedforward neural network is equipped to deal with data which contains a lot of noise. Feedforward neural networks are also relatively simple to maintain.

2. Radial Basis Function Neural Network

A radial basis function considers the distance of any point relative to the centre. Such neural networks have two layers. In the inner layer, the features are combined with the radial basis function.

Then the output of these features is taken into account when calculating the same output in the next time-step. Here is a diagram which represents a radial basis function neural network.Types of Neural Networks Source analyticsindiamag.com

Radial Basis Function Neural Network

The radial basis function neural network is applied extensively in power restoration systems. In recent decades, power systems have become bigger and more complex.

This increases the risk of a blackout. This neural network is used in the power restoration systems in order to restore power in the shortest possible time.

3. Multilayer Perceptron

A multilayer perceptron has three or more layers. It is used to classify data that cannot be separated linearly. It is a type of artificial neural network that is fully connected. This is because every single node in a layer is connected to each node in the following layer.

A multilayer perceptron uses a nonlinear activation function (mainly hyperbolic tangent or logistic function). Here’s what a multilayer perceptron looks like.Types of Neural Networks Source medium.com

Multilayer Perceptron

This type of neural network is applied extensively in speech recognition and machine translation technologies.

4. Convolutional Neural Network

A convolutional neural network(CNN) uses a variation of the multilayer perceptrons. A CNN contains one or more than one convolutional layers. These layers can either be completely interconnected or pooled.

Before passing the result to the next layer, the convolutional layer uses a convolutional operation on the input. Due to this convolutional operation, the network can be much deeper but with much fewer parameters.

Due to this ability, convolutional neural networks show very effective results in image and video recognition, natural language processing, and recommender systems.

Convolutional neural networks also show great results in semantic parsing and paraphrase detection. They are also applied in signal processing and image classification.

CNNs are also being used in image analysis and recognition in agriculture where weather features are extracted from satellites like LSAT to predict the growth and yield of a piece of land. Here’s an image of what a Convolutional Neural Network looks like.Types of Neural Networks Source medium.com

Convolutional Neural Network

5. Recurrent Neural Network(RNN) – Long Short Term Memory

A Recurrent Neural Network is a type of artificial neural network in which the output of a particular layer is saved and fed back to the input. This helps predict the outcome of the layer.

The first layer is formed in the same way as it is in the feedforward network. That is, with the product of the sum of the weights and features. However, in subsequent layers, the recurrent neural network process begins.

From each time-step to the next, each node will remember some information that it had in the previous time-step. In other words, each node acts as a memory cell while computing and carrying out operations. The neural network begins with the front propagation as usual but remembers the information it may need to use later.

If the prediction is wrong, the system self-learns and works towards making the right prediction during the backpropagation. This type of neural network is very effective in text-to-speech conversion technology.  Here’s what a recurrent neural network looks like.Types of Neural Networks Source analyticsindiamag.com

Recurrent Neural Network(RNN) – Long Short Term Memory

6. Modular Neural Network

A modular neural network has a number of different networks that function independently and perform sub-tasks. The different networks do not really interact with or signal each other during the computation process. They work independently towards achieving the output.

As a result, a large and complex computational process can be done significantly faster by breaking it down into independent components. The computation speed increases because the networks are not interacting with or even connected to each other.  Here’s a visual representation of a Modular Neural Network.Types of Neural Networks Source analyticsindiamag.com

Modular Neural Network

7. Sequence-To-Sequence Models

A sequence to sequence model consists of two recurrent neural networks. There’s an encoder that processes the input and a decoder that processes the output. The encoder and decoder can either use the same or different parameters. This model is particularly applicable in those cases where the length of the input data is not the same as the length of the output data.  

Sequence-to-sequence models are applied mainly in chatbots, machine translation, and question answering systems.

Summing up

There are many types of artificial neural networks that operate in different ways to achieve different outcomes. The most important part about neural networks is that they are designed in a way that is similar to how neurons in the brain work.

As a result, they are designed to learn more and improve more with more data and more usage. Unlike traditional machine learning algorithms which tend to stagnate after a certain point, neural networks have the ability to truly grow with more data and more usage.

That’s why many experts believe that different types of neural networks will be the fundamental framework on which next-generation Artificial Intelligence will be built. Thus taking a Machine Learning Course will prove to be a added benefit.

Hopefully, by now you must have understood the concept of Neural Networks and its types. Moreover, if you are also inspired by the opportunity of Machine Learning, enroll in our Machine Learning using Python Course.

Data Cleaning with Python and Pandas: Detecting Missing Values

Data cleaning can be a tedious task.

It’s the start of a new project and you’re excited to apply some machine learning models.

You take a look at the data and quickly realize it’s an absolute mess.

According to IBM Data Analytics you can expect to spend up to 80% of your time cleaning data.

data cleaning from ibm analytics

In this post we’ll walk through a number of different data cleaning tasks using Python’s Pandas library.  Specifically, we’ll focus on probably the biggest data cleaning task, missing values.

After reading this post you’ll be able to more quickly clean data.  We all want to spend less time cleaning data, and more time exploring and modeling.Click here to get the FREE Data Cleaning Cheat Sheet

Sources of Missing Values

Before we dive into code, it’s important to understand the sources of missing data.  Here’s some typical reasons why data is missing:

  • User forgot to fill in a field.
  • Data was lost while transferring manually from a legacy database.
  • There was a programming error.
  • Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

As you can see, some of these sources are just simple random mistakes.  Other times, there can be a deeper reason why data is missing.

It’s important to understand these different types of missing data from a statistics point of view.  The type of missing data will influence how you deal with filling in the missing values.

Today we’ll learn how to detect missing values, and do some basic imputation.  For a detailed statistical approach for dealing with missing data, check out these awesome slides from data scientist Matt Brems.

Keep in mind, imputing with a median or mean value is usually a bad idea, so be sure to check out Matt’s slides for the correct approach.

Getting Started

Before you start cleaning a data set, it’s a good idea to just get a general feel for the data.  After that, you can put together a plan to clean the data.

I like to start by asking the following questions:

  • What are the features?
  • What are the expected types (int, float, string, boolean)?
  • Is there obvious missing data (values that Pandas can detect)?
  • Is there other types of missing data that’s not so obvious (can’t easily detect with Pandas)?

To show you what I mean, let’s start working through the example.The data we’re going to work with is a very small real estate data.

Here’s a quick look at the data:

real estate data

This is a much smaller dataset than what you’ll typically work with.  Even though it’s a small dataset, it highlights a lot of real-world situations that you will encounter on projects.

A good way to get a quick feel for the data is to take a look at the first few rows.  Here’s how you would do that in Pandas:

# Importing libraries
import pandas as pd
import numpy as np

# Read csv file into a pandas dataframe
df = pd.read_csv("property data.csv")

# Take a look at the first few rows
print df.head()
0   104.0     PUTNAM            Y           3.0
1   197.0  LEXINGTON            N           3.0
2     NaN  LEXINGTON            N           3.0
3   201.0   BERKELEY          NaN           1.0
4   203.0   BERKELEY            Y           3.0

I know that I said we’ll be working with Pandas, but you can see that I also imported Numpy. We’ll use this a little bit later on to rename some missing values, so we might as well import it now.

After importing the libraries we read the csv file into a Pandas dataframe. You can think of the dataframe as a spreadsheet.

With the .head()method, we can easily see the first few rows.

Now I can answer my original question, what are my features?  It’s pretty easy to infer the following features from the column names:

  • ST_NUM: Street number
  • ST_NAME: Street name
  • OWN_OCCUPIED: Is the residence owner occupied
  • NUM_BEDROOMS: Number of bedrooms

We can also answer, what are the expected types?

  • ST_NUM: float or int… some sort of numeric type
  • ST_NAME: string
  • OWN_OCCUPIED: string… Y (“Yes”) or N (“No”)
  • NUM_BEDROOMS: float or int, a numeric type

To answer the next two questions, we’ll need to start getting more in-depth width Pandas. Let’s start looking at examples of how to detect missing values

Standard Missing Values

So what do I mean by “standard missing values”? These are missing values that Pandas can detect.

Going back to our original data set, let’s take a look at the “Street Number” column.

standard missing values

In the third column there’s an empty cell. In the seventh row there’s an “NA” value.

Clearly these are both missing values. Let’s see how Pandas deals with these.

# Looking at the ST_NUM column
print df['ST_NUM']
print df['ST_NUM'].isnull()
# Looking at the ST_NUM column
0    104.0
1    197.0
2      NaN
3    201.0
4    203.0
5    207.0
6      NaN
7    213.0
8    215.0

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False

Taking a look at the column, we can see that Pandas filled in the blank space with “NA”. Using the isnull() method, we can confirm that both the missing value and “NA” were recognized as missing values. Both boolean responses are True.

This is a simple example, but highlights an important point. Pandas will recognize both empty cells and “NA” types as missing values. In the next section, we’ll take a look at some types that Pandas won’t recognize.

Non-Standard Missing Values

Sometimes it might be the case where there’s missing values that have different formats.

Let’s take a look at the “Number of Bedrooms” column to see what I mean.

non-standard missing values

In this column, there’s four missing values.

  • n/a
  • NA
  • na

From the previous section, we know that Pandas will recognize “NA” as a missing value, but what about the others? Let’s take a look.

# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()

0      3
1      3
2    n/a
3      1
4      3
5    NaN
6      2
7     --
8     na

0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False

Just like before, Pandas recognized the “NA” as a missing value. Unfortunately, the other types weren’t recognized.

If there’s multiple users manually entering data, then this is a common problem. Maybe i like to use “n/a” but you like to use “na”.

An easy way to detect these various formats is to put them in a list. Then when we import the data, Pandas will recognize them right away. Here’s an example of how we would do that.

# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("property data.csv", na_values = missing_values)

Now let’s take another look at this column and see what happens.

# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()

0    3.0
1    3.0
2    NaN
3    1.0
4    3.0
5    NaN
6    2.0
7    NaN
8    NaN

0    False
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8     True

This time, all of the different formats were recognized as missing values.

You might not be able to catch all of these right away. As you work through the data and see other types of missing values, you can add them to the list.

It’s important to recognize these non-standard types of missing values for purposes of summarizing and transforming missing values. If you try and count the number of missing values before converting these non-standard types, you could end up missing a lot of missing values.

In the next section we’ll take a look at a more complicated, but very common, type of missing value.

Unexpected Missing Values

So far we’ve seen standard missing values, and non-standard missing values. What if we an unexpected type?

For example, if our feature is expected to be a string, but there’s a numeric type, then technically this is also a missing value.

Let’s take a look at the “Owner Occupied” column to see what I’m talking about.

unexpected missing values

From our previous examples, we know that Pandas will detect the empty cell in row seven as a missing value. Let’s confirm with some code.

# Looking at the OWN_OCCUPIED column
print df['OWN_OCCUPIED']
print df['OWN_OCCUPIED'].isnull()
# Looking at the ST_NUM column
0      Y
1      N
2      N
3     12
4      Y
5      Y
6    NaN
7      Y
8      Y

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False

In the fourth row, there’s the number 12. The response for Owner Occupied should clearly be a string (Y or N), so this numeric type should be a missing value.

This example is a little more complicated so we’ll need to think through a strategy for detecting these types of missing values. There’s a number of different approaches, but here’s the way that I’m going to work thorugh this one.

  1. Loop through the OWN_OCCUPIED column
  2. Try and turn the entry into an integer
  3. If the entry can be changed into an integer, enter a missing value
  4. If the number can’t be an integer, we know it’s a string, so keep going

Let’s take a look at the code and then we’ll go through it in detail.

# Detecting numbers 
for row in df['OWN_OCCUPIED']:
        df.loc[cnt, 'OWN_OCCUPIED']=np.nan
    except ValueError:

In the code we’re looping through each entry in the “Owner Occupied” column. To try and change the entry to an integer, we’re using int(row).

If the value can be changed to an integer, we change the entry to a missing value using Numpy’s np.nan.

On the other hand, if it can’t be changed to an integer, we pass and keep going.

You’ll notice that I used try and except ValueError. This is called exception handling, and we use this to handle errors.

If we were to try and change an entry into an integer and it couldn’t be changed, then a ValueError would be returned, and the code would stop. To deal with this, we use exception handling to recognize these errors, and keep going.

Another important bit of the code is the .loc method. This is the preferred Pandas method for modfiying entries in place. For more info on this you can check out the Pandas documentation.

Now that we’ve worked through the different ways of detecting missing values, we’ll take a look at summarizing, and replacing them.

Summarizing Missing Values

After we’ve cleaned the missing values, we will probably want to summarize them. For instance, we might want to look at the total number of missing values for each feature.

# Total missing values for each feature
print df.isnull().sum()
ST_NUM          2
ST_NAME         0

Other times we might want to do a quick check to see if we have any missing values at all.

# Any missing values?
print df.isnull().values.any()

We might also want to get a total count of missing values.

# Total number of missing values
print df.isnull().sum().sum()

Now that we’ve summarized the number of missing values, let’s take a look at doing some simple replacements.


Often times you’ll have to figure out how you want to handle missing values.

Sometimes you’ll simply want to delete those rows, other times you’ll replace them.

As I mentioned earlier, this shouldn’t be taken lightly. We’ll go over some basic imputations, but for a detailed statistical approach for dealing with missing data, check out these awesome slides from data scientist Matt Brems.

That being said, maybe you just want to fill in missing values with a single value.

# Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)

More likely, you might want to do a location based imputation. Here’s how you would do that.

# Location based replacement
df.loc[2,'ST_NUM'] = 125

A very common way to replace missing values is using a median.

# Replace using median 
median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True)

We’ve gone over a few simple ways to replace missing values, but be sure to check out Matt’s slides for the proper techniques.


Dealing with messy data is inevitable. Data cleaning is just part of the process on a data science project.

In this article we went over some ways to detect, summarize, and replace missing values.