Data Preprocessing in Machine Learning for Data Science

It is a data mining technique that transforms raw data into an understandable format. Raw data(real world data) is always incomplete and that data cannot be sent through a model. That would cause certain errors. That is why we need to preprocess data before sending through a model.

At the heart of this intricate process is data. Your machine learning tools are as good as the quality of your data. Sophisticated algorithms will not make up for poor data.In this article I will try to simplify the exercise of data preprocessing, or in other words, the rituals programmers usually follow before it is ready to be used for machine learning models into steps

Steps in Data Preprocessing

Here are the steps Data Scientist follows
1. Import libraries
2. Read data
3. Checking for missing values
4. Checking for categorical data
5. Standardize the data
6. PCA transformation
7. Data splitting

Step 1: Import Libraries:Libraries are modules that you can call upon when you need them,essential collections that Data Scientist or Analyst need in Python to use in Data processing and to arrive at a decisive outcome or output.e.g

import pandas as pd.

step2:Import Dataset: Datasets comes in many formats but a lot of comes in CSV formats.keep the datasets in the same directory as your program and you can read data using the method read_csv which can be found in the Library called pandas

import pandas as pd

dataset = pd.read_csv(YourFile.csv’)

After importing datasets you do EDA(Exploratory Data Analysis).“Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process.After studying our datasets carefully we creates a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations.read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].In the industry, a data scientist often works with large datasets. It is impossible to understand the entire dataset at one go. So first get an idea of what you are dealing with by taking a subset of the entire dataset as a sample. Do not make any modifications in this stage. You are just observing the dataset and getting an idea of how to tackle it.

X = dataset.iloc[:, :-1].values

: as a parameter selects all. So the above piece of code selects all the rows. For columns we have :-1, which means all the columns except the last one.

Step 3: Taking care of Missing Data in Dataset:When you first get your data from a source most of the time it comes incomplete,missing data,incompatible measurement(cm or meters,dollars or pound sterling)you have to Normalized or standardized your data.Of course we would not get into Scaling right now.

Sometimes you may find some data are missing in the dataset. We need to be equipped to handle the problem when we come across them. you could remove the entire line of data but what if you are unknowingly removing crucial information? Of course we would not want to do that. One of the most common idea to handle the problem is to take a mean of all the values of the same column and have it to replace the missing data.The Library to use is Scikit-learn preprocessing it contains the class imputer which helps in the missing data

from sklearn.preprocessing import Imputer

Create an object of the same class to call the functions that are in that class object imputer it will take many parameters:

i. missing_values — We can either give it an integer or “NaN” for it to find the missing values.
ii. strategy — we will find the average so we will set it to mean. We can also set it to median or most_frequent (for mode) as necessary.
iii. axis — we can either assign it 0 or 1, 0 to impute along columns and 1 to impute along rows

.imputer = Imputer(missing_values = “NaN”, strategy = “mean”, axis = 0)

we will fit the imputer object to our data.fit is means training or imposing the model to our data

imputer = imputer.fit(X[:,1:3])

The code above will fit the imputer object to our matrix of features X. Since we used :, it will select all rows and 1:3 will select the second and the third column (why? because in python index starts from 0 so 1 would mean the second column and the upper-bound is excluded. If we wanted to include the third column instead, we would have written 1:4).

Now we will just replace the missing values with the mean of the column by the method transform.which i called data transformation

X[:, 1:3] = imputer.transform(X[:, 1:3])

in the next article or Part 2 we will discuss about the rest steps in Data Preprocessing and Data Exploratory analysis which are

Checking for categorical data
Standardize the data
PCA transformation
Data splitting(Training and Testing)