The Basic Fundamentals of Data Preprocessing

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean data set.

Step 1: Importing the required libraries

Numpy: It is a library that contains Mathematical functions.

Pandas: It is a library used to import and manage datasets.

Step 2: Importing the dataset

Datasets are generally available in .csv format. A CSV file stored tabular data in plain text. We use the read_csv method of the pandas library to read a local CSV file as a DataFrame.

Step 3: Handling the Missing Data

The missing values in the data need to be handled so that it does not reduce the performance of our Machine Learning model. We can replace the missing data by Mean or Median of the entire column. We use the Imputer class of sklearn.preprocessing for this task.

Step 4: Encoding Categorical Data

Categorical data are variables that contain label values rather than numeric values. To achieve this, we import the LabelEncoder class from sklearn.preprocessing library.

Step 5: Splitting the dataset into training and testing

We make two partitions of the dataset, one for training and the model called training set and the other for testing the performance of the trained model called the test set. The split is generally 80/20.

Step 6: Feature Scaling

Most of the Machine Learning algorithms use the Euclidean distance between two data points in their computations, features highly varying in magnitudes. Done by Feature standardization or Z-Score normalization. StandardScaler of sklearn.preprocessing is imported.

And there you have it.

Leave a comment