Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
Need of Data Preprocessing
- For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set.
- Another aspect is that the data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithm are executed in one data set, and best out of them is chosen.
This article contains 3 different data preprocessing techniques for machine learning.
The Pima Indian diabetes dataset is used in each technique.
This is a binary classification problem where all of the attributes are numeric and have different scales.
It is a great example of a dataset that can benefit from pre-processing.
You can find this dataset on the UCI Machine Learning Repository webpage.
Note that the program might not run on Geeksforgeeks IDE, but it can run easily on your local python interpreter, provided, you have installed the required libraries.
1. Rescale Data
- When our data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
- This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent.
- It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
- We can rescale your data using scikit-learn using the MinMaxScaler class.
Code: Python code to Rescale data (between 0 and 1)
Python
# importing libraries
import pandas
import scipy
import numpy
[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]0 [[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]1import [[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]3
[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]4
[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]5
[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]6[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]7 [[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]8
[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]9
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]0[[ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 1. 0. 1. 1. 1.] [ 1. 1. 1. 0. 0. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1.] [ 0. 1. 1. 1. 1. 1. 1. 1.]]7 [[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]2[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]3[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]5[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]7[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]9[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4# importing libraries1[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4# importing libraries3[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4# importing libraries5[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4# importing libraries7[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426] [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191] [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106] [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042] [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]4# importing libraries9import0