What function can the analyst use to eliminate the NULL values?

What function can the analyst use to eliminate the NULL values?
Photo by Ben Hershey on Unsplash

Null values are a big problem in machine learning and deep learning. If you are using sklearn, TensorFlow, or any other machine learning or deep learning packages, it is required to clean up null values before you pass your data to the machine learning or deep learning framework. Otherwise, it will give you a long and ugly error message.

In this article, we will work on how to handle null values. First, there will be some very simple methods and slowly we will move toward some more complex and more efficient ways.

To demonstrate the handling of null values, We will use the famous titanic dataset.

What function can the analyst use to eliminate the NULL values?

The preview is already showing some null values. Let’s check how many null values are there in each column:

Output:

The age column has 177 and the embark_town column has 2 null values. But the deck column has the most null values 688 out of 891 rows of data. I would like to delete this column completely for machine learning or any other type of data analytics purpose.

We will focus on the age and embark_town columns and deal with the null values of those columns.

Let’s start!

I will start with the most simple strategy and slowly move towards the more complicated tricks.

1. Simply Drop

The most simple strategy is to drop the rows with null values if you have a good enough size of data using this simple code:

But the titanic dataset is not too big. In fact in the real world also we do not have the luxury of deleting the data lots of time because we do not have enough data left after deleting all rows with null values.

2. Filling with Zeros

Another very easy and simple way. You can fill up all the null values with zeros to make the process really simple. We can fill up the null values in the age column with zeros like this:

Output:

Look at row 888. It was null before and now it is zero. This is also a very naive approach. Especially, in this case, age cannot be zero.

3. Forward and Backward Fill

This is also a common technique to fill up the null values. Forward fill means, the null value is filled up using the previous value in the series and backward fill means the null value is filled up with the next value in the series.

Output:

Please notice, that row 888 is now 19, the same as the 887.

Output:

This row 888 has taken the value from row 889.

4. Mean and Median Fill

I prefer this one in most cases, filling up the null values by mean and median. Here I am using the median:

Output:

Now the null value in row 888 is 28, which is the median of the column age.

5. Mean and Median Fill with Groupby

Putting the median or mean of the whole column was the simple approach. But I like a bit more specific approach to the median and mean. Instead of taking the median of the whole age column and filling up all the null values, filling up the null values using the mean age of each pclass and ‘alive’ will be more accurate.

Let’s see the mean age of ‘pclass’ and ‘alive’ if they vary at all:

Output:

Yes, they do vary a lot.

It will be more accurate if we fill up the null values of each of these groups with the corresponding means.

Output:

Now, row 888 became 21. Compared to 28

Categorical Null Values Imputation

The embark_town column is a categorical column and we can still apply the same process as the previous example with it.

But before that the values of the ‘embark_town’ column need to be expressed as numeric values:

Output:

I will group the data by ‘pclass’ and ‘alive’ and fill up the null values by the median.

6. Iterative Imputation With a Machine Learning Model

This is a very good and efficient way of imputing the null values. In this process, null values in each column get filled up. Each column is used as the label of a specified machine learning model one by one.

Then the rows with non-null values are used to train a machine learning model and the rows with null values are predicted using that trained model.

For this demonstration, I am using only a few columns. Remember in the beginning we checked which columns have null values and we saw age, embark_town, and deck columns have null values.

The column deck had too many null values though and we wanted to avoid that column. But for the demonstration purpose, I am adding the deck column also in the titanic1 dataset.

I am using RandomForestRegressor here. You can use any other regression model.

Let’s do the imputation now and I will save the imputed dataset as titanic2.

What function can the analyst use to eliminate the NULL values?

The age in 888 row is 32 now!

We should check if there is any null values anymore in titanic2 dataset:

Output:

There are no null values in any of the columns anymore.

These are all the tips I wanted to share about null values today.

Conclusion

If you want you can choose individual techniques for individual columns as well. Please feel free to share if you find any other techniques to deal with null values more efficient.

Please feel free to follow me on Twitter, and the Facebook page

More Reading