9 tricks to master Pandas drop() and speed up your data analysis
Data manipulation refers to the process of adjusting data to make it organised and easier to read. Frequently, there is data that is unusable and can interfere with what matters. Unnecessary or inaccurate data should be cleaned and deleted.
Source from solvexia.com [1]
Delete one or many rows/columns from a Pandas DataFrame can be achieved in multiple ways. Among them, the most common one is the drop() method. The method seems fairly straightforward to use, but there are still some tricks you should know to speed up your data analysis.
In this article, you’ll learn Pandas drop() tricks to deal with the following use cases:
- Delete a single row
- Delete multiple rows
- Delete rows based on row position and custom range
- Delete a single column
- Delete multiple columns
- Delete columns based on column position and custom range
- Working with MultiIndex DataFrame
- Do operation in place with inplace=True
- Suppress error with error='ignore'
Please check out the Notebook for source code. More tutorials are available from Github Repo.
1. Delete a single row
By default, Pandas drop() will remove the row based on their index values. Most often, the index value is an 0-based integer value per row. Specifying a row index will delete it, for example, delete the row with the index value 1.:
df.drop(1)# It's equivalent todf.drop(labels=1)
Note that the argument axis must be set to 0 for deleting rows (In Pandas drop(), the axis defaults to 0, so it can be omitted). If axis=1 is specified, it will delete columns instead.
Alternatively, a more intuitive way to delete a row from DataFrame is to use the index argument.
# A more intuitive waydf.drop(index=1)
2. Delete multiple rows
Pandas drop() can take a list to delete multiple rows:
df.drop(labels=[1,2])
Similarly, a more intuitive way to delete multiple rows is to pass a list to the index argument:
# A more intuitive waydf.drop(index=[1,2])
3. Delete rows based on row position and custom range
The DataFrame index values may not be in ascending order, sometimes they can be any other values, for example, datetime or string labels. For these cases, we can delete rows based on their row position, for instance, delete the 2nd row, we can call df.index[1] and pass it to the index argument:
df.drop(index=df.index[1])To delete the last row, we can use shortcuts such as -1 which identifies the last index:
df.drop(index=df.index[-1])We can also use the slice technique to select a range of rows, for instance
- Delete the last 2 rows df.drop(index=df.index[-2:])
- Delete every other row df.drop(index=df.index[::2])
If you want to learn more about the slice technique and how to use row index for selecting data, you can check out this article:
4. Delete a single column
Similar to delete rows, Pandas drop() can be used to delete columns by specifying the axis argument to 1:
df.drop('math', axis=1)# It's equivalent todf.drop(labels='math', axis=1)
A more intuitive way to delete a column from DataFrame is to use the columns argument.
# A more intuitive waydf.drop(columns='math')
5. Delete multiple columns
Similarly, we can pass a list to delete multiple columns:
df.drop(['math', 'physics'], axis=1)# It's equivalent todf.drop(labels=['math', 'physics'], axis=1)
A more intuitive way to delete multiple columns is to pass a list to the columns argument:
# A more intuitive waydf.drop(columns=['math', 'physics'])
6. Delete columns based on column position and custom range
We can delete a column based on its column position, for instance, delete the 2nd column, we can call df.column[1] and pass it to the columns argument:
df.drop(columns=df.columns[1])To delete the last column, we can use shortcuts such as -1 which identifies the last index:
df.drop(columns=df.columns[-1])Similarly, we can also use the slice technique to select a range of columns, for instance
- Delete the last 2 columns df.drop(columns=df.columns[-2:])
- Delete every other column df.drop(columns=df.columns[::2])
7. Working with MultiIndex
A MultiIndex (also known as a hierarchical index) DataFrame allows us to have multiple columns acting as a row identifier and multiple rows acting as a header identifier:
When calling Pandas drop() on a MultiIndex DataFrame, it will remove the level 0 index and column by default.
# Delete all Oxford rowsdf.drop(index='Oxford')# Delete all Day columns
df.drop(columns='Day')
To specify a level to be removed, we can set the level argument:
# remove all 2019-07-04 row at level 1df.drop(index='2019-07-04', level=1)# Drop all Weather column at level 1
df.drop(columns='Weather', level=1)
In some cases, we would like to delete a specific index or column combination. To do that, we can pass a tuple to the index or columns argument:
# drop the index combination 'Oxford' and '2019-07-04'df.drop(index=('Oxford', '2019-07-04'))# drop the column combination 'Day' and 'Weather'
df.drop(columns=('Day', 'Weather'))
If you want to learn more about accessing data in a MultiIndex DataFrame, please check out this article:
8. Do operation in place with inplace=True
By default, the Pandas drop() return a copy of the result without affecting the given DataFrame. We can set the argument inplace=True to do the operation in place to avoid additional reassignment and reduce memory usage.
9. Suppress error with error='ignore'
You may notice that the Pandas drop() will throw an error when the given rows or columns don’t exist. We can set the argument error='ignore' to suppress the error.
Conclusion
In this article, we have covered 9 use cases about deleting rows and columns using the Pandas drop(). The method itself is very straightforward to use and it’s one of the top favorite methods for manipulating data in data Preprocessing.
Thanks for reading. Please check out the Notebook for the source code and stay tuned if you are interested in the practical aspect of machine learning. More tutorials are available from the Github Repo.
References
[1] 5 Tips for Data manipulation