Python-Pandas cheat sheet: 30 functions-methods
In this blog, I have mentioned 30 useful python-pandas functions/methods used frequently in data analysis. The following activities can be done using the functions/methods mentioned in the blog:
- Loading and understanding data
- Selecting columns/rows or both
- Filtering data using one or multiple columns
- Sorting data and dropping a column
- Summarizing data
The data used is Airlines.csv from Kaggle.
We will start by loading the pandas package
import pandas as pd
The functions/methods are as follows:
- Loading a csv file
# 1. Loading the data from a csv filedf = pd.read_csv("Airlines.csv")
2. Getting the count of rows and columns in the dataframe
# 2. Shape of a dataframedf.shape
3. Getting first/last few rows of the dataframe
# 3. Head and Tail of the data framedf.head(n=10)df.tail(n=10)
4. Getting data types of the columns to understand if the columns are loaded properly with correct types
# 4. data types of the columnsdf.dtypes
5. Getting the column names as list
# 5. Getting column namesdf.columns.tolist()
6. Getting the summary stats for numeric columns
# 6. Summary statsdf.describe()
7. Getting the count of NA values in the columns
# 7. Checking NA values in columnsdf.isna().sum()
8. Selecting columns with data type as object/int64
# 8. Selecting columns with data type as objectdf.select_dtypes(include = 'object').columns# Selecting columns with data type as int64df.select_dtypes(include = 'int64').columns
9. Getting count of values in a column
# 9. Getting value counts from the columnsdf['Airline'].value_counts(ascending=True)
10. Getting unique values in a column
# 10. Getting unique names of values in a columndf['Airline'].unique()
11. Selecting a few columns
# 11. Select a few columns from dfdf[['id', 'Airline', 'Flight']]
12. Selecting a few rows using .iloc
# 12. Select a few rowsdf.iloc[:10,]
13. Selecting a few rows and columns together using .loc
# 13. Select a few rows and columnsdf.loc[:5, ['id', 'Airline', 'Flight']]
14. Filtering the data using a column
# 14. Filter the data using a columndf[df['Airline'] == 'US']
15. Filtering the data using multiple columns
# 15. Filter the data using multiple columnsdf[(df['Airline'] == 'US') & (df['AirportFrom'] == 'PHX') & (df['DayOfWeek'] == 1)]
16. Filtering the data using OR conditions
# 16. Filter data using OR conditionsdf[(df['Airline'] == 'US') | (df['AirportFrom'] == 'PHX')]
17. Filtering the data using a list
# 17. Filter data using a listairline_list = ['DL','US']df[df['Airline'].isin(airline_list)]
18. Filtering for the data not in the list
# 18. Filter data not in listairline_list = ['DL','US']df[~df['Airline'].isin(airline_list)]
19. Sorting the data using a column
# 19. Sort the datadf.sort_values(by='Airline',ascending=False)
20. Rename columns
# 20. Rename a columndf.rename(columns={"Airline": "Airline_Code", "AirportFrom":"Airport_From"})
21. Summarise data using groupby
# 21. Summarise using groupbydf.groupby(['Airline','AirportFrom','AirportTo'], as_index=False)['id'].agg('count')
22. Summarise and sort
# 22. Summarise and sortdf_summ = df.groupby(['Airline','AirportFrom','AirportTo'], as_index=False)['id'].agg('count')df_summ.sort_values(by='id', ascending = False)
23. Summarise for multiple values
# 23. Summarise for multiple valuesdf.groupby(['Airline','AirportFrom','AirportTo'])['Time'].agg(['sum','count']).reset_index()
24. Summarise for multiple columns and values
# 24. Summarise for multiple columns and valuesdf.groupby(['Airline','AirportFrom','AirportTo']).aggregate({'id':'count','Time':'sum'}).reset_index()
25. Adding a new column
# 25. Adding a new columndf['Country'] = 'USA'
26. Adding a column using existing columns
# 26. Adding a column using existing columnsdf['CO_SFO'] = (df['Airline'] =='CO') & (df['AirportFrom'] == 'SFO')
27. Dropping a column
# 27. Dropping a Columndf.drop(['CO_SFO'], axis = 1)
28. Summarise using pivot_table with aggregation
# 28. Summarise Using pivot_tabledf.pivot_table(index = ['Airline','AirportFrom','AirportTo'],values = ['Time'], aggfunc=['sum','count']).reset_index(col_level=1)
29. Pivot data using pivot_table where unique rows are not available for index columns
# 29. Pivot data using pivot_table when unique rows are not available for index columnsdf.pivot_table(index = 'Airline', columns='DayOfWeek', values='id',aggfunc='count', fill_value = 0).reset_index()
30. Summarise using groupby and pivot
# 30. Summarise using groupby and pivotdf1 = df.groupby(['Airline','DayOfWeek'], as_index=False)['id'].agg('count').reset_index()df1.pivot(index = ['Airline'], columns = 'DayOfWeek', values = 'id').reset_index()
The corresponding notebook is published at JetBrains Datalore: A powerful environment for Jupyter notebooks.
Please share the blog if you like it.