Repeats: A Business Case Study using Python

Jyoti Kumar
5 min readJul 22, 2023

For any business, there are mainly two types of customers — new and repeats. We can get new customers through organic search, websites and marketing channels. Getting new customers through marketing channels are costly, so any business would like to have their customers making more repeat purchases. So, it is important to understand the repeat rates for overall business and by different categories. Some of the products/ geographies may be having higher repeat rates and it becomes important for marketing team to target more such customers.

In this blog, we would explore a hypothetical data with a few columns:

  • Customer
  • Product
  • Purchase Date

The objective is to look at overall repeat rate for the business and compare the base repeat rate with repeat rates for each products.

# Reading the libraries
import pandas as pd
import numpy as np

import seaborn as sns
# Reading the data
purchase_data = pd.read_csv("/data/workspace_files/Purchase_Data.csv")
Raw Data
# Changing date to the standard format
purchase_data['Purchase_Date'] = pd.to_datetime(purchase_data['Purchase_Date'], format = '%m/%d/%Y').dt.date

To calculate repeat rate, we will need purchase no for all purchases made and maximum purchases made by each customer. To achieve this we will follow the following steps:

  • Sort the data frame by Customer and Purchase Date in increasing order so that the first purchase of a customer is marked as 1 and so on.
  • Calculate purchase_no by grouping the data frame by Customer and counting
  • Calculate max_purchase by getting max of purchase_no for each customer
### Order the data frame and add purchase number for a customer

purchase_data = purchase_data.sort_values(by=['Customer', 'Purchase_Date'], ascending=[True, True])

purchase_data['purchase_no'] = purchase_data.groupby(['Customer'])['Purchase_Date'].cumcount()+1

purchase_data['max_purchase'] = purchase_data.groupby(['Customer'])['purchase_no'].transform('max')

The output after these steps is like this:

Data with purchase_no and max_purchase

From this data we can easily get count of customers at each purchase no and hence we can calculate the repeat rates.

# Count of customers at each Purchase no
repeat_count = purchase_data['purchase_no'].value_counts().reset_index()

repeat_count.columns = ['Purchase_no', 'Count']

repeat_count
Count by Purchase no

The repeat rate from 1st Purchase to 2nd Purchase is 290/433 = 67%

Similarly, the repeat rate from 2nd Purchase to 3rd Purchase is 163/290 = 56%

We can observe here that 67% customers made purchase for the second time but repeat rate is decreasing with number of purchases.

To dig deep and understand the repeat rates for each product, we will have to do a few more data transformations so that the (n)th and (n+1)th purchases of the customer are in the adjacent columns.

The steps are as follows:

  • Add a column next_purchase_no by shifting purchase_no one cell up (lead)
  • Replace the nulls by 0 in the column next_purchase_no and convert it to int
# Add a new colummn - next purchase no
purchase_data['next_purchase_no'] = purchase_data.groupby(['Customer'])['purchase_no'].shift(-1)

purchase_data['next_purchase_no'] = purchase_data['next_purchase_no'].apply(lambda x: 0 if pd.isnull(x) else x)

purchase_data['next_purchase_no'] = purchase_data['next_purchase_no'].astype('int')
Data after adding next purchase no

We can observe that a few customers at the top have not made any second purchase and hence their next_purchase_no is 0.

We need to self-join this table with itself to bring data to the desired format. The steps are as follows:

  • Adding keys for self join — combination of customer & purchase_no and customer & next_purchase_no
  • Self-join the data frame using these keys
# Adding keys for self-join
purchase_data['key_n'] = purchase_data['Customer'].astype('str')+'-'+purchase_data['purchase_no'].astype('str')
purchase_data['key_n1'] = purchase_data['Customer'].astype('str')+'-'+purchase_data['next_purchase_no'].astype('str')
# Self-join
coln = ['Customer', 'Product', 'Purchase_Date', 'purchase_no', 'max_purchase',
'next_purchase_no', 'key_n1']

coln1 = ['Product', 'Purchase_Date', 'key_n']

purchase_data_join = purchase_data[coln].merge(purchase_data[coln1], left_on = ['key_n1'],
right_on = ['key_n'], how = 'left',
suffixes=('_n', '_n1'))

purchase_data_join

The output with next purchase columns as adjacent columns is as under:

Data with next purchase columns

Let’s explore the repeat rate from 1st Purchase to 2nd Purchase for each product.

# Filter for purchase_no == 1 and group by Product_n and next_purchase_no
product_repeat12 = purchase_data_join[purchase_data_join['purchase_no'].isin([1])].groupby(['Product_n', 'next_purchase_no'])['Customer'].agg('count').unstack()

product_repeat12['total_count'] = product_repeat12[0]+product_repeat12[2]
product_repeat12['repeat_rate'] = product_repeat12[2]/(product_repeat12['total_count'])*100
product_repeat12['repeat_rate'] = product_repeat12['repeat_rate'].round(2)
product_repeat12
Data with Repeat Rates for Products

We can observe that the Repeat Rate from 1st purchase to 2nd purchase is highest for Product ‘H’ followed by ‘F’ and ‘G’.

It would be interesting to observe the 2nd purchase made by customers who bought ‘H’, ‘F’ and ‘G’ in their first purchase.

# Matrix of customers movement from Products in 1st Purchase to Products in 2nd Purchase
purchase_product12 = purchase_data_join[purchase_data_join['purchase_no'].isin([1])].groupby(['Product_n', 'Product_n1'])['Customer'].agg('count').unstack()

purchase_product12
Product choices in 1st and 2nd Purchases

We can observe that out of repeaters from 1st to 2nd purchase, customers who bought ‘H’ in their 1st Purchase have repeated mostly with ‘H’ or ‘A’ followed by ‘B’, ‘C’ and ‘G’.

Similarly, customers who bought ‘F’ and ‘G’ in their 1st Purchase have preferred Product ‘E’ in their 2nd Purchase.

We can observe this in a better way in a row-percentage table.

# Heatmap of customers movement from Products in 1st Purchase to Products in 2nd Purchase 
purchase_row_perc = purchase_product.div(purchase_product.sum(axis=1), axis=0)*100

purchase_row_perc = purchase_row_perc.round(2)

sns.heatmap(purchase_row_perc, annot=True, cmap="crest")
Heat Map of Product Choices in 1st and 2nd Purchase

From the heatmap above, we can observe that the highest repeat rate (31%) is from Product ‘G’ in 1st Purchase to Product ‘E’ in their 2nd Purchase.

We can explore further and look at the histogram of days between 1st and 2nd Purchase for overall and for different products as well.

If there are more variables, we can create such matrices easily to understand the movement of customers from 1st Purchase to 2nd Purchase and for other Purchases as well.

I hope this case study is useful in some way in your projects. The hypothetical data set used in this analysis can be downloaded from here.

Please share the blog if you like it.

--

--

Jyoti Kumar

I have experience in Predictive Modelling and Dashboards. I have rich working experience on various tools and software like Python, R, Tableau, Power BI and SQL