Useful trick using dplyr : Create a Sequence After Group by

Jyoti Kumar
4 min readJun 27, 2020

--

In this blog, I will discuss a trick using dplyr package that makes life easier. I will create a data frame in R and use it for analysis.

dplyr functions group_by and summarise are used to get summarised values for the variable used in group_by. Many summarise functions are available that help us get the desired result.

The problem discussed in the blog uses a combination of dplyr functions to get the answer.

The data used can be created using the code below:

# Creating an array of customercustomer = 1001:1010# Creating an array of order amountorderamount = round(rnorm(30, 500, 2))# Creating an array of order quantityqty = round(rnorm(30, 10, 2))# Creating a data frame with three columns — customer, orderamt and qtydata <- data.frame(customer = rep(customer, 3), orderamt = orderamount, qty = qty)

The first few rows of the data frame data is as under:

The data frame data contains 10 customer from 1001 to 1010.

For every customer there are 3 order amount values and 3 order quantity values.

The following questions are to be answered for the data:

  1. Find the max order amount for each customer
  2. Find the min order amount for each customer
  3. Find the 2nd max order amount for each customer
  4. Find the quantity for each customer when the order amount is max

Let’s try to find the answer to each problem

1. Find the max order amount for each customer

This can be done easily using group by and max function in summarise. The code below can be used to get the answer

# The data is grouped for each customer using group_by and
# max function is used to get max of order amount
data %>%
group_by(customer) %>%
summarise(ordermax = max(orderamt)) %>%
ungroup()
Result 1

2. Find the min order amount for each customer

This can also be done easily using group by and min function in summarise. The code below can be used to get the answer

# The data is grouped for each customer using group_by and
# min function is used to get max of order amount
data %>%
group_by(customer) %>%
summarise(ordermax = min(orderamt)) %>%
ungroup()
Result 2

3. Find the 2nd max order amount for each customer

This part is tricky and will require some effort to get the answer.

The result can be obtained using the following steps:

  1. group the data by customer
  2. arrange the grouped data in decreasing order of order amount
  3. create a variable that adds sequence for each customer in decreasing order of order amount
  4. ungroup the data
  5. filter for n = 2 as data was arranged in decreasing order of order amount and n= 2 will give 2nd max order amount for each customer

The code below will execute the above mentioned steps to get the result

Result 3
data %>% 
# step 1
group_by(customer) %>%
# step 2
arrange(desc(orderamt)) %>%
# step 3
mutate(n=1:n()) %>%
# step 4
ungroup() %>%
# step 5
filter(n==2)

The code above with little modifications can be used to answer all the 4 questions.

4. Find the quantity for each customer when the order amount is max

In the screenshot for the result 3 above, quantity values are available. This provides the hint that the result can be obtained by modifying the code above and replacing n=1 for n=2

data %>% 
group_by(customer) %>%
arrange(desc(orderamt)) %>%
mutate(n=1:n()) %>%
ungroup() %>%
filter(n==1)

Conclusion:

I find the code block below useful in many cases:

data %>% 
# Group by a variable for which sequence is required
group_by(variable1) %>%
# Arrange by the variable using which you want to create a sequence
# Do not use desc if ascending order is required
arrange(desc(variable2)) %>%
# Create the sequence using 1:n() where n() is the number of rows in # the variable with which it is grouped
mutate(n = 1:n())
# Ungroup
ungroup() %>%
# Filter to get n=1,2 or any other value
filter( n=1 )

Some problems that can be solved using a variant of the code above are:

  1. Find the date for each customer when the 2nd order was placed (variable2 is a date field)
  2. Find the items ordered in the 3rd order of a customer (variable2 is a date field and for n = 3 above the required column can be observed for the answer)

I have tried my best to explain the usefulness of the code block above in real life problems. I use it at my work when such questions are to be answered.

Note: The code block can be used while querying a database using R.

Please share the blog if you like it and share your feedback.

--

--

Jyoti Kumar
Jyoti Kumar

Written by Jyoti Kumar

I have experience in Predictive Modelling and Dashboards. YouTube : LearnToPlayWithData (https://www.youtube.com/channel/UChAOvvDYCu7zHcQq4O3Ov3Q)

Responses (1)