Logical Joins in R
Joins are an essential part of analytics. Joins can be achieved in R using many ways. I use dplyr mostly for joining data frames in R. It serves the purpose all the time.
Recently, I came across a SQL question where the solution was obtained using join with an where clause. I tried the same using R by creating two data frame and I was not successful using dpyr.
Data: There is a campaign table where start and end date of each campaign is stored in different columns.
Question: List the dates from 1st Jan, 2020 to 31st May, 2020 when there were no campaigns.
The SQL server code to solve this problem is as under:
--- Declaring start and end date
declare @startdate date = '2020-01-01',
@enddate date = '2020-05-31';--- Recursive CTE to generate a table with dates from start and end --- datewith datetab as
(
select @startdate d
union all
select DATEADD(day, 1, d)
from datetab
where dateadd(Day,1,d) <= @enddate
)--- Query to get the resultselect datetab.d as non_camp, c.camp_namefrom datetableft join Campaign c
--- Logical Joinon datetab.d between c.start_d and c.end_dWHERE camp_name is null;
The result will be like this
non_camp camp_name2020-01-01 NULL2020-01-02 NULL2020-01-03 NULL2020-01-04 NULL2020-01-05 NULL2020-01-06 NULL2020-01-07 NULL2020-01-08 NULL2020-01-09 NULL2020-01-10 NULL2020-01-11 NULL2020-01-12 NULL2020-01-13 NULL2020-01-14 NULL2020-01-26 NULL2020-01-27 NULL2020-01-28 NULL2020-01-29 NULL...
To achieve the same in R, Let’s create the tables first
Campaign table is created using the code below
campaign <- data.frame(camp_name = c("A", "B", "C"),
start_d = as.Date(c("2020/1/15", "2020/2/10", "2020/03/11")),
end_d = as.Date(c("2020/1/25", "2020/2/11", "2020/03/21")))
The dates table is created as under with dates from 1st Jan, 2020 to 31st May, 2020
period <- seq(as.Date("2020/1/1"), as.Date("2020/5/31"), "days")period_df <- data.frame(d = period)
Let’s try dplyr package to do logical joins
date_camp <- period_df %>%
left_join(campaign, by = c("d" >= "start_d", "d" <= "end_d"))
The code above does not work and throws the error below
Error: `by` must be a (named) character vector, list, or NULL, not a logical vector. Run `rlang::last_error()` to see where the error occurred.
Let’s try with data.table package now
date_camp_DT <- setDT(campaign)[period_df, on = .(start_d <= d, end_d >= d)][is.na(camp_name),]
The screenshot of the result is attached.
For normal joins dplyr is good. For logical joins data.table does the work.
If there is a better solution in R to the problem please share it in comments.