R for Querying Database —Tricks — SQL not required — Part II

Jyoti Kumar
3 min readFeb 15, 2020

In the previous post R for Querying Database — SQL not required — Part I, I have discussed how R can be used to query a MySQL database. You can use the skills acquired in querying data frame in R to query a database.

In this blog, I will discuss some tips that will help in writing queries. I will cover the following sub topics in the blog.

  1. Selecting columns
  2. Using sub query
  3. Using mutate to create new columns

Database: The database is financial and is available for download using MySQL Workbench

Requirements — Familiarity with dplyr

Trick 1: Selecting columns
While querying a database it may happen that some columns of a table are not used frequently. In such cases select can be used just after reference to the table so that when queries are executed only the selected columns will appear in the result and there is no need to select columns again.

Let us assume that for the query only trans and account tables are required. From the trans tables only trans_id to amount columns are required. In such case reference to the tables can be modified as below:

# Step 1: Reference to the table
trans <- tbl(mydb, “trans”)
# Step 2: Query to refer to the selected columns only
trans <- trans %>% select(trans_id, account_id, date, type, operation, amount)
# Step 3: Query to check column names
trans %>% colnames()

Other ways to achieve same result for Step 2 are as follows:

# Selecting columns in sequence by mentioning first and last 
trans <- trans %>% select(trans_id:amount)
# Removing columns in sequence by using "-" sign with first and last
trans <- trans %>% select(-balance:-account)

So, when the query below is used only the selected columns are returned

account <- tbl(mydb, "account")trans <- tbl(mydb, “trans”)
trans <- trans %>% select(trans_id:amount)
account %>%
# Joining trans table
left_join(trans, by = “account_id”) %>%
# getting the column names
colnames()
# Result
[1] “account_id” “district_id” “frequency” “date.x” “trans_id” “date.y” “type” “operation”
[9] “amount”

Trick 2: Using sub query
Let us assume district_ids are required for the district in which maximum amount for the customers are above 50000.
The result can be obtained in different ways. The code below uses a sub query to get the result.
Account table is selected with two columns and is right joined to the sub query:

# Sub Query
trans %>%
group_by(account_id) %>%
# summarizing to get max_amount for each account_id
summarise(max_amount = max(amount)) %>%
ungroup() %>%
# filtering rows
filter(max_amount >= 50000)

The sub query above summarizes the trans table for each account_id to get max(amount) and then filters to get account_ids that have max(amount) ≥ 50000

The R code using sub query to get distinct district_ids is as under:

d1 <- account %>% 
# Selecting a few columns
select(account_id, district_id) %>%
# Right joining summarised trans table to get results for account_ids satisfying the filter
right_join(# Sub Query
trans %>%
group_by(account_id) %>%
# summarising to get max_amount for each account_id
summarise(max_amount = max(amount)) %>%
ungroup() %>%
# filtering rows
filter(max_amount >= 50000), by = "account_id") %>%
distinct(district_id) %>%
collect()

Trick 3: Using mutate to create new columns
Let us assume district_ids are required for the districts in which count of accounts with type PRIJEM and amount above 10000 is greater than 100.

The code below uses mutate to create a column pr_acc in trans table while querying and summarizing the table for each account_id to get a summarized table that is joined with account table to get distinct district_ids.

d2 <- trans %>% 
# Creating column for account type PRIJEM with amount >= 10000
mutate(pr_acc = ifelse(type == “PRIJEM” & amount >= 10000, 1, 0)) %>%
group_by(account_id) %>%
# summarizing to get count of pr_account using sum
summarise(pr_acc_count = sum(pr_acc)) %>%
ungroup() %>%
# Filtering rows
filter(pr_acc_count > 100) %>%
# Left joining account table
left_join(account, by = "account_id") %>%
# getting distinct district_ids
distinct(district_id) %>%
collect()

There are a few more tricks that I have learnt while using R for querying database which I will discuss in the next part of the blog.

Also, if you are familiar with SQL, please write the SQL versions of the query as comment to the post below.

--

--

Jyoti Kumar

I have experience in Predictive Modelling and Dashboards. I have rich working experience on various tools and software like Python, R, Tableau, Power BI and SQL