R for Querying Database — SQL not required — Part I

Jyoti Kumar
3 min readFeb 4, 2020

--

Querying is the most used skill for any data scientist/analyst job. Is learning SQL mandatory for querying? May be not if you are able to connect your database with R Studio and you are comfortable with dplyr package in R.

Learning SQL is important but what if you have started you data science journey using R with no prior experience of querying a database. You have learned data science by loading csv files in R Studio. You have used dplyr/ data.table/ base r packages for querying the data frame obtained by loading the csv files. The skills acquired in querying a data frame in R will be helpful in querying a single table using SQL query. You will have to use any one or a combination of mostly these six functions select, filter, mutate, group by, summarize, arrange for querying and summarizing the result. You may have to use some joins if more than one data frame are required to get the result.

Requirements — Familiarity with the dplyr

In this blog I will use a MySQL database available free for querying using R Studio. The database is financial and is available for download using MySQL Workbench. I have used R Studio to connect the database. I have referred to this blogpost to connect to the database.

The packages required are as follows:

library(RMySQL)
library(dplyr)

The code to connect with the database in R Studio is as follows:

mydb = dbConnect(MySQL(), 
user=’guest’,
password=’relational’,
dbname=’financial’,
host=’relational.fit.cvut.cz’)

The database has 8 tables which can be listed using the function:

dbListTables(mydb)[1] "account"  "card"     "client"   "disp"     "district" "loan"     "order"    "trans"

The ER diagram for the dataset is is available in the link to the dataset.

Let us assume the results for the following questions are required.
1. Sum of amount for transactions from the trans table for each district (column A3 and column A2)
2. Sum of total amount, average duration and max payments from loan table for each district (column A3) , account_id and status from loan table

To get answers to the questions above, first create reference to the tables in the database using the tbl function.

The references to the tables are created below using the tbl function:

account <- tbl(mydb, “account”)card <- tbl(mydb, “card”)client <- tbl(mydb, “client”)disp <- tbl(mydb, “disp”)district <- tbl(mydb, “district”)loan <- tbl(mydb, “loan”)order <- tbl(mydb, “order”)trans <- tbl(mydb, “trans”)

Answer to the first question will require three tables — account, district and trans.

Joining the three tables above using the required keys, a large table with all the required columns will be created that can be summarized to get the result.

The R code used to get the result is as under:

d1 <- account %>% 
# Joining district table to get the columns A3 and A2 for the result
left_join(district, by = "district_id") %>%
# Joining trans table to get the amount column
left_join(trans, by = "account_id") %>%
# Using group by to get result for each A3 and A2 combination
group_by(A3, A2) %>%
# using summarise to get the sum of amount
summarise(tot_amount = sum(amount)) %>%
ungroup() %>%
# Collect is used to get the query result
collect()

The query using dplyr is easy to follow and understand.
After each line of code the table generated can be checked and further steps can be coded to get the result.
The function head can be used to observe the result after each step.

Similarly, for the second question the R code used to get the result is as under:

d2 <- account %>% 
# Joining district to get column A3
left_join(district, by = “district_id”) %>%
# Joining loan to the get amount, duration and payments column
left_join(loan, by = “account_id”) %>%
# filtering the table to get only rows where amount is not missing
filter(!is.na(amount)) %>%
# Using group by to get result for each A3, account_id and status combination
group_by(A3, account_id, status) %>%
# using summarise to get the required summarised values
summarise(tot_amount = sum(amount, na.rm = TRUE),
av_duration = mean(duration, na.rm = TRUE),
max_payments = max(payments, na.rm = TRUE)) %>%
ungroup() %>%
# Collect is used to get the query result
collect()

In the code above filter function is used to remove rows with amount value as NA.

In the next post I will share some other dplyr tips to query the database.

Also, if you are familiar with SQL please write the SQL versions of the query as comment to the post below.

--

--

Jyoti Kumar
Jyoti Kumar

Written by Jyoti Kumar

I have experience in Predictive Modelling and Dashboards. YouTube : LearnToPlayWithData (https://www.youtube.com/channel/UChAOvvDYCu7zHcQq4O3Ov3Q)

No responses yet