jDataLab Jager

3 minute read

A data scientist has to work with many different types of data storage and there are chances when you need to pull data from enterprise data warehouse into your analysis environment.

We can write SQL queries to retrieve data from database tables and write the data to a local CSV file. Read the local CSV file into R or Python in the analysis environment.

Alternatively, if you work with the R environment, RMarkdown supports database connection. With the connection, we extract data directly from a remote database sever. RMarkdown is a documentation tool which enables reproducibility of data analysis, and help data scientists turn their analysis and findings into high quality documents.

R Packages

To reproduce the operations in this post, you need to install the following R libraries:

  • tidyverse
  • DBI
  • RMySQL
  • kableExtra

Connect to MySQL

Reading data from MySQL into R workspace, it requires two R libraries, RMySQL and DBI.

The connection data should not be embedded in analysis code. Separate the connection code in another script. The script should set up the connection and save it into the workspace.

The saved connection is accessible by its name in the analysis code.

In the dbConnect function, you need to replace dbname, username, pwd, dbserver and port with the actual values of your remote database.

jsea <- dbConnect(RMySQL::MySQL(),dbname='jsea',username='username',password='pwd', host='dbsever',port=3306)
knitr::opts_chunk$set(connection = "jsea")

Read Data From Relational Table

After a connection has been established, dbReadTable can read a table into a data frame.

vehicles <- as_tibble(DBI::dbReadTable(jsea,"vehicles"))
vehicles %>% slice(sample(1:10000,5)) 
## # A tibble: 5 x 12
##      id make    model       year class            trans      drive        cyl displ fuel    hwy   cty
##   <dbl> <chr>   <chr>      <dbl> <chr>            <chr>      <chr>      <dbl> <dbl> <chr> <dbl> <dbl>
## 1 31815 Dodge   Caliber     2012 Small Station W~ Manual 5-~ Front-Whe~     4   2   Regu~    32    24
## 2 21620 Chevro~ Corvette    2006 Two Seaters      Automatic~ Rear-Whee~     8   6   Prem~    25    15
## 3 28701 Audi    A6 Avant ~  2010 Midsize Station~ Automatic~ All-Wheel~     6   3   Prem~    26    18
## 4  9242 Chevro~ C2500 Pic~  1992 Standard Pickup~ Manual 5-~ Rear-Whee~     6   4.3 Regu~    20    15
## 5   635 Buick   Century W~  1985 Midsize-Large S~ Automatic~ Front-Whe~     6   3   Regu~    21    15

Apply dplyr and ggplot to Relational data

Once you have the table in a tibble, you can easily tidy the data and explore it by using the fantastic tidyverse package.

frequency <- vehicles %>%
  count(make) %>%

frequency$make <- 

frequency %>% 
  slice(1:30) %>%
  geom_bar(stat="identity", fill="#87cefa") + 
  coord_flip() + 
  xlab("") + 
  ylab("") + 

Write New Data to Database Table

DBI::dbWriteTable can write a data frame or tibble into a database table.

DBI::dbWriteTable(conn=jsea, name="iris", value=iris, row.names=FALSE, overwrite=TRUE, append=FALSE)

Run SQL in RMarkdown

We also can run SQL in a code chunk of type sql. By setting the connection in the code chuck and adding the option output.var, the resulting table from the SQL is written into a variable in R.

```{sql connection=jsea, output.var="report_model_by_make"}
SELECT make, count(*) as num_of_models
FROM vehicles 
ORDER BY num_of_models DESC

Now we can access the report with the name report_model_by_make.

report_model_by_make %>% slice(1:10)
##             make num_of_models
## 1      Chevrolet          3461
## 2           Ford          2807
## 3          Dodge          2350
## 4            GMC          2265
## 5         Toyota          1727
## 6            BMW          1395
## 7         Nissan          1181
## 8  Mercedes-Benz          1087
## 9     Volkswagen           958
## 10    Mitsubishi           956

Disconnect from Database

After finishing the work with the database, close the connection.

comments powered by Disqus