jDataLab Jager

3 minute read

A data scientist has to work with many different types of data storage and there are chances when you need to pull data from enterprise data warehouse into your analysis environment.

We can write SQL queries to retrieve data from database tables and write the data to a local CSV file. Read the local CSV file into R or Python in the analysis environment.

Alternatively, if you work with the R environment, RMarkdown supports database connection. With the connection, we extract data directly from a remote database sever. RMarkdown is a documentation tool which enables reproducibility of data analysis, and help data scientists turn their analysis and findings into high quality documents.

R Packages

To reproduce the operations in this post, you need to install the following R libraries:

  • tidyverse
  • DBI
  • RMySQL
  • kableExtra

Connect to MySQL

Reading data from MySQL into R workspace, it requires two R libraries, RMySQL and DBI.

The connection data should not be embedded in analysis code. Separate the connection code in another script. The script should set up the connection and save it into the workspace.

The saved connection is accessible by its name in the analysis code.

In the dbConnect function, you need to replace dbname, username, pwd, dbserver and port with the actual values of your remote database.

library(RMySQL)
library(DBI)
jsea <- dbConnect(RMySQL::MySQL(),dbname='jsea',username='username',password='pwd', host='dbsever',port=3306)
knitr::opts_chunk$set(connection = "jsea")

Read Data From Relational Table

After a connection has been established, dbReadTable can read a table into a data frame.

library(tidyverse)
vehicles <- as_tibble(DBI::dbReadTable(jsea,"vehicles"))
vehicles %>% slice(sample(1:10000,5)) 
## # A tibble: 5 x 12
##      id make    model   year class   trans  drive    cyl displ fuel    hwy   cty
##   <dbl> <chr>   <chr>  <dbl> <chr>   <chr>  <chr>  <dbl> <dbl> <chr> <dbl> <dbl>
## 1 31594 Cadill~ SRX 2~  2012 Sport ~ Autom~ Front~     6   3.6 Gaso~    24    17
## 2 27587 Americ~ Eagle~  1984 Specia~ Manua~ 4-Whe~     4   2.5 Regu~    23    19
## 3 28583 Chevro~ Cobal~  2010 Compac~ Manua~ Front~     4   2.2 Regu~    35    25
## 4 12975 Chevro~ Picku~  1996 Standa~ Manua~ Rear-~     8   5   Regu~    18    14
## 5  9238 Chevro~ C1500~  1992 Standa~ Manua~ Rear-~     8   6.2 Dies~    23    18

Apply dplyr and ggplot to Relational data

Once you have the table in a tibble, you can easily tidy the data and explore it by using the fantastic tidyverse package.

frequency <- vehicles %>%
  count(make) %>%
  arrange(desc(n))

frequency$make <- 
  factor(frequency$make,level=frequency$make[order(frequency$n)])

frequency %>% 
  slice(1:30) %>%
  ggplot(aes(x=make,y=n))+
  geom_bar(stat="identity", fill="#87cefa") + 
  coord_flip() + 
  xlab("") + 
  ylab("") + 
  theme_light()

bar_chart_frequency

Write New Data to Database Table

DBI::dbWriteTable can write a data frame or tibble into a database table.

DBI::dbWriteTable(conn=jsea, name="iris", value=iris, row.names=FALSE, overwrite=TRUE, append=FALSE)

Run SQL in RMarkdown

We also can run SQL in a code chunk of type sql. By setting the connection in the code chuck and adding the option output.var, the resulting table from the SQL is written into a variable in R.

```{sql connection=jsea, output.var="report_model_by_make"}
SELECT make, count(*) as num_of_models
FROM vehicles 
GROUP BY make
ORDER BY num_of_models DESC
```

Now we can access the report with the name report_model_by_make.

report_model_by_make %>% slice(1:10)
##             make num_of_models
## 1      Chevrolet          3461
## 2           Ford          2807
## 3          Dodge          2350
## 4            GMC          2265
## 5         Toyota          1727
## 6            BMW          1395
## 7         Nissan          1181
## 8  Mercedes-Benz          1087
## 9     Volkswagen           958
## 10    Mitsubishi           956

Disconnect from Database

After finishing the work with the database, close the connection.

DBI::dbDisconnect(jsea)
comments powered by Disqus