jDataLab Jager

2 minute read

A data scientist has to work with many different types of data storage and there are chances when you need to pull data from enterprise data warehouse into your analysis environment.

We can write SQL queries to retrieve data from database tables and write the data to a local CSV file. Read the local CSV file into R or Python in the analysis environment.

Alternatively, if you work with the R environment, RMarkdown supports database connection. By using RMySQL, with the connection, we extract data directly from a remote database sever. RMarkdown is a documentation tool which enables reproducibility of data analysis, and help data scientists turn their analysis and findings into high quality documents.

Screen recording: Access MySQL in R Markdown

The screen recording shows in R Markdown, how to execute SQL commands, use either R or SQL chunk in R Markdown, write the resulting set to data frame, and perform exploratory data analysis and visualization.

  1. Connect to MySQL

The connection data should not be embedded in analysis code. Separate the connection code in another script. The script should set up the connection and save it into the workspace. The saved connection is accessible by its name in the analysis code.

library(RMySQL)
jsea <- dbConnect(RMySQL::MySQL(),dbname='jsea',username='username',password='pwd', host='dbsever',port=3306)
knitr::opts_chunk$set(connection = "jsea")
  1. Read Data From Relational Table

After a connection has been established, dbReadTable can read a table into a data frame.

library(tidyverse)
vehicles <- as_tibble(DBI::dbReadTable(jsea,"vehicles"))
vehicles %>% slice(sample(1:10000,5)) 
  1. Run SQL in R chunk

  2. Run dplyr and ggplot

Once you have the table in a tibble, you can easily tidy the data and explore it by using the fantastic tidyverse package.

frequency <- vehicles %>% count(make) %>% arrange(desc(n))

frequency$make <- 
  factor(frequency$make,level=frequency$make[order(frequency$n)])

frequency %>% 
  slice(1:30) %>%
  ggplot(aes(x=make,y=n))+
  geom_bar(stat="identity", fill="#87cefa") + 
  coord_flip() + 
  xlab("") + 
  ylab("") + 
  theme_light()
  1. Write New Data to Database Table
DBI::dbWriteTable(conn=jsea, name="iris", value=iris, row.names=FALSE, overwrite=TRUE, append=FALSE)
  1. Run SQL in SQL chunk

We also can run SQL in a code chunk of type sql. By setting the connection in the code chuck and adding the option output.var, the resulting table from the SQL is written into a variable in R. Access the report with the name report_model_by_make.

```{sql connection=jsea, output.var="report_model_by_make"}
SELECT make, count(*) as num_of_models
FROM vehicles 
GROUP BY make
ORDER BY num_of_models DESC
```

In the end, disconnect from Database with DBI::dbDisconnect().