# Handling Overplotting in Large Datasets

Scatterplots can reveal relationships among variables in a data set and is a popular way of visualizing data before applying learning algorithms. When plotting more and more data points into a scatterplot, if too many points overlap each other, dark regions will appear on the plot, referred to as overplotting. Overplotting can obscure clusters and patterns.

A dataset of 10,000 rows is used here for showing overplotting. The first 10 rows are listed to display the data schema including three variables sales_tatal, num_of_orders and gender.

 1sales <- read.csv("data/yearly_sales.csv")
2sales[1:10, 2:4]
3
4sales_total	num_of_orders	gender
5800.64	3 	F
6217.53	3 	F
774.58	2 	M
8498.60	3 	M
9723.11	4 	F
1069.43	2 	F
1140.15	2 	M
1258.61	2 	M
13364.63	2 	F
1444.31	2 	M 

## Descriptive Statistics

Firstly, we explore the sales data by calculating its descriptive data. The R function summary calculates basic statistics including min, max, mean, median, 1st quartile and 3rd quartile.

Two variables that are of our interest, sales_total and num_of_orders.

The output from the summary function indicates that for sales_total, 75% of data are located within the first 23% of the domain interval; for num_of_orders, 75% of data are close together in the first 10% of the interval. Thus, upon scatter plotting these variables, overplotting will occur on the regions where data points are highly concentrated.

summary(sales[ ,2:4])
  sales_total      num_of_orders    gender
Min.   :  30.02   Min.   : 1.000   F:5035
1st Qu.:  80.29   1st Qu.: 2.000   M:4965
Median : 151.65   Median : 2.000
Mean   : 249.46   Mean   : 2.428
3rd Qu.: 295.50   3rd Qu.: 3.000
Max.   :7606.09   Max.   :22.000


To make a scatterplot of sales_total against num_of_orders, use R commands, either plot or ggplot.

## Using plot

The following three scatterplots contain 100 rows, 200 rows and 10,000 rows, respectively. A linear regression line is drawn above each scatterplot to show the linear trend in the data. As you can see, the last plot of 10,000 points has the problem of overplotting.

 1d1 <- sales[1:100, 2:3]
2d2 <- sales[500:700, 2:3]
3d3 <- sales[ ,2:3]
4
5mat <- matrix(c(1,2,3), nrow=3, ncol=1)
6slug: overplotting-r
7layout(mat)
8
9plot(d1, main="The first 100 rows")
10abline(lm(d1$num_of_orders ~ d1$sales_total), col="red")
11
12plot(d2, main="200 rows from 500th through 700th")
13abline(lm(d2$num_of_orders ~ d2$sales_total), col="red")
14
15plot(d3, main="The entire data set of 10,000 rows")
16abline(lm(d3$num_of_orders ~ d3$sales_total), col="red")

## Using ggplot

ggplot is the only function in the R graphics package ggplot2. ggplot can generate higher quality graphics than other basic R plot functions.

ggplot constructs graphics over multiple layers. It firstly creates a base frame by calling ggplot, to which additional layers are added as needed to specify the plot type, the coordinate system and many other aesthetics and geometry shapes. Indeed, introduction of colors and shapes into a plot adds much more information for visualizing patterns, rather than simply beautifying the plot.

The following code snippet makes three scatterplots by using ggplot.

install.packages(c(“gridExtra”, “cowplot”))

 1library(ggplot2)
2
3# user-defined function myplot
4myplot <- function (mydata, mycolor, mytitle) {
5    p = ggplot(data=mydata, aes(x=sales_total,y=num_of_orders))
6    p = p + geom_point(size=2)
7    p = p + geom_smooth(method = "lm", fill=mycolor, colour=mycolor)
8    p = p + ggtitle(mytitle)
9    p = p + labs(y="Number of Orders", x="Sales Total")
10    p = p + theme(axis.text=element_text(size=6),
11        axis.title = element_text(size=6, face="bold"),
12        plot.title = element_text(size=6, face="bold")
13        )
14    p = p + theme(panel.background = element_rect(fill="lightblue",colour ="lightblue",size=0.5,linetype="solid"),
15                  panel.grid.major = element_line(size=0.5,linetype='solid',colour="white"),
16                  panel.grid.minor = element_line(size=0.25,linetype='solid',colour="white")
17        )
18    return(p)
19}
20
21d1 <- sales[1:100, 2:3]
22d2 <- sales[500:700, 2:3]
23d3 <- sales[ ,2:3]
24
25titles <- c("The first 100 rows", "200 rows from 500th through 700th", "The entire data set of 10,000 rows")
26colornames <- c("red", 'blue', 'red')
27
28p1 <- myplot(d1, colornames, titles)
29p2 <- myplot(d2, colornames, titles)
30p3 <- myplot(d3, colornames, titles)
31
32library(cowplot)
33library(gridExtra)
34plot_grid(p1, p2, p3, labels=c("d1", "d2", "d3"), ncol=1, nrow=3)

## Overplotting

Now it’s time to figure out how to prevent the overplotting or at least disclosing how dense the overplotted region is. Though adjusting color and transparency somehow can address this issue, there are some other better alternatives.

The R package ggExtra has been developed for adding marginal histograms, boxplots or density plots to ggplot2 scatterplots.

Run the command Install.package(ggExtra) to install ggExtra.

Firstly, a base ggplot for the dataset d3 is prepared before adding marginal plots, as shown in the following snippet.

 1library(ggplot2)
2p.base = ggplot(data=d3, aes(x=sales_total,y=num_of_orders))
3    p.base = p.base + geom_point(size=2, color="yellow")
4    p.base = p.base + geom_smooth(method = "lm", fill='lightblue', colour='lightblue')
5    p.base = p.base + labs(y="Number of Orders", x="Sales Total")
6    p.base = p.base + theme(axis.text=element_text(size=6),
7        axis.title = element_text(size=6, face="bold"),
8        plot.title = element_text(size=6, face="bold")
9        )
10    p.base = p.base + theme(panel.background = element_rect(fill="gray50",colour ="lightblue",size=0.5,linetype="solid"),
11                  panel.grid.major = element_line(size=0.5,linetype='solid',colour="white"),
12                  panel.grid.minor = element_line(size=0.25,linetype='solid',colour="white")
13        )

### Add marginal density plots to the plot p

 1library(ggExtra)
2library(miniUI)
3
4ggExtra::ggMarginal(
5  p.base,
6  type = 'density',
7  margins = 'both',
8  size = 3,
9  col = '#FF0000',
10  fill = '#FFA500'
11)

### Log scale *sales_total

Here, as the magnitute of sales_total is much larger than num_of_orders, transform sales_total into log scaled. To do so, create a new ggplot p.log after adding one more layer to the ggplot p.

 1p.log = p.base + scale_x_log10()
2
3ggExtra::ggMarginal(
4  p.log,
5  type = 'density',
6  margins = 'both',
7  size = 5,
8  col = '#FF0000',
9  fill = '#FFA500'
10)

### Add marginal histograms to ggplot

The following snippet addes histograms to both axes.

1ggExtra::ggMarginal(
2  p.log,
3  type = 'histogram',
4  margins = 'both',
5  size = 4,
6  col = '#0F0101',
7  fill = '#37AAE8'
8)

### Add marginal boxplots to ggplot

1ggExtra::ggMarginal(
2  p.log,
3  type = 'boxplot',
4  margins = 'both',
5  size = 5,
6  col = '#FF0000',
7  fill = '#FFA500'
8)

### Hexagonal binning

Hexagonal binning is another way of preventint overplotting from large datasets. Data points are placed into hexagonal bins instead of square bins in histograms. Hexagons are closer to circles than squares and can better simulate data aggregation in a cluster. The shades or colors of hexagonal bins can represent the density of data in each hexbin.

### hexbinplot: the R function for creating a hexbin plot.

1library(hexbin)
2
3hexbinplot(d3$num_of_orders~d3$sales_total, trans=sqrt,
4           inv=function(x) x^2, type=c('g','r'), xlab="Sales Total", ylab="Number of Orders")

### ggplot: adding a geom_hex() layer

 1p = ggplot(data=d3, aes(x=sales_total,y=num_of_orders))
2    p = p + labs(y="Number of Orders", x="Sales Total")
3    p = p + theme(axis.text=element_text(size=9),
4        axis.title = element_text(size=9, face="bold"),
5        plot.title = element_text(size=9, face="bold")
6        )
7    p = p + theme(panel.background = element_rect(fill="white",colour ="lightblue",size=0.5,linetype="solid"),
8                  panel.grid.major = element_line(size=0.5,linetype='solid',colour="white"),
9                  panel.grid.minor = element_line(size=0.25,linetype='solid',colour="white")
10        )
11
12sales.total.q3 <- quantile(d3$sales_total,0.75) # Q3 of sales_total 13num.orders.q3 <- quantile(d3$num_of_orders, 0.75) # Q3 of num_of_orders
14
15p.hexbin <- p + geom_hex(aes(fill="#000000",alpha=log(..count..)), fill="#0000ff") +
16                scale_alpha_continuous("Log of Count", breaks=seq(0,10,1)) +
17                geom_hline(yintercept = num.orders.q3, size = 0.5,color="red",linetype = 2) +
18                geom_vline(xintercept = sales.total.q3, size = 0.5,color="red",linetype = 2)
19
20p.hexbin