jDataLab

3 minute read

Outliers in a collection of data are the values which are far away from most other points. A boxplot is usually used to visualize a dataset for spotting unusual data points. However, is an outlier abnormal or normal? It needs to be decided by data analysts.

The boxplot displays five descriptive values which are minimum, \(Q_1\), median, \(Q_3\) and maximum.

The First Quartile and Third Quartile

Place a sample variable into ascending order. Split the sample set into two halves. The first quartile, denoted by \(Q_1\), is the median of the lower half of the set. This means that about 25% of the values are less than \(Q_1\).

The third quartile, denoted by \(Q_3\), is the median of the upper half of the set. This means that about 75% of the values are less than \(Q_3\).

\(IQR\)

An interval, IQR (Inter-Quartile Range), is calculated as the difference between \(Q_3\) and \(Q_1\).

Outliers

IQR is often used to filter out outliers. If an observation falls outside of the following interval,

$$ [~Q_1 - 1.5 \times IQR, ~ ~ Q_3 + 1.5 \times IQR~] $$

it is considered as an outlier.

Boxplot Example

It is easy to create a boxplot in R by using either the basic function boxplot or ggplot.

A dataset of 10,000 rows is used here as an example dataset. Two variables, num_of_orders, sales_total and gender are of interest to analysts if they are looking to compare buying behavior between women and men.

Firstly, load the data into R.

sales <- read.csv("data/yearly_sales.csv")

Select the variable sales_total and inspect the variable by calling the function summary:

summary(sales$sales_total)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.02   80.29  151.60  249.50  295.50 7606.00 

The summary function returns all the five descriptive values for the variable sales_total. Run summary on gender too.

summary(sales$gender)
F 5035
M 4965

As gender is a factor of two levels, F and M, the summary function returns the number of each level.

Boxplot by using boxplot

The following snippet will create three boxplots of sales_total by the basic R function boxplot. Each boxplot has a specific aesthetics setting

 1slug: outlier-boxplot
 2# set up a layout for plotting
 3mat <- matrix(c(1,2,3), nrow=1, ncol=3)
 4slug: outlier-boxplot
 5layout(mat)
 6# 1. boxplot for all customers
 7boxplot(sales$sales_total, pch=19, xlab='F and M')
 8# 2. boxplot for all customers, log scale 
 9boxplot(sales$sales_total,pch=19,log='y',xlab='F and M',ylab='The Log of sales_total')
10# 3. one boxplot for each gender level group, log scale
11boxplot(sales$sales_total~sales$gender, pch=19,log='y',col='bisque',xlab='Gender',ylab='The Log of sales_total')

Boxplot by using ggplot

install.packages(“colorspace”)

 1# BOXPLOT BY GENDER GROUP
 2library(ggplot2)
 3library(Rmisc)
 4
 5p1 <- ggplot(data = sales, aes(x=gender, y=sales_total)) + 
 6            scale_y_log10() +
 7            geom_point(aes(color=gender), alpha=0.2) +
 8            geom_boxplot(outlier.size=4, outlier.colour='blue', alpha=0.1)
 9
10plot(p1)

Jittering

Noticeably, there is the problem of overplotting with the points in both boxplots. Often, we can add a little random noise to the points, referred to as jittering data. In the geom_point layer of ggplot, assign jitter to the parameter position, which is shown in the following ggplot snippet.

p2 <- ggplot(data = sales, aes(x=gender, y=sales_total)) + 
        scale_y_log10() +
        geom_point(aes(color=gender), alpha=0.2, position='jitter') + 
        geom_boxplot(outlier.size=5, alpha=0.1)

plot(p2)