Outlier Detection by Data Visualization with Boxplot

Outliers in a collection of data are the values which are far away from most other points. A boxplot is usually used to visualize a dataset for spotting unusual data points. However, is an outlier abnormal or normal? It needs to be decided by data analysts.

The boxplot displays five descriptive values which are minimum, $$Q_1$$, median, $$Q_3$$ and maximum.

The First Quartile and Third Quartile

Place a sample variable into ascending order. Split the sample set into two halves. The first quartile, denoted by $$Q_1$$, is the median of the lower half of the set. This means that about 25% of the values are less than $$Q_1$$.

The third quartile, denoted by $$Q_3$$, is the median of the upper half of the set. This means that about 75% of the values are less than $$Q_3$$.

$$IQR$$

An interval, IQR (Inter-Quartile Range), is calculated as the difference between $$Q_3$$ and $$Q_1$$.

Outliers

IQR is often used to filter out outliers. If an observation falls outside of the following interval,

$$[~Q_1 - 1.5 \times IQR, ~ ~ Q_3 + 1.5 \times IQR~]$$

it is considered as an outlier.

Boxplot Example

It is easy to create a boxplot in R by using either the basic function boxplot or ggplot.

A dataset of 10,000 rows is used here as an example dataset. Two variables, num_of_orders, sales_total and gender are of interest to analysts if they are looking to compare buying behavior between women and men.

Firstly, load the data into R.

sales <- read.csv("data/yearly_sales.csv")

Select the variable sales_total and inspect the variable by calling the function summary:

summary(sales$sales_total)  Min. 1st Qu. Median Mean 3rd Qu. Max. 30.02 80.29 151.60 249.50 295.50 7606.00  The summary function returns all the five descriptive values for the variable sales_total. Run summary on gender too. summary(sales$gender)
F 5035
M 4965

As gender is a factor of two levels, F and M, the summary function returns the number of each level.

Boxplot by using boxplot

The following snippet will create three boxplots of sales_total by the basic R function boxplot. Each boxplot has a specific aesthetics setting

 1slug: outlier-boxplot
2# set up a layout for plotting
3mat <- matrix(c(1,2,3), nrow=1, ncol=3)
4slug: outlier-boxplot
5layout(mat)
6# 1. boxplot for all customers
7boxplot(sales$sales_total, pch=19, xlab='F and M') 8# 2. boxplot for all customers, log scale 9boxplot(sales$sales_total,pch=19,log='y',xlab='F and M',ylab='The Log of sales_total')
10# 3. one boxplot for each gender level group, log scale
11boxplot(sales$sales_total~sales$gender, pch=19,log='y',col='bisque',xlab='Gender',ylab='The Log of sales_total')

Boxplot by using ggplot

install.packages(“colorspace”)

 1# BOXPLOT BY GENDER GROUP
2library(ggplot2)
3library(Rmisc)
4
5p1 <- ggplot(data = sales, aes(x=gender, y=sales_total)) +
6            scale_y_log10() +
7            geom_point(aes(color=gender), alpha=0.2) +
8            geom_boxplot(outlier.size=4, outlier.colour='blue', alpha=0.1)
9
10plot(p1)

Jittering

Noticeably, there is the problem of overplotting with the points in both boxplots. Often, we can add a little random noise to the points, referred to as jittering data. In the geom_point layer of ggplot, assign jitter to the parameter position, which is shown in the following ggplot snippet.

p2 <- ggplot(data = sales, aes(x=gender, y=sales_total)) +
scale_y_log10() +
geom_point(aes(color=gender), alpha=0.2, position='jitter') +
geom_boxplot(outlier.size=5, alpha=0.1)

plot(p2)