Outlier Detection by Data Visualization with Boxplot

January 27, 2017

3 minute read

Outliers in a collection of data are the values which are far away from most other points. A boxplot is usually used to visualize a dataset for spotting unusual data points. However, is an outlier abnormal or normal? It needs to be decided by data analysts.

The boxplot displays five descriptive values which are minimum, $Q_1$, median, $Q_3$ and maximum.

The First Quartile and Third Quartile

Place a sample variable into ascending order. Split the sample set into two halves. The first quartile, denoted by $Q_1$, is the median of the lower half of the set. This means that about 25% of the values are less than $Q_1$.

The third quartile, denoted by $Q_3$, is the median of the upper half of the set. This means that about 75% of the values are less than $Q_3$.

$IQR$

An interval, IQR (Inter-Quartile Range), is calculated as the difference between $Q_3$ and $Q_1$.

Outliers

IQR is often used to filter out outliers. If an observation falls outside of the following interval,

$$ [~Q_1 - 1.5 \times IQR, ~ ~ Q_3 + 1.5 \times IQR~] $$

it is considered as an outlier.

Boxplot Example

It is easy to create a boxplot in R by using either the basic function boxplot or ggplot.

A dataset of 10,000 rows is used here as an example dataset. Two variables, num_of_orders, sales_total and gender are of interest to analysts if they are looking to compare buying behavior between women and men.

Firstly, load the data into R.

sales <- read.csv("data/yearly_sales.csv")

Select the variable sales_total and inspect the variable by calling the function summary:

summary(sales$sales_total)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.02   80.29  151.60  249.50  295.50 7606.00

The summary function returns all the five descriptive values for the variable sales_total. Run summary on gender too.

summary(sales$gender)

F 5035
M 4965

As gender is a factor of two levels, F and M, the summary function returns the number of each level.

Boxplot by using boxplot

The following snippet will create three boxplots of sales_total by the basic R function boxplot. Each boxplot has a specific aesthetics setting

 1slug: outlier-boxplot
 2# set up a layout for plotting
 3mat <- matrix(c(1,2,3), nrow=1, ncol=3)
 4slug: outlier-boxplot
 5layout(mat)
 6# 1. boxplot for all customers
 7boxplot(sales$sales_total, pch=19, xlab='F and M')
 8# 2. boxplot for all customers, log scale 
 9boxplot(sales$sales_total,pch=19,log='y',xlab='F and M',ylab='The Log of sales_total')
10# 3. one boxplot for each gender level group, log scale
11boxplot(sales$sales_total~sales$gender, pch=19,log='y',col='bisque',xlab='Gender',ylab='The Log of sales_total')

Boxplot by using ggplot

install.packages(“colorspace”)

 1# BOXPLOT BY GENDER GROUP
 2library(ggplot2)
 3library(Rmisc)
 4
 5p1 <- ggplot(data = sales, aes(x=gender, y=sales_total)) + 
 6            scale_y_log10() +
 7            geom_point(aes(color=gender), alpha=0.2) +
 8            geom_boxplot(outlier.size=4, outlier.colour='blue', alpha=0.1)
 9
10plot(p1)

Jittering

Noticeably, there is the problem of overplotting with the points in both boxplots. Often, we can add a little random noise to the points, referred to as jittering data. In the geom_point layer of ggplot, assign jitter to the parameter position, which is shown in the following ggplot snippet.

p2 <- ggplot(data = sales, aes(x=gender, y=sales_total)) + 
        scale_y_log10() +
        geom_point(aes(color=gender), alpha=0.2, position='jitter') + 
        geom_boxplot(outlier.size=5, alpha=0.1)

plot(p2)

post

Home

Posts

Categories

Tags

Gleam

DoPython

DoR

Books

Contact

Recent Posts

Post

Setting up a Python Environment for Machine Learning and Data Science with Conda Virtual Environment and Jupyter Notebook in MacOS and Windows

Fully Remove Python and Install a Fresh Python in MacOS and Windows

The itertools and functools in Python

Developing R Packages using devtools