4 minute read

A bar chart is a common plot for representing distribution of each possible value or interval and visualizing how frequent each value occurs. If the objects in the dataset belong to multiple categories and each object has been assigned with a group label, the portion of each group in each distribution can be visualized by filling the bar with a color map. In the color map, each group is associated with a distinct color.

The sample dataset contains the mean education values of 32038 houses, each house is located in a zipcode zone. meaneducation is continuous numeric values and zipzone is the first digit of the zip codes. zip codes beginning with 0 are in New England while zip codes beginning with 9 are in the West Coast. The zipzone column contains the first digit of the zip code for each row. The page https://en.wikipedia.org/wiki/ZIP_code shows the mapping from the zip codes beginning with 0 through 9 to the corresponding areas.

Load educationbyzipzone.csv into a data frame x.

x <- read.csv("data/educationbyzipzone.csv")

Creating an ordered factor of eight levels

As the original mean education values are numeric, firstly specify eight buckets with distinct string labels. Then assign the values into the buckets by using the cut function. Apply the factor function to create an ordered factor from the bins. The following script will generate an ordered factor education_level with 8 levels. The level labels are defined in the vector labels as

<2, 2-5), 5-8), 8-10), 10-13), 13-15), 15-19), 19-
# set up boundaries for intervals/bins
breaks <- c(0,2,5,8,10,13,15,19,21)
# specify interval/bin labels
labels <- c("<2", "2-5)", "5-8)", "8-10)", "10-13)", "13-15)", "15-19)", "19-")
# bucketing data points into bins
bins <- cut(x$meaneducation, breaks, include.lowest = T, right=FALSE, labels=labels)
# create an ordered factor
education_level <- factor(bins, levels = labels, ordered = TRUE)

Now append the ordered factor education_level to the data frame x

# append the ordered factor to the data frame
x <- cbind(x,education_level)

The frequency of education levels will be plotted by using the ordered factor education_level. The column zipzone will be the argument in filling the bars. In addition, make the custom legend names for each zip zone.

# define legends for each zip code zone
legends <- c("0 = Connecticut (CT), Massachusetts (MA), Maine (ME), New Hampshire (NH), New Jersey (NJ), New York (NY, Fishers Island only),\n Puerto Rico (PR), Rhode Island (RI), Vermont (VT), Virgin Islands (VI),Army Post Office Europe (AE), Fleet Post Office Europe (AE)", 
"1 = Delaware (DE), New York (NY), Pennsylvania (PA)",
"2 = District of Columbia (DC), Maryland (MD), North Carolina (NC), South Carolina (SC), Virginia (VA), West Virginia (WV)",
"3 = Alabama (AL), Florida (FL), Georgia (GA), Mississippi (MS), Tennessee (TN), Army Post Office Americas (AA), Fleet Post Office Americas (AA)",
"4 = Indiana (IN), Kentucky (KY), Michigan (MI), Ohio (OH)",
"5 = Iowa (IA), Minnesota (MN), Montana (MT), North Dakota (ND), South Dakota (SD), Wisconsin (WI)",
"6 = Illinois (IL), Kansas (KS), Missouri (MO), Nebraska (NE)",
"7 = Arkansas (AR), Louisiana (LA), Oklahoma (OK), Texas (TX)",
"8 = Arizona (AZ), Colorado (CO), Idaho (ID), New Mexico (NM), Nevada (NV), Utah (UT), Wyoming (WY)",
"9 = Alaska (AK), American Samoa (AS), California (CA), Guam (GU), Hawaii (HI), Marshall Islands (MH), Federated States of Micronesia (FM),\n Northern Mariana Islands (MP), Oregon (OR), Palau (PW), Washington (WA), Army Post Office Pacific (AP), Fleet Post Office Pacific (AP)")

Stacked bar chart

Now everything is ready for plotting eight education levels. The following snippet will create a bar chart which colors each zip zone and writes the count in each subbar.

library(ggplot2)
ggplot(data=x,aes(x=education_level,fill=as.factor(zipzone))) + 
  geom_bar(color='black', alpha=1) + 
  stat_count(geom="text", aes(label=..count..), size=3, hjust=1) +
  theme_bw() + 
  labs(y="Number of Houses",x="Mean Education per House") +
  coord_flip() +
  scale_y_log10() +
  theme(legend.position='top',
        legend.direction = "vertical",
        legend.text.align = 0,
        legend.text = element_text(size = 7, colour = "black"),
        legend.justification=c("left", "top")) + 
  scale_fill_discrete(name="Zip Code Zone", labels=legends) +
  scale_x_discrete(drop=FALSE)

Note: To include all the bins, add the layer: scale_x_discrete(drop=FALSE)

comments powered by Disqus