jDataLab Jie Wang

5 minute read

This is a simple guide to show you how to shape raw shopping basket data into the required format before mining association rule in R with the packages arules and aulesViz. The R package tidyverse is used for a fast data wrangling for this purpose.

Association rules reflect regularities of items or elements in a set of items, such as sale items, web link clicks or web page visits. The apriori command in the R package arules mines frequent itemsets, association rules and class association rules using the Apriori algorithm.

The apriori command requires input in a transactions class.

We normally view a transaction data in a two-dimensional matrix, where each row is one transaction and each column represents one sale item available in a grocery store or any other item objects.

Video: Set up a Python environment for doing Data Science in Jupyter Notebook with Conda virtual environment


The Transactions Class

The arules package has a sample dataset, Groceries, which is a transactions class. Run the command ?Groceries and the R help document displays the data information.

The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

Run the command data(Groceries) to read the Groceries data. The following command shows the class type.

data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"

To view the first five transactions,

inspect(head(Groceries))
##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}
## [6] {whole milk,              
##      butter,                  
##      yogurt,                  
##      rice,                    
##      abrasive cleaner}

The content returned from the inspect command is represented in the common way of describing a transaction by enumerating its items. However, the transactions class stores the transaction data in a special way.

Because the transaction data matrix is often very sparse, having a majority of element values being null for the items which haven't been purchased in a transaction. The Groceries data only has around 2% of cells which are not null.

To find the components inside the Groceries data, run the str command.

str(Groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':	169 obs. of  3 variables:
##   .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
##   .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
##   .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
##   ..@ itemsetInfo:'data.frame':	0 obs. of  0 variables

A transactions data has three component slots: @data, @itemInfo and @itemsetInfo.

@itemsetInfo Component

Initially, the @itemsetInfo is an empty data frame, which won't be filled with the itemsets until running the apriori function.

@itemInfo Component

The itemInfo component is a dataframe. The labels column stores the item labels or names. The following command returns the first 20 item names.

Groceries@itemInfo$labels[1:20]
##  [1] "frankfurter"       "sausage"           "liver loaf"       
##  [4] "ham"               "meat"              "finished products"
##  [7] "organic sausage"   "chicken"           "turkey"           
## [10] "pork"              "beef"              "hamburger meat"   
## [13] "fish"              "citrus fruit"      "tropical fruit"   
## [16] "pip fruit"         "grapes"            "berries"          
## [19] "nuts/prunes"       "root vegetables"

@data Component

The @data component holds the transaction data.

Groceries@data@i lists the item indexes which are included in each transaction, repeated from the first transaction to the last one. Groceries@data@p indicates the starting position for each transaction when reading its associated item indexes from Groceries@data@i.

For instance, we know from the previous inspect command, that the first transaction contains four items: citrus fruit, semi-finished bread, margarine, ready soups.

How is the first transaction stored in @data?

Firstly, find the item indexes of these four items. A quick way of doing so is the which function which returns the index of the matching element in a vector.

which(Groceries@itemInfo$labels == 'citrus fruit')
## [1] 14
which(Groceries@itemInfo$labels == 'semi-finished bread')
## [1] 61
which(Groceries@itemInfo$labels == 'margarine')
## [1] 70
which(Groceries@itemInfo$labels == 'ready soups')
## [1] 79

Noticeably, the index in @data begins with zero. The first four integers in Groceries@data@i

13 60 69 78

are the included items for the first transaction. The first integer, 0, in Groceries@data@i indicates the starting position in Groceries@data@i when reading the items for the first transaction. Similarly, The second integer in Groceries@data@i indicates the starting position in Groceries@data@i when reading the items for the second transaction, and so on.