This is a simple guide to show you how to shape raw shopping basket data into the required format before mining association rule in R with the packages arules and aulesViz. The R package tidyverse is used for a fast data wrangling for this purpose.
Association rules reflect regularities of items or elements in a set of items, such as sale items, web link clicks or web page visits. The apriori command in the R package arules mines frequent itemsets, association rules and class association rules using the Apriori algorithm.
The apriori command requires input in a transactions class.
We normally view a transaction data in a two-dimensional matrix, where each row is one transaction and each column represents one sale item available in a grocery store or any other item objects.
Video: Set up a Python environment for doing Data Science in Jupyter Notebook with Conda virtual environment
The Transactions Class
The arules
package has a sample dataset, Groceries
, which is a transactions class. Run the command ?Groceries
and the R help document displays the data information.
The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.
Run the command data(Groceries)
to read the Groceries data. The following command shows the class type.
data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
To view the first five transactions,
inspect(head(Groceries))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner}
The content returned from the inspect
command is represented in the common way of describing a transaction by enumerating its items. However, the transactions class stores the transaction data in a special way.
Because the transaction data matrix is often very sparse, having a majority of element values being null for the items which haven't been purchased in a transaction. The Groceries data only has around 2% of cells which are not null.
To find the components inside the Groceries data, run the str
command.
str(Groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 3 variables:
## .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
## .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
## .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
A transactions data has three component slots: @data
, @itemInfo
and @itemsetInfo
.
@itemsetInfo Component
Initially, the @itemsetInfo
is an empty data frame, which won't be filled with the itemsets until running the apriori
function.
@itemInfo Component
The itemInfo
component is a dataframe. The labels
column stores the item labels or names. The following command returns the first 20 item names.
Groceries@itemInfo$labels[1:20]
## [1] "frankfurter" "sausage" "liver loaf"
## [4] "ham" "meat" "finished products"
## [7] "organic sausage" "chicken" "turkey"
## [10] "pork" "beef" "hamburger meat"
## [13] "fish" "citrus fruit" "tropical fruit"
## [16] "pip fruit" "grapes" "berries"
## [19] "nuts/prunes" "root vegetables"
@data Component
The @data
component holds the transaction data.
Groceries@data@i
lists the item indexes which are included in each transaction, repeated from the first transaction to the last one. Groceries@data@p
indicates the starting position for each transaction when reading its associated item indexes from Groceries@data@i
.
For instance, we know from the previous inspect
command, that the first transaction contains four items: citrus fruit, semi-finished bread, margarine, ready soups.
How is the first transaction stored in @data
?
Firstly, find the item indexes of these four items. A quick way of doing so is the which
function which returns the index of the matching element in a vector.
which(Groceries@itemInfo$labels == 'citrus fruit')
## [1] 14
which(Groceries@itemInfo$labels == 'semi-finished bread')
## [1] 61
which(Groceries@itemInfo$labels == 'margarine')
## [1] 70
which(Groceries@itemInfo$labels == 'ready soups')
## [1] 79
Noticeably, the index in @data
begins with zero. The first four integers in Groceries@data@i
13 60 69 78
are the included items for the first transaction. The first integer, 0, in Groceries@data@i
indicates the starting position in Groceries@data@i
when reading the items for the first transaction. Similarly, The second integer in Groceries@data@i
indicates the starting position in Groceries@data@i
when reading the items for the second transaction, and so on.
Share this post
Twitter
Facebook
LinkedIn
Email