Is amount of information measurable? Can we quantify the information that is contained in a dataset? For a given probability distribution of a categorical attribute (which will be referred to as class label in the following part), the entropy is a measure of the amount of information that indicates level of uncertainty about which class label will be chosen.

A larger entropy value indicates a higher level of uncertainty or diversity, implying lower purity of the distribution.

In the following, a small open dataset, the weather data, will be used to explain the computation of information entropy for a class distribution.

1. Example Dataset

The dataset contains 14 samples about weather conditions for playing golf or not. Each sample is described with five nominal/categorical attributes whose names are listed in the first row.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
outlook,temperature,humidity,windy,play
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

The last attribute play is a binary class which has two possible labels yes and no. The class distribution is listed below.

1
play {no, no, yes, yes, yes, no, yes, no, yes, yes, yes, yes, yes, no}

2. Formula of Shannon Entropy

In a space A of \(k\) discrete labels, the entropy is computed by the formula

\(k\) : the total number of distinct class labels in the space A

\(p_i\) : the probability of the \(ith\) class label


Note: The term \(p\log{p}\) is defined as zero when \(p\) is zero.


Estimate \(p_i\)

For discrete labels, \(p_i\) is estimated by

Information Units

The entropy is expressed in a unit of information, depending on the base of logarithm. If taking base 2 in the logarithm, information units are bits.

Two Special Cases

For discrete labels, there are two special cases:

  1. When the labels are equally distributed, i.e., every label has the same count, \(H\) equals 1.

  2. When all instances have the same label, i.e., the probability is one for the label and zero for all the other labels, \(H\) equals 0.



3. Compute H(play) by hand

Now let’s compute the information in the sample space of play from the previous weather data.

1
play {no, no, yes, yes, yes, no, yes, no, yes, yes, yes, yes, yes, no}

Firstly, count the label yes and no. It comprises of 9 instances with the yes label and 5 with the no label.

By denoting yes with + and no with -, the space can be represented by the following notation:

Now calculate the probability for each label by the formula (2):

Applying the formula (1) to calculate the information in the space \(S:[9+,5-]\)

By the entropy value above, the amount of information of the play attribute in the weather data is about \(0.94\) bits. This also indicates the level of uncertainty.

Workout

A dataset comprises of 20 instances, each of which is labeled with a class label from a category set {1, 2, 3}. Five (5) instances are in class 1. Eight(8) instances are in class 2. Seven(7) instances are in class 3. To predict a class label for a new instance from the same distribution, how much information is needed, measured in bits? Show the intermediate steps.

Next Read: Entropy of Continuous Attribute

If the attribute is continuous, discretize the attribute values into discrete intervals, as known as data binning.

How to perform data binning is explained in the notes Data Binning and Plotting

Cut the label range into a number of bins of same width. Each bin defines an numerical interval. Perform data binning that will place each label into one bin if the label value falls into the interval that the bin covers. Apply the formula (1) to the bins instead of labels.