jDataLab

8 minute read

The data.frame object in R groups a number of column vectors into a data set in R. The way data.frame organizes data is similar to that of a spreadsheet, a 2D frame. Tibble is a modern version of classical data.frame which is used in some of R packages. A data.frame is constrained to only hold named columns of the same length.

data.frame is included in the R base. The same data structure is implemented in Python with the module Pandas

A data.frame is like an Excel spreadsheet on the surface with columns and rows. Statistically, each column is a variable and ech row is an observation. In the data mining terms, each column is an attribute and each row is an instance. In the machine learning terms, each column is a feature and each row is an object.

Creating data.frame

A data.frame object is created by specifying a set of named vectors to the data.frame function. For example, create a data.frame containing Chicago temperature forecasts over the next five days:

Chicago Temperature Forcasts
Day Date TempF
Thursday Feb 1 26
Friday Feb 2 22
Saturday Feb 3 30
Sunday Feb 4 32
Monday Feb 5 24

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjQXNzaWduIGNvbHVtbnMgbmFtZXMgZHVyaW5nIGNyZWF0aW9uXG5jaGljV2VhdGhlciA8LSBkYXRhLmZyYW1lKCBcbiBEYXkgPSBjKFwiVGh1cnNkYXlcIiwgXCJGcmlkYXlcIiwgXCJTYXR1cmRheVwiLCBcIlN1bmRheVwiLCBcIk1vbmRheVwiKSxcbiBEYXRlID0gYyhcIkZlYiAxXCIsIFwiRmViIDJcIiwgXCJGZWIgM1wiLCBcIkZlYiA0XCIsIFwiRmViIDVcIiksXG4gVGVtcEYgPSBjKDI2LCAyMiwgMzAsIDMyLCAyNClcbiApXG5wcmludChjaGljV2VhdGhlcikifQ==

Or creating named columns and combing them:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJEYXkgPC0gYyhcIlRodXJzZGF5XCIsIFwiRnJpZGF5XCIsIFwiU2F0dXJkYXlcIiwgXCJTdW5kYXlcIiwgXCJNb25kYXlcIilcbkRhdGUgPC0gYyhcIkZlYiAxXCIsIFwiRmViIDJcIiwgXCJGZWIgM1wiLCBcIkZlYiA0XCIsIFwiRmViIDVcIilcblRlbXBGIDwtIGMoMjYsIDIyLCAzMCwgMzIsIDI0KVxuXG5jaGljV2VhdGhlciA8LSBkYXRhLmZyYW1lKERheSwgRGF0ZSwgVGVtcEYpXG5cbnByaW50KGNoaWNXZWF0aGVyKSJ9

Cautions

Nonmatching Vector Lengths

If attempting to create a data frame using vectors with nonmatching lengths, R will print an error message. For example, the following command will produce the error message:

Error in data.frame(x = 1:5, y = 1:2) : arguments imply differing number of rows: 5, 2

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJkYXRhLmZyYW1lKHggPSAxOjUsIHkgPSAxOjIpIn0=

Encode String Input to Factor

R likes to encode an input string vector to a factor unless we turn off the default setting. The str command displays the type of each column in chicWeather.

str(chicWeather)
## 'data.frame':    5 obs. of  3 variables:
##  $ Day  : Factor w/ 5 levels "Friday","Monday",..: 5 1 3 4 2
##  $ Date : Factor w/ 5 levels "Feb 1","Feb 2",..: 1 2 3 4 5
##  $ TempF: num  26 22 30 32 24

Both Day and Date are of the factor type. To prevent this conversion from character to factor, add stringsAsFactors=FALSE to data.frame:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjQXNzaWduIGNvbHVtbnMgbmFtZXMgZHVyaW5nIGNyZWF0aW9uXG5jaGljV2VhdGhlciA8LSBkYXRhLmZyYW1lKCBcbiBEYXkgPSBjKFwiVGh1cnNkYXlcIiwgXCJGcmlkYXlcIiwgXCJTYXR1cmRheVwiLCBcIlN1bmRheVwiLCBcIk1vbmRheVwiKSxcbiBEYXRlID0gYyhcIkZlYiAxXCIsIFwiRmViIDJcIiwgXCJGZWIgM1wiLCBcIkZlYiA0XCIsIFwiRmViIDVcIiksXG4gVGVtcEYgPSBjKDI2LCAyMiwgMzAsIDMyLCAyNCksXG4gc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFXG4gKVxuc3RyKGNoaWNXZWF0aGVyKSJ9

EX1

Write R code to create the following data frame of 5 rows and 5 named columns, store the data in a name weather, and print the data.

outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes


eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjV3JpdGUgUiBjb2RlIHRvIGNyZWF0ZSBhIGRhdGEuZnJhbWUgZm9yIHRoZSB0YWJsZSBhYm92ZSwgb2YgNSByb3dzIGFuZCA1IG5hbWVkIGNvbHVtbnMsIG5hbWUgdGhlIGRhdGEuZnJhbWUgYHdlYXRoZXJgLCBhbmQgcHJpbnQgaXQuXG4ifQ==

Inspecting the Data

It is useful to look at a few R base functions which help quickly understand the data stored in a data frame as well as each column for its data type and sample values. Run the following commands on a data.frame.

class()
nrow(weather)     # row count in the data 
ncol(weather)
colnames()
rownames()
dim(weather)      # dimension of the data
dim(weather)[2]   # column count in the data
str(weather)      # View the structure of the data
head(weather)     # return the first few rows of the data
tail(weather)     # return the last few rows of the data
summary(weather)  # statistics data of each column

The View of Data.frame

Because R views a data.frame as simply a named list of column vectors, each element of a data frame is a column vector. Therefore,

  • The length function returns the number of column vectors
  • The names function returns the element (column) names.

The following example shows the outputs of the two functions:

length(weather)
## [1] 5
names(weather)
## [1] "outlook"     "temperature" "humidity"    "windy"       "play"

EX2

R contains many built-in datasets in the base package datasets. Check whether the package datasets is available in the library. You should have the package as it comes with R installation. In any case when the package is not available, install the pacakge. To import a built-in dataset, e.g., iris, simply type the name of dataset and run.

iris

Open the help doc for built-in dataset

Each R built-in dataset comes with a help document, explaining the values inside. To open the help document, run the command help.

help(iris)

Inspect the object

Run the commands in the Inspecting the Data on iris.

Find data type of a column (attribute)

To find the type of a vector, e.g., 'Species' in iris, run the command:

class(iris$Species)

The result shows the column is of type factor. A factor stores categories and enumerated values.

Write the code to find the type of the column `Sepal.Length` in `iris`.

Selecting one column

As with data.frame, we can reference a single element (vector) from the data frame using either style:

  • Double square brackets
  • The $ sign with column name
  • Subscripting as in a numerical matrix, with square brackets
chicWeather[[3]]  # double squared brackets
## [1] 26 22 30 32 24
chicWeather$TempF # the $ symbol
## [1] 26 22 30 32 24
chicWeather[,3]   # subscripting
## [1] 26 22 30 32 24
chicWeather[,"TempF"]
## [1] 26 22 30 32 24

Subscripting data.frame like Matrix

R allows us to reference the data frame as if it was a matrix. We can filter rows and columns in a data.frame by the same subscripting methods for math matrices. For example:

chicWeather[,3]
chicWeather[1:3,]
chicWeather[1:2,c(1,3)]
chicWeather[1:2,c("Day","TempF")]

Logical subscripting a single column

Besides, a logcial expression can filter values in a single column by only returning the values which evaluate TRUE for a given criterion. Here is an expression, only returning values in TempF which is higher than 25.

chicWeather$TempF[ chicWeather$TempF > 25 ] # logical subscript
## [1] 26 30 32

Logical subscripting for filtering rows

Test a logical expression rowwise. Only choose the rows which satisfy the criteria. The following code returns a subset containing only rows with temperatures higher than 25F.

chicWeather[ chicWeather$TempF > 25, ]
##        Day  Date TempF
## 1 Thursday Feb 1    26
## 3 Saturday Feb 3    30
## 4   Sunday Feb 4    32

The following code shows subsetting for days when temperature is 22F.

chicWeather[ chicWeather$TempF == 22, ]
##      Day  Date TempF
## 2 Friday Feb 2    22

Logical subscripting (subsetting) both rows and columns

If the columns need to be filtered too, a vector of names or indexes is added as the second argument.

chicWeather[ chicWeather$TempF > 25, c("Day", "TempF") ]
##        Day TempF
## 1 Thursday    26
## 3 Saturday    30
## 4   Sunday    32

EX3

Retrieve a column from iris

As the columns in iris are named, the $ symbol provides an intutive way of referencing named columns. For example, to retrieve the column named Species, run the following expression and it will return a vector which contains the request column.

iris$Species

Write the code to retrieve the column Petal.Length in the iris dataset.

Run the following code. Describe the values in set1 and set2.

df <- data.frame(X = -2:2, Y = 1:5)
set1 <- df$Y[ df$X > 0 ]
set2 <- df[ df$X > 0, ]

Subsetting iris data

Write R code to retrieve the following subsets from iris:

  • The first 50 rows
  • The first 2 columns
  • The columns Sepal.Length and Petal.Length
  • All of the columns excluding the last column Species
  • The rows whose Species equals to 'setosa'
  • The rows whose Species is not 'setosa'
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFRoZSBmaXJzdCA1MCByb3dzIFxuIyBUaGUgZmlyc3QgMiBjb2x1bW5zIFxuIyBUaGUgY29sdW1ucyBgU2VwYWwuTGVuZ3RoYCBhbmQgYFBldGFsLkxlbmd0aGBcbiMgQWxsIG9mIHRoZSBjb2x1bW5zIGV4Y2x1ZGluZyB0aGUgbGFzdCBjb2x1bW4gYFNwZWNpZXNgXG4jIFRoZSByb3dzIHdob3NlIGBTcGVjaWVzYCBlcXVhbHMgdG8gYCdzZXRvc2EnYFxuIyBUaGUgcm93cyB3aG9zZSBgU3BlY2llc2AgaXMgbm90IGAnc2V0b3NhJ2BcbiJ9

Adding New Columns to data.frame

For example, add a new column named TempC to chicWeather, containing the temperature in degrees Celsius:

TempC <- round((chicWeather$TempF - 32) * 5/9)
chicWeather$TempC <- TempC
print(chicWeather)
##        Day  Date TempF TempC
## 1 Thursday Feb 1    26    -3
## 2   Friday Feb 2    22    -6
## 3 Saturday Feb 3    30    -1
## 4   Sunday Feb 4    32     0
## 5   Monday Feb 5    24    -4

Or use data.frame command

Humidity <- c(2,1,8,5,4)
chicWeather <- data.frame(chicWeather, Humidity)
print(chicWeather)
##        Day  Date TempF TempC Humidity
## 1 Thursday Feb 1    26    -3        2
## 2   Friday Feb 2    22    -6        1
## 3 Saturday Feb 3    30    -1        8
## 4   Sunday Feb 4    32     0        5
## 5   Monday Feb 5    24    -4        4

EX4

Load the built-in mtcars dataset. Read the help doc of mtcars to understand the origin of the data. Use mtcars to:

  • Print only the first five rows.
  • Print the last five rows.
  • How many rows and columns does the data have?
  • Look at the data in the RStudio data viewer (if you are using RStudio).
  • Print the mpg column of the data.
  • Print the mpg column of the data where the corresponding cyl column is 6.
  • Print all rows of the data where cyl is 6.
  • Print all rows of the data where mpg is greater than 25, but only for the mpg and cyl columns.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjbXRjYXJzIn0=

EX5

Install ggplot2 package. ggplot2 contains the diamonds dataset. Load the diamonds data.

  • Install ggplot2 package. ggplot2 contains the diamonds dataset.
  • Import the ggplot2 package. Load the diamonds data.
  • Run the command ?diamonds. The help page will open under the Help tab. Read the document to understand the origin of the data and its attributes.
  • Print the first five rows
  • Print the row count and column count
  • Select rows whose cut equals to Very Good. And find the total of rows in the returned subset
  • Find how many diamonds whose carat is greater than 3.0
  • Return rows where color is D, but only for the color and cut columns
  • Run the summary command with the diamonds data. Read the average price.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjZ2dwbG90Mjo6ZGlhbW9uZHMifQ==