The data.frame object in R groups a number of column vectors into a data set in R. The way data.frame organizes data is similar to that of a spreadsheet, a 2D frame. Tibble is a modern version of classical data.frame which is used in some of R packages. A data.frame is constrained to only hold named columns of the same length.
data.frame is included in the R base. The same data structure is implemented in Python with the module
A data.frame is like an Excel spreadsheet on the surface with columns and rows. Statistically, each column is a variable and ech row is an observation. In the data mining terms, each column is an attribute and each row is an instance. In the machine learning terms, each column is a feature and each row is an object.
A data.frame object is created by specifying a set of named vectors to the data.frame function. For example, create a data.frame containing Chicago temperature forecasts over the next five days:
Or creating named columns and combing them:
Nonmatching Vector Lengths
If attempting to create a data frame using vectors with nonmatching lengths, R will print an error message. For example, the following command will produce the error message:
Error in data.frame(x = 1:5, y = 1:2) : arguments imply differing number of rows: 5, 2
Encode String Input to Factor
R likes to encode an input string vector to a factor unless we turn off the default setting. The
str command displays the type of each column in
str(chicWeather) ## 'data.frame': 5 obs. of 3 variables: ## $ Day : Factor w/ 5 levels "Friday","Monday",..: 5 1 3 4 2 ## $ Date : Factor w/ 5 levels "Feb 1","Feb 2",..: 1 2 3 4 5 ## $ TempF: num 26 22 30 32 24
Both Day and Date are of the factor type. To prevent this conversion from character to factor, add
Write R code to create the following data frame of 5 rows and 5 named columns, store the data in a name
weather, and print the data.
Inspecting the Data
It is useful to look at a few R base functions which help quickly understand the data stored in a data frame as well as each column for its data type and sample values. Run the following commands on a data.frame.
class() nrow(weather) # row count in the data ncol(weather) colnames() rownames() dim(weather) # dimension of the data dim(weather) # column count in the data str(weather) # View the structure of the data head(weather) # return the first few rows of the data tail(weather) # return the last few rows of the data summary(weather) # statistics data of each column
The View of Data.frame
Because R views a data.frame as simply a named list of column vectors, each element of a data frame is a column vector. Therefore,
lengthfunction returns the number of column vectors
namesfunction returns the element (column) names.
The following example shows the outputs of the two functions:
length(weather) ##  5 names(weather) ##  "outlook" "temperature" "humidity" "windy" "play"
R contains many built-in datasets in the base package
datasets. Check whether the package
datasets is available in the library. You should have the package as it comes with R installation. In any case when the package is not available, install the pacakge. To import a built-in dataset, e.g.,
iris, simply type the name of dataset and run.
Open the help doc for built-in dataset
Each R built-in dataset comes with a help document, explaining the values inside. To open the help document, run the command help.
Inspect the object
Run the commands in the Inspecting the Data on iris.
Find data type of a column (attribute)
To find the type of a vector, e.g., 'Species' in
iris, run the command:
The result shows the column is of type
factor. A factor stores categories and enumerated values.
Write the code to find the type of the column `Sepal.Length` in `iris`.
Selecting one column
As with data.frame, we can reference a single element (vector) from the data frame using either style:
- Double square brackets
- The $ sign with column name
- Subscripting as in a numerical matrix, with square brackets
chicWeather[] # double squared brackets ##  26 22 30 32 24 chicWeather$TempF # the $ symbol ##  26 22 30 32 24 chicWeather[,3] # subscripting ##  26 22 30 32 24 chicWeather[,"TempF"] ##  26 22 30 32 24
Subscripting data.frame like Matrix
R allows us to reference the data frame as if it was a matrix. We can filter rows and columns in a data.frame by the same subscripting methods for math matrices. For example:
chicWeather[,3] chicWeather[1:3,] chicWeather[1:2,c(1,3)] chicWeather[1:2,c("Day","TempF")]
Logical subscripting a single column
Besides, a logcial expression can filter values in a single column by only returning the values which evaluate TRUE for a given criterion. Here is an expression, only returning values in TempF which is higher than 25.
chicWeather$TempF[ chicWeather$TempF > 25 ] # logical subscript ##  26 30 32
Logical subscripting for filtering rows
Test a logical expression rowwise. Only choose the rows which satisfy the criteria. The following code returns a subset containing only rows with temperatures higher than 25F.
chicWeather[ chicWeather$TempF > 25, ] ## Day Date TempF ## 1 Thursday Feb 1 26 ## 3 Saturday Feb 3 30 ## 4 Sunday Feb 4 32
The following code shows subsetting for days when temperature is 22F.
chicWeather[ chicWeather$TempF == 22, ] ## Day Date TempF ## 2 Friday Feb 2 22
Logical subscripting (subsetting) both rows and columns
If the columns need to be filtered too, a vector of names or indexes is added as the second argument.
chicWeather[ chicWeather$TempF > 25, c("Day", "TempF") ] ## Day TempF ## 1 Thursday 26 ## 3 Saturday 30 ## 4 Sunday 32
Retrieve a column from iris
As the columns in
iris are named, the
$ symbol provides an intutive way of referencing named columns. For example, to retrieve the column named
Species, run the following expression and it will return a vector which contains the request column.
Write the code to retrieve the column
Petal.Length in the
Run the following code. Describe the values in set1 and set2.
df <- data.frame(X = -2:2, Y = 1:5) set1 <- df$Y[ df$X > 0 ] set2 <- df[ df$X > 0, ]
Subsetting iris data
Write R code to retrieve the following subsets from
- The first 50 rows
- The first 2 columns
- The columns
- All of the columns excluding the last column
- The rows whose
- The rows whose
Adding New Columns to data.frame
For example, add a new column named
TempC to chicWeather, containing the temperature in degrees Celsius:
TempC <- round((chicWeather$TempF - 32) * 5/9) chicWeather$TempC <- TempC print(chicWeather) ## Day Date TempF TempC ## 1 Thursday Feb 1 26 -3 ## 2 Friday Feb 2 22 -6 ## 3 Saturday Feb 3 30 -1 ## 4 Sunday Feb 4 32 0 ## 5 Monday Feb 5 24 -4
Humidity <- c(2,1,8,5,4) chicWeather <- data.frame(chicWeather, Humidity) print(chicWeather) ## Day Date TempF TempC Humidity ## 1 Thursday Feb 1 26 -3 2 ## 2 Friday Feb 2 22 -6 1 ## 3 Saturday Feb 3 30 -1 8 ## 4 Sunday Feb 4 32 0 5 ## 5 Monday Feb 5 24 -4 4
Load the built-in mtcars dataset. Read the help doc of mtcars to understand the origin of the data. Use mtcars to:
- Print only the first five rows.
- Print the last five rows.
- How many rows and columns does the data have?
- Look at the data in the RStudio data viewer (if you are using RStudio).
- Print the
mpgcolumn of the data.
- Print the
mpgcolumn of the data where the corresponding
cylcolumn is 6.
- Print all rows of the data where
- Print all rows of the data where
mpgis greater than 25, but only for the
Install ggplot2 package. ggplot2 contains the diamonds dataset. Load the diamonds data.
- Import the
ggplot2package. Load the
- Run the command
?diamonds. The help page will open under the Help tab. Read the document to understand the origin of the data and its attributes.
- Print the first five rows
- Print the row count and column count
- Select rows whose cut equals to
Very Good. And find the total of rows in the returned subset
- Find how many diamonds whose carat is greater than 3.0
- Return rows where color is D, but only for the color and cut columns
- Run the
summarycommand with the diamonds data. Read the average price.