5 minute read

Twitter offers three types of APIs, Search, Streaming and “firehose“. Developers and data scientists use the streaming API (https://dev.twitter.com/streaming/overview) to access to multiple types of real-time streams including public, user and site. The Search API (https://dev.twitter.com/rest/public/search) retrieves past tweets and searches against a sampling of recent Tweets posted in the past 7 days. The Search API provides relevance while no completeness. If looking for completeness, a Streaming API should be considered instead.

The following part describes the steps to gather tweets with the Search API by using R.

Step 1: Sign up Twitter Account and Create an Application

If not yet having a Twitter account, visit twitter.com and quickly create a new account. Sign in to the Twitter account. Open (https://apps.twitter.com). Sign in and click on Create New App. Enter a name, description and website. The website URL can be a placeholder such as http://placeholder.com

Once created, click modify app permissions and check read, write and access direct messages.

click Keys and Access Tokens tab and scroll to the bottom of the page. Under token actions, click on Create my access token. Your access token will be generated.

As specified in this tab,

  • “This access token can be used to make API requests on your own account’s behalf. Do not share your access token secret with anyone.”

  • “Keep the Consumer Secret a secret. This key should never be human-readable in your application.”

Step 2: Install R Packages

If you have not installed R and RStudio, visit their official web site, download and install both.

A Search API needs a few R packages. Launch RStudio. Run the following snippet to install the required packages.

# Install packages
install.packages("RCurl", "RJSONIO", "bitops","stringr")
install.packages("ROAuth")
install.packages("twitteR")
install.packages('base64enc')
install.packages('streamR')

Step 3: Create a New Project in RStudio

When separating code in multiple scripts, make a project.

To make a project in RStudio, follow the menu command path: File -> New Project.

Choose New Directory -> Empty Directory.

In the Create New Project window, enter a name for Directory name as the project name, for example, “TwitterSearch”. Then specify the directory where the project is located.

Once the project has been created, the project should be open under the Files tab.

Step 4: Set up a Crawler in a Function

Firstly, create a new script by following the menu command path: File->New File->R Script. Save the script and name it setupCrawler with the default R script file extension.

The script should contain a function setupCrawler which is listed below. Replace four placeholder strings with your own keys and tokens.

# function
setupCrawler <- function() {
  # Load the packages
  library(bitops)
  library(RCurl)
  library(RJSONIO)
  library(ROAuth)
  library(twitteR)
  library(stringr)
  
  # Provide Tokens (apps.twitter.com)
  api_key <- "placeholder:your_api_key" 
  api_secret <- "placeholder:your_api_secret" 
  token <- "placeholder:your_token" 
  token_secret <- "placeholder:your_token_secret" 
  
  # Create Twitter Connection
  # [1] "Using direct authentication"
  # vUse a local file to cache OAuth access credentials between R sessions?
  # 1: Yes
  # 2: No
  origop <- options("httr_oauth_cache")
  options(httr_oauth_cache = TRUE)
  setup_twitter_oauth(api_key, api_secret, token, token_secret)
  options(httr_oauth_cache = origop)
}   

Step 5: Test the Search API

Now create another R script. In this script, we will test the Search API for crawling tweets by using searchTwitter function. The following is a sample script that you can test the crawler with the Search API. The details about the searchTwitter function is underneath the code block.

# Set up Twitter Search API 
source("setupCrawler.R") #include external script source
setupCrawler() #call the function setupCrawler

# Example 1: Crawl 500 tweets relevant to java
tweets <- searchTwitter("java", n=500, lang="en")

# Transform tweets list into a data frame
tweets.df <- twListToDF(tweets)

# Write the data frame into a file
write.table(tweets.df, file="tweets_java", append=FALSE, sep=",", na="NA")

The searchTwitter Function

This function will issue a search of Twitter based on a supplied search string.

Usage:

searchTwitter(searchString, n=25, lang=NULL, since=NULL, until=NULL,
locale=NULL, geocode=NULL, sinceID=NULL, maxID=NULL,
resultType=NULL, retryOnRateLimit=120)

Sample Query:

searchTwitter('DataScience', n=1000, lang='en', since='2017-02-01', until='2017-02-07',
locale=NULL, geocode='41.5443613,-87.5099057,50mi', sinceID=NULL, maxID=NULL,
resultType='recent', retryOnRateLimit=120)

Arguments:

  • searchString

Search terms can contain spaces, and multiple terms should be separated with “+”.

You can formulate the query string at https://twitter.com/search-advanced?lang=en and validate the search query string at https://twitter.com/search-home

  • n

The maximum number of tweets to return

  • lang

If not NULL, restricts tweets to the given language, given by an ISO 639-1 code since If not NULL, restricts tweets to those since the given date. Date is to be formatted as YYYY-MM-DD

  • until

If not NULL, restricts tweets to those up until the given date. Date is to be formatted as YYYY-MM-DD

  • locale

If not NULL, will set the locale for the search. As of 03/06/11 only ja is effective, as per the Twitter API

  • geocode

If not NULL, returns tweets by users located within a given radius of the given latitude/longitude. The values are given in the format latitude,longitude,radius, where the radius can have either mi (miles) or km (kilometers) as a unit. For instance, geocode="41.5443613,-87.5099057,50mi”

  • sinceID

If not NULL, returns tweets with IDs greater (ie newer) than the specified ID

  • maxID

If not NULL, returns tweets with IDs smaller (ie older) than the specified ID

  • resultType

If not NULL, returns filtered tweets as per value. It specifies the type of search results received in API response. Default is mixed. Allowed values are mixed (includes popular + real time results), recent (returns the most recent results) and popular (returns only the most popular results).

  • retryOnRateLimit

If non-zero the search command will block retry up to X times if the rate limit is experienced. This might lead to a much longer run time but the task will eventually complete if the retry count is high enough

comments powered by Disqus