How often do people talk about a specific topic? How popular is a hashtag in twitter? In order to answer this kind of questions, we may examine how fast the next tweet will arrive. This post shows how to visualize inter-arrival times of tweets with a specific hashtag.

  1. Collecting Tweets
  2. Calculating inter-arrival times
  3. Creating a histogram plot of the inter-arrival times
  4. Calculating cumulative probabilities
  5. Grouping breaks, cumulative probabilites and the hashtag into a data frame
  6. Plotting the cumulative probability distribution of the inter-arrival times
  7. Performing a Poisson test

1. Collecting Tweets

Gathering sample tweets by using R is discussed at the post Crawling Tweets by using the Search API.

Follow the steps in the post to complete the following steps.

  • Set up Twitter connection
  • Pull n(=1000) tweets with a specific hashtag or search query
  • Store the tweets collection into a data frame

Here, we choose the hashtag #coffee for the query to request 1000 sample tweets. The function twListToDF converts the tweet list into a data frame tweets.df from which we will extract the creation time of the tweets by accessing the created field in the data frame.

tweets <- searchTwitter("#coffee", n=1000)
tweets.df <- twListToDF(tweets)

Creating a Histogram Plot of the Creation Times

We will firstly take a look at the creation time of all the tweets. Each tweet has a key whose value is its creation time with the UTC time zone. The following shows statistics of the creation time by calling the R function summary.

summary(tweets.df$created)
                 Min.               1st Qu.                Median 
"2017-02-14 16:24:47" "2017-02-14 17:02:24" "2017-02-14 17:43:05" 
                 Mean               3rd Qu.                  Max. 
"2017-02-14 17:41:03" "2017-02-14 18:19:46" "2017-02-14 18:57:11" 

Then we create a histogram plot of the column createdof the data frame tweets.df. The bins are left side inclusive.

# hist
hist(tweets.df$created,breaks=20,freq=TRUE,include.lowest=TRUE,
     main="",xlab="Creation time", col='bisque')

png

# ggplot2
library(ggplot2)
ggplot(tweets.df, aes(x=created)) + 
  geom_histogram(bins=18, closed='left', colour='black', aes(fill=..count..,alpha=0.2)) +
  scale_fill_gradient('Count', low='aliceblue', high ='blue') 

png

2. Calculating Inter-Arrival Times

How soon will be the next tweet arriving? We need to calculate the time intervals between every two consecutive tweets. The sample tweets are not ordered. Thus, we need to sort the tweets by the creation time in ascending order.

Run the following function which shows the type of the created vector is POSIXct.

class(tweets.df$created)

[1] “POSIXct” “POSIXt”

It needs to be coerced to integer type for sorting.

The function as.integer will convert the times to integers. The function sort will sort the time integers in ascending order. Run the following sort function.

# integer casting and sorting
created.sort <- sort(as.integer(tweets.df[,'created']))

After running the statement above, we have the sorted time integers in a vector created.sort. We can inspect the first 10 values in created.sort by running the expression:

# inspect the first 10 values
created.sort[1:10]
  1. 1487089487
  2. 1487089496
  3. 1487089505
  4. 1487089507
  5. 1487089512
  6. 1487089521
  7. 1487089524
  8. 1487089525
  9. 1487089527
  10. 1487089534

To find the difference in seconds between each pair, simply calling the function diff which will repeat calculating difference for every two consecutive integers in the vector.

# find the difference in seconds for every two consecutive tweets
created.diff <- diff(created.sort)
# inspcet created.diff
summary(created.diff)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   6.000   9.153  13.000  57.000 

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 6.000 9.153 13.000 57.000

Next, we will visualize the frequency of the inter-arrival times in created.diff.


3. Creating a Histogram Plot of the Inter-Arrival Times

# Frequency plot
ggplot(data=as.data.frame(created.diff), aes(x=created.diff)) + 
  geom_histogram(bins=32, closed='left', colour='black', aes(fill=..count..,alpha=0.2)) +
  scale_fill_gradient('Count', low='blue', high ='orange')  

png

# Plot probabilities of each bin
ggplot(data=as.data.frame(created.diff), aes(x=created.diff)) + 
  geom_histogram(aes(y = ..density..,fill=..count..,alpha=0.2),bins=32, closed='left') +
  geom_density(fill='red',alpha=0.2) 

png

Density of a random variable describes the relative likelihood for this random variable to take on a given value.

In the next step, we will calculate the cumulative probabilities for each possible interval. For each time interval, its cumulative probability describes the likelihood of expecting the next tweet with a shorter wait time than the length of interval.


4. Calculating cumulative probabilities

The following snippet will calculate the cumulative probabilities in cumProb.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Calculating the cumulative probabilities of created.diff
# 1. create bins
bin.width <- 1 #specify the width of each bin
min <- min(created.diff)
max <- max(created.diff)  
breaks <- seq(min, max+1, by = bin.width) # specify end points of bins
cuts <- cut(created.diff, breaks, right=FALSE) # assign each interval with a bin

# 2. the table function returns a table for counts/frequency of each level/bin
freq <- table(cuts)
# 3. returns a vector whose elements are the cumulative sums
cumFreq <- cumsum(freq)
# 4. divide freqency by the total to get cumulative probability
cumProb <- cumFreq/length(created.diff)

5. Grouping breaks, cumulative probabilities and the hashtag into a data frame

Before plotting, we want to make a data frame to wrap all the data that will be used in the plot.

1
2
3
4
5
6
7
8
9
10
11
# create a sequence from min to max
x <- c(min(created.diff) : max(created.diff))
# convert cumProb to a vector
y <- as.vector(cumProb)
# create a factor for legend
tag <- c('#coffee')
legend <- as.factor(rep(tag, times=length(y))) #factor vector of same length of y
# group x, y and legend into a data frame
dat <- data.frame(x,y,legend) 
# inspect dat1
dat[1:5,]
0 0.05705706#coffee
1 0.16416416#coffee
2 0.24824825#coffee
3 0.33333333#coffee
4 0.39439439#coffee

6. Plotting Cumulative Probability Distribution of Inter-Arrival Times

Firstly, be sure to have easyGgplot2 installed in R. If not, run the following

1
2
3
install.packages("devtools")
library(devtools)
install_github("kassambara/easyGgplot2")

Then run the following snippet which will plot the cumulative probability distribution of inter-arrival times for #coffee.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# one
p1 <- ggplot(data = dat, aes(x=x, y=y)) + geom_line()
# two
p2 <- p1  + 
  geom_line(aes(colour=legend)) +
  xlim(-1,60) +
  xlab('Inter-Arrival Time in Seconds') +
  ylab('Cumulative Probability') + 
  theme(axis.text=element_text(size=7),
        axis.title=element_text(size=7),
        plot.title= element_text(lineheight=.2),
        legend.text=element_text(size=7),
        legend.position='bottom')
# three
p3 <- p2 + geom_point(aes(colour=legend))

library(easyGgplot2)
ggplot2.multiplot(p1,p2,p3, cols=3)

png

The generated plots shows a classic Poisson distribution of arrival probabilities.

The Possion distribution is a discrete frequency distribution that gives the probability of a number of independent events occurring in a fixed time. The Poisson distribution deals with mutually independent events, occurring at a known and constant rate r per unit (of time or space). The rate r is the expected or most likely outcome.

If the tweets arrive rapidly, the curve will become steep.

7. Finding the Probability of the Next Tweet Arriving Less Than x Seconds

To find the likelihood of seeing the next tweet less than 5 seconds with the hashtag coffee, run the following snippet:

total <-  1000
sum(created.diff<=5)/length(created.diff)

0.456456456456456

The result above tells us that the probability of the wait time being less than 5 seconds is about 0.46. This means it is not very realistic to expect the next tweet with #coffee less than 5 seconds.

If we want to find out the amount of seconds that 75 percent of tweets will arrive in less than that amount of seconds, run the quantile function:

quantile(created.diff, 0.75)

75%: 13

The result shows that 75 percent of tweets will arrive in less than 13 seconds.

8. Performing a Poisson Test

total <-  1000
mu <- mean(created.diff)
c <- sum(created.diff<=mu)

poisson.test(c,total) 
    	Exact Poisson test
    
    data:  c time base: total
    number of events = 630, time base = 1000, p-value < 2.2e-16
    alternative hypothesis: true event rate is not equal to 1
    95 percent confidence interval:
     0.5817592 0.6811739
    sample estimates:
    event rate 
          0.63 

The poisson test shows that for \(95%\) of the samples from twitter with #coffee with a sample size of 1000 and the proportion of samples that arrive in mu seconds or less will fall into the confidence interval: 58.2% to 68.1%