January 1, 2017

Modelling Tweets as Markov Chains

To celebrate the end of the 2016, let's model president-elect Trump's 2016 tweets as a Markov Chain!

What is a Markov Chain?

Markov Chains are stochastic processes that satisfy the Markov Property. Markov Chains are described "memoryless", where the probability of the future state is entirely and only dependent on the current state and not the previous states.

Are the words contained within a tweet generated via a stochastic process that satisfies the Markov property? absolutely not

So why do you want to model tweets as a Markov Chain? because it's hilarious

Getting the Tweet Data

Twitter provides an API for reading and writing tweet data. It's also useless for our purpose because Twitter only returns tweets from the last three weeks. Very sad.

But it looks like trumptwitterarchive.com have a database of historical tweets from the president-elect. So I've copied all of the tweets from 2016 to a local text file.

Tweet Data Preprocessing

We'll need to pre-process our data so we can produce a sensible looking transition matrix. String elements such as links to articles, quotes and urls are removed using Regex.

Modelling Tweets as a Markov Chain

Strictly speaking, we are modelling the process of writing out a tweet as a Markov Chain. Each word of the tweet is represented by a system state. With the knowledge of the present system state (current word) we can obtain the future system state (next word) using the transition matrix.

A Tweet's First Word:

We will need to set an initial condition for our system before the Markov process of writing out the subsequent word can begin. From our dataset, we can determine the probability distribution of the first word (initial condition) of the tweet:

function calculateFirstWordsProbabilities(tweets)
{
  let tweetFirstWords = [];
  tweets.forEach(extractFirstWordofTweet);
  tweetFirstWordsProbability = convertToProbabilityArray(tweetFirstWords);

  function extractFirstWordofTweet(tweet) {
    let words = tweet.split(' ');
    tweetFirstWords.push(words[0]);
  }  
}

If your dataset is small, you can conceivably just populate an array of all of the first words of each tweet (with repeats). That will simplify the process as we no longer need to calculate the cumulative distribution function. This will probably require lots of storage for any decent size dataset.

Determining the Transition Matrix

Given that the state space is very large (all unique words within our tweets dataset). I've chosen to implement the transition matrix as an hash table with the key representing the current state and the value as an array of future states and their associated probabilities. This minimises storage requirement as the transition matrix is very sparse.

function populateMarkovFrequencies(tweet) {
  let words = tweet.split(' ');

  for (let i = 0; i < words.length; i++) {
    let word = words[i];
    let isLastWordinTweet = (i === (words.length - 1));

    if (markovTransition[word] === undefined) {
      markovTransition[word] = [];
    }

    if (isLastWordinTweet) {
      markovTransition[word].push('\n');
    } else {
      let nextWordInTweet = words[i+1];
      markovTransition[word].push(nextWordInTweet); 
    }
  }

function processMarkovTransition() {
  for (let word in markovTransition) {
    markovTransition[word] = convertToProbabilityArray(markovTransition[word]);
  }
}

Done!

All we need to do now is to use the transition matrix to generate our own tweet!