Modelling Tweets as Markov Chains
To celebrate the end of the 2016, let's model president-elect Trump's 2016 tweets as a Markov Chain!
Project Code on: GitHub | Live Demo
What is a Markov Chain?
Markov Chains are stochastic processes that satisfy the Markov Property. Markov Chains are described "memoryless", where the probability of the future state is entirely and only dependent on the current state and not the previous states.
How are Tweets and Markov Chain related?
Are the words contained within a tweet generated via a stochastic process that satisfies the Markov property? absolutely not
So why do you want to model tweets as a Markov Chain? because it's hilarious
Getting the Tweet Data
Twitter provides an API for reading and writing tweet data. It's also useless for our purpose because Twitter only returns tweets from the last three weeks. Very sad.
But it looks like trumptwitterarchive.com have a database of historical tweets from the president-elect. So I've copied all of the tweets from 2016 to a local text file.
Tweet Data Preprocessing
We'll need to pre-process our data so we can produce a sensible looking transition matrix. String elements such as links to articles, quotes and urls are removed using Regex.
Modelling Tweets as a Markov Chain
Strictly speaking, we are modelling the process of writing out a tweet as a Markov Chain. Each word of the tweet is represented by a system state. With the knowledge of the present system state (current word) we can obtain the future system state (next word) using the transition matrix.
A Tweet's First Word:
We will need to set an initial condition for our system before the Markov process of writing out the subsequent word can begin. From our dataset, we can determine the probability distribution of the first word (initial condition) of the tweet:
function calculateFirstWordsProbabilities(tweets)
{
let tweetFirstWords = [];
tweets.forEach(extractFirstWordofTweet);
tweetFirstWordsProbability = convertToProbabilityArray(tweetFirstWords);
function extractFirstWordofTweet(tweet) {
let words = tweet.split(' ');
tweetFirstWords.push(words[0]);
}
}
If your dataset is small, you can conceivably just populate an array of all of the first words of each tweet (with repeats). That will simplify the process as we no longer need to calculate the cumulative distribution function. This will probably require lots of storage for any decent size dataset.
Determining the Transition Matrix
Given that the state space is very large (all unique words within our tweets dataset). I've chosen to implement the transition matrix as an hash table with the key
representing the current state and the value
as an array
of future states and their associated probabilities. This minimises storage requirement as the transition matrix is very sparse.
function populateMarkovFrequencies(tweet) {
let words = tweet.split(' ');
for (let i = 0; i < words.length; i++) {
let word = words[i];
let isLastWordinTweet = (i === (words.length - 1));
if (markovTransition[word] === undefined) {
markovTransition[word] = [];
}
if (isLastWordinTweet) {
markovTransition[word].push('\n');
} else {
let nextWordInTweet = words[i+1];
markovTransition[word].push(nextWordInTweet);
}
}
function processMarkovTransition() {
for (let word in markovTransition) {
markovTransition[word] = convertToProbabilityArray(markovTransition[word]);
}
}
Done!
All we need to do now is to use the transition matrix to generate our own tweet!