Text Generation using Markov Chain Algorithm

4 min readNov 9, 2020

Introduction

Markov Chain is a mathematical model of stochastic process that predicts the condition of the next state based on the condition of the previous one. Mathematically speaking, the conditional probability distribution of the next state depends on the current state and not the past states. That is s(t) depends only on s(t-1), where s(t) is the state at time t. This is what is called as the first-order Markov model. Using this principle, the Markov Chain can predict the next word based on the last word typed. It models the transition probability between states, where in NLP each state is represented by terms/words. Markov Chains is a simple yet effective method to create a text generation model.

Let us understand text generation using Markov Chain with a simple example.

I’ve grown up reading poetry of John Keats(famous English Romantic Poet who died in 1821-when he was only 25). Let us try generating his poetry using the Markov Chain Algorithm.

Methodology

For the new poem generation, we will make use of a 2nd-order Markov model(we can also use 3rd-order and 4th-order Markov model for this). At first, we need to clean up the data and then train a Markov model on the cleaned up data. The training of the Markov model can be divided into the following stages:

i) Tokenisation

Firstly, we perform tokenisation, where each sentence is broken into words.

ii) Building the state pairs

The second stage consists of forming the previous and current state pairs. Since we are building a 2nd-order Markov model, our previous state will consist of two words.
Additionally, we have to consider two peculiar cases — the first word and the second word. Both of them will not have two previous words. So, we have to handle them differently.
For the first word, we will just calculate the initial state distribution.
For the second word, we will treat it as a 1st-order Markov model, since it contains one previous word.
Finally, for the end of the sentence, we will add an additional identification token ‘END’ and form pairs.

iii) Determining the probability distribution

Here, we calculate the probability of the next states possible for a given current state as before.

Once we have completed the training, we will have the initial word distribution, second-word distribution and the state transition distributions.

To generate the poem, all we need is to write a function to sample out from the above-created distributions.

Code for training the Markov Model

Complete Markov Model code can be fetched from https://www.kaggle.com/wildflowerhd/text-generation-using-markov-chain-algorithm

# Trains a Markov model based on the data in training_data_file
def train_markov_model():
    for line in open(training_data_file):
        tokens = remove_punctuation(line.rstrip().lower()).split()
        tokens_length = len(tokens)
        for i in range(tokens_length):
            token = tokens[i]
            if i == 0:
                initial_word[token] = initial_word.get(token, 0) + 1
            else:
                prev_token = tokens[i - 1]
                if i == tokens_length - 1:
                    add2dict(transitions, (prev_token, token), 'END')
                if i == 1:
                    add2dict(second_word, prev_token, token)
                else:
                    prev_prev_token = tokens[i - 2]
                    add2dict(transitions, (prev_prev_token, prev_token), token)# Normalize the distributions
    initial_word_total = sum(initial_word.values())
    for key, value in initial_word.items():
        initial_word[key] = value / initial_word_total
        
    for prev_word, next_word_list in second_word.items():
        second_word[prev_word] = list2probabilitydict(next_word_list)
        
    for word_pair, next_word_list in transitions.items():
        transitions[word_pair] = list2probabilitydict(next_word_list)
    
    print('Training successful.')

Text Generation

Now that the Markov Model is trained, let us generate some random poem.

generate()full of sweet dreams and health and quiet breathing
its loveliness increases it will never
till i heard chapman speak out loud and bold
with streams that deepen freshly into bowers
about old forests while the willow trails
hum about globes of clover and sweet peas
there let its trumpet blow and quickly dress
the passion poesy glories infinite
a fullborn beauty new and exquisite
hide in deep herbage and ere yet the bees

Conclusion

This surely doesn’t sound like a poem from John Keats :) because the model is only looking at a pair of words to predict the next word. While, the sequence of words may make sense but there is no flow to the poem. Markov Model may work fantastically when the next word generation doesn’t demand a dependency on the historic words. In case as such, where the next word has to flow from the historic words, other techniques such as LSTMs gains relevance.

Hope you guys found the application of Markov Model interesting and easy to grasp! Do reach out to me in case of questions.