Text Generation With LSTM

Sunil Patel
Jul 15, 2017
5 min read

To know about how Long Short Term Memory (LSTM) works, please refer this blog first.

Note : This article is a Primitive Implementation of Text Generation. To understand shortcomings and advancements one must go through this article. Advancement to this article is provided here.

My idea behind starting entire series of Text-based tutorials is to make a fully functional chatbot. More importantly to provide easy to understand things is a primary goal of my any blog-post.

Today I will be going to give you an idea about text generation with LSTM. Also, we will see how we can use text generation in data augmentation.

In simple words, text generation is something which involves a system capable of generating text as if spoken by the human. However reaching to human level text generation is a wast and challenging field. In this tutorial I will be training system with a small chunk of text (~5 MB ), eventually after training system will also be able to generate related text on its own.

Text generation can be done in two way:

1) word generation, word generation would require very powerful resources, as the system which will be generating the world should be well trained with character generation. Also if softmax is used as end activation function, it may eat up the hell lot of memory as a word has to be selected directly form millions or billions of word probability. Word generation is a penultimate goal of text generating systems, the ultimate goal is to generate human-like well-defined sentences with proper grammar. 2) Character generation is actually more basic and less resource intensive operation. Character generation model work by predicting character Cn from all previous m characters Cn-1 to Cn-m. Every time a character is generated, it gets added up as new character and help in predicting next character. Mathematically in term of the Markovian chain, it can be represented as p(Cn | C n-1, Cn-2 .... Cn-m).

Let's understand the logic behind character generation first. While generating character we used a "seed", seed provides those all previous characters which then help in predicting new character.let's take seed "quick brown fox jumps over the". Let's consider that we have seed number as 15. So last 15 characters would decide next character.

quick brown fox jumps over th quick brown fox jumps over the

quick brown fox jumps over the quick brown fox jumps over the l quick brown fox jumps over the la quick brown fox jumps over the laz quick brown fox jumps over the lazy quick brown fox jumps over the lazy quick brown fox jumps over the lazy d quick brown fox jumps over the lazy do quick brown fox jumps over the lazy dog quick brown fox jumps over the lazy dog.

SEED GENERATED CHARACTER

Like that iterate entire sentence would be made. training will also happen similarly. let's see how training happen. The entire text which is to be learnt by machine is fed to algorithm in similar way.

DATA -> LABEL [brown fox jumps over th] -> [e] [rown fox jumps over the] -> [ ] [own fox jumps over the ] -> [q] [wn fox jumps over the q] -> [u] [n fox jumps over the qu] -> i] [ fox jumps over the qui] -> [c] [fox jumps over the quic] -> [k] [ox jumps over the quick] -> [ ]

Practically we take 20 to 200 previous characters for the prediction of next character,

Let's start by understanding code accordingly.

# read file content filename = "Xyz.txt" raw_text = open(filename).read() raw_text = raw_text.lower()

# All unique characters in text chars = set(raw_text)

All unique characters present in text will be like

chars = {'\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '}', '~'} All these characters will be converted to corresponding integer, as machine do not understand chars.

# mapping character to intiger char_indices = dict((c, i) for i, c in enumerate(chars)) # mapping integer to character back for decoding indices_char = dict((i, c) for i, c in enumerate(chars))

here we define the maximum character to look before to predict next character. # number of previous characters require predicting next character maxlen = 100

for the entire text we will be constructing train data-set, the logic behind constructing data set is already discussed. we take the window of 100 characters and define 101 th character as the label. then we slide this window to entire length of the text. while constructing such data-set all characters are converted to corresponding integer and at last, everything will be converted to numpy array.

sentences = [] next_chars = [] for i in range(0, len(raw_text) - maxlen, 1): sentences.append(raw_text[i: i + maxlen]) next_chars.append(raw_text[i + maxlen]) print('nb sequences:', len(sentences))

X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) y = np.zeros((len(sentences), len(chars)), dtype=np.bool) for i, sentence in enumerate(sentences): for t, char in enumerate(sentence): X[i, t, char_indices[char]] = 1 y[i, char_indices[next_chars[i]]] = 1

let's say we have such 5000 data then the dimension of X would be (5000, 100,len(chars)) and Y would be (5000,len(chars)) in Y the character which is to be ptredicted, its integer is marked as 1 else everything stays 0. In short, its is called one hot encoding. As we are done with the dataset, next we define model,

# defining a model model = Sequential() model.add(LSTM(512, input_shape=(maxlen,len(chars)), return_sequences=True)) model.add(Dropout(0.20))

# you may use this unused layers for bigger dataset, I'm not using it # model.add(LSTM(512, return_sequences=True)) # model.add(Dropout(0.20))

# model.add(LSTM(512, return_sequences=True)) # model.add(Dropout(0.20))

model.add(LSTM(256, return_sequences=False)) model.add(Dropout(0.20))

model.add(Dense(len(chars))) model.add(Activation('softmax'))

As I am using very small dataset, I am not using some part of the model. you may uncomment some part and use with your bigger dataset. # compiling model model.compile(loss='categorical_crossentropy', optimizer='rmsprop',metrics=['accuracy'])

The last one is text generation module, it takes two input. first one is current iteration number and second one is trained to model.

def generatetext(i,model): """ To generate text from trained modle i = can be any integer, while training you may pass epoch iterator as i to keep watch on quality of model. modle = a trained model """ # seed text provides previous n( here 100) characters on basis of which n+ characters will be predicted. seed_text = "Another great John Wayne movie. Lots of action. Lots of stars. A great attempt to keep reality in play and of course the good guys win.".lower() generated = '' + seed_text[-100:] print(i,"Seed",seed_text) # will print next 300 characters for iteration in range(300): # create x vector from seed to predict on #generating numpy array as generated above x = np.zeros((1, maxlen, len(chars))) for t, char in enumerate(seed_text[-100:]): x[0, t, char_indices[char]] = 1. #predict next character preds = model.predict(x, verbose=0)[0] next_index = np.argmax(preds) next_char = indices_char[next_index] #append next character to seed text, on the basis on new 100 character generate next to next character and so on. generated += next_char seed_text = seed_text[1:] + next_char #print seed_text,next_char print('follow up with: ' + generated)

This function will take the model, utilize seed text. After each iteration model will predict next word and append it to seed. now new seed is given to model and this continues to the number of time specified. here it will be repeated up to 300 times as specified. so for each seed, next 300 characters will be generated.

At the end, train a model

# run for number of specified epochs epochs = 20 for i in range(0,epochs): model.fit(X,y, batch_size=1000,nb_epoch=1) model.save_weights('data_augmentation_'+str(i)+'.h5') generatetext(i,model)

while model is being trained, we can see beauty of text generation engine iteration over iteration

This is all about how text generation works. I have applied the same logic on couple of dataset. To see how it learns and generates text after each iteration follow my next blog.

Text Generation Part - 2

#LSTM #NaturalLanguageProcessing