AutoCorrect / Spell Check using Deep Learning in Python!

5 min readMay 30, 2021

Introduction

We have all been a victim of AutoCorrect at least once in our lives and our usual reaction is “Damn AutoCorrect ( ͠❛ ͟ʖ ͠❛ )”. Today, we are going to build our very own character level Seq2Seq model for AutoCorrect and Spell Checker using Python and Keras. Want to check out the code now? Click here

Data Preprocessing

We still do not have a good dataset in the market that can be used to train our model so we build our custom dataset. Here we are going to use the “English to French” dataset used for machine translation.

!curl -O http://www.manythings.org/anki/fra-eng.zip
!unzip fra-eng.zip

Our process is quite simple, we take the English data from the corpus (ignoring the french text), add some noise into the text and consider that as the input to our model. The output is the unchanged English data from the corpus.

Here we have taken 120,000 samples from the dataset to not exceed the memory capacity. If you have a GPU with a higher memory capacity, you could play around with the number of samples. The data is read in lines 10 and 11 and stored in “lines”. We loop through our dataset on line 13. Each sentence is looped 5 times for the model to learn the data with different noise and further increase the variance.

We consider “\t” as our SOS (Start of sentence) token and “\n” as EOS (End of Sentence) token. This is required to let the model know to stop predicting the sequence after reaching the EOS token. These tokens are added to our target text using this code:target_text = "\t" + input_text + "\n"

We add noise to our data in line 21 by changing the random characters from our text. The number of characters changed is determined by

np.random.choice(np.arange(0, 2), p=[0.1, 0.9]))

Sample Output

    input                      target
uom misled you           tom misled you\n
tou uisled you           tom misled you\n
tomvmisledvyou           tom misled you\n
tgm misled ygu           tom misled you\n
tom misled you           tom misled you\n

Since ML/DL models do not understand text input, we need to convert our text into numerical data to feed into the model. The “input_text” and “target_text” are tokenized from lines 25 to 30. This makes sure that for every input character in data, there is the corresponding numerical value.

Input token index and target token index is computed in lines 14 and 15. Our whole data is converted to numerical data depending on tokens from lines 27 to 39. We finally get the “encoder_input_data” and “decoder_target_data” that could be given to our model for training.

Input Token Index:{' ': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}

Model

Sequence to Sequence (often abbreviated to seq2seq) models is a special class of Recurrent Neural Network architectures that we typically use (but not restricted) to solve complex Language problems like Machine Translation, Question Answering, creating Chatbots, Text Summarization, etc. Here we consider Spell-checker and Autocorrect task similar to Machine Translation, thus using the same model.

Kindly read the below blog post to understand seq2seq models in detail.
1. https://www.analyticsvidhya.com/blog/2020/08/a-simple-introduction-to-sequence-to-sequence-models/
2. https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Model Architecture

Configurations: (You can play with this depending on your compute)batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 128  # Latent dimensionality of the encoding space.
output_dim = 64

We have a single LSTM layer for both input and output with a dropout of 40%.
The encode states (h, c) is computed and given to the decoder layer as its initial state. Therefore the whole model is generated in line 22 with two inputs and one output. (We are performing Teacher Forcing to our model, to learn more about it click here)

We compile our model with optimizer as “adam” and loss as “categorial_crossentropy”. Model is trained for 15 epochs with a batch_size of 64 and validation split as 20%.

Inference

Once we have our model ready, we can perform inference on our input text. We first reconstruct the same model used for training.

Next, we initialize the reverse token index to get back our English words once the model has auto-corrected our text.

# Reverse-lookup token index to decode sequences back to
# something readable.reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

Now we define our function “decode _ sequence” which will give back the corrected text.

Here the model keeps predicting the next character until the EOS token is received or the “max_decoder_seq_length” is exceeded. To perform the inference on our text, we can run the following snippet.

We get the following output

Input : tgm misled ygu
Output : tom missed youInput : yeu mvst lefve
Output : you must leave

Download the whole code or run the code on Google Colab using this link:
https://github.com/piyush0511/SpellChecker-AutoCorrect/blob/main/SpellCheck%20-%20seq2seq.ipynb

Further Improvements

We have received an accuracy of 84% for training and 76% for validation. We could add regularization methods and dropouts to reduce the overfit.
We used only one LSTM layer for encoding and decoding, increasing the LSTM layers might give better results.
Using Bi-Directional LSTM instead of Uni-Directional for the encoding layer.
We used only 120,000 data units for training, the English language is sparse and requires a huge amount of data to perform well.

References

Stay tuned for the next blog, as we will utilize Attention Mechanism to further improve our accuracy.

Thank you for reading,
Piyush