Building a sentiment classification model

Did the movie get a positive or negative review?

Sentiment classification uses natural language processing and machine learning to interpret emotions in the inputted data. It is a text analysis technique that detects polarity.

This article would explain the steps to building a sentiment classification model using the “IMDB dataset of 50k movie reviews”.

Understanding the dataset

We convert the data in the form of a csv file to a data frame and make note of the data type in each column.

Importing libraries and packages

All necessary libraries and packages for text preprocessing and building the model are imported.

Text preprocessing

This process refers to steps taken to transfer text from human language to a machine-readable format for further analysis. The better your text is preprocessed, the more accurate the results obtained are.

It would include the following:

  • Changing data types

We need to change the data type of the sentiment column to suit a classification model, so we use the pandas apply function to carry this out.

  • Removing punctuation
  • Removing stopwords

Stop words are commonly used words such as ‘a’, ‘the’, and ‘an’ that do not add much meaning to the sentence. This reduces the dataset size and time to train the model without affecting the accuracy significantly. This also ensures the data inputted has fewer and essential tokens therefore improves classification accuracy too.

  • Stemming

Stemming refers to the process of converting various related forms of a word to its common base form. For example, in the English language, we have suffixes like “-ed” and “-ing” which may be useful to cut off in order to map the words ‘play’, playing, and ‘played’ all to the same stem of ‘play’.

  • Removing URLs and HTML strips
  • Converting all letters to lower case

This is one of the simplest and most effective forms of text preprocessing. It also definitely helps with the consistency of the expected output.

Checking the length of reviews to input

This is a crucial step to decide on appropriate values to tune some hyperparameters during the rest of the process. For example, the maximum length for the reviews to be inputted.

Splitting the data set into train and test data

As seen below, it's always better to first check the distribution of both positive and negative reviews and make sure both the train and test data have equal numbers under each category for unbiased classification.

Processing input

  • Tokenization
  1. fit_on_texts : This creates the vocabulary index based on word frequency. Every word gets a unique integer value starting from 1 since 0 is reserved for padding. So lower the integer means the more frequent the word is.
  2. texts_to_sequences: This takes each word in the reviews and replaces it with its corresponding integer value assigned in the previous step.
  • Padding

When feeding sequences into a network for training, the sequences all need to be uniform in size. Currently, the sequences have varied lengths as shown above, so the next step is to make them all the same size, either by padding them with zeros and/or truncating them.

I decided to use a length of 500 words since the mean length obtained was around 780.

  • Converting lists into arrays

The model

  • A simple model that uses word embeddings

This model produces output with an accuracy of 85.79%.

  • Long short term memory (LSTM) models

These are a special kind of recurrent neural networks that preserve information from inputs that have already passed through it using the hidden state.

To read more about how an LSTM model functions please do refer to this link.

Types of LSTM models:

  1. Unidirectional LSTM: preserves the information of the past because the only inputs it has seen are from the past.
  2. Bidirectional LSTM: runs your inputs two ways, one from past to future and one from future to past. In the LSTM that runs backwards, you preserve information from the future and by combining the two hidden states at any point in time you can preserve information from both past and future.
  3. Multiple bidirectional LSTM: To achieve even greater results, we can stack multiple LSTM units on top of each other but we must take into consideration that this takes time.

This model produces output with an accuracy of 86.88% and minimizes loss too.

To see how the code works in practice, refer to this notebook.

As observed, LSTM models do increase the accuracy of the results obtained. In order to get better results, you could also include a spell check during pre-processing. You could also hyper tune your parameters like ‘vocab_size’, ‘embedding_dim’, and ‘maxlen’.

Please do have a look at this article of mine to see what other techniques can be implemented to improve the results obtained.

Hope this gave you a good insight into sentiment classification :)

Srilankan living in Berlin. Mathematics master student at Freie Universitat. Interested in Data science & Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store