Did the movie get a positive or negative review?
This article would explain the steps to building a sentiment classification model using the “IMDB dataset of 50k movie reviews”.
Understanding the dataset
We convert the data in the form of a csv file to a data frame and make note of the data type in each column.
Importing libraries and packages
All necessary libraries and packages for text preprocessing and building the model are imported.
This process refers to steps taken to transfer text from human language to a machine-readable format for further analysis. The better your text is preprocessed, the more accurate the results obtained are.
It would include the following:
- Changing data types
We need to change the data type of the sentiment column to suit a classification model, so we use the pandas apply function to carry this out.
- Removing punctuation
- Removing stopwords
Stop words are commonly used words such as ‘a’, ‘the’, and ‘an’ that do not add much meaning to the sentence. This reduces the dataset size and time to train the model without affecting the accuracy significantly. This also ensures the data inputted has fewer and essential tokens therefore improves classification accuracy too.