Analyzing Movie Reviews - Sentiment Analysis

Abstract

The problem at hand is sentiment analysis or opinion mining, where we want to analyze some textual documents and predict their sentiment or opinion based on the content of these documents. Sentiment analysis is perhaps one of the most popular applications of natural language processing and text analytics with a vast number of websites, books and tutorials on this subject. Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment. 

Problem Statement 

The main objective in this Internship Project is to predict the sentiment for a number of movie reviews obtained from the Internet Movie Database (IMDb). This dataset contains 50,000 movie reviews that have been pre-labeled with “positive” and “negative” sentiment class labels based on the review content. Besides this, there are additional movie reviews that are unlabeled. The dataset can be obtained from http://ai.stanford.edu/~amaas/data/sentiment/ , courtesy of Stanford University and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. They have datasets in the form of raw text as well as already processed bag of words formats. We will only be using the raw labeled movie reviews for our analyses. Hence our task will be to predict the sentiment of 15,000 labeled movie reviews and use the remaining 35,000 reviews for training our supervised models.

Sentiment analysis is also popularly known as opinion analysis or opinion mining. The key idea is to use techniques from text analytics, NLP, Machine Learning, and linguistics to extract important information or data points from unstructured text. This in turn can help us derive qualitative outputs like the overall sentiment being on a positive, neutral, or negative scale and quantitative outputs like the sentiment polarity, subjectivity, and objectivity proportions.

In this Coding Internship project by Suven Consultants and Technology Pvt. Ltd. , we focus on trying to analyze a large corpus of movie reviews and derive the sentiment.


In this first part, we cover a wide variety of techniques for analyzing sentiment, which include the following.

  • Unsupervised lexicon-based models
  • Traditional supervised Machine Learning models

Besides looking at various approaches and models, we also focus on important aspects in the Machine Learning pipeline including text pre-processing, normalization, and in-depth analysis of models, including model interpretation and topic models. The key idea here is to understand how we tackle a problem like sentiment analysis on unstructured text, learn various techniques, models and understand how to interpret the results. This will enable you to use these methodologies in the future on your own datasets. Let's get started!



Sentiment Analysis

Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment.


Sentiment analysis is also popularly known as opinion analysis or opinion mining. The key idea is to use techniques from text analytics, NLP, Machine Learning, and linguistics to extract important information or data points from unstructured text. This in turn can help us derive qualitative outputs like the overall sentiment being on a positiveneutral, or negative scale and quantitative outputs like the sentiment polaritysubjectivity, and objectivity proportions.

Sentiment polarity is typically a numeric score that's assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0. Of course, you can always change these thresholds based on the type of text you are dealing with.

How to classify Sentiment?

Machine Learning:

This approach, employes a machine-learning technique and diverse features to construct a classifier that can identify text that expresses sentiment. Nowadays, deep-learning methods are popular because they fit on data learning representations.

Lexicon-Based:

This method uses a variety of words annotated by polarity score, to decide the general assessment score of a given content. The strongest asset of this technique is that it does not require any training data, while its weakest point is that a large number of words and expressions are not included in sentiment lexicons.

Hybrid:

The combination of machine learning and lexicon-based approaches to address Sentiment Analysis is called Hybrid. Though not commonly used, this method usually produces more promising results than the approaches mentioned above.


Notes: NLP libraries which will be used include spacy, nltk, and gensim. Do remember to check that your installed nltk version is at least >= 3.2.4, otherwise, the ToktokTokenizer class may not be present. If you want to use a lower nltk version for some reason, you can use any other tokenizer like the default word_tokenize() based on the TreebankWordTokenizer. The version for gensim should be at least 2.3.0 and for spacy, the version used was 1.9.0. We recommend using the latest version of spacy which was recently released (version 2.x) as this has fixed several bugs and added several improvements.

Text Pre-Processing and Normalization

An initial step in text and sentiment classification is pre-processing. A significant amount of techniques is applied to data in order to improvement of classification effectiveness. This enables standardization across a document corpus, which helps build meaningful features, to reduce dimensionality and reduce noise that can be introduced due to many factors like irrelevant symbols, special characters, XML and HTML tags, and so on.

Cleaning Text - strip HTML

Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing sentiment. Hence we need to make sure we remove them before extracting features. The BeautifulSoup library does an excellent job in providing necessary functions for this. Our strip_html_tags(...) function enables in cleaning and stripping out HTML code.

Removing accented characters

In our dataset, we are dealing with reviews in the English language so we need to make sure that characters with any other format, especially accented characters are converted and standardized into ASCII characters. A simple example would be converting é to e. Our remove_accented_chars(...) function helps us in this respect.


Expanding Contractions

In the English language, contractions are basically shortened versions of words or syllables. Contractions pose a problem in text normalization because we have to deal with special characters like the apostrophe and we also have to convert each contraction to its expanded, original form. Our expand_contractions(...) function uses regular expressions and various contractions mapped to expand all contractions in our text corpus.


Removing Special Characters

Simple regexes can be used to achieve this. Our function remove_special_characters(...) helps us remove special characters. In our code, we have retained numbers but you can also remove numbers if you do not want them in your normalized corpus.


Lemmatizing text

Word stems are usually the base form of possible words that can be created by attaching affixes like prefixes and suffixes to the stem to create new words. This is known as inflection. The reverse process of obtaining the base form of a word is known as stemming. The nltk package offers a wide range of stemmers like the PorterStemmer and LancasterStemmer. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However the base form in this case is known as the root word but not the root stem. The difference being that the root word is always a lexicographically correct word, present in the dictionary, but the root stem may not be so. We will be using lemmatization only in our normalization pipeline to retain lexicographically correct words. The function lemmatize_text(...) helps us with this aspect.


Removing Stopwords

Words which have little or no significance especially when constructing meaningful features from text are also known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a document corpus. Words like a, an, the, and so on are considered to be stopwords. There is no universal stopword list but we use a standard English language stopwords list from nltk. You can also add your own domain specific stopwords if needed. The function remove_stopwords(...) helps us remove stopwords and retain words having the most significance and context in a corpus.


Normalize text corpus - tying it all together

We use all these components and tie them together in the following function called normalize_corpus(...), which can be used to take a document corpus as input and return the same corpus with cleaned and normalized text documents.


Load and normalize data

We can now load our IMDb movie reviews dataset, use the first 40,000 reviews for training models and the remaining 10,000 reviews as the test dataset to evaluate model performance.


Sentiment Analysis - Unsupervised Lexical

Even though we have labeled data, this section should give you a good idea of how lexicon based models work and you can apply the same in your own datasets when you do not have labeled data.

Unsupervised sentiment analysis models use well curated knowledge bases, ontologies, lexicons, and databases that have detailed information pertaining to subjective words, phrases including sentiment, mood, polarity, objectivity, subjectivity, and so on. A lexicon model typically uses a lexicon, also known as a dictionary or vocabulary of words specifically aligned toward sentiment analysis. Usually these lexicons contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality, and so on. You can use these lexicons and compute sentiment of a text document by matching the presence of specific words from the lexicon, look at other additional factors like presence of negation parameters, surrounding words, overall context and phrases and aggregate overall sentiment polarity scores to decide the final sentiment score.

There are several popular lexicon models used for sentiment analysis. Some of them are mentioned as follows.

  • Bing Liu’s Lexicon
  • MPQA Subjectivity Lexicon
  • Pattern Lexicon
  • AFINN Lexicon
  • SentiWordNet Lexicon
  • VADER Lexicon
This is not an exhaustive list of lexicon models, but definitely lists among the most popular ones available today. Since we have labeled data, it will be easy for us to see how well our actual sentiment values for these movie reviews match our lexiconmodel based predicted sentiment values. We will be covering the last three lexicon models in more detail and predict their sentiment and see how well our model performs based on model evaluation metrics like accuracy, precision, recall, and F1-score.

Sentiment Analysis with AFINN

The AFINN lexicon is perhaps one of the simplest and most popular lexicons that can be used extensively for sentiment analysis. It is a list of words rated for valence with an integer between minus five (negative) and plus five (positive). The current version of the lexicon is AFINN-en-165.txt and it contains over 3,300+ words with a polarity score associated with each word. The author has also created a nice wrapper library on top of this in Python called afinn which we will be using for our analysis needs. AFINN takes into account other aspects like emoticons and exclamations. We can now use this object and compute the polarity of our chosen four sample reviews. The results permit you compare the actual sentiment label for each review and also check out the predicted sentiment polarity score. A negative polarity typically denotes negative sentiment.

We get an overall F1-Score of 72%, which is quite decent considering it's an unsupervised model. Looking at the confusion matrix we can clearly see that quite a number of positive sentiment based reviews have been misclassified as negative (1,848) and this leads to the lower recall of 63% for the positive sentiment class. Performance for negative class is better with regard to recall or f1-score, where we correctly predicted 4,131 out of 5,041 negative reviews, but precision is 69% because of the many wrong negative predictions made in case of positive sentiment reviews.

Sentiment Analysis with SentiWordNet

The WordNet corpus is definitely one of the most popular corpora for the English language used extensively in natural language processing and semantic analysis. WordNet gave us the concept of synsets or synonym sets. The SentiWordNet lexicon is based on WordNet synsets and can be used for sentiment analysis and opinion mining. The SentiWordNet lexicon typically assigns three sentiment scores for each WordNet synset. These include a positive polarity score, a negative polarity score and an objectivity score. We will be using the nltk library, which provides a Pythonic interface into SentiWordNet. Consider we have the adjective awesome.


Let's now build a generic function to extract and aggregate sentiment scores for a complete textual document based on matched synsets in that document. Our function basically takes in a movie review, tags each word with its corresponding POS tag, extracts out sentiment scores for any matched synset token based on its POS tag, and finally aggregates the scores. We can clearly see the predicted sentiment along with sentiment polarity scores and an objectivity score for each sample movie review depicted in formatted dataframes.


We get an overall F1-Score of 55%, which is definitely a step down from our AFINN based model. While we have lesser number of negative sentiment based reviews being misclassified as positive, the other aspects of the model performance have been affected.

Sentiment Analysis with VADER

The VADER lexicon, developed by C.J. Hutto, is a lexicon that is based on a rule-based sentiment analysis framework, specifically tuned to analyze sentiments in social media. VADER stands for Valence Aware Dictionary and Sentiment Reasoner. You can use the library based on nltk's interface under the nltk.sentiment.vader module. Besides this, you can also [download the actual lexicon or install the framework](https://github.com/cjhutto/ vaderSentiment). The file titled vader_lexicon.txt contains necessary sentiment scores associated with words, emoticons and slangs (like wtf, lol, nah, and so on). There were a total of over 9,000 lexical features from which over 7,500 curated lexical features were finally selected in the lexicon with proper validated valence scores. Each feature was rated on a scale from "[-4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)". The process of selecting lexical features was done by keeping all features that had a non-zero mean rating and whose standard deviation was less than 2.5, which was determined by the aggregate of ten independent raters.

Now let's use VADER to analyze our movie reviews! We build our own modeling function as follows. In our modeling function, we do some basic pre-processing but keep the punctuations and emoticons intact. Besides this, we use VADER to get the sentiment polarity and also proportion of the review text with regard to positive, neutral and negative sentiment. We also predict the final sentiment based on a user-input threshold for the aggregated sentiment polarity. We get an overall F1-Score and model accuracy of 72%, which is quite similar to the AFINN based model. The AFINN based model only wins for very little, both models have a similar performance.

  1. Setting up Dependencies
  2. Sentiment Analysis using AFINN
  • Model Training , Prediction and Evaluation

Sentiment polarity is typically a numeric score that’s assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0. Of course, you can always change these thresholds based on the type of text you are dealing with; there are no hard constraints on this.

    3. Sentiment Analysis using SentiWordNet

    4.  Sentiment Analysis using VADER

           Link to Jupyter Notebook

Classifying Sentiment with Supervised Learning

Introduction:

We will be building an automated sentiment text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

  1. Prepare train and test datasets (optionally a validation dataset)
  2. Pre-process and normalize text documents
  3. Feature engineering
  4. Model training
  5. Model prediction and evaluation

In our scenario, documents indicate the movie reviews and classes indicate the review sentiments that can either be positive or negative, making it a binary classification problem.



  1. Setting up Dependencies

      2. Text Normalisation(using Text_normalizer.py) & Feature Engineering

A text corpus consists of multiple text documents and each document can be as simple as a single sentence to a complete document with multiple paragraphs. Textual data, in spite of being highly unstructured, can be classified into two major types of documents. Factual documents that typically depict some form of statements or facts with no specific feelings or emotion attached to them. These are also known as objective documents. Subjective documents on the other hand have text that expresses feelings, moods, emotions, and opinions.

3. Model Training , Prediction and evaluation using Model_evaluation_util.py

4. Summary :-
The F1-score of the model using traditional Supervised Learning is 89.68% and an accuracy of 89.69% approximately .

 SUMMARY :-
 

Method

F1 Score

Accuracy

AFINN

70.6

72.8

SentiWordNet

68.3

68.7

VADER

70.6

72.4


Therefore the best Unsupervised Lexicon Model is AFINN.


Conclusion

On comparing the overall F1-Score and model accuracy of Supervised ML Model with the best Unsupervised Lexicon Model ,we conclude that Supervised Leaning gives us an more accurate and good model than Unsupervised Lexicon Model .








Comments