-
V. Notebook 2 – Feature Selection, Model Training, Evaluation, & Tuning
-
VI. Notebook 3 – Logistic Regression & Feature Contributions Evaluation
In this project, we developed several machine learning models (e.g. Logistic Regression, Naïve Bayes) to classify movie reviews as positive or negative using the IMDb Dataset of Movie Reviews.
Our workflow included data preprocessing, visualization, feature selection, model building and tuning. We concluded by visualizing feature contributions and explaining predictions for logistic regression, the best-performing model.
The dataset contains 150,000 movie reviews and 14,206 unique movie titles, sourced from Kaggle, with a total size of 384.06 MB in CSV format. Each entry includes:
-
Rating: User rating from
1to10. -
Review: User review in English.
-
Movie: Name of the movie.
-
Resenhas: Portuguese translation of the reviews.
Since our focus was on English reviews, we excluded the Portuguese ones.
Implementing sentiment analysis in NLP offers several advantages:
-
Understanding Customers: Helps businesses learn customer opinions about their products or services, leading to meaningful improvements.
-
Better Customer Service: Enables quick responses to customer concerns, boosting satisfaction and loyalty.
-
Staying Competitive: Monitors competitor mentions to reveal market trends and opportunities.
-
Smart Market Research: Gauges public sentiment on new products, predicting success and guiding refinements.
Sentiment analysis of IMDb movie reviews offers several benefits:
-
Content Improvement: Filmmakers can assess audience reactions to pinpoint strengths and areas for improvement.
-
Marketing Strategies: Helps craft targeted campaigns that align with audience preferences.
-
Reputation Management: Enables timely responses to negative reviews, maintaining a positive brand image.
-
Consumer Insights: Aggregates sentiment data to reveal audience trends and guide future content creation.
In summary, sentiment analysis of IMDb reviews provides actionable insights to improve content, refine marketing, and understand audience preferences.
For each notebook, we began by:
-
Importing relevant libraries and loading the IMDb reviews dataset from CSV (the data was stored in a folder named “Data” using the path
Data/IMDB_Dataset.csv). -
Customizing stop words by adding and removing specific words.
-
Cleaning review data using multiple preprocessing steps (e.g. removing stop words, expanding contractions).
Note: Our cleaning method was updated early on to reduce or remove specific repeated word sequences discovered during initial exploration.
For each notebook (except the first), we also:
-
Assigned multiclass labels to ratings and filtered data to exclude neutral ratings (
0for negative and1for positive). -
Defined a tokenizer class to lemmatize words in the text.
-
Split the data into training and testing sets.
-
We performed exploratory data analysis (EDA) with statistics (e.g. checking for missing values, counting unique values) and visuals (word clouds of positive and negative reviews, distributions of reviews, ratings, and words with histograms).
-
We visualized the top
20unigrams, bigrams, trigrams, 4-grams, and 5-grams in positive and negative reviews. -
We identified bigrams, trigrams, 4-grams, and 5-grams with identical words (e.g. "good good," "blah blah blah," "la la la la") across all reviews. These were either reduced to a single word (if meaningful) or completely removed after updating our cleaning process.
-
We vectorized text data using TfidfVectorizer with lemmatization and
5,0001- to 3-grams, then trained and evaluated baseline models (e.g. Logistic Regression, Naïve Bayes). -
After tuning model hyperparameters using GridSearchCV, the optimized models performed better overall compared to the baselines, with Random Forest showing significantly less overfitting.
-
We vectorized text data using TfidfVectorizer with lemmatization and
5,0001- to 3-grams. -
Using the baseline logistic regression model, we plotted a confusion matrix to visualize true vs. predicted labels.
-
We stored false positive and false negative reviews in dictionaries and displayed the first few ones based on their ratings.
-
We visualized the marginal contribution of features for the logistic regression model, with notable features including “bad,” “great,” “worst,” and “fun.”
-
We explained model predictions for a false positive and a false negative review based on feature contributions, providing better insight into how features influenced outcomes.
-
Both positive and negative review lengths are right-skewed and follow a similar pattern.
-
Many reviews are under
1,000characters, with only some exceeding2,000characters.
-
Notable words include "like," "good," and "great."
-
Interestingly, "not" also appears frequently due to the lack of contextual pairing.
-
Common words include "not" and "bad."
-
Surprisingly, "like" and "good" are also frequent due to the lack of contextual pairing.
-
Examples include "one best," "see movie," and "good movie."
-
However, pairs like "movie not," "not like," and "not really" highlight the inability of bigrams to capture full context.
-
Common pairs include "waste time," "not good," and "bad movie."
-
Others like "much better" and "watch movie" also appear, again showing limitations in contextual interpretation.
Based on the tables above:
-
The Logistic Regression model performs best in both the baseline and optimized versions, followed by Naïve Bayes, Random Forest, and AdaBoost.
-
Random Forest is the only model that overfitted, especially in the baseline version, with the optimized version showing significantly less overfitting and better performance.
-
Here are the explanations of the baseline logistic regression test evaluation metrics:
-
Accuracy (89.16%): Percentage of correctly predicted sentiments (positive or negative) out of all reviews; measures overall performance in classifying IMDb reviews accurately.
-
AUC (95.81%): Measures the model's ability to distinguish between positive and negative sentiments in IMDb reviews; higher AUC indicates better separation between the two sentiment classes.
-
F1 (89.18%): Harmonic mean of precision and recall; balances false positives and false negatives, useful for imbalanced datasets.
-
-
The similar accuracy (
89.16%) and F1 score (89.18%) suggest that the model's predictions are well-balanced between true positives and true negatives, as the classes were balanced.





