This project leverages Apache Spark to perform sentiment analysis on the Amazon Reviews dataset. Using natural language processing (NLP) and machine learning techniques, it classifies reviews as positive or negative and compares the performance of different classification models.
The repository includes the following components:
main.py
: Main script for data loading, preprocessing, and model training.README.md
: Project documentation.Amazon Review Sentiment Analysis Report.docx
: Project Report.output_results.pdf
: All results of our project.spark_history.pdf
: The history of our project.
To run the project, ensure the following environment setup:
- Operating System: Windows/Linux/MacOS
- Python Version: 3.8+
- Spark Version: 3.3.0+
- Dependencies: Listed in
requirements.txt
This project uses the Amazon Reviews dataset, which includes the following features:
- polarity: Sentiment label (1 = Negative, 2 = Positive)
- title: Review title
- text: Review content
- Training data path:
gs://your-bucket-name/train.csv
- Testing data path:
gs://your-bucket-name/test.csv
You can download the dataset from Kaggle - Amazon Reviews Dataset.
-
Upload Dataset
Upload the training and testing datasets to your cloud storage bucket (e.g., Google Cloud Storage). -
Modify the Code
Open the scriptmain.py
and update the dataset paths to your bucket paths. Example:train_path = "gs://your-bucket-name/train.csv" test_path = "gs://your-bucket-name/test.csv" output_path = "gs://your-bucket-name/output_all_results"
-
Upload the Code
Upload your modified script (main.py) into your bucket. -
Create a Cluster Set up a Spark cluster in your cloud environment (e.g., Google Cloud Dataproc, AWS EMR, or Alibaba E-MapReduce).
-
Submit the Job Choose your cluster, and then choose pyspark, paste the code path "gs://your-bucket-name/main.py", attach arguments "gs://your-bucket- name/train.csv" and "gs://your-bucket-name/test.csv". Use the following command to submit the job to your cluster:spark-submit gs://your-bucket-name/main.py