Sentiment Analysis and Classification Using Apache Spark

Introduction

This project leverages Apache Spark to perform sentiment analysis on the Amazon Reviews dataset. Using natural language processing (NLP) and machine learning techniques, it classifies reviews as positive or negative and compares the performance of different classification models.

Project Structure

The repository includes the following components:

main.py: Main script for data loading, preprocessing, and model training.
README.md: Project documentation.
Amazon Review Sentiment Analysis Report.docx: Project Report.
output_results.pdf: All results of our project.
spark_history.pdf: The history of our project.

Environment Requirements

To run the project, ensure the following environment setup:

Operating System: Windows/Linux/MacOS
Python Version: 3.8+
Spark Version: 3.3.0+
Dependencies: Listed in requirements.txt

Dataset

This project uses the Amazon Reviews dataset, which includes the following features:

polarity: Sentiment label (1 = Negative, 2 = Positive)
title: Review title
text: Review content

Dataset Paths

Training data path: gs://your-bucket-name/train.csv
Testing data path: gs://your-bucket-name/test.csv

You can download the dataset from Kaggle - Amazon Reviews Dataset.

Setup and Execution

Steps to Run the Code

Upload Dataset
Upload the training and testing datasets to your cloud storage bucket (e.g., Google Cloud Storage).

Modify the Code
Open the script main.py and update the dataset paths to your bucket paths. Example:

train_path = "gs://your-bucket-name/train.csv"
test_path = "gs://your-bucket-name/test.csv"
output_path = "gs://your-bucket-name/output_all_results"

Upload the Code
Upload your modified script (main.py) into your bucket.
Create a Cluster Set up a Spark cluster in your cloud environment (e.g., Google Cloud Dataproc, AWS EMR, or Alibaba E-MapReduce).
Submit the Job Choose your cluster, and then choose pyspark, paste the code path "gs://your-bucket-name/main.py", attach arguments "gs://your-bucket- name/train.csv" and "gs://your-bucket-name/test.csv". Use the following command to submit the job to your cluster:spark-submit gs://your-bucket-name/main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis and Classification Using Apache Spark

Introduction

Project Structure

Environment Requirements

Dataset

Dataset Paths

Setup and Execution

Steps to Run the Code

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Amazon Review Sentiment Analysis Report.docx		Amazon Review Sentiment Analysis Report.docx
README.md		README.md
main.py		main.py
output_results.pdf		output_results.pdf
spark_history.pdf		spark_history.pdf

ChengqinLi1206/CS777

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis and Classification Using Apache Spark

Introduction

Project Structure

Environment Requirements

Dataset

Dataset Paths

Setup and Execution

Steps to Run the Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages