MovieGenre-Prediction-Apache-Spark-

Implemented a predictive analytics pipeline using Spark to predict the genres associated with a movie.

The movie plots were preprocessed using SparkNLP library built on top of Apache Spark and SparkML.

Tokenization was done with RegexTokenizer and the stop words were removed using Stopwords remover.

Term-document matrix from the plots was inputted to the machine learning model.

Logistic Regression model was created in Spark to predict .

Performance was improved with term frequency-inverse document frequency(tf-idf) based engineering technique.

Custom feature engineering method, Word2Vec was implemented and an increase in performance was noted.

Predictions for the test set were uploaded to the Kaggle website.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Part 1.ipynb		Part 1.ipynb
Part 2.ipynb		Part 2.ipynb
Part 3.ipynb		Part 3.ipynb
README.md		README.md
mapping.csv		mapping.csv
sample.csv		sample.csv
specification.pdf		specification.pdf
test.csv.zip		test.csv.zip
train.csv.zip		train.csv.zip

Provide feedback