-
filter.py
This script will go through all theJSONfiles indatasetfolder, and will only store the tweet if it matches following criterias:
-extended_tweetis NOT null
-langisen(English)
- Tweet contains word(s) defined inkeyWordslist
It will not store all the details of a particular tweets, but only the features we require for our purpose:
- Twitter User Desciption
- Tweet
All this information will be stored incsvformat (saved asall_data.csv). -
label.py
Since we need to manually annotate all the selected tweets, this script will provide a simple command line interface to help with that.
This will present the user with a tweet (fromall_data.csv, line by line), user will input1or0where:
-1: Tweet is migration relevant
-0: Tweet is NOT migration relevant
Once the user will hit enter, label will be stored intrain_label.csv. -
annotation.ipynb
This notebook trains and performs evaluation on the labelled data.
Pipeline (for now):
- Import data, and remove rows with null values in any columns
- Balance the dataset using SMOTE
- PrepareTF-IDFandDoc2Vecfeature extraction techniques
- Provide appropriate data and labels to both the techniques, train classifiers using retrieved feature vectors
- Perform classification on a seperate validation set
- Print and Plot results!
-
Notifications
You must be signed in to change notification settings - Fork 2
harshildarji/DataScienceLab
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
Data Science Lab - SS - 2019
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published