Skip to content

colinsimon/Global-News-Analysis-Using-Cloud-Tools

Repository files navigation

GA-DSI-Project-5 Executive Summary

Nate Bukowski and Colin Simon

Contents:

Problem Statement

Using data science to identify commodity supply events using global news data.

Data Summary

Data Source:

  • The data for this project was pulled using Google Cloud Platform's BigQuery.
  • GDELT

Datasets Analyzed:

Mapping

  • For this project, mapping serves three purposes:
    1. EDA: To understand the data
    2. To present the data to any audience
    3. As a means of analytics/data science. While not predictive, given the nature of the dataset, this is crucial.

Models and Techniques

  • The goal of our modeling was to use K-Means Clustering, PCA and CountVectorizer to find a classification model that performed best when classifying the month of the year an article was written. A pipeline of various parameters was run through a GridSearch on the following models:

    • K-Nearest Neighbors
    • Random Forest
    • Logistic Regression
    • Support Vector Machine

    Below are the accuracy scores for the best performing models:

    • K-Nearest Neighbors:

      • Train accuracy: 58.9%
      • Test accuracy: 37.5%
    • Random Forest:

      • Train accuracy: 100%
      • Test accuracy: 44.6%
    • Logistic Regression:

      • Train accuracy: 69.6%
      • Test accuracy: 36.9%
    • Support Vector Machine:

      • Train accuracy: 99.3%
      • Test accuracy: 42.4%

Conclusions, Limitations and Recommendations:

Conclusions:

  • This is a tremendously powerful dataset that updates many times a day.
  • Event hotspots can be located.
  • Date can be predicted from the location, tone and theme of an article.

Limitations:

  • We only used English Articles.
  • Our modeling dataset was small (1,100 data points).
  • Multiple Machine Learning levels used.
  • Cloud computing is expensive.
  • Dataset could not be used for time series modeling.

Recomendations:

  • Research Google, GDELT algorithms and NLP classifications .
  • Nonprofit or significant sponsorship angle needed.
  • Cleaning/engineering to allow for time-series modeling and improvement to clustering model.

About

Ensemble classification of global news data from gdeltproject.org gathered using Google Cloud BigQuery

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •