Skip to content

Google Cloud native web analytics using Google Compute Engine, Cloud SQL Server, Cloud Dataflow, Google Kubernetes Engine and Google Deployment Manager

Mohitsai/google-analytics-simulation

Repository files navigation

Google Analytics Simulation – Google Cloud Platform

Project Overview

This project simulates a large-scale web analytics system using Google Cloud Platform (GCP). The system deploys 10,000 webpages on Google Cloud Storage, tracks 100,000+ user requests, and applies machine learning to predict user demographics based on web traffic behavior. The project incorporates cloud-native big data processing, machine learning, and scalable orchestration.

Key Features

  • Web Traffic Simulation:
    • Hosted 10,000 webpages in Google Cloud Storage.
    • Served content via a Google Compute Engine (GCE) Virtual Machine.
    • Tracked 100,000+ HTTP requests, capturing metadata such as location, age, and gender.
  • Real-Time Traffic Analysis with Cloud SQL & Pub/Sub:
    • Logged all user requests in Google Cloud SQL (MySQL 8.0).
    • Used Google Pub/Sub to track and handle requests from banned countries.
  • PageRank Computation with Google Cloud Dataflow:
    • Processed webpage link structures in real-time using Apache Beam on Cloud Dataflow.
    • Identified high-authority pages, improving search and ranking insights.
  • Machine Learning for User Demographics Prediction:
    • Trained an ML model to predict user demographics based on web request metadata.
    • Achieved 99.7% accuracy in classification.
  • Scalable Orchestration with Google Kubernetes Engine (GKE):
    • Deployed the system in GKE, ensuring high availability and fault tolerance.
    • Used Google Deployment Manager for automated infrastructure setup.

System Architecture

1️⃣ Web Serving Layer

  • Google Cloud Storage (GCS): Stores 10,000 webpages.
  • Compute Engine (GCE VM): Serves webpages to users.

2️⃣ Data Ingestion & Storage

  • Cloud SQL (MySQL 8.0): Stores 100,000+ user requests.
  • Pub/Sub Topic & Subscription:
    • Tracks banned country requests.
    • Notifies the logging system of policy violations.

3️⃣ Real-Time Analytics & Processing

  • Google Cloud Dataflow (Apache Beam):
    • Computes PageRank for all 10,000 webpages.
    • Identifies influential pages for performance insights.

4️⃣ Machine Learning & Deployment

  • ML Model:
    • Predicts user demographics from metadata.
    • Achieves 99.7% accuracy.
  • Google Kubernetes Engine (GKE):
    • Deploys the prediction service for real-time analytics.
  • Google Deployment Manager:
    • Automates resource provisioning.

Deployment Steps

1️⃣ Prerequisites

  • Google Cloud SDK installed & authenticated.
  • Cloud Storage Bucket with webpages.
  • Cloud SQL Instance set up with MySQL.
  • Pub/Sub Topic & Subscription created.

2️⃣ Deploy Web Serving VM

gcloud compute instances create web-server \
    --zone=us-central1-a \
    --machine-type=e2-micro \
    --image-family=debian-11 \
    --image-project=debian-cloud \
    --metadata=startup-script-url=gs://your-bucket/startup-script.sh

3️⃣ Enable Web Traffic Logging

gcloud sql instances create web-traffic-db \
    --database-version=MYSQL_8_0 \
    --tier=db-f1-micro \
    --region=us-central1

4️⃣ Deploy Banned Country Tracker

from google.cloud import pubsub_v1

PROJECT_ID = "ds-561-mohitsai"
SUBSCRIPTION_ID = "banned-country-topic-sub"

def callback(message):
    print(f"Received banned country request: {message.data.decode('utf-8')}")
    message.ack()

def listen_for_banned_requests():
    subscriber = pubsub_v1.SubscriberClient()
    subscription_path = subscriber.subscription_path(PROJECT_ID, SUBSCRIPTION_ID)
    
    future = subscriber.subscribe(subscription_path, callback=callback)
    print(f"Listening for messages on {subscription_path}...")

    try:
        future.result()
    except KeyboardInterrupt:
        future.cancel()

if __name__ == "__main__":
    listen_for_banned_requests()

5️⃣ Compute PageRank on Cloud Dataflow

gcloud dataflow jobs run compute-pagerank \
    --gcs-location gs://your-bucket/path-to-pagerank-pipeline \
    --region us-central1

6️⃣ Train & Deploy ML Model on GKE

gcloud container clusters create analytics-cluster \
    --num-nodes=3 --zone=us-central1-a

gcloud builds submit --tag gcr.io/$PROJECT_ID/predictor

gcloud run deploy predictor-service \
    --image gcr.io/$PROJECT_ID/predictor \
    --platform managed --region us-central1

Insights & Results

  • Top Web Pages: Identified via PageRank computation.
  • Traffic Analysis: Logged 100,000+ requests, categorized by region, age, gender.
  • Banned Country Tracking: Requests flagged via Pub/Sub logs.
  • ML Model Accuracy: Achieved 99.7% accuracy in demographic predictions.

Contributing & Usage

  • Modify and adapt configurations as needed.
  • Ensure credentials & permissions are set correctly.
  • Star ⭐ the repo if you found this useful!

Contact

Feel free to reach out via:


© 2025 Mohit Sai Gutha | Built using Google Cloud, Dataflow & GKE

About

Google Cloud native web analytics using Google Compute Engine, Cloud SQL Server, Cloud Dataflow, Google Kubernetes Engine and Google Deployment Manager

Topics

Resources

Stars

Watchers

Forks