Skip to content

Google Cloud Data Extraction Component Setup

Kadi-7 edited this page Jul 5, 2023 · 10 revisions

Author: Abdelkader Alkadour

Date: 20.06.2023

This documentation provides instructions for setting up and configuring the Google Cloud environment to run the Data Extraction using the provided commands. Please follow the steps below to ensure a successful setup.

Prerequisites

Before you begin, make sure you have the following:

  • Google Cloud account
  • Project ID
  • Billing account connected to the project

Configuration Steps

  1. Open the Google Cloud Console and sign in to your Google Cloud account.

  2. Initialize google cloud. You can do this by running the following command in the terminal:

    gcloud init

    This command will sign you in and configure the project.

  3. Connect the project to a billing account. Go to the Google Cloud Console, navigate to "Billing" under "IAM & Admin," and follow the instructions to connect a billing account to your project.

  4. Enable the Compute Engine API by running the following command:

    gcloud services enable compute.googleapis.com
  5. Replace <ProjectID> with the ProjectID of your project and <Region> with the region where you want to deploy the Data Extraction Component.

    gcloud compute instances create data-retriver \
        --zone=<Region>\
        --machine-type=e2-highmem-2 \
        --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default \
        --maintenance-policy=MIGRATE \
        --provisioning-model=STANDARD \
        --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
        --create-disk=auto-delete=yes,boot=yes,device-name=instance-1,image=projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20230616,mode=rw,size=20,type=projects/<ProjectID>/zones/<Region>/diskTypes/pd-balanced \
        --no-shielded-secure-boot \
        --shielded-vtpm \
        --shielded-integrity-monitoring \
        --labels=goog-ec-src=vm_add-gcloud \
        --reservation-affinity=any
        ```
    
  6. SSH into the instance by running the following command:

    gcloud compute ssh --zone <Region> "root@data-retriver"

    Replace <Region> with the region of your instance.

    If you have problem with last command. You can click on the dropdown menu and then on view gcloud command. The command can be used on your local shell. image

    After you are on the machine, enter the following command to get in root user:

    sudo su -
  7. Update the OS and install Anaconda:

    sudo apt update
    sudo apt upgrade -y
    curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh
    sudo bash Miniconda3-latest-Linux-x86_64.sh -b -p /home/ubuntu/miniconda3
    /home/ubuntu/miniconda3/bin/conda init 
    exit
    
  8. Reconnect with the VM, clone the project and set the teokens file:

    Reconnect:

    gcloud compute ssh --zone <Region> "root@data-retriver"

    Clone the project:

    git clone https://github.com/amosproj/amos2023ss03-qachat.git 
    cd amos2023ss03-qachat/
    export PYTHONPATH="$PYTHONPATH:$(pwd)"
    cd QAChat/Data_Processing
    pip install -r requirements.txt
    conda install -c conda-forge poppler
    sudo apt-get install tesseract-ocr tesseract-ocr-deu
    pip install pdf2image
    python -m spacy download de_core_news_sm
    python -m spacy download xx_ent_wiki_sm

    Add API credentials:

    nano /root/amos2023ss03-qachat/tokens.env
    nano /root/amos2023ss03-qachat/QAChat/Data_Processing/credentials_file.json
  9. Add startup script: this script ensure that the vm will be shout down after finish the task.

    click on the name of the vm, as in the picture: image

    then click on Edit, as in the picture: image

    scroll down to Metadata and add the following startup script, as in the picture:

    #!/bin/bash
    export PYTHONPATH=$PYTHONPATH:/root/amos2023ss03-qachat/
    
    env_file="/root/amos2023ss03-qachat/tokens.env"
    while IFS= read -r line || [ -n "$line" ]; do
       line="${line// /}"    # Replace space with nothing
       line="${line//\"/}"   # Replace double quotes with nothing
       export "$line"
    done < "$env_file"
    
    export CREDENTIALS_JSON_FILE=/root/amos2023ss03-qachat/QAChat/Data_Processing/credentials_file.json
    
    /home/ubuntu/miniconda3/bin/python /root/amos2023ss03-qachat/QAChat/Data_Processing/main.py
    
    sudo shutdown -h now
    image

    finally click on save.

Clone this wiki locally