De Bias

Overview Technologies Deploy EDA Visualization

Overview

The repository is dedicated to the Debias project, dedicated to showing relationships between different concepts in the news.

We cover different geographical locations (mainly USA and UK), different political positions (taken from AllSides) and various news providers.

The final goal is to create an interactive visualization, which would show how concepts are interconnected within different time stamps and from different points of view.

Technologies

Python
Docker
Redis
MinIO
NATS
Postgres
Playwright
Litestar
Polars
D3.js

NLP:

Transfrormers
SpaCy

Services

Scraper

Scaper is a service which scrapers news from different news providers. This service is recursively calling itself to scrape the next news pages. If page requires rendering, it will be sent to the renderer service. If page is static, it is stored in the s3 service, metadata is stored in the metastore service, and a processor service is called to process the page.

Renderer

Renderer is a service which renders news pages using browser API. It is called by the scraper service. After render, it saves HTML content to the s3 service and metadata to the metastore service and sends a request to the processor service to process the page.

Processor

Processor is a service which processes news pages. It extracts human-readable text from the page, performs NLP pipelines and stores the results in the wordstore service.

NLP pipeline

Classifier
- A zero-shot classifier from HuggingFace Transformers. In particular, MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli due to it's comparably low size.
Extractor
- A keyword exctraction algorithm with SpaCy. SpaCy is used to extract Named Entities, which are used as keywords after processing.

Server

Web server which serves the results of the processor. It aggregates the statistics of the words, precomputes and caches aggregations, and serves them to the client. It serves the frontend files as well.

Metastore

A postgres database which stores metadata of the scraped pages.

S3

A S3 provider which stores the static pages. Could be a local MinIO deployment or an external S3 cloud service.

Wordstore

A postgres database which stores the processed pages, keywords, topics, and their corresponding frequencies.

Message queue

A NATS message queue which is used for S2S communication.

Deploy

The Javascript visualization is available at https://debias.dartt0n.ru/

Using external S3 provider

Create .env file Fill in the following variables:

PG_USERNAME=...
PG_PASSWORD=...

Create configuration files

debias/scraper/config.toml
debias/server/config.toml
debias/processor/config.toml
debias/renderer/config.toml

Note

You can find example configuration in the following files:

Pre-download ML models

mkdir models
uv run --group processor download-models.py

Run services

docker compose -f docker-compose.yml up --build --detach

Using local S3 provider

Create .env file Fill in the following variables:

MINIO_ACCESS_KEY=...
MINIO_SECRET_KEY=...
MINIO_BUCKET=...

PG_USERNAME=...
PG_PASSWORD=...

Create configuration files

debias/scraper/config.toml
debias/server/config.toml
debias/processor/config.toml
debias/renderer/config.toml

Note

You can find example configuration in the following files:

Pre-download ML models

mkdir models
uv run --group processor download-models.py

Create MinIO S3 service using docker:

docker compose -f minio.docker-compose.yml up minio_setup

Scale services for better performance!

The following services could be automatically scaled horizontally for better performance:

scraper
renderer
processor

For easy scaling use docker-compose --scale option.

E.g., the following command will launch 5 scaper instances, 2 rendererinstances. 2processor` instances:

docker compose up --detach \
  --scale scaper=5 \
  --scale renderer=2 \
  --scale processor=2\

Remove services

To stop all remove all containers AND THEIR VOLUMES:

docker compose -f minio.docker-compose.yml down --volumes
# or
docker compose -f docker-compose.yml down --volumes

Development

Structure

.
├── debias          # shared code root
│   ├── core        # reusable components - s3, metastore, configs, etc
│   └── scraper     # scraper related code
│   └── processor   # NLP processor related code
│   └── renderer    # browser renderer related code
│   └── server      # server related code
│       └── frontend   # frontend related code

Adding new service

To add new service:

Create new directory in debias directory
Create dockerfile prefixed with servicename (e.g. scraper.dockerfile)
Add all the required dependencies to pyproject.toml under --group servicename
Add new package to tool.hatch.build.targets.wheel config in pyproject.toml

Frontend Development

Create .env file Fill in the following variables:

PG_USERNAME=...
PG_PASSWORD=...

Launch database container Using docker-compose:

docker compose up -d database

Generate random data Set environment variable POSTGRES_CONNECTION to the connection string of the database (replace USERNAME and PASSWORD with your actual username and password):

POSTGRES_CONNECTION="postgresql://USERNAME:$PASSWORD$@localhost:5432/postgres" uv run generate-data.py

Create server configuration file config.toml Replace USERNAME and PASSWORD with your actual username and password:

[pg]
connection = "postgresql://${PG_USERNAME}:${PG_PASSWORD}$@localhost:5432/postgres"

Launch backend server with hot reload

CONFIG=config.toml uv run litestar --app debias.server:app run --debug --reload

EDA

We have collected 38 sources of news from USA and UK and found out their political positions.

Distribution of political positions overall

Distribution of political positions in the USA

Distribution of political positions in the UK

Bonus: Distribution of political positions of sources which require VPN

It seems left parties are indeed more liberal.

We have parsed several news articles using python and prepared a deployment describing general trends in these articles.

The deployment can be found on Github Pages

Visualization

The visualization is divided into 3 parts:

Comparison of topics distribution for Left-Leaning and Right-Leaning media.
Comparison of keywords networks for Left-Leaning and Right-Leaning media.
Sandbox network with filtering functionality.

All visualizations are created using D3.js.

You can view the visualization at https://debias.dartt0n.ru/

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/assets		.github/assets
data		data
debias		debias
docs		docs
experiments		experiments
reports		reports
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
download-models.py		download-models.py
generate-data.py		generate-data.py
launch-scrapping.sh		launch-scrapping.sh
minio.docker-compose.yml		minio.docker-compose.yml
processor.dockerfile		processor.dockerfile
pyproject.toml		pyproject.toml
renderer.dockerfile		renderer.dockerfile
scraper.dockerfile		scraper.dockerfile
server.dockerfile		server.dockerfile
uv.lock		uv.lock

License

Data-Wrangling-and-Visualisation/DeBias

Folders and files

Latest commit

History

Repository files navigation

De Bias

Table of Contents

Overview

Technologies

NLP:

Services

Scraper

Renderer

Processor

NLP pipeline

Server

Metastore

S3

Wordstore

Message queue

Deploy

Using external S3 provider

Using local S3 provider

Scale services for better performance!

Remove services

Development

Structure

Adding new service

Frontend Development

EDA

Distribution of political positions overall

Distribution of political positions in the USA

Distribution of political positions in the UK

Bonus: Distribution of political positions of sources which require VPN

Visualization

Visualization example

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 4

Uh oh!

Languages