SEA Wikipedia Code

Welcome to SEA Wikipedia Code Repository. This repo containing script used for generating data in SEA Wikipedia HF. The data are extracted from Wikipedia HF and processed using the scripts available in this repository for reproducibility purpose. For licensing purpose, this codebase adopts license GNU General Public License and the data generated follows Wikipedia license of cc-by-sa 4.0.

Getting Started

To read the datasets directly

Plase refer to the HF Repo on SEA Wiki HF to see load_datasets implementation of ready-to-use data

How does the data looks like

You can check on several folders in this repo to check the data by yourself. More complete data can be found on corresponding HF Repo mentioned previously.

To replicate the whole dataset generation process

Set-up a new Python/Conda Environment (recommended Python version: 3.9.6 to 3.9.18 or 3.10.0 to 3.10.13) and install the requirements on requirements.txt use this codebase via pip install -r requirements.txt.
Activate the chosen Python/Conda environment which the requirements are being installed.
Force install multiprocess==0.70.15 by using pip install multiprocess==0.70.15 to avoid this issue (there's no other workaround for now)
Run this sh script for extractions from Wikiedia HF using sh extract_raw_wiki_data_sea.sh
This script will run extract_raw_wiki_data.py to construct the Wiki Dataset.
Run this sh script for deduplications from extracted data in Step 4 using sh dedup_raw_wiki_data_sea.sh
This script will run dedup_raw_wiki_data.py to do Wiki Dataset Clenasing. Please note that the cleansing process can be language/dialect specific.

FAQS

How does the data being preprocessed? What makes it different from loading it directly from Wikipedia HF?

The data available in here are processed with following flows:

Raw data is being deduplicated on title and text (text-content from a given article), to remove articles containing boilerplate text (template text that are used usually for unavailable informations or asking for contributions of content in that article), which usually deemed noisy for NLP data.
Furthermore, the title and text data are being checked for string-matching duplication (duplication of text that are being pre-processed, i.e symbols removed, HTML tags striped, or ASCII-chars/UTF-8 chars validated). You may check this dedup_raw_wiki_data.py script to understand its implementation.

How do I extract new Wikipedia Dataset of SEA languages?

You may check to the script extract_raw_wiki_data.py to understand its implementations, or you can adjust the bash provided in extract_raw_wiki_data_sea.sh to extract it on your own.

How do I extract new Wikipedia Dataset of SEA languages?

You may visit this Wikipedia Dump Index to check any latest available data and this link Wikipedia Language Coverage to map into any languages that you're wanting to extract. Please note that this dataset is extensible to any languages of your choice.

What if my machine can't load it in one-go?

Don't worry! You can do a batched-loading by looking at the script on extract_raw_wiki_data_batched.py and extract_raw_wiki_data_batched_example.sh. Please note that the batched approach will output same data with direct flow, but perhaps with different data ordering (although it can be verified by joining via id).

Citation Info:

@ONLINE{wikidump,
    author = "Wikimedia Foundation",
    title  = "Wikimedia Downloads",
    url    = "https://dumps.wikimedia.org"}
@ONLINE{wikipedia-hf,
    title  = "Huggingface Wikipedia Dataset",
    url    = "https://huggingface.co/datasets/wikipedia"}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
example_concatted_data_batched		example_concatted_data_batched
example_result_extract_raw_wiki_data_batched_id		example_result_extract_raw_wiki_data_batched_id
sea_loader_batched		sea_loader_batched
sea_wiki_dedup_data		sea_wiki_dedup_data
sea_wiki_raw_data		sea_wiki_raw_data
.gitattributes		.gitattributes
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
concat_batched_data.py		concat_batched_data.py
dedup_raw_wiki_data.py		dedup_raw_wiki_data.py
dedup_raw_wiki_data_sea.sh		dedup_raw_wiki_data_sea.sh
extract_raw_wiki_data.py		extract_raw_wiki_data.py
extract_raw_wiki_data_batched.py		extract_raw_wiki_data_batched.py
extract_raw_wiki_data_batched_example.sh		extract_raw_wiki_data_batched_example.sh
extract_raw_wiki_data_sea.sh		extract_raw_wiki_data_sea.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SEA Wikipedia Code

Getting Started

To read the datasets directly

How does the data looks like

To replicate the whole dataset generation process

FAQS

How does the data being preprocessed? What makes it different from loading it directly from Wikipedia HF?

How do I extract new Wikipedia Dataset of SEA languages?

How do I extract new Wikipedia Dataset of SEA languages?

What if my machine can't load it in one-go?

Citation Info:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sabilmakbar/sea_wiki

Folders and files

Latest commit

History

Repository files navigation

SEA Wikipedia Code

Getting Started

To read the datasets directly

How does the data looks like

To replicate the whole dataset generation process

FAQS

How does the data being preprocessed? What makes it different from loading it directly from Wikipedia HF?

How do I extract new Wikipedia Dataset of SEA languages?

How do I extract new Wikipedia Dataset of SEA languages?

What if my machine can't load it in one-go?

Citation Info:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages