Welcome to SEA Wikipedia Code Repository. This repo containing script used for generating data in SEA Wikipedia HF. The data are extracted from Wikipedia HF and processed using the scripts available in this repository for reproducibility purpose. For licensing purpose, this codebase adopts license GNU General Public License
and the data generated follows Wikipedia license of cc-by-sa 4.0.
Plase refer to the HF Repo on SEA Wiki HF to see load_datasets
implementation of ready-to-use data
You can check on several folders in this repo to check the data by yourself. More complete data can be found on corresponding HF Repo mentioned previously.
-
Set-up a new Python/Conda Environment (recommended Python version: 3.9.6 to 3.9.18 or 3.10.0 to 3.10.13) and install the requirements on
requirements.txt
use this codebase viapip install -r requirements.txt
. -
Activate the chosen Python/Conda environment which the requirements are being installed.
-
Force install
multiprocess==0.70.15
by usingpip install multiprocess==0.70.15
to avoid this issue (there's no other workaround for now) -
Run this
sh
script for extractions from Wikiedia HF usingsh extract_raw_wiki_data_sea.sh
This script will runextract_raw_wiki_data.py
to construct the Wiki Dataset. -
Run this
sh
script for deduplications from extracted data in Step 4 usingsh dedup_raw_wiki_data_sea.sh
This script will rundedup_raw_wiki_data.py
to do Wiki Dataset Clenasing. Please note that the cleansing process can be language/dialect specific.
How does the data being preprocessed? What makes it different from loading it directly from Wikipedia HF?
The data available in here are processed with following flows:
- Raw data is being deduplicated on
title
andtext
(text-content from a given article), to remove articles containing boilerplate text (template text that are used usually for unavailable informations or asking for contributions of content in that article), which usually deemed noisy for NLP data. - Furthermore, the
title
andtext
data are being checked for string-matching duplication (duplication of text that are being pre-processed, i.e symbols removed, HTML tags striped, or ASCII-chars/UTF-8 chars validated). You may check thisdedup_raw_wiki_data.py
script to understand its implementation.
You may check to the script extract_raw_wiki_data.py
to understand its implementations, or you can adjust the bash provided in extract_raw_wiki_data_sea.sh
to extract it on your own.
You may visit this Wikipedia Dump Index to check any latest available data and this link Wikipedia Language Coverage to map into any languages that you're wanting to extract. Please note that this dataset is extensible to any languages of your choice.
Don't worry! You can do a batched-loading by looking at the script on extract_raw_wiki_data_batched.py and extract_raw_wiki_data_batched_example.sh. Please note that the batched approach will output same data with direct flow, but perhaps with different data ordering (although it can be verified by joining via id
).
@ONLINE{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"}
@ONLINE{wikipedia-hf,
title = "Huggingface Wikipedia Dataset",
url = "https://huggingface.co/datasets/wikipedia"}