-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Added BrWac dataset #3880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Added BrWac dataset #3880
Conversation
- remove unused imports - change test input data and number of test examples
Regarding unchecked boxes in the checklist:
|
Hello @marcospiau , and thank you for your contribution! Datasets with manually downloaded files do not need checksum files. We would be happy to merge your PR -- could you please resolve the conflict in the
|
Hi guys, thanks for reviewing the code! I've solved the conflicts on Best, |
Hi @ccl-core , could you please take a look and confirm everything is OK? Best, |
Hello @marcospiau , thank you for the heads-up! I requested the manual dataset at the given homepage but I received an email that their server is down. |
Hi @ccl-core, I tested the form request just now and had the same problem. I will contact the dataset mantainers and get back to you as I have an answer. Best, |
Hi @ccl-core, |
Thank you very much, @marcospiau ! |
Hi, @ccl-core. It took a little longer than expected, but the new links are already working. Could you please check if everything is OK now? PS.: the links are different from the previous ones, so the form needs to be filled out again. Best, |
Thank you @marcospiau , I'm having a look! I was wondering whether the dependency on |
Hi, @ccl-core. Thanks for the review! This dependency is included because the raw text contains many errors due to mojibake. One could write code to replicate what this dependency does, but the final code would probably be very similar to |
Dear @marcospiau , I understand. By the way, it seems like the You can register the new checksums with Thank you! |
Hi @ccl-core! The link for downloading this dataset is available once a form is filled out, so a manual download is required. Is it possible to generate checksums for manually downloaded files? |
Hi @marcospiau ! Yes, it is possible :) See e.g. the kaggle_wit dataset as an example: |
Hi @ccl-core! I tried using the command provided, but only got an empty I don't know if at the time I submitted my PR for the first the instructions were different from now, but I was informed that checksum files are not required for manually downloaded datasets.
|
Hi @ccl-core , just a heads-up! Can we proceed with the onboarding process? |
Hi @ccl-core ! Just a heads up! Can we proceed? |
Thank you for your contribution!
Please read https://www.tensorflow.org/datasets/contribute#pr_checklist to make sure your PR follows the guidelines.
Add Dataset
dataset_info.json
Gist: https://gist.github.com/marcospiau/58e9b3528288b2fbc419dd86ec1f18b7#file-brwac_dataset_info-jsonDescription
The Brazilian Portuguese Web as Corpus is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents and 2.68 billion tokens. In order to use this dataset, you must request access by filling the form in the official homepage. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications.
Title and text fields are preprocessed using ftfy (Speer, 2019) Python library.
PS.: Description is extracted from official homepage.
Checklist
__init__.py
download_and_prepare
successfullyBibTeX
formatscipy
), use lazy_imports (if applicable)