Skip to content

Added BrWac dataset #3880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Added BrWac dataset #3880

wants to merge 21 commits into from

Conversation

marcospiau
Copy link
Contributor

@marcospiau marcospiau commented Apr 12, 2022

Thank you for your contribution!

Please read https://www.tensorflow.org/datasets/contribute#pr_checklist to make sure your PR follows the guidelines.

Add Dataset

Description

The Brazilian Portuguese Web as Corpus is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents and 2.68 billion tokens. In order to use this dataset, you must request access by filling the form in the official homepage. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications.

Title and text fields are preprocessed using ftfy (Speer, 2019) Python library.
PS.: Description is extracted from official homepage.

Checklist

  • Address all TODO's
  • Add alphabetized import to subdirectory's __init__.py
  • Run download_and_prepare successfully
  • Add checksums file
  • Properly cite in BibTeX format
  • Add passing test(s)
  • Add test data
  • If using additional dependencies (e.g. scipy), use lazy_imports (if applicable)
  • Add data generation script (if applicable)
  • Lint code

@marcospiau marcospiau marked this pull request as ready for review April 12, 2022 13:24
@marcospiau
Copy link
Contributor Author

marcospiau commented Apr 12, 2022

Regarding unchecked boxes in the checklist:

  • this dataset used a manually downloaded file, and I was not able to generate the checksums files, even though I ran tfds build with --register_checksums flag
  • I didn't run download_and_prepare, but built the dataset from the command line using tfds build brwac --register_checksums --manual_dir=<DIRECTORY_WITH_MANUAL_DOWNLOAD> with success

@ccl-core ccl-core self-assigned this May 2, 2022
@ccl-core
Copy link
Collaborator

Hello @marcospiau , and thank you for your contribution!

Datasets with manually downloaded files do not need checksum files.

We would be happy to merge your PR -- could you please resolve the conflict in the setup.py file before that?
This typically requires running:

# On your feature branch
git fetch origin master
git rebase origin/master

@marcospiau
Copy link
Contributor Author

marcospiau commented May 19, 2022

Hi guys, thanks for reviewing the code!

I've solved the conflicts on setup.py, please let me know if there is anything else I could help with.

Best,
Marcos

@marcospiau
Copy link
Contributor Author

Hi @ccl-core , could you please take a look and confirm everything is OK?

Best,
Marcos

@ccl-core ccl-core added the copybara-import Internal label for PR management label Jun 29, 2022
@ccl-core
Copy link
Collaborator

Hello @marcospiau , thank you for the heads-up!

I requested the manual dataset at the given homepage but I received an email that their server is down.
Did you also encounter a similar problem? As soon as I can access the data to put in the manual_dir I'll finish the testing and I'll be able to complete the onboarding process.

@marcospiau
Copy link
Contributor Author

Hi @ccl-core,

I tested the form request just now and had the same problem. I will contact the dataset mantainers and get back to you as I have an answer.

Best,
Marcos

@marcospiau
Copy link
Contributor Author

Hi @ccl-core,
I spoke to them, the servers are being migrated and should be OK by the end of next week. I'll let you know as soon as the manual download is working again.
Best,
Marcos

@ccl-core
Copy link
Collaborator

ccl-core commented Jul 6, 2022

Thank you very much, @marcospiau !

@marcospiau
Copy link
Contributor Author

Hi, @ccl-core. It took a little longer than expected, but the new links are already working. Could you please check if everything is OK now? PS.: the links are different from the previous ones, so the form needs to be filled out again.

Best,
Marcos

@ccl-core
Copy link
Collaborator

Thank you @marcospiau , I'm having a look!

I was wondering whether the dependency on ftfy is really necessary here? Or would there be a workaround?

@marcospiau
Copy link
Contributor Author

Hi, @ccl-core. Thanks for the review! This dependency is included because the raw text contains many errors due to mojibake. One could write code to replicate what this dependency does, but the final code would probably be very similar to ftfy; besides, the few existing large language models pretrained in Portuguese use BrWac with ftfy preprocessing, so I think it's a good idea to use it as default preprocessing. What do you think?

@ccl-core
Copy link
Collaborator

Dear @marcospiau , I understand.

By the way, it seems like the checksum.tsv file is still missing? See tensorflow_datasets/text/bool_q/checksums.tsv. as an example.

You can register the new checksums with tfds build --register_checksums

Thank you!

@marcospiau
Copy link
Contributor Author

Hi @ccl-core! The link for downloading this dataset is available once a form is filled out, so a manual download is required. Is it possible to generate checksums for manually downloaded files?

@ccl-core
Copy link
Collaborator

Hi @marcospiau ! Yes, it is possible :)

See e.g. the kaggle_wit dataset as an example:
https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/vision_language/wit_kaggle

@marcospiau
Copy link
Contributor Author

Hi @ccl-core! I tried using the command provided, but only got an empty checksums.tsv file. Also, the example file you provided is an empty file (kaggle_wit). Can I manually create a checksums.tsv file?

I don't know if at the time I submitted my PR for the first the instructions were different from now, but I was informed that checksum files are not required for manually downloaded datasets.

Hello @marcospiau , and thank you for your contribution!

Datasets with manually downloaded files do not need checksum files.

We would be happy to merge your PR -- could you please resolve the conflict in the setup.py file before that? This typically requires running:

# On your feature branch
git fetch origin master
git rebase origin/master

@marcospiau
Copy link
Contributor Author

marcospiau commented Oct 17, 2022

Hi @ccl-core , just a heads-up! Can we proceed with the onboarding process?

@marcospiau
Copy link
Contributor Author

Hi @ccl-core ! Just a heads up! Can we proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
copybara-import Internal label for PR management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants