-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add malaria bbbc dataset #1025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add malaria bbbc dataset #1025
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
|
@googlebot I signed it!
…On Tue, Sep 24, 2019 at 7:54 PM googlebot ***@***.***> wrote:
Thanks for your pull request. It looks like this may be your first
contribution to a Google open source project (if not, look below for help).
Before we can look at your pull request, you'll need to sign a Contributor
License Agreement (CLA).
📝 *Please visit https://cla.developers.google.com/
<https://cla.developers.google.com/> to sign.*
Once you've signed (or fixed any issues), please reply here with @googlebot
I signed it! and we'll verify it.
------------------------------
What to do if you already signed the CLA Individual signers
- It's possible we don't have your GitHub username or you're using a
different email address on your commit. Check your existing CLA data
<https://cla.developers.google.com/clas> and verify that your email is
set on your git commits
<https://help.github.com/articles/setting-your-email-in-git/>.
Corporate signers
- Your company has a Point of Contact who decides which employees are
authorized to participate. Ask your POC to be added to the group of
authorized contributors. If you don't know who your Point of Contact is,
direct the Google project maintainer to go/cla#troubleshoot (Public
version <https://opensource.google.com/docs/cla/#troubleshoot>).
- The email used to register you as an authorized contributor must be
the email used for the Git commit. Check your existing CLA data
<https://cla.developers.google.com/clas> and verify that your email is
set on your git commits
<https://help.github.com/articles/setting-your-email-in-git/>.
- The email used to register you as an authorized contributor must
also be attached to your GitHub account
<https://github.com/settings/emails>.
ℹ️ *Googlers: Go here
<https://goto.google.com/prinfo/https%3A%2F%2Fgithub.com%2Ftensorflow%2Fdatasets%2Fpull%2F1025>
for more info*.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1025?email_source=notifications&email_token=AH5W2TZMQ7RGDMMBPVZGBRDQLKSCDA5CNFSM4I2GBS7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7QEO5I#issuecomment-534792053>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AH5W2T3DHDRXTORQLDAMTG3QLKSCDANCNFSM4I2GBS7A>
.
|
|
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Ouwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
us
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "label": tfds.features.ClassLabel(names=_NAMES), | ||
| }), | ||
| supervised_keys=("image", "label"), | ||
| urls=[_URL], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this url should be dataset home page link not download link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@us The dataset has the following copyright access. Free to redistribute.
https://creativecommons.org/licenses/by-nc-sa/3.0/
Ouwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass, but dataset server is very slow... would be ideal to move into gcp. @powergkrry can you give this a sanity check via the colab notebook below?
https://colab.research.google.com/drive/1ozOwyvehz-XUeu9JLxhZLUYVstKuxylR
| import tensorflow as tf | ||
| import tensorflow_datasets.public_api as tfds | ||
|
|
||
| _URL = "http://people.duke.edu/~kk349/bbbc.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@us @Conchylicultor what is the process for adding datasets like this into the tfds bucket? The download from these personal duke servers is around 3MB/s unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, for legal reason, we are not allowed to host the datasets ourselves. Only the original author of the datasets can host it on a faster server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@powergkrry you are the original author of this dataset correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ouwen,
We are not the original authors of the dataset. The data is obtained from https://data.broadinstitute.org/bbbc/BBBC041/
We are free to copy and redistribute the data though.
| consists of two classes of uninfected cells(red blood cells and leukocytes) and | ||
| four classes of infected cells(gametocytes, rings, schizonts, and trophozoites). | ||
| The Malaria dataset contains a total of 74,238 cell images and we have removed 441 cell images | ||
| because annotators marked the cells as difficult as it was not clearly in one of the cell classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max lines should be PEP8 compliant 72 preferred 79 max chars
Ouwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://colab.research.google.com/drive/1ozOwyvehz-XUeu9JLxhZLUYVstKuxylR
Tests pass dataset loads. Unfortunate that the dataset download is slow @powergkrry. Consider using Duke's data hosting service: https://research.repository.duke.edu/
Make the minor documentation changes.
|
Hi @powergkrry, this doesn't seems to be an official dataset. If the data is from https://data.broadinstitute.org/bbbc/index.html, I'm not sure I understand why you're not using the official data from the website, which contains 2GB of data (https://data.broadinstitute.org/bbbc/BBBC041/malaria.zip), vs 600MB for this one. TFDS should redistribute the datasets as the original dataset was, otherwise users may be misslead and train their model on wrong data. |
No description provided.