Skip to content

Conversation

@powergkrry
Copy link

No description provided.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no Author has not signed CLA label Sep 24, 2019
@powergkrry
Copy link
Author

powergkrry commented Sep 24, 2019 via email

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes Author has signed CLA and removed cla: no Author has not signed CLA labels Sep 24, 2019
@powergkrry powergkrry changed the title version_1 Add malaria bbbc dataset Sep 25, 2019
Copy link
Contributor

@Ouwen Ouwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

Copy link
Contributor

@us us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"label": tfds.features.ClassLabel(names=_NAMES),
}),
supervised_keys=("image", "label"),
urls=[_URL],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this url should be dataset home page link not download link.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@us The dataset has the following copyright access. Free to redistribute.
https://creativecommons.org/licenses/by-nc-sa/3.0/

Copy link
Contributor

@Ouwen Ouwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass, but dataset server is very slow... would be ideal to move into gcp. @powergkrry can you give this a sanity check via the colab notebook below?

https://colab.research.google.com/drive/1ozOwyvehz-XUeu9JLxhZLUYVstKuxylR

import tensorflow as tf
import tensorflow_datasets.public_api as tfds

_URL = "http://people.duke.edu/~kk349/bbbc.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@us @Conchylicultor what is the process for adding datasets like this into the tfds bucket? The download from these personal duke servers is around 3MB/s unfortunately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, for legal reason, we are not allowed to host the datasets ourselves. Only the original author of the datasets can host it on a faster server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@powergkrry you are the original author of this dataset correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ouwen,
We are not the original authors of the dataset. The data is obtained from https://data.broadinstitute.org/bbbc/BBBC041/
We are free to copy and redistribute the data though.

consists of two classes of uninfected cells(red blood cells and leukocytes) and
four classes of infected cells(gametocytes, rings, schizonts, and trophozoites).
The Malaria dataset contains a total of 74,238 cell images and we have removed 441 cell images
because annotators marked the cells as difficult as it was not clearly in one of the cell classes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max lines should be PEP8 compliant 72 preferred 79 max chars

Copy link
Contributor

@Ouwen Ouwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://colab.research.google.com/drive/1ozOwyvehz-XUeu9JLxhZLUYVstKuxylR

Tests pass dataset loads. Unfortunate that the dataset download is slow @powergkrry. Consider using Duke's data hosting service: https://research.repository.duke.edu/

Make the minor documentation changes.

@Conchylicultor Conchylicultor added the dataset request Request for a new dataset to be added label Oct 15, 2019
@Conchylicultor
Copy link
Member

Hi @powergkrry, this doesn't seems to be an official dataset.
Otherwise, could you provide the official homepage of this dataset ?

If the data is from https://data.broadinstitute.org/bbbc/index.html, I'm not sure I understand why you're not using the official data from the website, which contains 2GB of data (https://data.broadinstitute.org/bbbc/BBBC041/malaria.zip), vs 600MB for this one.
Similarly, the original data provide both train and test set, while this data only provide test set.

TFDS should redistribute the datasets as the original dataset was, otherwise users may be misslead and train their model on wrong data.

@cyfra cyfra added the icebox No response from the author for at least 1 month. label Jan 3, 2020
@cyfra cyfra added author:please_respond Author - please respond to the recent comments. and removed cannot_merge:under review labels Feb 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:please_respond Author - please respond to the recent comments. cla: yes Author has signed CLA dataset request Request for a new dataset to be added icebox No response from the author for at least 1 month.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants