Skip to content

Datasets

Bastian Eichenberger edited this page Sep 21, 2020 · 14 revisions

❗ Disclaimer - This does not yet work and will be added in a PR very soon.

Here you'll learn how to create your own dataset for training or evaluation. We divide the creation process into the following steps:

  1. ✏️ Labelling
  2. 📤 Export labels
  3. 🗃️ Create dataset npz file

Labelling

Labelling is done in Fiji (download here) using the Multi-point Tool. To open this tool, right-click on the Point Tool. You should have the view shown below. Double-click on the Icon to configure (if you want to remove label numbers, change point size, etc.).

Multi-point Tool in Fiji

After opening an image, each blob is labelled by clicking on the spot thereby adding a point. If you aren't happy with your selection, click+drag to move point (wait until the cursor turns into a hand), option(alt)+click to remove a point, or shift+A to delete all current points.

Example image labelling in Fiji

Now all that's left to do is saving a file with labels into an empty directory of your liking (see the next step).

Export labels

After the previous step, you should have a directory with labelled images only. Please download our Fiji export macro, unzip, and open / drag it into Fiji and execute as shown below (if the download link does not work, save the raw file).

Run Fiji export macro

After execution, you should have a labels directory inside the select image directory.

Create dataset npz file

Now, we can use deepBlink to convert the raw files into a single npz file ready for usage. Please run:

deepblink create --input INPUT --name NAME

Quick explanation on what is going on:

  • The NAME is the name of your dataset. The generated file will have the name NAME.npz. Feel free to pass in a path to change the saving location.
  • The INPUT will take the directory with all images and a labels subdirectory (as previously created) by default.
  • If you have a different file structure you can use the --labels LABELS flag to customise the path to the labels. The INPUT will then only be used as path to the images.
  • Change the ratio of train / validation / test split by using the --validsplit VALIDSPLIT or --testsplit TESTSPLIT flags. Both are values between 0-1 corresponding to the percentages of images used (e.g. TESTSPLIT of 0.2 will use 20% of images for testing). First TESTSPLIT will be applied to the entire dataset. Then, VALIDSPLIT will be applied to the remaining non-test data (i.e. a VALSPLIT of 0.2 is slightly less than 20% depending on the TESTSPLIT).
  • To resize images uniformly, use the --size SIZE flag. Note that for deepBlink to work properly, training images have to be square and a power of two (256, 512, 1024). So we don't train on duplicate images, any crops that would overlap with existing images are ignored. Similarly all images smaller than the specified size will not be included in the dataset.

Additional insights

This dataset npz file is nothing else than six numpy arrays bundled together. These arrays are x_train, y_train, x_valid, y_valid, x_test, y_test where x denotes the input / images and y the ground truth / labels. A npz file can be easily read in python using our deepblink.io.load_npz function.

Clone this wiki locally