Skip to content

Methods section on choosing compressor defaults #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Mar 28, 2024 · 7 comments
Closed

Methods section on choosing compressor defaults #26

jeromekelleher opened this issue Mar 28, 2024 · 7 comments

Comments

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Mar 28, 2024

@shz9 has done a nice analysis of how various settings affect compression rations:

sgkit-dev/bio2zarr#74

We should incorporate this into the paper. I've made an initial Methods section "Choosing default compressor settings", where we can write a few paragraphs discussing what we did, and the basic conclusions. (Interesting to note that BitPack didn't do much, e.g.)

I guess we want some sort of supplementary figure or table summarising the results as well?

We also want to bring the code for doing this into the repo. A suggested sketch:

  • Make a new directory real_data, and create a Makefile to download the file of interest and create the starting-point Zarr (can copy lots from scaling/Makefile
  • Add the script for doing the analysis to src
  • Save the data to a CSV in plot_data
  • Add the code to plot the figure(s) to src/plot.py

Basically we want to keep everything here in the repo so it's all nice and reproducible.

How does this sound @shz9?

@shz9
Copy link
Contributor

shz9 commented Apr 2, 2024

Sounds good. I'll be preparing/traveling to conferences for the next couple of weeks, but I will do this once it's all done.

@jeromekelleher
Copy link
Contributor Author

How is this looking @shz9? I'd like to get a draft of this into the document as soon as we can. Keen to get this preprinted in the next few weeks...

@shz9
Copy link
Contributor

shz9 commented Apr 24, 2024

I've been busy recently. I will try to get it done this weekend.

@shz9
Copy link
Contributor

shz9 commented Apr 28, 2024

OK, so I created the real_data/Makefile file and it seems to be working fine (learned quite a bit from the scaling/Makefile script, thanks!). The updates are here.

I'm planning to re-run the experiments as part of this new pipeline to make sure that the whole thing runs from start to finish. Once the results are in, I will describe them in the LaTeX document and do a pull request. Just need some help with the issue raised earlier.

Some notes:

  • The WGS sequencing file for CHR22 is quite large and slow to download/convert. This can make the experiments unnecessarily difficult to run. That's one reason I gave an option in the Makefile to use older genotype-only data as an alternative. It should be <1 GB, so more manageable and will make the experiments faster run. However, there's an issue with converting those files at the moment.
  • I need to use seaborn for the grid figures. Should we add a requirements.txt to list all python dependencies? I think we should definitely list the PyPI version of bio2zarr in there to make sure the experiments are reproducible.
  • Should we add a .gitignore file and list data directories in there?

@jeromekelleher
Copy link
Contributor Author

Great, thanks @shz9!

The WGS sequencing file for CHR22 is quite large and slow to download/convert. This can make the experiments unnecessarily difficult to run.

I think the simplest thing here is to download the first gigabyte or so of the full data using bcftools HTTP access features. See validation data Makefile in the bio2zarr repo for examples of how to do this. You'll need to tweak the number of lines to "head" to suit.

I need to use seaborn for the grid figures. Should we add a requirements.txt to list all python dependencies? I think we should definitely list the PyPI version of bio2zarr in there to make sure the experiments are reproducible.

Yes please, that would be excellent

Should we add a .gitignore file and list data directories in there?

Yep, any usability tweaks like this just go for it.

@shz9
Copy link
Contributor

shz9 commented May 7, 2024

Sorry for the delays. The pipeline is now ready and the figures have been added to the latest version. I also added requirements.txt and .gitignore as we discussed. Do you mind taking a look and making suggestions?

I added 3 figures that highlight 3 aspects of our discussions:

  1. figures/compression_ratio_grid.pdf: This shows the effects of the chunksize and shuffle parameters across all the call_* arrays. The figure is quite big, do you think it makes sense to condense it a bit, since a lot of the arrays have similar compression profiles?
  2. figures/compression_packbits.pdf: This shows the effect of the PackBits filter applied to the call_genotype_mask array.
  3. figures/compression_dim_shuffle.pdf (optional): This shows the effect of shuffling the dimensions on the quality of the compression for the call_AD array (the only one where I saw notable effect).

Once you get a chance to provide some feedback on the figures, I can write up our conclusions and do a pull request.

@jeromekelleher
Copy link
Contributor Author

jeromekelleher commented May 7, 2024

That's great, can you open a pr please? It doesn't need to be final, and easier for me to give feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants