Methods section on choosing compressor defaults #26

jeromekelleher · 2024-03-28T16:51:44Z

@shz9 has done a nice analysis of how various settings affect compression rations:

We should incorporate this into the paper. I've made an initial Methods section "Choosing default compressor settings", where we can write a few paragraphs discussing what we did, and the basic conclusions. (Interesting to note that BitPack didn't do much, e.g.)

I guess we want some sort of supplementary figure or table summarising the results as well?

We also want to bring the code for doing this into the repo. A suggested sketch:

Make a new directory real_data, and create a Makefile to download the file of interest and create the starting-point Zarr (can copy lots from scaling/Makefile
Add the script for doing the analysis to src
Save the data to a CSV in plot_data
Add the code to plot the figure(s) to src/plot.py

Basically we want to keep everything here in the repo so it's all nice and reproducible.

How does this sound @shz9?

The text was updated successfully, but these errors were encountered:

shz9 · 2024-04-02T02:18:44Z

Sounds good. I'll be preparing/traveling to conferences for the next couple of weeks, but I will do this once it's all done.

jeromekelleher · 2024-04-24T13:16:07Z

How is this looking @shz9? I'd like to get a draft of this into the document as soon as we can. Keen to get this preprinted in the next few weeks...

shz9 · 2024-04-24T23:07:26Z

I've been busy recently. I will try to get it done this weekend.

shz9 · 2024-04-28T23:55:33Z

OK, so I created the real_data/Makefile file and it seems to be working fine (learned quite a bit from the scaling/Makefile script, thanks!). The updates are here.

I'm planning to re-run the experiments as part of this new pipeline to make sure that the whole thing runs from start to finish. Once the results are in, I will describe them in the LaTeX document and do a pull request. Just need some help with the issue raised earlier.

Some notes:

The WGS sequencing file for CHR22 is quite large and slow to download/convert. This can make the experiments unnecessarily difficult to run. That's one reason I gave an option in the Makefile to use older genotype-only data as an alternative. It should be <1 GB, so more manageable and will make the experiments faster run. However, there's an issue with converting those files at the moment.
I need to use seaborn for the grid figures. Should we add a requirements.txt to list all python dependencies? I think we should definitely list the PyPI version of bio2zarr in there to make sure the experiments are reproducible.
Should we add a .gitignore file and list data directories in there?

jeromekelleher · 2024-04-29T08:54:02Z

Great, thanks @shz9!

The WGS sequencing file for CHR22 is quite large and slow to download/convert. This can make the experiments unnecessarily difficult to run.

I think the simplest thing here is to download the first gigabyte or so of the full data using bcftools HTTP access features. See validation data Makefile in the bio2zarr repo for examples of how to do this. You'll need to tweak the number of lines to "head" to suit.

I need to use seaborn for the grid figures. Should we add a requirements.txt to list all python dependencies? I think we should definitely list the PyPI version of bio2zarr in there to make sure the experiments are reproducible.

Yes please, that would be excellent

Should we add a .gitignore file and list data directories in there?

Yep, any usability tweaks like this just go for it.

shz9 · 2024-05-07T20:00:29Z

Sorry for the delays. The pipeline is now ready and the figures have been added to the latest version. I also added requirements.txt and .gitignore as we discussed. Do you mind taking a look and making suggestions?

I added 3 figures that highlight 3 aspects of our discussions:

figures/compression_ratio_grid.pdf: This shows the effects of the chunksize and shuffle parameters across all the call_* arrays. The figure is quite big, do you think it makes sense to condense it a bit, since a lot of the arrays have similar compression profiles?
figures/compression_packbits.pdf: This shows the effect of the PackBits filter applied to the call_genotype_mask array.
figures/compression_dim_shuffle.pdf (optional): This shows the effect of shuffling the dimensions on the quality of the compression for the call_AD array (the only one where I saw notable effect).

Once you get a chance to provide some feedback on the figures, I can write up our conclusions and do a pull request.

jeromekelleher · 2024-05-07T20:50:23Z

That's great, can you open a pr please? It doesn't need to be final, and easier for me to give feedback

jeromekelleher closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Methods section on choosing compressor defaults #26

Methods section on choosing compressor defaults #26

jeromekelleher commented Mar 28, 2024 •

edited

Loading

shz9 commented Apr 2, 2024

Uh oh!

jeromekelleher commented Apr 24, 2024

Uh oh!

shz9 commented Apr 24, 2024

Uh oh!

shz9 commented Apr 28, 2024

Uh oh!

jeromekelleher commented Apr 29, 2024

Uh oh!

shz9 commented May 7, 2024

Uh oh!

jeromekelleher commented May 7, 2024 •

edited

Loading

Uh oh!

Methods section on choosing compressor defaults #26

Methods section on choosing compressor defaults #26

Comments

jeromekelleher commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

shz9 commented Apr 2, 2024

Uh oh!

jeromekelleher commented Apr 24, 2024

Uh oh!

shz9 commented Apr 24, 2024

Uh oh!

shz9 commented Apr 28, 2024

Uh oh!

jeromekelleher commented Apr 29, 2024

Uh oh!

shz9 commented May 7, 2024

Uh oh!

jeromekelleher commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Mar 28, 2024 •

edited

Loading

jeromekelleher commented May 7, 2024 •

edited

Loading