Add support for 'tabix' formatted genomics/genetics data #1568

MattBrauer · 2022-09-01T23:04:05Z

Geneticists and genomics scientists cannot natively get data from S3 when it's compressed using the most common format for that data type
Tabular data in genetics and genomics is often compressed using the so-called "tabix" format. This is a block compressed, gzip-compatible compression that allows indexing into a file by genome position. While the file suffixes are .gz, and gunzip can be used to decompress them from local files (via pandas.read_csv with compression=gzip) AWS data wrangler cannot fetch these data from S3 (by awswrangler.s3.read_csv).

What I'd like to see
I'd like to be able to get data via awswrangler.s3.read_csv(S3_uri, compression="bgzip"). While guzip will decompress the file on local storage,bgzip -d is actually the preferred method. I believe that subtle differences between gzip and bgzip corrupt the reading of the data from S3.

Alternatives
Transferring a file to local storage (aws s3 cp <uri> ./) and then using the pandas read_csv function works, but involves an extra copy step.

It is likely that the genomics community's use of bgzip for tabix' files is idiosyncratic, but there is a large and growing number of users in this space.

Adding support for this compression method would support the genetics field and the biopharma industry.

The text was updated successfully, but these errors were encountered:

malachi-constant · 2022-09-13T21:27:31Z

Hi @MattBrauer can you help us replicate your scenario? You mention using pandas' read_csv method, but I get an error when using bgzip as the compression value.

ValueError: Unrecognized compression type: bgzip
Valid compression types are ['infer', None, 'bz2', 'gzip', 'xz', 'zip', 'zstd']

MattBrauer · 2022-09-16T19:41:23Z

Hello. The bgzip format is compatible with gzip, so pandas' read_csv can be used successfully with gzip compression. I'd like it to be possible to do the same with awswrangler.

malachi-constant · 2022-09-19T16:49:00Z

OK, so you are using pandas.read_csv(.., compression="gzip") after copying file to local today?

MattBrauer · 2022-09-19T19:13:59Z

That is correct. That's a reasonable workaround, but it would be nice to have wr.s3.read_csv behave the same way, if possible.

Thanks for the attention on this. If necessary I can get you a file that demonstrates the problem, but since they contain restricted data I'd have to do some obfuscation work first.

MattBrauer added the feature label Sep 1, 2022

github-actions bot added the needs-triage label Sep 7, 2022

malachi-constant removed the needs-triage label Sep 13, 2022

malachi-constant self-assigned this Sep 22, 2022

malachi-constant added the backlog label Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 'tabix' formatted genomics/genetics data #1568

Add support for 'tabix' formatted genomics/genetics data #1568

MattBrauer commented Sep 1, 2022 •

edited

Loading

malachi-constant commented Sep 13, 2022

MattBrauer commented Sep 16, 2022

malachi-constant commented Sep 19, 2022

MattBrauer commented Sep 19, 2022

Add support for 'tabix' formatted genomics/genetics data #1568

Add support for 'tabix' formatted genomics/genetics data #1568

Comments

MattBrauer commented Sep 1, 2022 • edited Loading

malachi-constant commented Sep 13, 2022

MattBrauer commented Sep 16, 2022

malachi-constant commented Sep 19, 2022

MattBrauer commented Sep 19, 2022

MattBrauer commented Sep 1, 2022 •

edited

Loading