Skip to content

Truncation issue with GZIPInputStream when reading bgzip-compressed streams #1691

@jkmatila

Description

@jkmatila

Description of the issue

When reading bgzip-compressed data from a stream (e.g. over the network or from a pipe), htsjdk can stop early and not read the stream all the way to the end, processing partial data. This is due to using java.util.zip.GZIPInputStream which has a known bug JDK-7036144 affecting reading concatenated gzip streams when the complete size of the stream is not known beforehand.

The problem has been discussed earlier in the comment thread to broadinstitute/gatk issue #4224, but it seems that a part of the recommended fix, replacing the use of the buggy GZIPInputStream class, was not adopted. There also is a comment in SamInputResource.java which references this problem and the aforementioned JDK bug.

For example, GzipCompressorInputStream from Apache Commons Compress would be an alternative which does not suffer from this problem, as noted in its Javadoc.

Your environment

  • version of htsjdk: 4.0.2
  • version of Java: OpenJDK 17.0.8.1 (Eclipse Temurin-17.0.8.1+1), OpenJDK 21 (Eclipse Temurin-21+35)
  • which OS: Linux x86_64 (Ubuntu 22.04.3)

Steps to reproduce

Here is an example program which demonstrates one instance of the problem, VCFIteratorBuilder using the problematic GZIPInputStream class when reading a bgzipped VCF stream, and only part of the stream gets read. (Note: The code attempts to read a 200 MB file from the Web, although it in practice due to the bug it does not get very far.)

import java.io.IOException;
import java.io.InputStream;
import java.net.URL;

import htsjdk.variant.vcf.VCFIterator;
import htsjdk.variant.vcf.VCFIteratorBuilder;

public class Repro {

    public static void main(String[] args) throws IOException {
        URL url = new URL("https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz");
        int expectedNumRecords = 1103547;

        int numRecords = 0;
        try (InputStream in = url.openStream();
             VCFIterator vcfIterator = new VCFIteratorBuilder().open(in)) {
            while (vcfIterator.hasNext()) {
                vcfIterator.next();
                numRecords++;
                if (numRecords % 100000 == 0) {
                    System.out.println("Read " + numRecords + " records...");
                }
            }
        }

        if (numRecords != expectedNumRecords) {
            System.err.println("URL: " + url);
            System.err.println("Expected: " + expectedNumRecords + " records");
            System.err.println("Parsed: " + numRecords + " records");
            System.exit(1);
        }
    }

}

There are also other places where GZIPInputStream is used in htsjdk.

Expected behaviour

The compressed VCF stream should be read all the way to the end and all records parsed.

Actual behaviour

The code stops reading the stream at some point, not parsing all records. No error is returned and parsing appears successful but in fact produces only part of the expected data. The exact location where it stops probably depends on how the block boundaries happen to coincide with the buffer size. It appears that if you have a large enough input file, it will likely stop at some point before reaching the end.

On both OpenJDK 17 (Eclipse Temurin-17.0.8.1+1) and on OpenJDK 21 (Eclipse Temurin-21+35) using the official Eclipse Temurin Docker images the above code outputs the following:

URL: https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Expected: 1103547 records
Parsed: 7395 records

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions