Skip to content

Parse a large uploaded gzipped csv file in the browser #1074

@psaffrey-biomodal

Description

@psaffrey-biomodal

The file size I'm using is about 200MB. I've already made this work by dumping everything into memory, but I want to use streams and put some custom logic into the parsing step to speed it up.

Doing this with streams is easy enough in Node:

import fs from 'fs'
import zlib from 'zlib'
import papaparse from 'papaparse'

const path = "/path/to/file.csv.gz"    
const data = [];

const parser = papaparse.parse(papaparse.NODE_STREAM_INPUT)

parser.on('data', (chunk) => {
    data.push(chunk);
    }
);

parser.on('end', () => {
    console.log(data)
    }
);

fs.createReadStream(filePath)
  .pipe(zlib.createGunzip())
  .pipe(parser);

To do the same in the browser (as far as I can tell), you need to turn an <input> file into a stream, push it through a DecompressionStream('gzip').writable and then push that into a stream capable papaparse parser. So I have this:

const fileInput = document.getElementById('selectFileBtn');

async function parseGzippedCsv(file) {
  const parser = papaparse.parse(papaparse.NODE_STREAM_INPUT)
  parser.on('data', (chunk) => {
      data.push(chunk);
      }
  );

  parser.on('end', () => {
      // I actually need to wrap in a Promise to make this work, but it fails before it gets here
      resolve(data);
    }
  );
  
  file.stream()        
    .pipeTo(new DecompressionStream('gzip').writable)
    .pipeTo(parser);
}

fileInput.addEventListener('change', async function(e) {
  const file = e.target.files[0];
  const data = await parseGzippedCsv(file);
});

The error is: TypeError: Cannot read properties of null (reading 'stream') on the line that creates the papaparse parser, so maybe I can't use papaparse.NODE_STREAM_INPUT in the browser...?

I've also tried to do something similar with csv-parse without success. I'm a bit surprised nobody else wants to do this in the browser 🤔

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions