Skip to content

Add the Jelly output format #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 26, 2025
Merged

Add the Jelly output format #258

merged 4 commits into from
Jun 26, 2025

Conversation

Ostrzyciel
Copy link
Contributor

@Ostrzyciel Ostrzyciel commented Jun 8, 2025

This PR adds support for outputting files in the Jelly format, a high-performance binary RDF format based on Protocol Buffers.

  • Jelly is a compressed streaming format. Like N-Triples, you can write or read a Jelly file triple-by-triple, without knowing the entire dataset up-front.
    • This means that a Jelly file can be of essentially infinite size, and the memory usage for writing or parsing such a file would be constant.
  • Like HDT, it uses pretty efficient compression, to make the resulting file smaller.
  • Unlike HDT, it is not indexed, so you can't query the file directly. However, you can parse it very efficiently (up to 15 million triples/s in our benchmarks).

Implementation

I used the jelly-rdf4j library, which integrates nicely with the RDF4J Rio subsystem. GitHub: https://github.com/Jelly-RDF/jelly-jvm

Jelly is a binary format, so it can't be written to a Java Writer. For this reason I added a new write method to QuadStore that takes as input an OutputStream. In modern RDF4J there is really no reason to use a Writer for output, as the only legal encoding is UTF-8 anyway, and RDF4J is perfectly happy with writing to a raw binary stream (it should even be faster). But, I assume that the Writer-based method must be kept for API compatibility, so for now I made it so that only Jelly uses the native OutputStream output. Remaining RDF4J formats can be migrated later by simply replacing the out parameter to be OutputStream – I already tested that and it works fine.

These changes may be useful later to reduce the hackiness of the current HDT serializer implementation (which currently bypasses write with an additional conditional branch), or to add support for other binary formats.

Tests

I added a test to verify that the output is saved correctly in the Jelly format.

The target_output.jelly was generated using jelly-cli with this command:

$ jelly-cli rdf to-jelly --opt.rdf-star=false --opt.generalized-statements=false ../output-turtle/target_output.ttl > target_output.jelly

Using the output

You can use jelly-cli to play around with the generated Jelly files and convert them to other formats. You can also load them into Apache Jena Fuseki by installing the Jelly plugin for Jena.

You can also use pyjelly to load the file into rdflib in Python.

Dependencies

This adds 3 new dependencies, which together weigh around 2 megabytes in JAR form:

  • jelly-core – generic serialization code for Jelly
  • jelly-rdf4j – integration of jelly-core with RDF4J
  • protobuf-java – Google's Protobuf library

protobuf-java is a very popular library, used in many projects, with a robust security policy. The Jelly libraries are extensively tested (8000+ test cases in the main suite) and have mitigations for known security risks tested in CI. They are production-grade and are currently being used for example in the nanopublication services for inter-service communication.

@bjdmeest bjdmeest requested a review from ghsnd June 26, 2025 11:28
@bjdmeest
Copy link
Collaborator

This is an absolutely beautiful PR, thanks for that. I'll let @ghsnd doublecheck and merge if he's as happy as I am :)

Copy link
Contributor

@ghsnd ghsnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ineed!

After this is merged I'll change QuadStore#write to take an InputStream instead of a Writer as parameter, then the store code can be a bit more concise. I think Writer is used because there was no output to a binary format yet.

@ghsnd ghsnd merged commit a28e7a0 into RMLio:master Jun 26, 2025
@Ostrzyciel Ostrzyciel deleted the add-jelly branch June 26, 2025 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants