Proposed `Dataset` API changes #2591

namedgraph · 2023-09-12T12:18:21Z

namedgraph
Sep 12, 2023

The Dataset is quite weird and assumes that standalone Graphs have identifiers, which will be phased out (#2537). For example, adding a named graph to a Dataset looks like this:

g = Graph(identifier=URIRef("urn:named_graph"))
g.add((..., ... ,...))
...
d = Dataset()
d.add_graph(g)

Moreover, Dataset uses the term context when referring to named graphs. I think it should be phased out as well.
If in doubt, I suggest just copying Jena's Dataset API.

My suggestions for Dataset:

add add_named_graph(uri: IdentifiedNode, graph: Graph) method
add has_named_graph(uri: IdentifiedNode) method
add remove_named_graph(uri: IdentifiedNode) method
add replace_named_graph(uri: IdentifiedNode, graph: Graph)) method
add graphs() method as an alias for contexts()
add default_graph property as an alias for default_context
add get_named_graph as an alias for get_graph
deprecate graph(graph) method
deprecate remove_graph(graph) method
deprecate contexts() method

Using IdentifiedNode as a super-interface for URIRef and BNode (since both are allowed as graph names in RDF 1.1).

The above example would become something like this after these changes:

g = Graph()
g.add((..., ... ,...))
...
d = Dataset()
d.add_named_graph(URIRef("urn:named_graph"), g)

namedgraph · 2023-09-12T12:20:53Z

namedgraph
Sep 12, 2023
Author

Also aligns with #2446 and #2407

0 replies

namedgraph · 2023-09-13T08:23:28Z

namedgraph
Sep 13, 2023
Author

It also looks like a simple Graph.add(self, g: Graph) method is missing 🤷‍♂️ There's only __add__.

0 replies

nicholascar · 2023-10-13T11:06:54Z

nicholascar
Oct 13, 2023
Maintainer

I support all of the proposals in this discussion. This has been a long-time coming - we've noticed these things for years - but have never done anything about these and they still hurt u - @edmondchuc is battling with datasets in a current project.

I suggest we also remove the ConjunctiveGraph class and fold any differences it has with Dataset into Dataset constructor parameters.

0 replies

namedgraph · 2023-10-20T15:49:46Z

namedgraph
Oct 20, 2023
Author

I don't think ConjunctiveGraph has to go -- as I understand, it provides a "union graph" which most triplestores support. But Dataset probably does not need to extend it.

0 replies

recalcitrantsupplant · 2024-11-26T07:08:10Z

recalcitrantsupplant
Nov 26, 2024

I've wrote down my thoughts on what expected interfaces are in a pseudo python/rdflib format:

without reference to the current implementation. As such I don't expect what I've written below would be backwards compatible - I'd think it should be changed for this reason.
without reading other's suggestions

Hopefully it's a coherent perspective; it may take some effort to reconcile / integrate with others'. Will have a go at this next.

Minimal class definitions

Only enough to illustrate the thinking/scenarios
Graph:

class GraphType(Enum):
    DEFAULT = "default"
    NAMED = "named"

class Graph:
    def __init__(
        self,
        identifier: URIRef | None = None,
        graph_type: GraphType | None = None,
    ):
        pass

Dataset:

class Dataset:
    def __init__(self):
        pass

    def quads(
        self,
        context: GraphType | URIRef | list[GraphType | URIRef] | None = None,
    ):
        pass

    def triples(
        self,
        context: GraphType | URIRef | list[GraphType | URIRef] | None = None,
    ):
        pass

    def add_graph(
        self,
        graph: Graph,
        target: URIRef | GraphType.DEFAULT | None = None,
    ):
        pass

Graph Scenarios

Scenario 1: Default Graph (Start with Triple)

Graph instantiated without context becomes a "default" or contextless graph when the first thing added is a triple.

g = Graph()
g.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

print(g.graph_type)
> default  # graph type is now "default"; any triples or quads added after this have no context
g.parse(data="<ex:s2> <ex:p2> <ex:o2> <ex:graph> .", format="nquads")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>'), ('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None), ('<ex:s2>', '<ex:p2>', '<ex:o2>', None)]

Scenario 2: Named Graph (Start with Quad)

Graph instantiated without context gets context from parsed quad.
Subsequently parsed triples inherit the context.

g = Graph()
g.parse(data="<ex:s2> <ex:p2> <ex:o2> <ex:g2> .", format="nquads")
print(list(g.triples()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(g.quads()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g2>')]

print(g.graph_type)
> named
g.parse(data="<ex:s3> <ex:p3> <ex:o3> .", format="turtle")
print(list(g.triples()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>'), ('<ex:s3>', '<ex:p3>', '<ex:o3>')]

print(list(g.quads()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g2>'), ('<ex:s3>', '<ex:p3>', '<ex:o3>', '<ex:g2>')]

Scenario 3: Named Graph with Identifier

Triples added to graph inherit the context.

g = Graph(identifier="ex:g1")
print(g.graph_type)
> named
g.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', '<ex:g1>')]

Scenario 4: Add quad to default graph

Context is ignored.

g = Graph(graph_type="default")
g.parse(data="<ex:s1> <ex:p1> <ex:o1> <ex:graph> .", format="nquads")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

Dataset Scenarios

Scenario 5: Add a Default Graph to a Dataset

g = Graph(graph_type="default")
g.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")

ds = Dataset()
ds.add_graph(g)

print(list(ds.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(ds.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

Scenario 6: Add a Named Graph to a Dataset

g = Graph(identifier="ex:g1")
g.parse(data="<ex:s2> <ex:p2> <ex:o2> .", format="turtle")

ds = Dataset()
ds.add_graph(g)

print(list(ds.triples()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.quads()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g1>')]

Scenario 7: Add a Graph to the Default Context

Graph ID of graph being added (if present) is overridden by "target".

g = Graph(identifier="ex:g1")
g.parse(data="<ex:s3> <ex:p3> <ex:o3> .", format="turtle")

ds = Dataset()
ds.add_graph(g, target="default")

print(list(ds.triples()))
> [('<ex:s3>', '<ex:p3>', '<ex:o3>')]

print(list(ds.quads()))
> [('<ex:s3>', '<ex:p3>', '<ex:o3>', None)]

Scenario 8: Add Graphs to Dataset changing the graph

Graph ID of graph being added (if present) is overridden by "target".

g = Graph(identifier="ex:g2", graph_type="named")
g.parse(data="<ex:s4> <ex:p4> <ex:o4> .", format="turtle")

ds = Dataset()
ds.add_graph(g, target="ex:newg")

print(list(ds.triples()))
> [('<ex:s4>', '<ex:p4>', '<ex:o4>')]

print(list(ds.quads()))
> [('<ex:s4>', '<ex:p4>', '<ex:o4>', '<ex:newg>')]

Scenario 9: Iterate Over Triples with Contexts

g1 = Graph(graph_type="default")
g1.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")

g2 = Graph(identifier="ex:g1")
g2.parse(data="<ex:s2> <ex:p2> <ex:o2> .", format="turtle")

ds = Dataset()
ds.add_graph(g1)
ds.add_graph(g2)

print(list(ds.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>'), ('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.triples(context=["NAMED", "DEFAULT"])))  # equivalent to default behaviour when not specifying context 
> [('<ex:s1>', '<ex:p1>', '<ex:o1>'), ('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.triples(context="NAMED")))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.triples(context="DEFAULT")))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(ds.triples(context=["DEFAULT", "ex:g2"])))  # ex:g2 is not in the dataset so no data returned from this graph.
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

Scenario 10: Iterate Over Quads with Contexts

g1 = Graph(graph_type="default")
g1.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")

g2 = Graph(identifier="ex:g1")
g2.parse(data="<ex:s2> <ex:p2> <ex:o2> .", format="turtle")

ds = Dataset()
ds.add_graph(g1)
ds.add_graph(g2)

print(list(ds.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None), ('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g1>')]

print(list(ds.quads(context="NAMED")))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g1>')]

print(list(ds.quads(context="DEFAULT")))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

print(list(ds.quads(context=["DEFAULT", "ex:g2"])))  # ex:g2 is not in the dataset so no data returned from this graph.
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

12 replies

nicholascar Dec 2, 2024
Maintainer

Confirming what @ashleysommer says: this is about fixing things not patching. Yes things will break. For people unable to adopt v8, v7 will still be available for quite some time, as v4.22, v5 & v6 still are.

recalcitrantsupplant Dec 5, 2024

Thanks for the feedback.

To clarify a few points around my thinking, the terminology I'm using is:

Inclusive Dataset

covers all triples/quads in the Default Graph and all Named Graphs

Exclusive Dataset

graphs targeted must be specified or inferred.
- default union off: refers to the Default Graph
- default union on: refers to the union of all Named Graphs.

Dataset Default Context

in SPARQL:
- set by FROM and FROM NAMED
- when not set using FROM and FROM NAMED, can be inclusive (e.g. GraphDB) or exclusive (e.g. Fuseki)
in RDFLib
- I'm proposing setting via the context parameter. It would be an array of one or more of "Named", "Default", and URIs for Graphs.
- when not set e.g. Graph().triples() or Dataset().quads() I'm proposing the behaviour is inclusive. In my experience many people find the exclusive behaviour counter intuitive.

Default Graph: the "unnamed" graph.

What am I proposing:
The context parameter on the triples and quads methods (and all related methods that iterate on the dataset e.g. subject_objects etc. ) is equivalent to Dataset clauses in SPARQL i.e. FROM and FROM NAMED not equivalent to Graph clauses.

I think the options are:

Target Data	Context for Inclusive Dataset	Context for Exclusive Dataset
Everything (Default Graph + all Named Graphs)	Context not set OR `["Default", "Named"]`	`["Default", "Named"]`
Default Graph only	`["Default"]`	Default Union = true: `["Default"]` Default Union = false: Context not set
All Named Graphs only	`["Named"]`	Default Union = true: Context not set. Default Union = false: `["Named"]`
Specific Named Graphs	`[uri1, uri2]`	`[uri1, uri2]`
Default Graph + specific Named Graphs	`["Default", uri1, uri2]`	`["Default", uri1, uri2]`

I think Inclusive would be the better option here:

no need for default union setting
easier for new users to not "lose" their data:
- with Exclusive dataset, adding a quad to the Dataset when Default Union is off means the quad won't be found when using triples or quads methods unless context is set to "Named". Inverse is true with adding a triple where Default Union is on.
- equivalent scenarios for Inclusive Dataset only occur when a user explicitly sets context= to something that does not include their data.

Perhaps the inclusive/exclusive options are a good place to start as the other methods depend on these:

are the options as I've drawn them correct? other options?
what are others' preferences?

namedgraph Dec 18, 2024
Author

@nicholascar I understand 8.x will contain breaking changes, but I still think they should be kept to a minimum. More specifically, they should be limited to where RDFLib is non-compliant or the API is awkward, but not to rename or redesign the APIs that already make sense, just for the sake of redesigning them.

namedgraph Dec 18, 2024
Author

@recalcitrantsupplant honestly I don't understand/like your proposal :) Why is GraphType and context necessary at all? As for dataset, I think get_graph(identifier) which retrieves a named Graph is much more useful than quads().

recalcitrantsupplant Dec 18, 2024

Hey Martynas, I'll write this up more clearly soon outside of this discussion, including what I think shouldn't be changed. Given the feedback on Graph specifically, i.e. that having a name or identifier for it is inconsistent with the spec, I'll remove that from proposed changes.

The thinking with quads and related methods, by which I mean "subjects", "subject_predicates", "subject_objects" etc. Is that there should be a way to provide "dataset subsetting" (equivalent to SPARQL FROM and FROM NAMED, but entirely separate from the query method). In my day to day use of rdflib this would be quite useful.

The main pain point for me with Dataset otherwise is adding and accessing triples or quads, so I'll include proposed changes here too.

afs · 2024-12-06T13:47:15Z

afs
Dec 6, 2024

Observations from afar ...

The context parameter on the triples and quads methods

Would the dataset (the storage unit) have a default context setting? Otherwise if an app changes, then it might require every API call to be tracked down and changed.

FWIW Fuseki has both modes - union default graph is SPARQL only, and it is a view of the dataset at query time. The usual way is to have a setting on the dataset but it can be set per query execution.

For update, where do new triples go to in an inclusive dataset?

1 reply

recalcitrantsupplant Jan 7, 2025

Would the dataset (the storage unit) have a default context setting? Otherwise if an app changes, then it might require every API call to be tracked down and changed.

For update, where do new triples go to in an inclusive dataset?

I would propose not having a default context setting both for storage and queries; triples added to the dataset without explicitly stating a graph are stored in the Default Graph, quads to the relevant graph.

To confirm, when you say if an app changes... is this in relation to current use of default contexts; those users would need to refactor their code to explicitly state where to store/query triples with each call?

PicouAymeric · 2025-01-24T23:42:06Z

PicouAymeric
Jan 24, 2025

Trying to summarize, what we have to do:

Graph and Dataset classes in two different files
Delete conjonctive graph class (merge with Dataset)
rename "context" by "named graph" in Dataset
Add add_graph/remove_graph/has_graph etc on Dataset
A graph is just a set of triples: graph identifier (to rename graph names) shouldn't be on the graph but rather on the dataset, same for store
Graph doesn't inherit from Node
Query always executed against an rdf dataset not a graph
Dataset: is not a Graph but rather based on following structure= one default graph + None or Dictionary IRI,Graph.

What else?

I had a look and it looks like the refactoring will be complex, in order not to break things we need to understand how most of the project works.
I have some time I could spend on that but probably not enough given I don't know the project enough.

10 replies

namedgraph Jan 27, 2025
Author

I wonder if it would be easier to make RDFLib 8.0 just a wrapper for Oxigraph (given that Oxrdflib already exists) rather than try to refactor this legacy codebase 😅
SPARQL would perform faster, too. @Tpt WDYT? :)

ashleysommer Jan 27, 2025
Maintainer

@namedgraph

I wonder if it would be easier to make RDFLib 8.0 just a wrapper for Oxigraph

I get where where you're coming from, but for the last 25 years RDFLib has been the pure-python RDF implementation. The whole point of RDFLib is that is is pure python. If you want to wrap C or Rust or Java code, there are better libraries out there (personally, I just use PyOxigraph for my projects).

Having said that, I am planning to have an official Oxigraph-store backend for RDFlib ready for the RDFlib v8 release (something similar to Oxrdflib), that will be an optional python-"extra" in the same way we have optional orjson support.

Tpt Jan 28, 2025

+1 to @ashleysommer RDFLib is the pure-python RDF implementation. Moving everything on top of pyoxigraph seems a larger work than cleaning up RDFLib internals.

Moreover, while pyoxigraph is much faster to load a RDF file or run SPARQL query, plain RDFLib is significantly faster at creating a RDF term or accessing term values (IRI string...) because of the overhead of converting between Python str and Rust String so for users who don't care about SPARQL but hand write RDF manipulations in Python, RDFLib is likely to be faster.

I am planning to have an official Oxigraph-store backend for RDFlib ready for the RDFlib v8 release

This would be amazing! However, feel free to just take over oxrdflib if you prefer to have a smaller RDFLib core and dogfood the RDFLib Store plugin API with pyoxigraph. Feel free to reach out if you need help in any case.

recalcitrantsupplant Jan 29, 2025

I've created a draft PR here: #3060

namedgraph Feb 12, 2025
Author

I've provided some comments on the PR

Proposed Dataset API changes #2591

Uh oh!

Uh oh!

Replies: 7 comments · 23 replies

Uh oh!

Uh oh!

namedgraph Sep 12, 2023 Author

Uh oh!

Uh oh!

namedgraph Sep 13, 2023 Author

Uh oh!

nicholascar Oct 13, 2023 Maintainer

Uh oh!

namedgraph Oct 20, 2023 Author

Uh oh!

Minimal class definitions

Graph Scenarios

Scenario 1: Default Graph (Start with Triple)

Scenario 2: Named Graph (Start with Quad)

Scenario 3: Named Graph with Identifier

Scenario 4: Add quad to default graph

Dataset Scenarios

Scenario 5: Add a Default Graph to a Dataset

Scenario 6: Add a Named Graph to a Dataset

Scenario 7: Add a Graph to the Default Context

Scenario 8: Add Graphs to Dataset changing the graph

Scenario 9: Iterate Over Triples with Contexts

Scenario 10: Iterate Over Quads with Contexts

Uh oh!

nicholascar Dec 2, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

namedgraph Dec 18, 2024 Author

Uh oh!

Uh oh!

namedgraph Dec 18, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

namedgraph Jan 27, 2025 Author

Uh oh!

ashleysommer Jan 27, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

namedgraph Feb 12, 2025 Author

Proposed `Dataset` API changes #2591

Replies: 7 comments 23 replies

namedgraph
Sep 12, 2023
Author

namedgraph
Sep 13, 2023
Author

nicholascar
Oct 13, 2023
Maintainer

namedgraph
Oct 20, 2023
Author

nicholascar Dec 2, 2024
Maintainer

namedgraph Dec 18, 2024
Author

namedgraph Dec 18, 2024
Author

namedgraph Jan 27, 2025
Author

ashleysommer Jan 27, 2025
Maintainer

namedgraph Feb 12, 2025
Author