-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Description
I was looking into using spaCy Layout in a project I've been working on. I tested it out and noticed that the page headers and footers are not being included in the Document object.
I took a look through the source code and I see this is where the text spans are put together:
for node, _ in document.iterate_items():
...iterate_items iterates by default over only the Docling NodeItems in the "body" content layer. The page headers and footers seem to be in the other layer, "furniture", so they get excluded.
I understand that this makes sense to do for most NLP projects, but for ours we actually need those headers and footers. Could there be an option added to override this behavior?
basavarm
Metadata
Metadata
Assignees
Labels
No labels