Skip to content

Cannot include page headers and footers #32

@jtnicholl-cosairus

Description

@jtnicholl-cosairus

I was looking into using spaCy Layout in a project I've been working on. I tested it out and noticed that the page headers and footers are not being included in the Document object.

I took a look through the source code and I see this is where the text spans are put together:

for node, _ in document.iterate_items():
    ...

iterate_items iterates by default over only the Docling NodeItems in the "body" content layer. The page headers and footers seem to be in the other layer, "furniture", so they get excluded.

I understand that this makes sense to do for most NLP projects, but for ours we actually need those headers and footers. Could there be an option added to override this behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions