Skip to content

Replace Boilerpipe functionality with more modern Readability clone #557

@mjsuhonos

Description

@mjsuhonos

Problem
Class ExtractBoilerpipeText doesn't fully do what it purports to; ie. it sometimes leaves (often large) portions of eg. header and comment thread text in the output. Boilerpipe was last updated 10 years ago.

Preferred solution
Boilerplate removal more consistent with Readability.js, based on a more modern Java/Scala library.

Alternatives considered
There are two libraries available:

Additional context
Advice on selecting a library would be much appreciated; I suspect the main consideration will be maintainability.

I'm happy to write up a test and PR for this issue once there's a decision. I can provide failing examples if that's helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions