-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Labels
dependenciesPull requests that update a dependency filePull requests that update a dependency fileenhancement
Description
Problem
Class ExtractBoilerpipeText doesn't fully do what it purports to; ie. it sometimes leaves (often large) portions of eg. header and comment thread text in the output. Boilerpipe was last updated 10 years ago.
Preferred solution
Boilerplate removal more consistent with Readability.js, based on a more modern Java/Scala library.
Alternatives considered
There are two libraries available:
- Readability4J: more recent, more active
- readability4s: pure Scala, but moribund
Additional context
Advice on selecting a library would be much appreciated; I suspect the main consideration will be maintainability.
I'm happy to write up a test and PR for this issue once there's a decision. I can provide failing examples if that's helpful.
Metadata
Metadata
Assignees
Labels
dependenciesPull requests that update a dependency filePull requests that update a dependency fileenhancement