Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
5e8ecfd
make gradle build and add github actions
Mar 29, 2024
d545b5d
read grobid-home from configuration
Mar 29, 2024
33648de
disable superfluous tests
Mar 29, 2024
49a07b6
fix build
Mar 29, 2024
5d2872e
add simple test on analyzer to get started
Mar 29, 2024
8bc2987
enable jacoco report
Mar 29, 2024
fd84d88
fix build docker
Mar 29, 2024
ffb5bea
disable docker build for the moment
Mar 29, 2024
bb48f37
add parameter to enable/disable sentence segmentation for TEI processing
Apr 18, 2024
f05f68b
Update docker build (#1)
lfoppiano Apr 26, 2024
981ac95
implement tei processing for datasets
Apr 26, 2024
d668625
fix output JSON streaming
Apr 26, 2024
33d4f13
Merge branch 'master' into add-tei-processing-dataset
lfoppiano May 1, 2024
288850f
add the rest of the processing
May 2, 2024
12dcc37
disable broken tests
May 2, 2024
23c2dd5
add XML JATS entry point
May 2, 2024
0213c78
add CC-BY sample documents
May 2, 2024
52ffc23
revert to the original port
May 2, 2024
4448437
enable TEI processing in UI - javascript joy
May 2, 2024
4aad23d
correct parameter
May 2, 2024
6989335
attach URLs obtained from Grobid's TEI
May 6, 2024
7f0cdd5
fix frontend
May 7, 2024
1c5ff72
fix github action
May 7, 2024
4cd7390
fix wrong ifs - thanks intellij!
lfoppiano May 9, 2024
df86b81
avoid exception when entities are empty
lfoppiano May 9, 2024
843463c
avoid injecting null stuff
lfoppiano May 9, 2024
1b1da5f
reduce the timeout for checking the disambiguation service
lfoppiano May 12, 2024
75dd711
fix the convention for sentence segmentation and enable it
lfoppiano May 20, 2024
758f418
update examples
lfoppiano May 21, 2024
91fe70d
add sequence (sentence, paragraph) identifier in each mention
lfoppiano May 21, 2024
cc1cd2a
Fix sentence switch
lfoppiano May 21, 2024
c58502e
Fix incorrect xpath on children
lfoppiano May 23, 2024
6977bda
Cleanup text when extracting from XML, normalise unicode character, r…
lfoppiano Jun 4, 2024
cc01140
Fix bug in the xpaths that were used wrongly to select sentences or p…
lfoppiano Jun 4, 2024
3c3af44
Try to get possible sections in the <back> in which the das is hidden…
lfoppiano Jun 4, 2024
fda5831
Add DataSeer ML multi-sentence processing
lfoppiano Jun 7, 2024
dbbb6e1
add backward compatibility to dataseer-ml
lfoppiano Jun 7, 2024
c466ce8
fix body extraction in certain pub2tei tei output with multiple div l…
lfoppiano Jun 21, 2024
56c8042
update to grobid 0.8.1
lfoppiano Sep 14, 2024
7b6fe06
update to grobid 0.8.1, and catch up other changes
lfoppiano Sep 14, 2024
f7eef79
remove jar
lfoppiano Sep 15, 2024
21df13d
update documentation, add TEI processing
lfoppiano Oct 4, 2024
795f866
add output data format
lfoppiano Oct 4, 2024
849465d
update manual build
lfoppiano Oct 10, 2024
72edfb1
remove more stuff
lfoppiano Oct 10, 2024
538c0eb
cleanup the space
lfoppiano Oct 10, 2024
da6746c
retrieve URLs from the TEI XML in all the sections that are of interest
lfoppiano Oct 13, 2024
2162720
retrieve URLs from the TEI XML in all the sections that are of interest
lfoppiano Oct 13, 2024
a2b5bbb
update github actions
lfoppiano Oct 13, 2024
920323f
fix xpath to fall back into div into TEI/back
lfoppiano Oct 13, 2024
e3a4890
fix xpath to fall back into div into TEI/back
lfoppiano Oct 13, 2024
e256ffa
cleanup
lfoppiano Oct 13, 2024
c92adb1
fix reference mapping
lfoppiano Oct 13, 2024
27194da
fix references extraction
lfoppiano Oct 14, 2024
127fbc2
update copyright
lfoppiano Oct 14, 2024
371f520
cleanup
lfoppiano Oct 13, 2024
1483aab
fix reference mapping
lfoppiano Oct 13, 2024
4ab67a6
fix references extraction
lfoppiano Oct 14, 2024
b54c567
cleanup API
lfoppiano Oct 16, 2024
4172805
fix regression
lfoppiano Oct 22, 2024
0a5cedd
cosmetics
lfoppiano Oct 22, 2024
774dd78
fix regression
lfoppiano Oct 22, 2024
b18454b
cosmetics
lfoppiano Oct 22, 2024
1e658fd
fix regressions in the way we attach references from TEI
lfoppiano Oct 22, 2024
962f7eb
fix regressions in the way we attach references from TEI
lfoppiano Oct 22, 2024
450e5f2
update config docker
lfoppiano Oct 22, 2024
0206b5c
propagate maximum parallel requests to grobid
lfoppiano Oct 22, 2024
cfc48db
revert wrong config change
lfoppiano Oct 22, 2024
3b343c6
allow xml:id to be string using a wrapper that generates integer to m…
lfoppiano Jan 1, 2025
54bc62a
Merge branch 'add-tei-processing-dataset'
lfoppiano Jan 1, 2025
39c0e43
fix extraction of urls that are not well formed (supplementary-materi…
lfoppiano Jan 2, 2025
701e1f4
update download location using HF
lfoppiano Apr 9, 2025
2894f13
also include the dataseer-ml models
lfoppiano Apr 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions .github/workflows/ci-build-manual.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
name: Build and push a development version on docker

on:
workflow_dispatch:
inputs:
custom_tag:
type: string
description: Docker image tag
required: true
default: "latest-develop"

jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

docker-build:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /opt/hostedtoolcache
sudo rm -rf /opt/google/chrome
sudo rm -rf /opt/microsoft/msedge
sudo rm -rf /opt/microsoft/powershell
sudo rm -rf /opt/pipx
sudo rm -rf /usr/lib/mono
sudo rm -rf /usr/local/julia*
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/lib/node_modules
sudo rm -rf /usr/local/share/chromium
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/share/swift
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile.datastet
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/datastet
registry: docker.io
pushImage: true
tags: |
latest-develop, ${{ github.event.inputs.custom_tag}}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
71 changes: 71 additions & 0 deletions .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Build unstable

on: [push]

concurrency:
group: gradle
# cancel-in-progress: true


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

- name: Test with Gradle Jacoco and Coveralls
run: ./gradlew test jacocoTestReport coveralls --no-daemon

- name: Coveralls GitHub Action
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
format: jacoco

docker-build:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /opt/hostedtoolcache
sudo rm -rf /opt/google/chrome
sudo rm -rf /opt/microsoft/msedge
sudo rm -rf /opt/microsoft/powershell
sudo rm -rf /opt/pipx
sudo rm -rf /usr/lib/mono
sudo rm -rf /usr/local/julia*
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/lib/node_modules
sudo rm -rf /usr/local/share/chromium
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/share/swift
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile.datastet
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/datastet
registry: docker.io
pushImage: ${{ github.event_name != 'pull_request' }}
tags: latest-develop
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
32 changes: 32 additions & 0 deletions .github/workflows/ci-integration-manual.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Run integration tests manually

on:
# push:
# branches:
# - master
workflow_dispatch:

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout grobid home
uses: actions/checkout@v4
with:
repository: kermitt2/grobid
path: ./grobid
- name: Checkout Datastet
uses: actions/checkout@v4
with:
path: ./grobid/datastet
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build and run integration tests
working-directory: ./grobid/datastet
run: ./gradlew copyModels integration --no-daemon

74 changes: 74 additions & 0 deletions .github/workflows/ci-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Build release

on:
workflow_dispatch:
push:
tags:
- 'v*'

concurrency:
group: docker
cancel-in-progress: true


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17.0.10+7'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

- name: Test with Gradle Jacoco and Coveralls
run: ./gradlew test jacocoTestReport coveralls --no-daemon

- name: Coveralls GitHub Action
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
format: jacoco


docker-build:
needs: [build]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- name: Set tags
id: set_tags
run: |
DOCKER_IMAGE=lfoppiano/datastet
VERSION=""
if [[ $GITHUB_REF == refs/tags/v* ]]; then
VERSION=${GITHUB_REF#refs/tags/v}
fi
if [[ $VERSION =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
TAGS="${VERSION}"
else
TAGS="latest"
fi
echo "TAGS=${TAGS}"
echo ::set-output name=tags::${TAGS}
- uses: actions/checkout@v4
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v6
with:
dockerfile: Dockerfile.local
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/datastet
registry: docker.io
pushImage: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.set_tags.outputs.tags }}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
Loading