CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

This repository contains code to reproduce the results from our paper "CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents", introducing:

A Contextual Ranking Method (CRAWLDoc) - Novel document-as-query approach for robust identification of bibliographic sources across web documents using:
- Unified embeddings of content, URLs, and anchor texts
- Layout-aware processing of HTML/PDF documents
- Maximum Inner Product Search (MIPS) ranking
A New Benchmark Dataset - A comprehensive dataset for bibliographic source retrieval containing:
- 600 publications from 6 major CS publishers (ACM, IEEE, Springer, etc.)
- 72,483 annotated document relevancy labels
- Complete bibliographic records with author affiliations
- Publisher layout variations for robustness testing

Key Features

Layout Independence: Robust ranking across publisher website variations
Multi-Format Support: Processes both HTML and PDF documents
One-Hop Context: Evaluates linked resources within single crawl depth
Reproducible Baseline: Includes pre-configured Jina Embeddings v2 model setup

Structure

The repository is structured as follows:

dataset: Contains the dataset with the bibliographic metadata and the linked websites
run_scripts: Contains the scripts to train and test the models in a robustness check setup

Dataset

To reproduce the results, use the dataset from the dataset folder. The dataset is structured as follows:

{
    "doi": "The DOI of the publication",
    "publisher_doi": "The DOI of the publisher",
    "publisher": "The publisher of the publication",
    "year": "The year of the publication",
    "title": "The title of the publication",
    "authors": [
        [
            "The name of the author",
            [
                "The affiliations of the author"
            ]
        ]
    ],
    "linked_websites": [
        {
            "id": "The id of the linked website",
            "anchor": "The anchor text of the linked website",
            "website": "The URL of the linked website",
            "label": "The label of the linked website"
        }
    ]
}

Due to legal reasons, we cannot provide the websites itself. However, in /dataset you can find the scripts to crawl the websites.

Experimental Setup

To reproduce the results, use the following python files:

train_retrieval.py Train the retrieval models (Document and Query encoder) with the CrawlDoc procedure
eval_ranking.py Evaluate the retrieval models

Hyperparemeter search

The Hyperparameter search was conducted with Weights and Biases. The config for the sweep is stored in sweep.yaml.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
run_scripts		run_scripts
.gitignore		.gitignore
LICENSE		LICENSE
eval_ranking.py		eval_ranking.py
readme.md		readme.md
requirements.txt		requirements.txt
sweep.yaml		sweep.yaml
train_retrieval.py		train_retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Key Features

Structure

Dataset

Experimental Setup

Hyperparemeter search

About

Uh oh!

Releases

Packages

Languages

License

FKarl/CRAWLDoc

Folders and files

Latest commit

History

Repository files navigation

CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Key Features

Structure

Dataset

Experimental Setup

Hyperparemeter search

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages