PDF Datasheets to AAS

Python library and scripts to extract technical data from PDFs or texts utilizing transformers and especially Large Language Models (LLMs) to export them in an Asset Administration Shell (AAS) submodel.

Usage

from pdf2aas import PDF2AAS
pdf2aas = PDF2AAS()
pdf2aas.convert('path/to/datasheet.pdf', 'eclass id e.g. 27274001', 'path/to/submodel.json')

The default toolchain PDF2AAS will save the submodel in json format to a given path, using:

pypdfium2 preprocessor
ECLASS dictionary
PropertyLLMSearch extractor with openAI client
AAS Technical Data Submodel generator

See Modules for details.

Examples

You can find some example toolchains with intermediate steps in the examples folder. To run them make sure pdf2aas is installed according to setup section above. Then execute for example: python default_toolchain.py Use -h or --help argument for more information on command line arguments.

WebUI

A gradio based web UI is available by starting examples/demo_gradio.py. It is meant as an example to show some of the library features and to experiment with some settings, rather than being a production ready tool.

To use it, additional dependencies need to be installed listed in demo-requirements: pip install -r demo-requirements.txt The webserver is also build as an windows executable, which can be download from the job artifacts:

The web UI features the selection of an asset class from ECLASS, ETIM or CDD dictionary in different releases. If no dictionary is selected, the extractor searches for all technical propertiers without using definitions. An AAS Template (in fact any aasx package) can be opened and searched for properties. They can be used as definitions to search for as well.

A datasheet (PDF, text, csv, html, ...) can be uploaded and the properties of the selected dictionary class or AAS templated can be extracted by an LLM. The extracted properties can be downloaded as xlsx, json, technical data submodel (json) and as an AAS (aasx). In case of a given AAS template, the aasx can be downloaded with property values updated from the extraction. The UI shows a table of the extracted properties and marks their references in the preprocessed text if found.

The Raw Results tabs show the raw and formated prompt and answer from the LLM in Chatbot style and json. Moreover, the Settings tab allows to configure different extractor and client parameter.

Docker

A docker image can be build via the contained Dockerfile. It uses the WebUI Example as entry point to showcase some features of the library. The entrypoint can be easily overwritten, to use the library directly. A prebuild container image can be used, e.g. via the docker-compose file to start the WebUI.

Dictionary Cache

Because conversion of dictionary releases and web requests take some time, the dictionaries are cached in a temp/dict folder, which is mapped into the container. They are stored in a custom json format. This also allows to add ECLASS or ETIM releases. For example add the release as CSV zip files: ETIM-9.0-ALL-SECTORS-CSV-METRIC-EI-2022-12-05.zip, ECLASS-14.0-CSV.zip. They need some time to be converted to the internal format on first startup.

WebUI Settings

To customize the settings add a settings.json file to the working directory. To create a well formed settings file, an example can be downloaded from the WebUI under the Settings register card clicking "Save Settings File". The settings path can be additionaly overwritten via command arg --settings SETTINGS.

Some settings can be altered via environment variables to prevent exposing secrets. For example an .env file can be loaded via docker-compose:

# OpenAI Endpoint
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=...
# Azure Endpoint
AZURE_OPENAI_API_KEY=...
AZURE_ENDPOINT=https://...openai.azure.com
AZURE_DEPLOYMENT=gpt-4...
AZURE_API_VERSION=2025-...
# Optional ETIM API
ETIM_CLIENT_ID=...
ETIM_CLIENT_SECRET=...
# Optional proxy setup
HTTP_PROXY=..
HTTPS_PROXY=..
NO_PROXY=...

Workflow

flowchart LR
    datasheet(Data Sheet)
    dictDB[("Property 
                 Dictionary")]
    llm{{LLM Client}}
    submodel(AAS Submodel)
    table(Table)
    aasx(AASX Package)

    subgraph pdf2aas

      preprocessor[[Preprocessor]]
      dictionary[[Dictionary]]
      extractor[[Extractor]]
      generator[[Generator]]

      preprocessor --text--> extractor
      dictionary --"property 
          definitions"-->  extractor
      extractor --prompt--> llm --properties--> extractor
      extractor --property list--> generator
    end

    datasheet --pdf/csv/html/...---> preprocessor
    datasheet -.classification.-> dictionary

    dictDB --"class and
      property definition"---> dictionary
    
    generator -.property definitions
      (AAS template).-> extractor
    generator --json--> submodel & aasx
    generator --csv--> table

Remarks

Typical Property Dictionaries are ECLASS, CDD, ETIM, EDIBATEC, EPIC, GPC, UniClass.
The Classification (e.g. ECLASS or ETIM class of the device) needs to be done manualy, but can be automated (e.g. also via LLMs, RAG, etc.) in the future.
Additional PDF Preprocessors might be added in the future, e.g. specialized on table or image extraction. LLMs might also be used to preprocess the PDF content first, e.g. summarize it in JSON format.
Property definitions can be derived from an AAS template (or instance), instead of providing the property definitions from a class of a dictionary directly.

Modules

preprocessor: converts the PDF to a text format that can be processed by LLMs, keeping layout and table information.
- PDFium: Uses pypdfium2 based on PDFium to extract text from pdf without layout information
- PDF2HTML: Uses pdf2htmlEX to convert the PDF data sheets to HTML. The converted html is preprocessed further to reduce token usage for the llms.
- PDFPlumber: Uses pdfplumber to extract text from the pdf, based on pdfminer.six.
- PDFPlumberTable: Uses pdfplumber to extract tables from the pdf. Can output the extractrated tables in various formats using tabula, e.g. markdown.
- Text: Opens the file as text file, allowing to use text file formats like txt, html, csv, json, etc.
dictionary: defines classes and properties semantically.
- ECLASS: loads property definitions from ECLASS website for a given ECLASS class.
  - To load from an release the CSV version needs to be placed as zip file in temp/dict and named similar to ECLASS-14.0-CSV.zip.
  - Make sure to comply with ECLASS license.
- ETIM: loads property definitions via the ETIM API
  - Provide ETIM API client id and secret as environment variables.
  - To load from an ETIM model release the CSV version needs to be placed as zip file in temp/dict.
  - Make sure to comply with ETIM license, which refers to the Open Data Commons Attribution License.
- CDD: loads property definitions from IEC CDD website for a given CDD class.
  - Make sure to comply with CDD license. We are only using "FREE ATTRIBUTES" according to the current license.
extractor: extracts technical properties from the preprocessed data sheet.
- PropertyLLM: Prompts an LLM client to extract all properties (without definitions) from a datasheet text.
- PropertyLLMSearch: Prompts an LLM to search for values of given property definitions from a datasheet text.
- PropertyLLMMap: Prompts an LLM client to extract all properties from a datasheet text and maps them with given property definitions (currently only by the label).
- Clients: The PropertyLLM extractors can be used with OpenAI, AzureOpenAI and a CustomLLMClientHTTP (defined here) at different local or cloud endpoints.
generator: transforms an extracted property-value list into different formats.
- AASSubmodelTechnicalData: outputs the properties in a technical data submodel.
- AASTemplate: loads an aasx file as template to search for and update all contained properties.
- CSV: outputs the extracted properties as csv file
model: python classes to handle properties, their definitions and dictionary classes inside the library.
evaluation: python classes to evaluate the library against existing AASes. Needs optional dependencies, c.f. Evaluation.

Setup

Virtual Environment (Optional but highly recommended)
- Create a virtual environment, e.g. with venv in current directory with python -m venv .venv
- Activate it
  - Windows: .venv/Scripts/activate
  - Linux: source .venv/bin/activate
Install requirements: python -m pip install -r requirements.txt
Install the pdf2aas package: python -m pip install .

Development

Install dev requirements: python -m pip install -r dev-requirements.txt This allows to run the tests and installer, etc.
Install pdf2aas as editable package with python -m pip install -e . This makes the package importable in the entire local system while remaining editable.

Specific Toolchains

For pdf2htmlEX (HTML) preprocessor the binary needs to be downloaded and installed. Currently it is only available for Linux distributions, but it can be used via WSL or Docker on Windows.
To run a local model, the extractor needs to be initialised or configured with an openai API conform api_endpoint or using a CustomLLMClientHTTP.
For some toolchains specific environment variables need to be set, e.g. via .env file and the python-dotenv package.
- OPENAI_API_KEY: to use the extractor via the OpenAI public endpoint.
- AZURE_OPENAI_API_KEY, AZURE_ENDPOINT, AZURE_DEPLOYMENT, AZURE_API_VERSION: To use an AzureOpenAI client.
- ETIM_CLIENT_ID and ETIM_CLIENT_SECRET: to use the ETIM dictionary via ETIM API.

Tests

If not already done, install the dev dependencies via python -m pip install -r dev-requirements.txt
Run tests with pytest
To check for codecoverage use pytest --cov=pdf2aas
To check for codestyle use ruff, e.g. ruff check
- You can use ruff format etc. to format accordingly.

Evaluation

The evaluation module allows to evaluate the extraction against existing pairs of an AAS and datasheet. To use it, additional dependencies need to be installed listed in eval-requirements: pip install -r eval-requirements.txt An example script can be found in the examples folder.

Name		Name	Last commit message	Last commit date
Latest commit History 468 Commits
.github/workflows		.github/workflows
.vscode		.vscode
doc		doc
examples		examples
license		license
src/pdf2aas		src/pdf2aas
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
demo-requirements.txt		demo-requirements.txt
dev-requirements.txt		dev-requirements.txt
docker-compose.yml		docker-compose.yml
eval-requirements.txt		eval-requirements.txt
generate_requirements.bat		generate_requirements.bat
generate_requirements.sh		generate_requirements.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Datasheets to AAS

Usage

Examples

WebUI

Docker

Dictionary Cache

WebUI Settings

Workflow

Remarks

Modules

Setup

Development

Specific Toolchains

Tests

Evaluation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

eclipse-basyx/basyx-pdf-to-aas

Folders and files

Latest commit

History

Repository files navigation

PDF Datasheets to AAS

Usage

Examples

WebUI

Docker

Dictionary Cache

WebUI Settings

Workflow

Remarks

Modules

Setup

Development

Specific Toolchains

Tests

Evaluation

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages