Skip to content

feat: amazon textract integration #2017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kanenorman
Copy link
Contributor

Related Issues

Proposed Changes:

Introduce new components for AWS Textract integration.

Textract API Proposed Haystack Component Supported Sources Description
DetectDocumentText AmazonTextractDocumentConverter local, s3, PIL Basic OCR. Extracts lines and words without forms or tables.
AnalyzeDocument AmazonTextractStructuredDocumentConverter local, s3, PIL Extracts structured data like forms, tables, and optional queries.
AnalyzeID AmazonTextractIDConverter local, s3, PIL Extracts identity fields from official IDs (e.g., driver’s license, passport).
AnalyzeExpense AmazonTextractExpenseConverter local, s3, PIL Extracts fields, totals, and line items from receipts and invoices.

Example Usage

from haystack_integrations.components.generators.amazon_textract import AmazonTextractDocumentConverter

converter = AmazonTextractDocumentConverter()

results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)

Implementation Details

  • AnalyzeID does not have an asynchronous API thus wont support multi page sources.
  • Multipage PDFs must be stored on s3.

How did you test it?

Will add thorough unit and integration test in final PR.

Notes for the reviewer

This is an early draft intended to gather feedback and input from the core team. To keep the scope of this PR focused and manageable, I'm starting with the AmazonTextractDocumentConverter. I will gather feedback then finish implementation, testing, and docs. Pending future approval, I plan to follow up with additional PRs implementing the other proposed components.

Checklist

@github-actions github-actions bot added the type:documentation Improvements or additions to documentation label Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for AWS textract
1 participant