Skip to content

Commit 84d56bd

Browse files
Merge pull request #31 from shcherbak-ai/dev
v0.6.0
2 parents 9ffc087 + 10fd115 commit 84d56bd

File tree

68 files changed

+52557
-37724
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+52557
-37724
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ repos:
4141
- id: export-requirements
4242
name: Export requirements files
4343
entry: python
44-
args: ["-c", "import subprocess; subprocess.run(['poetry', 'export', '-f', 'requirements.txt', '--output', 'dev/requirements/requirements.main.txt', '--without-hashes']); subprocess.run(['poetry', 'export', '-f', 'requirements.txt', '--output', 'dev/requirements/requirements.dev.txt', '--with', 'dev', '--without-hashes'])"]
44+
args: ["-c", "import subprocess; subprocess.run(['poetry', 'export', '-f', 'requirements.txt', '--output', 'dev/requirements/requirements.main.txt']); subprocess.run(['poetry', 'export', '-f', 'requirements.txt', '--output', 'dev/requirements/requirements.dev.txt', '--with', 'dev'])"]
4545
language: python
4646
pass_filenames: false
4747
always_run: true

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55

66
- **Refactor**: Code reorganization that doesn't change functionality but improves structure or maintainability
77

8+
## [0.6.0](https://github.com/shcherbak-ai/contextgem/releases/tag/v0.6.0) - 2025-06-03
9+
### Added
10+
- LabelConcept - a classification concept type that categorizes content using predefined labels.
11+
812
## [0.5.0](https://github.com/shcherbak-ai/contextgem/releases/tag/v0.5.0) - 2025-05-29
913
### Fixed
1014
- Params handling for reasoning (CoT-capable) models other than OpenAI o-series. Enabled automatic retry of LLM calls with dropping unsupported params if such unsupported params were set for the model. Improved handling and validation of LLM call params.

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ To sign the agreement:
104104
pytest
105105
```
106106

107-
Please note that we use pytest-vcr to record and replay LLM API interactions. Your changes may require re-recording VCR cassettes for the tests. See [VCR Cassette Management](#vcr-cassette-management) section below for details.
107+
Please note that we use [pytest-recording](https://github.com/kiwicom/pytest-recording) to record and replay LLM API interactions. Your changes may require re-recording VCR cassettes for the tests. See [VCR Cassette Management](#vcr-cassette-management) section below for details.
108108

109109
4. **Commit your changes** using Conventional Commits format:
110110

@@ -171,7 +171,7 @@ By submitting issues or feature requests to this project, you acknowledge that t
171171

172172
### VCR Cassette Management
173173

174-
We use pytest-vcr to record and replay HTTP interactions with LLM APIs. This allows tests to run without making actual API calls after the initial recording.
174+
We use [pytest-recording](https://github.com/kiwicom/pytest-recording) to record and replay HTTP interactions with LLM APIs. This allows tests to run without making actual API calls after the initial recording.
175175

176176
#### When to Re-record Cassettes
177177

NOTICE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Development Dependencies:
4242
- pre-commit: Pre-commit hooks
4343
- pytest: Testing framework
4444
- pytest-cov: Coverage plugin for pytest
45-
- pytest-vcr: Recording HTTP interactions for tests
45+
- pytest-recording: Recording HTTP interactions for tests
4646
- python-dotenv: Environment variable management
4747
- sphinx: Documentation generator
4848
- sphinx-autodoc-typehints: Type annotation support for Sphinx

README.md

Lines changed: 31 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,23 @@
1717
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-blue?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
1818
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md)
1919
[![DeepWiki](https://img.shields.io/static/v1?label=DeepWiki&message=Chat%20with%20Code&labelColor=%23283593&color=%237E57C2&style=flat-square)](https://deepwiki.com/shcherbak-ai/contextgem)
20+
[![GitHub latest commit](https://img.shields.io/github/last-commit/shcherbak-ai/contextgem?label=latest%20commit)](https://github.com/shcherbak-ai/contextgem/commits/main)
2021

2122
<img src="https://contextgem.dev/_static/tab_solid.png" alt="ContextGem: 2nd Product of the week" width="250">
2223
<br/><br/>
2324

2425
ContextGem is a free, open-source LLM framework that makes it radically easier to extract structured data and insights from documents — with minimal code.
2526

27+
---
28+
2629

2730
## 💎 Why ContextGem?
2831

2932
Most popular LLM frameworks for extracting structured data from documents require extensive boilerplate code to extract even basic information. This significantly increases development time and complexity.
3033

3134
ContextGem addresses this challenge by providing a flexible, intuitive framework that extracts structured data and insights from documents with minimal effort. Complex, most time-consuming parts are handled with **powerful abstractions**, eliminating boilerplate code and reducing development overhead.
3235

33-
Read more on the project [motivation](https://contextgem.dev/motivation.html) in the documentation.
36+
📖 Read more on the project [motivation](https://contextgem.dev/motivation.html) in the documentation.
3437

3538

3639
## ⭐ Key features
@@ -158,8 +161,9 @@ Read more on the project [motivation](https://contextgem.dev/motivation.html) in
158161

159162
\* See [descriptions](https://contextgem.dev/motivation.html#the-contextgem-solution) of ContextGem abstractions and [comparisons](https://contextgem.dev/vs_other_frameworks.html) of specific implementation examples using ContextGem and other popular open-source LLM frameworks.
160163

164+
## 💡 What you can build
161165

162-
## 💡 With **minimal code**, you can:
166+
With **minimal code**, you can:
163167

164168
- **Extract structured data** from documents (text, images)
165169
- **Identify and analyze key aspects** (topics, themes, categories) within documents ([learn more](https://contextgem.dev/aspects/aspects.html))
@@ -253,17 +257,17 @@ for item in anomalies_concept.extracted_items:
253257

254258
---
255259

256-
See more examples in the documentation:
260+
### 📚 More Examples
257261

258-
### Basic usage examples
262+
**Basic usage:**
259263
- [Aspect Extraction from Document](https://contextgem.dev/quickstart.html#aspect-extraction-from-document)
260264
- [Extracting Aspect with Sub-Aspects](https://contextgem.dev/quickstart.html#extracting-aspect-with-sub-aspects)
261265
- [Concept Extraction from Aspect](https://contextgem.dev/quickstart.html#concept-extraction-from-aspect)
262266
- [Concept Extraction from Document (text)](https://contextgem.dev/quickstart.html#concept-extraction-from-document-text)
263267
- [Concept Extraction from Document (vision)](https://contextgem.dev/quickstart.html#concept-extraction-from-document-vision)
264268
- [LLM chat interface](https://contextgem.dev/quickstart.html#lightweight-llm-chat-interface)
265269

266-
### Advanced usage examples
270+
**Advanced usage:**
267271
- [Extracting Aspects Containing Concepts](https://contextgem.dev/advanced_usage.html#extracting-aspects-with-concepts)
268272
- [Extracting Aspects and Concepts from a Document](https://contextgem.dev/advanced_usage.html#extracting-aspects-and-concepts-from-a-document)
269273
- [Using a Multi-LLM Pipeline to Extract Data from Several Documents](https://contextgem.dev/advanced_usage.html#using-a-multi-llm-pipeline-to-extract-data-from-several-documents)
@@ -302,15 +306,13 @@ docx_text = converter.convert_to_text_format(
302306

303307
```
304308

305-
Learn more about [DOCX converter features](https://contextgem.dev/converters/docx.html) in the documentation.
306-
309+
📖 Learn more about [DOCX converter features](https://contextgem.dev/converters/docx.html) in the documentation.
307310

308311
## 🎯 Focused document analysis
309312

310313
ContextGem leverages LLMs' long context windows to deliver superior extraction accuracy from individual documents. Unlike RAG approaches that often [struggle with complex concepts and nuanced insights](https://www.linkedin.com/pulse/raging-contracts-pitfalls-rag-contract-review-shcherbak-ai-ptg3f), ContextGem capitalizes on continuously expanding context capacity, evolving LLM capabilities, and decreasing costs. This focused approach enables direct information extraction from complete documents, eliminating retrieval inconsistencies while optimizing for in-depth single-document analysis. While this delivers higher accuracy for individual documents, ContextGem does not currently support cross-document querying or corpus-wide retrieval - for these use cases, modern RAG systems (e.g., LlamaIndex, Haystack) remain more appropriate.
311314

312-
Read more on [how ContextGem works](https://contextgem.dev/how_it_works.html) in the documentation.
313-
315+
📖 Read more on [how ContextGem works](https://contextgem.dev/how_it_works.html) in the documentation.
314316

315317
## 🤖 Supported LLMs
316318

@@ -320,8 +322,7 @@ ContextGem supports both cloud-based and local LLMs through [LiteLLM](https://gi
320322
- **Model Architectures**: Works with both reasoning/CoT-capable (e.g. o4-mini) and non-reasoning models (e.g. gpt-4.1)
321323
- **Simple API**: Unified interface for all LLMs with easy provider switching
322324

323-
Learn more about [supported LLM providers and models](https://contextgem.dev/llms/supported_llms.html), how to [configure LLMs](https://contextgem.dev/llms/llm_config.html), and [LLM extraction methods](https://contextgem.dev/llms/llm_extraction_methods.html) in the documentation.
324-
325+
📖 Learn more about [supported LLM providers and models](https://contextgem.dev/llms/supported_llms.html), how to [configure LLMs](https://contextgem.dev/llms/llm_config.html), and [LLM extraction methods](https://contextgem.dev/llms/llm_extraction_methods.html) in the documentation.
325326

326327
## ⚡ Optimizations
327328

@@ -342,36 +343,35 @@ ContextGem allows you to save and load Document objects, pipelines, and LLM conf
342343
- Transfer extraction results between systems
343344
- Persist pipeline and LLM configurations for later reuse
344345

345-
Learn more about [serialization options](https://contextgem.dev/serialization.html) in the documentation.
346-
346+
📖 Learn more about [serialization options](https://contextgem.dev/serialization.html) in the documentation.
347347

348348
## 📚 Documentation
349349

350-
Full documentation is available at [contextgem.dev](https://contextgem.dev).
351-
352-
A raw text version of the full documentation is available at [`docs/docs-raw-for-llm.txt`](https://github.com/shcherbak-ai/contextgem/blob/main/docs/docs-raw-for-llm.txt). This file is automatically generated and contains all documentation in a format optimized for LLM ingestion (e.g. for Q&A).
350+
📖 **Full documentation:** [contextgem.dev](https://contextgem.dev)
353351

354-
You can also explore the repository through [DeepWiki](https://deepwiki.com/shcherbak-ai/contextgem), an AI-powered conversational interface that provides visual architecture maps and natural language Q&A for the codebase.
352+
📄 **Raw documentation for LLMs:** Available at [`docs/docs-raw-for-llm.txt`](https://github.com/shcherbak-ai/contextgem/blob/main/docs/docs-raw-for-llm.txt) - automatically generated, optimized for LLM ingestion.
355353

356-
For a history of changes, improvements, and bug fixes, see the [CHANGELOG](https://github.com/shcherbak-ai/contextgem/blob/main/CHANGELOG.md).
354+
🤖 **AI-powered code exploration:** [DeepWiki](https://deepwiki.com/shcherbak-ai/contextgem) provides visual architecture maps and natural language Q&A for the codebase.
357355

356+
📈 **Change history:** See the [CHANGELOG](https://github.com/shcherbak-ai/contextgem/blob/main/CHANGELOG.md) for version history, improvements, and bug fixes.
358357

359358
## 💬 Community
360359

361-
If you have a feature request or a bug report, feel free to [open an issue](https://github.com/shcherbak-ai/contextgem/issues/new) on GitHub. If you'd like to discuss a topic or get general advice on using ContextGem for your project, start a thread in [GitHub Discussions](https://github.com/shcherbak-ai/contextgem/discussions/new/).
360+
🐛 **Found a bug or have a feature request?** [Open an issue](https://github.com/shcherbak-ai/contextgem/issues/new) on GitHub.
362361

362+
💭 **Need help or want to discuss?** Start a thread in [GitHub Discussions](https://github.com/shcherbak-ai/contextgem/discussions/new/).
363363

364364
## 🤝 Contributing
365365

366-
We welcome contributions from the community - whether it's fixing a typo or developing a completely new feature! To get started, please check out our [Contributor Guidelines](https://github.com/shcherbak-ai/contextgem/blob/main/CONTRIBUTING.md).
366+
We welcome contributions from the community - whether it's fixing a typo or developing a completely new feature!
367367

368+
📋 **Get started:** Check out our [Contributor Guidelines](https://github.com/shcherbak-ai/contextgem/blob/main/CONTRIBUTING.md).
368369

369370
## 🔐 Security
370371

371372
This project is automatically scanned for security vulnerabilities using [CodeQL](https://codeql.github.com/). We also use [Snyk](https://snyk.io) as needed for supplementary dependency checks.
372373

373-
See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
374-
374+
🛡️ **Security policy:** See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
375375

376376
## 💖 Acknowledgements
377377

@@ -388,17 +388,20 @@ ContextGem relies on these excellent open-source packages:
388388

389389
## 🌱 Support the project
390390

391-
ContextGem is just getting started, and your support means the world to us! If you find ContextGem useful, the best way to help is by sharing it with others and giving the project a ⭐. Your feedback and contributions are what make this project grow!
391+
ContextGem is just getting started, and your support means the world to us!
392392

393+
**Star the project** if you find ContextGem useful
394+
📢 **Share it** with others who might benefit
395+
🔧 **Contribute** with feedback, issues, or code improvements
393396

394-
## 📄 License & Contact
397+
Your engagement is what makes this project grow!
395398

396-
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.
399+
## 📄 License & Contact
397400

398-
Copyright © 2025 [Shcherbak AI AS](https://shcherbak.ai), an AI engineering company building tools for AI/ML/NLP developers.
401+
**License:** Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.
399402

400-
Shcherbak AI is now part of Microsoft for Startups.
403+
**Copyright:** © 2025 [Shcherbak AI AS](https://shcherbak.ai), an AI engineering company building tools for AI/ML/NLP developers.
401404

402-
[Connect with us on LinkedIn](https://www.linkedin.com/in/sergii-shcherbak-10068866/) for questions or collaboration ideas.
405+
**Connect:** [LinkedIn](https://www.linkedin.com/in/sergii-shcherbak-10068866/) for questions or collaboration ideas.
403406

404407
Built with ❤️ in Oslo, Norway.

contextgem/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
ContextGem - Effortless LLM extraction from documents
2121
"""
2222

23-
__version__ = "0.5.0"
23+
__version__ = "0.6.0"
2424
__author__ = "Shcherbak AI AS"
2525

2626
from contextgem.public import (
@@ -36,6 +36,7 @@
3636
JsonObjectClassStruct,
3737
JsonObjectConcept,
3838
JsonObjectExample,
39+
LabelConcept,
3940
LLMPricing,
4041
NumericalConcept,
4142
Paragraph,
@@ -58,6 +59,7 @@
5859
"RatingConcept",
5960
"JsonObjectConcept",
6061
"DateConcept",
62+
"LabelConcept",
6163
# Documents
6264
"Document",
6365
# Pipelines

contextgem/internal/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,11 +52,13 @@
5252
_IntegerItem,
5353
_IntegerOrFloatItem,
5454
_JsonObjectItem,
55+
_LabelItem,
5556
_StringItem,
5657
)
5758
from contextgem.internal.llm_output_structs import (
5859
_get_aspect_extraction_output_struct,
5960
_get_concept_extraction_output_struct,
61+
_LabelConceptItemValueModel,
6062
)
6163
from contextgem.internal.loggers import logger
6264
from contextgem.internal.typings import (
@@ -119,6 +121,7 @@
119121
# LLM output structs
120122
"_get_aspect_extraction_output_struct",
121123
"_get_concept_extraction_output_struct",
124+
"_LabelConceptItemValueModel",
122125
# Typings
123126
"NonEmptyStr",
124127
"LLMRoleAny",
@@ -162,6 +165,7 @@
162165
"_BooleanItem",
163166
"_JsonObjectItem",
164167
"_DateItem",
168+
"_LabelItem",
165169
# Logging
166170
"logger",
167171
# Utils

contextgem/internal/base/llms.py

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1465,9 +1465,15 @@ def merge_usage_data(existing: _LLMUsage | None, new: _LLMUsage) -> _LLMUsage:
14651465
if add_justifications or add_references:
14661466
for i in concept_dict["extracted_items"]:
14671467
# Process the item value with a custom function on the concept
1468-
i["value"] = relevant_concept._process_item_value(
1469-
i["value"]
1470-
)
1468+
try:
1469+
i["value"] = relevant_concept._process_item_value(
1470+
i["value"]
1471+
)
1472+
except ValueError as e:
1473+
logger.error(
1474+
f"Error processing extracted item value: {e}"
1475+
)
1476+
return None, all_usage_data
14711477
concept_extracted_item_kwargs = {"value": i["value"]}
14721478
if add_justifications:
14731479
concept_extracted_item_kwargs["justification"] = i[
@@ -1569,7 +1575,13 @@ def merge_usage_data(existing: _LLMUsage | None, new: _LLMUsage) -> _LLMUsage:
15691575
else:
15701576
for i in concept_dict["extracted_items"]:
15711577
# Process the item value with a custom function on the concept
1572-
i = relevant_concept._process_item_value(i)
1578+
try:
1579+
i = relevant_concept._process_item_value(i)
1580+
except ValueError as e:
1581+
logger.error(
1582+
f"Error processing extracted item value: {e}"
1583+
)
1584+
return None, all_usage_data
15731585
sources_mapper[relevant_concept.unique_id][
15741586
"extracted_items"
15751587
].append(relevant_concept._item_class(value=i))

contextgem/internal/items.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,3 +208,32 @@ def from_dict(cls, obj_dict: dict[str, Any]) -> Self:
208208

209209
# Use the parent class's from_dict method
210210
return super().from_dict(obj_dict_copy)
211+
212+
213+
class _LabelItem(_ExtractedItem):
214+
"""
215+
Represents an extracted item that holds a list of label values.
216+
217+
:ivar value: A list of label strings. Always returns a list for API consistency,
218+
containing one or more labels depending on the classification type.
219+
:type value: list[NonEmptyStr]
220+
"""
221+
222+
value: list[NonEmptyStr] = Field(..., min_length=1, frozen=True)
223+
224+
@field_validator("value")
225+
@classmethod
226+
def _validate_value(cls, value: list[NonEmptyStr]) -> list[NonEmptyStr]:
227+
"""
228+
Validates the input list of labels. Ensures there are no duplicates in the list.
229+
230+
:param value: List of label strings to validate.
231+
:type value: list[NonEmptyStr]
232+
:return: The same list provided as input, if it passes validation.
233+
:rtype: list[NonEmptyStr]
234+
:raises ValueError: If the list contains duplicate labels.
235+
"""
236+
if len(value) != len(set(value)):
237+
raise ValueError("_LabelItem value cannot contain duplicate labels.")
238+
239+
return value

contextgem/internal/llm_output_structs/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
)
2222
from contextgem.internal.llm_output_structs.concept_structs import (
2323
_get_concept_extraction_output_struct,
24+
_LabelConceptItemValueModel,
2425
)
2526
from contextgem.internal.llm_output_structs.utils import _create_root_model
2627

@@ -31,4 +32,5 @@
3132
"_get_aspect_extraction_output_struct",
3233
# Concept structs
3334
"_get_concept_extraction_output_struct",
35+
"_LabelConceptItemValueModel",
3436
]

0 commit comments

Comments
 (0)