Add Support for Ollama and OpenAI LLM Providers #69

AhmadHakami · 2025-09-09T13:02:36Z

This pull request introduces comprehensive support for Ollama and OpenAI LLM providers, significantly enhancing the flexibility of the synthetic-data-kit for both local and cloud-based inference workflows.

Key Changes

Provider Integration
- Extended the LLM client to support:
  - Ollama (local models like llama3.2:3b)
  - OpenAI (cloud models like gpt-4o)
- Integrated with the CLI via the --provider flag for seamless model selection.
Configuration Updates
- Updated config files and utilities to handle new providers.
- Maintains full backward compatibility.
Testing Enhancements
- Added unit tests and standalone test scripts.
- Fixed PYTHONPATH issues for reliable module imports.
- Relocated test and demo files to appropriate directories.
Dependencies
- Updated pyproject.toml with:
  - ollama
  - pytest
Documentation
- Enhanced README.md with:
  - Provider-specific usage examples
  - Setup instructions for local and cloud LLMs

Benefits

These changes allow users to leverage both local and cloud-based LLMs throughout the synthetic data pipeline (ingest → create → curate → save-as). This improves accessibility, performance, and deployment flexibility—without disrupting existing workflows.

Related Issue

Fixes: N/A (New feature implementation)

Type of Change

New feature (non-breaking change which adds functionality)

- Add table of contents for better navigation - Fix typos and broken links - Improve overall structure and organization - Consolidate repeated information - Add badges and improve formatting - Enhance examples and troubleshooting sections

meta-cla · 2025-09-09T13:02:42Z

Hi @AhmadHakami!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

meta-cla · 2025-09-09T14:12:06Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

- Add _ollama_batch_completion() method for proper Ollama API support - Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)

New _openai_include_sampling_params(): skips temperature/top_p for gpt-5-*. Applied in both sync and async chat.completions code paths.

…cluded

Use max_completion_tokens instead of max_tokens for o1/o2/o3/o4 models. Omit temperature/top_p for O-series models that only support defaults. Updated cli.py system-check to avoid sending temperature to restricted models.

Pass difficulty through directory create flow and core.create.process_file Apply difficulty in QA, CoT, and multimodal-qa generators’ prompts Preserve backward compatibility and tests green

…ogs with --verbose CoT: print chunk/batch progress and totals; stop requesting once target reached Perf: cap CoT tokens (cot_max_tokens default 768) and summary to 256 for faster replies Ollama: use batch_size=1 and pass max_tokens to reduce latency/stalls CLI: with --verbose, print plain “Generating …” instead of spinner for live logs Honors --chunk-size/--chunk-overlap overrides; non-verbose still shows batch progress Smoke-tested: ollama qwen3:4b on sample text, output saved and progress visible

… create; forwarded to processing. Implemented page_range support in PDF parser and threaded through ingest/create and directory flows previously. Added tests: tests/unit/test_pdf_page_range.py: checks pdfminer page_numbers mapping and invalid ranges. tests/functional/test_page_range_cli.py: validates CLI forwarding for ingest and create (QA and COT). tests/functional/test_real_pdf_path.py: uses

…ified pages, with randomized coverage to avoid summary-only questions (addressing the issue you observed in the attached CoT file)

Restored correct indentation and structure to remove syntax errors. Added structured prompt with explicit fences and rules: Use only SOURCE_TEXT. Forbid page numbers, headers/footers, formatting, or difficulty echo. Prefer content-focused questions; advanced requires multi-step reasoning. Slightly over-generates per chunk and applies a post-filter to drop trivial/meta items: Filters references to page/صفحة/heading/difficulty, numeric-only answers, “years/page numbers/section title” questions, and difficulty word echoes. Randomizes chunk order for coverage within the selected range. Injects language instruction into the system message so functional language tests can detect it. CLI ingest fix: Guarded --page-range parsing to avoid attempting .strip() on Typer’s OptionInfo when called directly. Now safely casts and strips: rng = str(page_range).strip(). CoT generator reliability: Kept single-call generation simple and removed the extra pre-call to match tests that assert one chat_completion. Added deterministic fallback examples when parsing fails, matching unit test expectations and request count.

Extract arrays from code fences or free text safely. Normalize keys: support question/answer in Arabic and aliases (q/a, response/reply, answers[]). Clean trailing commas and whitespace reliably. Adjusted QA filters in qa_generator.py: Numeric-heavy heuristic relaxed for Arabic (allow brief date-like answers if some letters appear). Still blocks page numbers, headers/footers, difficulty echoes, and trivial patterns.

Apply similar enforcement to related generators (multimodal).

…nerator returns exactly the requested number of pairs. Extended test_qa_generator.py with test_generate_qa_pairs_exact_count_and_dedup to verify deduplication and exact count

AhmadHakami added 2 commits September 9, 2025 15:41

ntegrate Ollama & OpenAI, fix PYTHONPATH, update pyproject and tests

344a171

Enhance README.md for clarity and readability

f6fbfba

- Add table of contents for better navigation - Fix typos and broken links - Improve overall structure and organization - Consolidate repeated information - Add badges and improve formatting - Enhance examples and troubleshooting sections

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 9, 2025

AhmadHakami added 3 commits September 10, 2025 08:00

Resolve empty QA pairs by fixing batch completion routing

659a925

- Add _ollama_batch_completion() method for proper Ollama API support - Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)

Resolve empty QA pairs by fixing batch completion routing

e4473ab

- Add _ollama_batch_completion() method for proper Ollama API support - Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)

_openai_token_param(): picks max_completion_tokens for gpt-5-*.

95b16b2

New _openai_include_sampling_params(): skips temperature/top_p for gpt-5-*. Applied in both sync and async chat.completions code paths.

AhmadHakami force-pushed the main branch from 0d035b2 to 95b16b2 Compare September 10, 2025 05:24

AhmadHakami added 14 commits September 10, 2025 08:26

remove new_dirctory folder

5ca3aa2

feat(cli,generators): add language control and .env support; tests in…

c422537

…cluded

fix: preserve non-ASCII in JSON outputs; add language flag docs/tests

3ec5f42

Updated LLMClient to adapt to OpenAI’s O-series constraints:

c7e091f

Use max_completion_tokens instead of max_tokens for o1/o2/o3/o4 models. Omit temperature/top_p for O-series models that only support defaults. Updated cli.py system-check to avoid sending temperature to restricted models.

Add CLI flag --difficulty [easy|medium|advanced]

ae1843f

Pass difficulty through directory create flow and core.create.process_file Apply difficulty in QA, CoT, and multimodal-qa generators’ prompts Preserve backward compatibility and tests green

QA/CoT now derive content from intelligently chunked text in the spec…

23e7d3f

…ified pages, with randomized coverage to avoid summary-only questions (addressing the issue you observed in the attached CoT file)

Enforce difficulty and specificity in QA and CoT prompts

9f27376

QA: harden prompts + parsing (Arabic/fenced JSON) and anti-meta filters

9c3aacb

Enforce requested number of pairs in QA.

7b358eb

Apply similar enforcement to related generators (multimodal).

tests/unit/test_multimodal_qa_generator.py: ensures the multimodal ge…

c3f4c3d

…nerator returns exactly the requested number of pairs. Extended test_qa_generator.py with test_generate_qa_pairs_exact_count_and_dedup to verify deduplication and exact count

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Support for Ollama and OpenAI LLM Providers #69

Add Support for Ollama and OpenAI LLM Providers #69

Uh oh!

AhmadHakami commented Sep 9, 2025

Uh oh!

meta-cla bot commented Sep 9, 2025

Uh oh!

meta-cla bot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Support for Ollama and OpenAI LLM Providers #69

Are you sure you want to change the base?

Add Support for Ollama and OpenAI LLM Providers #69

Uh oh!

Conversation

AhmadHakami commented Sep 9, 2025

Key Changes

Benefits

Related Issue

Type of Change

Uh oh!

meta-cla bot commented Sep 9, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant