Skip to content

Conversation

AhmadHakami
Copy link

This pull request introduces comprehensive support for Ollama and OpenAI LLM providers, significantly enhancing the flexibility of the synthetic-data-kit for both local and cloud-based inference workflows.

Key Changes

  • Provider Integration

    • Extended the LLM client to support:
      • Ollama (local models like llama3.2:3b)
      • OpenAI (cloud models like gpt-4o)
    • Integrated with the CLI via the --provider flag for seamless model selection.
  • Configuration Updates

    • Updated config files and utilities to handle new providers.
    • Maintains full backward compatibility.
  • Testing Enhancements

    • Added unit tests and standalone test scripts.
    • Fixed PYTHONPATH issues for reliable module imports.
    • Relocated test and demo files to appropriate directories.
  • Dependencies

    • Updated pyproject.toml with:
      • ollama
      • pytest
  • Documentation

    • Enhanced README.md with:
      • Provider-specific usage examples
      • Setup instructions for local and cloud LLMs

Benefits

These changes allow users to leverage both local and cloud-based LLMs throughout the synthetic data pipeline (ingest → create → curate → save-as). This improves accessibility, performance, and deployment flexibility—without disrupting existing workflows.

Related Issue

Fixes: N/A (New feature implementation)

Type of Change

  • New feature (non-breaking change which adds functionality)

- Add table of contents for better navigation
- Fix typos and broken links
- Improve overall structure and organization
- Consolidate repeated information
- Add badges and improve formatting
- Enhance examples and troubleshooting sections
Copy link

meta-cla bot commented Sep 9, 2025

Hi @AhmadHakami!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 9, 2025
Copy link

meta-cla bot commented Sep 9, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

- Add _ollama_batch_completion() method for proper Ollama API support
- Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)
- Add _ollama_batch_completion() method for proper Ollama API support
- Fix batch completion routing for LLM providers (OpenAI, Ollama, vLLM, api-endpoint)
New _openai_include_sampling_params(): skips temperature/top_p for gpt-5-*.
Applied in both sync and async chat.completions code paths.
Use max_completion_tokens instead of max_tokens for o1/o2/o3/o4 models.
Omit temperature/top_p for O-series models that only support defaults.
Updated cli.py system-check to avoid sending temperature to restricted models.
Pass difficulty through directory create flow and core.create.process_file
Apply difficulty in QA, CoT, and multimodal-qa generators’ prompts
Preserve backward compatibility and tests green
…ogs with --verbose

CoT: print chunk/batch progress and totals; stop requesting once target reached
Perf: cap CoT tokens (cot_max_tokens default 768) and summary to 256 for faster replies
Ollama: use batch_size=1 and pass max_tokens to reduce latency/stalls
CLI: with --verbose, print plain “Generating …” instead of spinner for live logs
Honors --chunk-size/--chunk-overlap overrides; non-verbose still shows batch progress
Smoke-tested: ollama qwen3:4b on sample text, output saved and progress visible
… create; forwarded to processing.

Implemented page_range support in PDF parser and threaded through ingest/create and directory flows previously.
Added tests:
tests/unit/test_pdf_page_range.py: checks pdfminer page_numbers mapping and invalid ranges.
tests/functional/test_page_range_cli.py: validates CLI forwarding for ingest and create (QA and COT).
tests/functional/test_real_pdf_path.py: uses
…ified pages, with randomized coverage to avoid summary-only questions (addressing the issue you observed in the attached CoT file)
Restored correct indentation and structure to remove syntax errors.
Added structured prompt with explicit fences and rules:
Use only SOURCE_TEXT.
Forbid page numbers, headers/footers, formatting, or difficulty echo.
Prefer content-focused questions; advanced requires multi-step reasoning.
Slightly over-generates per chunk and applies a post-filter to drop trivial/meta items:
Filters references to page/صفحة/heading/difficulty, numeric-only answers, “years/page numbers/section title” questions, and difficulty word echoes.
Randomizes chunk order for coverage within the selected range.
Injects language instruction into the system message so functional language tests can detect it.
CLI ingest fix:

Guarded --page-range parsing to avoid attempting .strip() on Typer’s OptionInfo when called directly.
Now safely casts and strips: rng = str(page_range).strip().
CoT generator reliability:

Kept single-call generation simple and removed the extra pre-call to match tests that assert one chat_completion.
Added deterministic fallback examples when parsing fails, matching unit test expectations and request count.
Extract arrays from code fences or free text safely.
Normalize keys: support question/answer in Arabic and aliases (q/a, response/reply, answers[]).
Clean trailing commas and whitespace reliably.
Adjusted QA filters in qa_generator.py:
Numeric-heavy heuristic relaxed for Arabic (allow brief date-like answers if some letters appear).
Still blocks page numbers, headers/footers, difficulty echoes, and trivial patterns.
Apply similar enforcement to related generators (multimodal).
…nerator returns exactly the requested number of pairs.

Extended test_qa_generator.py with test_generate_qa_pairs_exact_count_and_dedup to verify deduplication and exact count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant