Skip to content

Speed up prepdocs for file strategy with parallel async pools #2553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tonybaloney
Copy link
Contributor

Purpose

Implements #2516

@tonybaloney
Copy link
Contributor Author

Concurrency Time to run prepdocs
1 8 min 1 sec
4 5min 50 sec
10 6min 20 sec

At c=10 it hit the rate limit for the embeddings API. Most of the time is spent on the embeddings API.

@tonybaloney tonybaloney changed the title Make the file strategy parallel with a default of 10 files Speed up prepdocs for file strategy with parallel async pools Jun 2, 2025
@tonybaloney tonybaloney marked this pull request as ready for review June 2, 2025 03:33
@pamelafox
Copy link
Collaborator

FYI, I am working on a change that will also add in calls to the Vision API and Chat Completions API during file processing, for developers that want support for multimodal documents. Those calls are currently made after file processing for the current version of multimodal ingestion, but I'm moving them to the parse_file function. We may need to add in more retry logic if we add more concurrency.

@pamelafox
Copy link
Collaborator

Also FYI @mattgotteiner is porting prepdocs to an Azure Function, to be used as a skillset by AI Search. That will use a specific FunctionFileStrategy though, similar to UploadFileStrategy (per-file basis) - so may be orthogonal to this change.

@cforce
Copy link
Contributor

cforce commented Jul 8, 2025

Please merge

@pamelafox pamelafox requested review from pamelafox and Copilot July 8, 2025 17:42
Copilot

This comment was marked as outdated.

@pamelafox
Copy link
Collaborator

This change introduces the drawback that the log from prepdocs becomes harder to follow, since things are done in different orders.
I've changed the logging statements to be more consistent so it is easier to grok, with the filename always at the start of a line:

Screenshot 2025-07-08 at 10 58 21 AM

I'm okay with the drawback now that the logging is prettier. I'll push those changes to the branch.

@pamelafox pamelafox requested a review from Copilot July 8, 2025 18:01
Copilot

This comment was marked as outdated.

@pamelafox pamelafox requested a review from Copilot July 8, 2025 18:14
Copilot

This comment was marked as outdated.

@pamelafox pamelafox requested a review from Copilot July 8, 2025 19:54
Copy link
Collaborator

@pamelafox pamelafox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, now that I've cleaned up the logging output to be easier to read across multiple concurrently processed files.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds parallel asynchronous processing for the file ingestion strategy and harmonizes log message formats across modules.

  • Unified and improved log message formatting (consistent quoting and prefixes).
  • Introduced a concurrency parameter in FileStrategy and CLI, leveraging asyncio.Semaphore and asyncio.gather for parallel file processing.
  • Changed some high-volume logs from INFO to DEBUG in embeddings.py.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
app/backend/prepdocslib/searchmanager.py Standardized quoting in log messages
app/backend/prepdocslib/pdfparser.py Updated log prefix format for local parsing
app/backend/prepdocslib/mediadescriber.py Removed trailing ellipses in log entries
app/backend/prepdocslib/listfilestrategy.py Adjusted log message format for MD5 checks
app/backend/prepdocslib/integratedvectorizerstrategy.py Revised info log formatting
app/backend/prepdocslib/htmlparser.py Updated log prefix for HTML parsing
app/backend/prepdocslib/filestrategy.py Added concurrency support and parallel processing
app/backend/prepdocslib/embeddings.py Lowered log level for embedding computation
app/backend/prepdocslib/blobmanager.py Consistent log formatting for blob operations
app/backend/prepdocs.py Introduced --concurrency CLI arg and adjusted log level setting
Comments suppressed due to low confidence (2)

app/backend/prepdocslib/filestrategy.py:45

  • [nitpick] Add a note in the class docstring explaining the purpose of concurrency and acceptable value ranges to improve discoverability.
    DEFAULT_CONCURRENCY = 4

app/backend/prepdocslib/filestrategy.py:104

  • Introduce tests that simulate multiple files and verify the parallel processing behavior under different concurrency settings to ensure reliability.
    async def run(self):

logger.info("Running with concurrency: %d", self.concurrency)
semaphore = asyncio.Semaphore(self.concurrency)
tasks = [process_file_worker(semaphore, file) async for file in files]
await asyncio.gather(*tasks)
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using asyncio.gather(*tasks, return_exceptions=True) or handling exceptions within process_file_worker so that a single task failure doesn't cancel the entire batch.

Suggested change
await asyncio.gather(*tasks)
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, Exception):
logger.error("Task failed with exception: %s", str(result), exc_info=True)

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonybaloney Thoughts on this suggestion from Copilot? Is it correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants