Speed up prepdocs for file strategy with parallel async pools #2553

tonybaloney · 2025-06-02T02:24:35Z

Purpose

Implements #2516

tonybaloney · 2025-06-02T03:29:02Z

Concurrency	Time to run prepdocs
1	8 min 1 sec
4	5min 50 sec
10	6min 20 sec

At c=10 it hit the rate limit for the embeddings API. Most of the time is spent on the embeddings API.

pamelafox · 2025-06-03T16:51:20Z

FYI, I am working on a change that will also add in calls to the Vision API and Chat Completions API during file processing, for developers that want support for multimodal documents. Those calls are currently made after file processing for the current version of multimodal ingestion, but I'm moving them to the parse_file function. We may need to add in more retry logic if we add more concurrency.

pamelafox · 2025-06-03T16:53:06Z

Also FYI @mattgotteiner is porting prepdocs to an Azure Function, to be used as a skillset by AI Search. That will use a specific FunctionFileStrategy though, similar to UploadFileStrategy (per-file basis) - so may be orthogonal to this change.

cforce · 2025-07-08T05:45:32Z

Please merge

pamelafox · 2025-07-08T17:59:54Z

This change introduces the drawback that the log from prepdocs becomes harder to follow, since things are done in different orders.
I've changed the logging statements to be more consistent so it is easier to grok, with the filename always at the start of a line:

I'm okay with the drawback now that the logging is prettier. I'll push those changes to the branch.

pamelafox

This looks good, now that I've cleaned up the logging output to be easier to read across multiple concurrently processed files.

Copilot

Pull Request Overview

This PR adds parallel asynchronous processing for the file ingestion strategy and harmonizes log message formats across modules.

Unified and improved log message formatting (consistent quoting and prefixes).
Introduced a concurrency parameter in FileStrategy and CLI, leveraging asyncio.Semaphore and asyncio.gather for parallel file processing.
Changed some high-volume logs from INFO to DEBUG in embeddings.py.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
app/backend/prepdocslib/searchmanager.py	Standardized quoting in log messages
app/backend/prepdocslib/pdfparser.py	Updated log prefix format for local parsing
app/backend/prepdocslib/mediadescriber.py	Removed trailing ellipses in log entries
app/backend/prepdocslib/listfilestrategy.py	Adjusted log message format for MD5 checks
app/backend/prepdocslib/integratedvectorizerstrategy.py	Revised info log formatting
app/backend/prepdocslib/htmlparser.py	Updated log prefix for HTML parsing
app/backend/prepdocslib/filestrategy.py	Added `concurrency` support and parallel processing
app/backend/prepdocslib/embeddings.py	Lowered log level for embedding computation
app/backend/prepdocslib/blobmanager.py	Consistent log formatting for blob operations
app/backend/prepdocs.py	Introduced `--concurrency` CLI arg and adjusted log level setting

Comments suppressed due to low confidence (2)

app/backend/prepdocslib/filestrategy.py:45

[nitpick] Add a note in the class docstring explaining the purpose of concurrency and acceptable value ranges to improve discoverability.

    DEFAULT_CONCURRENCY = 4

app/backend/prepdocslib/filestrategy.py:104

Introduce tests that simulate multiple files and verify the parallel processing behavior under different concurrency settings to ensure reliability.

    async def run(self):

Copilot · 2025-07-08T19:55:28Z

app/backend/prepdocslib/filestrategy.py

+            logger.info("Running with concurrency: %d", self.concurrency)
+            semaphore = asyncio.Semaphore(self.concurrency)
+            tasks = [process_file_worker(semaphore, file) async for file in files]
+            await asyncio.gather(*tasks)


Consider using asyncio.gather(*tasks, return_exceptions=True) or handling exceptions within process_file_worker so that a single task failure doesn't cancel the entire batch.

Suggested change

await asyncio.gather(*tasks)

results = await asyncio.gather(*tasks, return_exceptions=True)

for result in results:

if isinstance(result, Exception):

logger.error("Task failed with exception: %s", str(result), exc_info=True)

@tonybaloney Thoughts on this suggestion from Copilot? Is it correct?

Make the file strategy parallel with a default of 10 files

f6494f0

Set default c to 4, make a single constant

44c912b

tonybaloney changed the title ~~Make the file strategy parallel with a default of 10 files~~ Speed up prepdocs for file strategy with parallel async pools Jun 2, 2025

tonybaloney marked this pull request as ready for review June 2, 2025 03:33

tonybaloney mentioned this pull request Jun 2, 2025

PDF File Ingest Performance prepdocs.ps1 #2516

Open

4 tasks

pamelafox requested review from pamelafox and Copilot July 8, 2025 17:42

This comment was marked as outdated.

Sign in to view

Improve logging output

ad31bb0

pamelafox requested a review from Copilot July 8, 2025 18:01

This comment was marked as outdated.

Sign in to view

Bring back embedding logs, at DEBUG level

a1444c2

pamelafox requested a review from Copilot July 8, 2025 18:14

This comment was marked as outdated.

Sign in to view

Address feedback from Copilot

e5f71d9

pamelafox requested a review from Copilot July 8, 2025 19:54

pamelafox approved these changes Jul 8, 2025

View reviewed changes

Copilot AI reviewed Jul 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up prepdocs for file strategy with parallel async pools #2553

Speed up prepdocs for file strategy with parallel async pools #2553

Uh oh!

tonybaloney commented Jun 2, 2025

Uh oh!

tonybaloney commented Jun 2, 2025

Uh oh!

pamelafox commented Jun 3, 2025

Uh oh!

pamelafox commented Jun 3, 2025

Uh oh!

cforce commented Jul 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

pamelafox commented Jul 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

pamelafox left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 8, 2025

Uh oh!

pamelafox Jul 8, 2025

Uh oh!

Uh oh!

-            await asyncio.gather(*tasks)
+            results = await asyncio.gather(*tasks, return_exceptions=True)
+            for result in results:
+                if isinstance(result, Exception):
+                    logger.error("Task failed with exception: %s", str(result), exc_info=True)

Speed up prepdocs for file strategy with parallel async pools #2553

Are you sure you want to change the base?

Speed up prepdocs for file strategy with parallel async pools #2553

Uh oh!

Conversation

tonybaloney commented Jun 2, 2025

Purpose

Uh oh!

tonybaloney commented Jun 2, 2025

Uh oh!

pamelafox commented Jun 3, 2025

Uh oh!

pamelafox commented Jun 3, 2025

Uh oh!

cforce commented Jul 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

pamelafox commented Jul 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

pamelafox left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

pamelafox Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!