Skip to content

feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration#3429

Open
bogdankostic wants to merge 5 commits into
mainfrom
ms_sharepoint
Open

feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration#3429
bogdankostic wants to merge 5 commits into
mainfrom
ms_sharepoint

Conversation

@bogdankostic

Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

Adds a new integration, microsoft-sharepoint-haystack, providing the MSSharePointRetriever component.

Given a query, the retriever calls the Microsoft Graph Search API (POST /search/query) and returns matching SharePoint and OneDrive content as Haystack Documents. Each hit becomes a Document whose content is the search snippet (hit-highlight markup stripped, entities unescaped) and whose meta carries file_name, web_url, entity_type,
created_date_time, last_modified_date_time, created_by, last_modified_by, mime_type, and file_extension (keys with no value are omitted).

How did you test it?

  • Unit tests covering: init validation, serialization round-trip, hit-to-Document mapping, request-body and auth-header construction, KQL/fields omission, pagination with offset/size, top_k handling, error handling (401/403/other 4xx, 429 retry-then-succeed, give-up after max retries), running inside a Pipeline, and the async path.
  • Manual live verification against a real Microsoft 365 tenant, including a two-user per-user-scoping check: a user with access to a restricted site receives its content, while a user without access does not.

Notes for the reviewer

  • The access_token is provided at run time. It can be obtained from the OAuthResolver, which ships in a separate PR: feat(oauth): Add OAuth integration #3419
  • The component does not fetch the files themselves. It returns the summary provided by the Search API in the Document's content field, with the link to the file in the meta (web_url). Fetching the actual file contents requires additional downstream components.

Checklist

@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Coverage report (microsoft_sharepoint)

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  integrations/microsoft_sharepoint/src/haystack_integrations/components/retrievers/microsoft_sharepoint
  errors.py
  retriever.py 191, 227, 283-286, 301
Project Total  

This report was generated by python-coverage-comment-action

@bogdankostic bogdankostic changed the title feat: Add Microsoft SharePoint retriever integration feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration Jun 11, 2026
@bogdankostic bogdankostic marked this pull request as ready for review June 16, 2026 10:55
@bogdankostic bogdankostic requested a review from a team as a code owner June 16, 2026 10:55
@bogdankostic bogdankostic requested review from sjrl and removed request for a team June 16, 2026 10:55
Comment thread .github/workflows/microsoft_sharepoint.yml Outdated
self.max_retries = max_retries

@component.output_types(documents=list[Document])
def run(self, query: str, access_token: str, top_k: int | None = None) -> dict[str, list[Document]]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: Would it also be worth supporting passing a Secret for access_token in addition here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for Secret in e9ad7bc

Comment on lines +295 to +300
if status == _HTTP_UNAUTHORIZED:
msg = (
"Microsoft Graph rejected the access token (401 Unauthorized). The token may be expired, invalid, "
"or missing the required delegated scopes (for example Files.Read.All / Sites.Read.All)."
)
elif status == _HTTP_FORBIDDEN:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: I feel like these status code values are pretty common. Do we need a global variable for them?

```python
from haystack import Pipeline
from haystack.utils import Secret
from haystack_integrations.components.connectors.oauth import OAuthResolver, RefreshTokenSource

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could update this example to not use another integration? Then in the docs page we make for this I think we can add a full example where we can explain that users also need to install the other haystack integration.

Comment on lines +90 to +99
:param entity_types: The Microsoft Search entity types to query. Defaults to `["driveItem", "listItem"]`,
which covers files, folders, SharePoint pages and news, and list items. Other valid values are
`"list"` and `"site"`.
:param top_k: The maximum number of documents to return. Maps to the Search API `size` and is paginated
when it exceeds a single page.
:param fields: Optional list of resource properties to request via the Search API `fields` selection
(only honored for `listItem` and `driveItem` entity types).
:param query_template: Optional KQL query template used to scope the search, for example
`'{searchTerms} path:"https://contoso.sharepoint.com/sites/Team"'`. The literal `{searchTerms}`
placeholder is replaced by the run-time query.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the entitu_types, fields and query_template are there microsoft docs links we could provide?

Comment on lines +206 to +207
def _build_request_body(self, query: str, offset: int, size: int) -> dict[str, Any]:
"""Build the Microsoft Search `POST /search/query` request body for one page."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it's internal but if possible providing a microsoft docs link for how to see what are valid params to this endpoint would be great for future devs

Comment on lines +239 to +241
if response.status_code in _RETRYABLE_STATUS and attempt < self.max_retries:
attempt += 1
delay = self._retry_delay(response, attempt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth using tenacity instead of rolling our own retry mechanism?

We may even be able to reuse request_with_retry from haystack which uses tenacity and httpx under the hood.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have an async version async_request_with_retry available in haystack in haystack/utils/request_utils.py

Comment on lines +129 to +130
@component.output_types(documents=list[Document])
def run(self, query: str, access_token: str | Secret, top_k: int | None = None) -> dict[str, list[Document]]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity is there a concept of filters we could also add here? Just to make it work more like our doc store retrievers

@sjrl

sjrl commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@bogdankostic one thing I noticed is that ideally we can use this new sharepoint retriever inside of the MultiRetriever since that is our recommended retriever to use in platform and allows for not needing to manually wire many retrievers all up to a document joiner.

To do this though we will need two sets of changes. The main one I want to propose for this integration (I think) is number 2 where we create a wrapper component to align with the TextRetriever protocol. WDYT?

1. MultiRetriever — add a per-retriever run-inputs map (haystack)

A new optional run param (run + run_async) that maps retriever name → extra kwargs, merged only into that retriever's call. Generic, explicit routing, no signature introspection, protocol-safe (the protocol allows extra optional params), and not serialized (it's a run input, not init).

def run(self, query, filters=None, top_k=None, *, active_retrievers=None,
        retriever_inputs: dict[str, dict[str, Any]] | None = None):
    retriever_inputs = retriever_inputs or {}
    unknown = set(retriever_inputs) - self.retrievers.keys()
    if unknown:
        raise ValueError(f"Unknown retriever name(s) in retriever_inputs: {sorted(unknown)}")
    ...
    executor.submit(
        retriever.run,
        query=query, filters=resolved_filters, top_k=resolved_top_k,
        **retriever_inputs.get(name, {}),   # only this retriever gets the extras
    )

2. Create a new SharePointTextRetriever adapter

Wraps an OAuthResolver + MSSharePointRetriever, conforms to our TextRetriever protocol, and resolves the token per run. If the resolver's token source needs a per-request subject_token, the adapter declares it as a mandatory input too — mirroring the resolver's contract.

@component
class SharePointTextRetriever:
    def __init__(self, *, retriever: MSSharePointRetriever, resolver: OAuthResolver):
        self.retriever = retriever
        self.resolver = resolver
        # public token-source attribute, not the resolver's private flag
        self._requires_subject_token = bool(getattr(resolver.token_source, "requires_subject_token", False))
        if self._requires_subject_token:
            component.set_input_type(self, "subject_token", str)  # no default => mandatory socket

    @component.output_types(documents=list[Document])
    def run(self, query: str, filters=None, top_k=None, **kwargs):
        if filters is not None:
            logger.warning("SharePoint retrieval ignores `filters`; scope via `query_template` instead.")
        access_token = self.resolver.run(**self._resolver_kwargs(kwargs))["access_token"]
        return self.retriever.run(query=query, access_token=access_token, top_k=top_k)

    @component.output_types(documents=list[Document])
    async def run_async(self, query: str, filters=None, top_k=None, **kwargs):
        # signature must match run() — enforced by @component
        if filters is not None:
            logger.warning("SharePoint retrieval ignores `filters`; scope via `query_template` instead.")
        result = await self.resolver.run_async(**self._resolver_kwargs(kwargs))
        return await self.retriever.run_async(query=query, access_token=result["access_token"], top_k=top_k)

    def _resolver_kwargs(self, kwargs):
        if not self._requires_subject_token:
            return {}
        return {"subject_token": kwargs.get("subject_token", "")}

    # also add to_dict / from_dict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants