feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration#3429
feat(microsoft_sharepoint): Add Microsoft SharePoint retriever integration#3429bogdankostic wants to merge 5 commits into
Conversation
Coverage report (microsoft_sharepoint)Click to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||
| self.max_retries = max_retries | ||
|
|
||
| @component.output_types(documents=list[Document]) | ||
| def run(self, query: str, access_token: str, top_k: int | None = None) -> dict[str, list[Document]]: |
There was a problem hiding this comment.
Minor suggestion: Would it also be worth supporting passing a Secret for access_token in addition here?
| if status == _HTTP_UNAUTHORIZED: | ||
| msg = ( | ||
| "Microsoft Graph rejected the access token (401 Unauthorized). The token may be expired, invalid, " | ||
| "or missing the required delegated scopes (for example Files.Read.All / Sites.Read.All)." | ||
| ) | ||
| elif status == _HTTP_FORBIDDEN: |
There was a problem hiding this comment.
Minor comment: I feel like these status code values are pretty common. Do we need a global variable for them?
| ```python | ||
| from haystack import Pipeline | ||
| from haystack.utils import Secret | ||
| from haystack_integrations.components.connectors.oauth import OAuthResolver, RefreshTokenSource |
There was a problem hiding this comment.
Maybe we could update this example to not use another integration? Then in the docs page we make for this I think we can add a full example where we can explain that users also need to install the other haystack integration.
| :param entity_types: The Microsoft Search entity types to query. Defaults to `["driveItem", "listItem"]`, | ||
| which covers files, folders, SharePoint pages and news, and list items. Other valid values are | ||
| `"list"` and `"site"`. | ||
| :param top_k: The maximum number of documents to return. Maps to the Search API `size` and is paginated | ||
| when it exceeds a single page. | ||
| :param fields: Optional list of resource properties to request via the Search API `fields` selection | ||
| (only honored for `listItem` and `driveItem` entity types). | ||
| :param query_template: Optional KQL query template used to scope the search, for example | ||
| `'{searchTerms} path:"https://contoso.sharepoint.com/sites/Team"'`. The literal `{searchTerms}` | ||
| placeholder is replaced by the run-time query. |
There was a problem hiding this comment.
For the entitu_types, fields and query_template are there microsoft docs links we could provide?
| def _build_request_body(self, query: str, offset: int, size: int) -> dict[str, Any]: | ||
| """Build the Microsoft Search `POST /search/query` request body for one page.""" |
There was a problem hiding this comment.
I realize it's internal but if possible providing a microsoft docs link for how to see what are valid params to this endpoint would be great for future devs
| if response.status_code in _RETRYABLE_STATUS and attempt < self.max_retries: | ||
| attempt += 1 | ||
| delay = self._retry_delay(response, attempt) |
There was a problem hiding this comment.
Maybe worth using tenacity instead of rolling our own retry mechanism?
We may even be able to reuse request_with_retry from haystack which uses tenacity and httpx under the hood.
There was a problem hiding this comment.
We also have an async version async_request_with_retry available in haystack in haystack/utils/request_utils.py
| @component.output_types(documents=list[Document]) | ||
| def run(self, query: str, access_token: str | Secret, top_k: int | None = None) -> dict[str, list[Document]]: |
There was a problem hiding this comment.
Out of curiosity is there a concept of filters we could also add here? Just to make it work more like our doc store retrievers
|
@bogdankostic one thing I noticed is that ideally we can use this new sharepoint retriever inside of the To do this though we will need two sets of changes. The main one I want to propose for this integration (I think) is number 2 where we create a wrapper component to align with the 1. A new optional run param ( def run(self, query, filters=None, top_k=None, *, active_retrievers=None,
retriever_inputs: dict[str, dict[str, Any]] | None = None):
retriever_inputs = retriever_inputs or {}
unknown = set(retriever_inputs) - self.retrievers.keys()
if unknown:
raise ValueError(f"Unknown retriever name(s) in retriever_inputs: {sorted(unknown)}")
...
executor.submit(
retriever.run,
query=query, filters=resolved_filters, top_k=resolved_top_k,
**retriever_inputs.get(name, {}), # only this retriever gets the extras
)2. Create a new Wraps an @component
class SharePointTextRetriever:
def __init__(self, *, retriever: MSSharePointRetriever, resolver: OAuthResolver):
self.retriever = retriever
self.resolver = resolver
# public token-source attribute, not the resolver's private flag
self._requires_subject_token = bool(getattr(resolver.token_source, "requires_subject_token", False))
if self._requires_subject_token:
component.set_input_type(self, "subject_token", str) # no default => mandatory socket
@component.output_types(documents=list[Document])
def run(self, query: str, filters=None, top_k=None, **kwargs):
if filters is not None:
logger.warning("SharePoint retrieval ignores `filters`; scope via `query_template` instead.")
access_token = self.resolver.run(**self._resolver_kwargs(kwargs))["access_token"]
return self.retriever.run(query=query, access_token=access_token, top_k=top_k)
@component.output_types(documents=list[Document])
async def run_async(self, query: str, filters=None, top_k=None, **kwargs):
# signature must match run() — enforced by @component
if filters is not None:
logger.warning("SharePoint retrieval ignores `filters`; scope via `query_template` instead.")
result = await self.resolver.run_async(**self._resolver_kwargs(kwargs))
return await self.retriever.run_async(query=query, access_token=result["access_token"], top_k=top_k)
def _resolver_kwargs(self, kwargs):
if not self._requires_subject_token:
return {}
return {"subject_token": kwargs.get("subject_token", "")}
# also add to_dict / from_dict |
Related Issues
Proposed Changes:
Adds a new integration,
microsoft-sharepoint-haystack, providing theMSSharePointRetrievercomponent.Given a query, the retriever calls the Microsoft Graph Search API (
POST /search/query) and returns matching SharePoint and OneDrive content as HaystackDocuments. Each hit becomes aDocumentwhosecontentis the search snippet (hit-highlight markup stripped, entities unescaped) and whosemetacarriesfile_name,web_url,entity_type,created_date_time,last_modified_date_time,created_by,last_modified_by,mime_type, andfile_extension(keys with no value are omitted).How did you test it?
Documentmapping, request-body and auth-header construction, KQL/fieldsomission, pagination with offset/size,top_khandling, error handling (401/403/other 4xx, 429 retry-then-succeed, give-up after max retries), running inside aPipeline, and the async path.Notes for the reviewer
access_tokenis provided at run time. It can be obtained from theOAuthResolver, which ships in a separate PR: feat(oauth): Add OAuth integration #3419Document'scontentfield, with the link to the file in themeta(web_url). Fetching the actual file contents requires additional downstream components.Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.