-
Couldn't load subscription status.
- Fork 10
Refactored static and dynamic enrichment APIs #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| pd = _get_pandas() | ||
| tqdm = _get_tqdm() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these are not just being imported at the top of the file?
cleanlab_studio/studio/enrichment.py
Outdated
| Returns: | ||
| Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
| """ | ||
| run_online = _get_run_online() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not just imported at the top of the file?
cleanlab_studio/studio/enrichment.py
Outdated
| Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
| """ | ||
| run_online = _get_run_online() | ||
| job_info = run_online(data, options, new_column_name, self._api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think passing in self._api_key because run_online expects a Studio object?
cleanlab_studio/studio/enrichment.py
Outdated
| Args: | ||
| data (Union[pd.DataFrame, List[dict]]): The dataset to enrich. | ||
| options (EnrichmentOptions): Options for enriching the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to EnrichmentOptions docstring
| return job_info | ||
|
|
||
|
|
||
| def _validate_enrichment_options(options: EnrichmentOptions) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify why there is a separate _validate_enrichment_options defined here rather than using the validation function in run() here?
| regex: Union[str, Replacement, List[Replacement]], | ||
| ) -> Union[pd.Series, List[str]]: | ||
| column_data: Union["pd.Series", List[str]], | ||
| regex: Union[str, Tuple[str, str], List[Tuple[str, str]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Replacement is a type alias for the Tuple[str, str] type (ref here), not entirely sure why you made this change?
| @lru_cache(maxsize=None) | ||
| def _get_pandas(): | ||
| import pandas as pd | ||
|
|
||
| return pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is going on here?
pandas is already a dependency of this package, there should be no special logic to lazy-import it
Line 52 in c2a3013
| "pandas==2.*", |
| @lru_cache(maxsize=None) | ||
| def _get_tqdm(): | ||
| from tqdm import tqdm | ||
|
|
||
| return tqdm | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is going on here?
tqdm is already a dependency of this package, there should be no special logic to lazy import it
Line 57 in c2a3013
| "tqdm>=4.64.0", |
| @lru_cache(maxsize=None) | ||
| def _get_run_online(): | ||
| from cleanlab_studio.utils.data_enrichment.enrich import run_online | ||
|
|
||
| return run_online | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get rid of this and do a standard import, unsure why you are using such an odd approach
| from typing import Any, Dict, List, Optional, Tuple, TypedDict, Union, cast | ||
|
|
||
| import matplotlib.pyplot as plt | ||
| import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the PR description with:
-
User code if they just want to do some real-time data enrichment quickly.
-
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Refactored client side code based on this task: https://www.notion.so/cleanlab/make-data-enrichment-client-side-API-match-backend-API-105c7fee85be8097b54dfb121b7dba4e
Goal: make Dynamic API match the Static API.
in all facets: cohesive naming of methods, argument types, regular expression libraries, etc.
This way user can use backend API to prototype Enrichment jobs and run them over a big static dataset, but then use client-API when they need to run the same logic in real-time over streaming data one at a time.
For any packages we need to import client-side, make these lazy optional imports, so cleanlab-studio package still works without those installed.
This will contain:
User code if they just want to do some real-time data enrichment quickly.
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.
Still doing some testing (unit test runs) but will request review on overall structure.