-
Notifications
You must be signed in to change notification settings - Fork 10
Refactored static and dynamic enrichment APIs #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
4e0db6a
1cdc0dd
5015254
35526a9
d07c74b
f2754e2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,7 @@ | |
| import pandas as pd | ||
| from tqdm import tqdm | ||
| from typing_extensions import NotRequired | ||
| from functools import lru_cache | ||
|
|
||
| from cleanlab_studio.errors import EnrichmentProjectError | ||
| from cleanlab_studio.internal.api import api | ||
|
|
@@ -49,6 +50,13 @@ def _response_timestamp_to_datetime(timestamp_string: str) -> datetime: | |
| return datetime.strptime(timestamp_string, response_timestamp_format_str) | ||
|
|
||
|
|
||
| @lru_cache(maxsize=None) | ||
| def _get_run_online(): | ||
| from cleanlab_studio.utils.data_enrichment.enrich import run_online | ||
|
|
||
| return run_online | ||
|
|
||
|
Comment on lines
+54
to
+59
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. get rid of this and do a standard import, unsure why you are using such an odd approach |
||
|
|
||
| class EnrichmentProject: | ||
| """Represents an Enrichment Project instance, which is bound to a Cleanlab Studio account. | ||
|
|
||
|
|
@@ -342,9 +350,11 @@ def list_all_jobs(self) -> List[EnrichmentJob]: | |
| id=job["id"], | ||
| status=job["status"], | ||
| created_at=_response_timestamp_to_datetime(job["created_at"]), | ||
| updated_at=_response_timestamp_to_datetime(job["updated_at"]) | ||
| if job["updated_at"] | ||
| else None, | ||
| updated_at=( | ||
| _response_timestamp_to_datetime(job["updated_at"]) | ||
| if job["updated_at"] | ||
| else None | ||
| ), | ||
| enrichment_options=EnrichmentOptions(**enrichment_options_dict), # type: ignore | ||
| average_trustworthiness_score=job["average_trustworthiness_score"], | ||
| job_type=job["type"], | ||
|
|
@@ -399,6 +409,27 @@ def resume(self) -> JSONDict: | |
| latest_job = self._get_latest_job() | ||
| return api.resume_enrichment_job(api_key=self._api_key, job_id=latest_job["id"]) | ||
|
|
||
| def run_online( | ||
| self, | ||
| data: Union[pd.DataFrame, List[dict]], | ||
| options: EnrichmentOptions, | ||
| new_column_name: str, | ||
| ) -> Dict[str, Any]: | ||
| """ | ||
| Enrich data in real-time using the same logic as the run() method, but client-side. | ||
mturk24 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Args: | ||
| data (Union[pd.DataFrame, List[dict]]): The dataset to enrich. | ||
| options (EnrichmentOptions): Options for enriching the dataset. | ||
|
||
| new_column_name (str): The name of the new column to store the results. | ||
|
|
||
| Returns: | ||
| Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
| """ | ||
| run_online = _get_run_online() | ||
|
||
| job_info = run_online(data, options, new_column_name, self._api_key) | ||
|
||
| return job_info | ||
|
|
||
|
|
||
| class EnrichmentJob(TypedDict): | ||
| """Represents an Enrichment Job instance. | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -1,91 +1,71 @@ | ||||
| from typing import Any, List, Optional, Tuple, Union | ||||
| import pandas as pd | ||||
| from typing import Any, List, Tuple, Union, Dict | ||||
| from functools import lru_cache | ||||
| from cleanlab_studio.internal.enrichment_utils import ( | ||||
| extract_df_subset, | ||||
| get_prompt_outputs, | ||||
| get_regex_match_or_replacement, | ||||
| get_constrain_outputs_match, | ||||
| get_optimized_prompt, | ||||
| Replacement, | ||||
| ) | ||||
| from cleanlab_studio.studio.enrichment import EnrichmentOptions | ||||
|
|
||||
| from cleanlab_studio.studio.studio import Studio | ||||
|
|
||||
|
|
||||
| def enrich_data( | ||||
| studio: Studio, | ||||
| data: pd.DataFrame, | ||||
| prompt: str, | ||||
| *, | ||||
| regex: Optional[Union[str, Replacement, List[Replacement]]] = None, | ||||
| constrain_outputs: Optional[List[str]] = None, | ||||
| optimize_prompt: bool = True, | ||||
| subset_indices: Optional[Union[Tuple[int, int], List[int]]] = (0, 3), | ||||
| new_column_name: str = "metadata", | ||||
| disable_warnings: bool = False, | ||||
| **kwargs: Any, | ||||
| ) -> pd.DataFrame: | ||||
|
|
||||
| @lru_cache(maxsize=None) | ||||
| def _get_pandas(): | ||||
| import pandas as pd | ||||
|
|
||||
| return pd | ||||
|
||||
| "pandas==2.*", |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is going on here?
tqdm is already a dependency of this package, there should be no special logic to lazy import it
Line 57 in c2a3013
| "tqdm>=4.64.0", |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these are not just being imported at the top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify why there is a separate _validate_enrichment_options defined here rather than using the validation function in run() here?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Replacement is a type alias for the Tuple[str, str] type (ref here), not entirely sure why you made this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the PR description with:
User code if they just want to do some real-time data enrichment quickly.
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.