Open
Description
Problem
The process_record
function currently tightly couples content extraction and aggregation logic. This makes it difficult to:
- Reuse the extraction logic across different parts of the codebase.
- Isolate and test the extraction logic effectively.
Proposed Improvement
Introduce a separate step for content extraction. This abstraction will:
- Encourage Reusability: By decoupling the logic, the content extraction step can be easily shared across modules or extended by the community.
- Enhance Testability: Since the extraction logic involves mostly pure and idempotent functions, isolating it would simplify testing and debugging.
Implementation Suggestions
- Extract the content extraction logic into a dedicated function.
- Extract the content aggregation logic into a dedicated function.
- Modify
process_record
to delegate to the new abstractions
This can be implemented at two potential levels:
-
At the
CCSparkJob
Level:- Establish a standardized approach to content extraction, signifying it as the principal way of handling such tasks in the codebase.
-
At Specific Examples:
- Implement the abstraction in specific examples like
ExtractLinksJob
to showcase the idea as a suggestion. - Provides flexibility for contributors to adopt or adapt the approach as needed.
- Implement the abstraction in specific examples like
Metadata
Metadata
Assignees
Labels
No labels