Skip to content

Abstract process_record to Separate Content Extraction Step for Reusability and Testing  #48

Open
@silentninja

Description

@silentninja

Problem

The process_record function currently tightly couples content extraction and aggregation logic. This makes it difficult to:

  1. Reuse the extraction logic across different parts of the codebase.
  2. Isolate and test the extraction logic effectively.

Proposed Improvement

Introduce a separate step for content extraction. This abstraction will:

  • Encourage Reusability: By decoupling the logic, the content extraction step can be easily shared across modules or extended by the community.
  • Enhance Testability: Since the extraction logic involves mostly pure and idempotent functions, isolating it would simplify testing and debugging.

Implementation Suggestions

  1. Extract the content extraction logic into a dedicated function.
  2. Extract the content aggregation logic into a dedicated function.
  3. Modify process_record to delegate to the new abstractions

This can be implemented at two potential levels:

  1. At the CCSparkJob Level:

    • Establish a standardized approach to content extraction, signifying it as the principal way of handling such tasks in the codebase.
  2. At Specific Examples:

    • Implement the abstraction in specific examples like ExtractLinksJob to showcase the idea as a suggestion.
    • Provides flexibility for contributors to adopt or adapt the approach as needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions