-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Datapipe is a real-time, incremental Python ETL library for machine learning with record-level dependency tracking.
The library is designed for describing data processing pipelines and is capable of tracking dependencies for each record in the pipeline. This ensures that tasks within the pipeline receive only the data that has been modified, thereby improving the overall efficiency of data handling.
Key Features:
-
Incremental Processing: datapipe processes only new or modified data, significantly reducing computation time and resource usage.
-
Real-time ETL: The library supports real-time data extraction, transformation, and loading.
-
Dependency Tracking: Automatic tracking of data dependencies and processing states.
-
Python Integration: Seamlessly integrates with Python applications, offering a Pythonic way to describe data pipelines.
Ideal projects for Datapipe
-
Projects with complex ML pipelines with a human-in-the-loop component
-
ML projects that require real-time model retraining based on newly labeled data
-
Projects that require content moderation
Github
https://github.com/epoch8/datapipe – Datapipe Core
https://github.com/epoch8/datapipe-examples/ – Usage examples