Skip to content

Latest commit

 

History

History
911 lines (727 loc) · 31.6 KB

File metadata and controls

911 lines (727 loc) · 31.6 KB

Pandas DataFrame Compatibility

DataStore provides comprehensive pandas DataFrame API compatibility, allowing you to use familiar pandas methods directly on DataStore objects while maintaining the benefits of SQL-based query optimization.

Our Approach

We don't guarantee 100% pandas compatibility—we optimize for practical migration.

Our compatibility strategy:

  1. Real-World Testing: We test against actual pandas code from Kaggle notebooks and common data analysis patterns using import datastore as pd.

  2. Prioritize Common Operations: We implement the pandas operations that appear most frequently in real workflows.

  3. Minimal Code Changes: The goal is that most existing pandas code works with just an import change.

  4. Document Differences: When behavior differs from pandas, we clearly document it.

# Typical migration
- import pandas as pd
+ import datastore as pd

# Most pandas code works unchanged
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()

Alternative: Fluent SQL-style API

If you prefer more explicit, SQL-like syntax over pandas conventions:

from datastore import DataStore

ds = DataStore.from_file("data.csv")
result = (ds
    .filter(ds.age > 25)
    .select('city', 'salary')
    .groupby('city')
    .agg({'salary': 'mean'})
    .to_df())

Both styles produce identical results. Choose based on your preference.

read_csv() Compatibility

DataStore provides a pandas-compatible read_csv() function that automatically chooses the optimal execution engine:

Default Behavior (Matches pandas)

import datastore as ds

# These work exactly like pandas
df = ds.read_csv("data.csv")                    # First row is header
df = ds.read_csv("data.csv", header=0)          # Explicit: first row is header
df = ds.read_csv("data.csv", header=None)       # No header, auto-generate column names
df = ds.read_csv("data.csv", sep=";")           # Semicolon delimiter
df = ds.read_csv("data.csv", sep="\t")          # Tab delimiter (uses TSV format)
df = ds.read_csv("data.csv", nrows=100)         # Read first 100 rows
df = ds.read_csv("data.csv", compression='gzip') # Compressed CSV

Parameters Handled by chDB SQL Engine

These parameters are translated to ClickHouse settings for optimal performance:

Parameter ClickHouse Setting Notes
sep=',' Default CSV format Comma delimiter
sep='\t' TSVWithNames format Tab delimiter uses native TSV
header=None CSV format No header row
skiprows=N input_format_csv_skip_first_lines Skip initial rows
nrows=N LIMIT N Read first N rows
compression File function parameter gzip, zstd, etc.

Parameters That Fall Back to pandas

For full compatibility, these parameters automatically use pandas' read_csv():

  • Column customization: names, usecols, index_col
  • Type conversion: dtype, converters
  • Date parsing: parse_dates, date_parser, date_format
  • Custom delimiters: Any delimiter other than , or \t
  • Complex features: skipfooter, comment, thousands, chunksize
# These automatically use pandas for full compatibility
df = ds.read_csv("data.csv", usecols=['name', 'age'])    # Column selection
df = ds.read_csv("data.csv", dtype={'age': int})        # Type conversion
df = ds.read_csv("data.csv", parse_dates=['date_col'])  # Date parsing
df = ds.read_csv("data.csv", header=None, names=['a', 'b', 'c'])  # Custom names

Boolean Value Handling

By default, ClickHouse recognizes true/false (case-insensitive). For custom boolean strings:

# Custom boolean values (uses pandas fallback for full compatibility)
df = ds.read_csv("data.csv", 
                 true_values=['yes', 'Yes', 'TRUE'],
                 false_values=['no', 'No', 'FALSE'])

Best Practices

  1. Use standard CSV format: Files with comma delimiters and first-row headers work best with chDB engine
  2. Prefer chDB-supported parameters: nrows, compression for performance
  3. Fall back to pandas when needed: Complex parsing requirements are handled automatically

Implementation Statistics

Pandas API Coverage

Category Pandas Total Implemented Notes
DataFrame methods 209 209 All pandas DataFrame methods
Series methods 210 (via delegation) Delegated to pandas
Series.str accessor 56 56 All pandas str methods
Series.dt accessor 42 42+ All pandas + ClickHouse extras

ClickHouse-Specific Accessors

Accessor Methods Description
Series.arr accessor 37 Array functions (ClickHouse-specific)
Series.json accessor 13 JSON functions
Series.url accessor 15 URL parsing functions
Series.ip accessor 9 IP address functions
Series.geo accessor 14 Geo/distance functions

Note: DataStore implements all pandas DataFrame API methods and all pandas Series.str/dt accessor methods. Additionally, it provides ClickHouse-specific accessors for array, JSON, URL, IP, and geo operations.

ClickHouse Functions

Engine Functions
ClickHouse (ch_functions.json) 1,475
Implemented in DataStore 334
Implementation Rate 22.6%

Overview

  • 209 Methods: 100% coverage of pandas DataFrame API
  • Seamless Integration: Mix SQL-style queries with pandas transformations
  • Automatic Wrapping: DataFrame/Series results automatically wrapped as DataStore
  • Immutable: All operations return new instances (no inplace=True)
  • Smart Execution: SQL operations build queries, pandas operations execute and cache results
  • Correct Chaining: Handles mixed SQL→pandas→pandas chains correctly

Quick Start

from datastore import DataStore

ds = DataStore.from_file("data.csv")

# Use any pandas method
df = ds.drop(columns=['unused'])
      .fillna(0)
      .assign(revenue=lambda x: x['price'] * x['quantity'])
      .sort_values('revenue', ascending=False)
      .head(10)

# Mix SQL and pandas
result = (ds
    .select('*')
    .filter(ds.price > 100)              # SQL-style
    .assign(margin=lambda x: x['profit'] / x['revenue'])  # pandas-style
    .query('margin > 0.2')               # SQL-style
    .groupby('category').agg({'revenue': 'sum'}))  # pandas-style

Feature Checklist

✅ Attributes and Properties

  • df.index - Row labels
  • df.columns - Column labels
  • df.dtypes - Data types
  • df.values - NumPy array representation
  • df.shape - Dimensions (rows, cols)
  • df.size - Total elements
  • df.ndim - Number of dimensions
  • df.empty - Empty check
  • df.T - Transpose
  • df.axes - Axis labels

✅ Indexing and Selection

  • df.loc[...] - Label-based indexing
  • df.iloc[...] - Integer-based indexing
  • df.at[...] - Fast scalar access
  • df.iat[...] - Fast integer scalar access
  • df['col'] - Column selection
  • df[['col1', 'col2']] - Multiple columns
  • df.head(n) - First n rows
  • df.tail(n) - Last n rows
  • df.sample(n) - Random sample
  • df.select_dtypes() - Select by dtype
  • df.query() - Query by expression
  • df.where() - Conditional replacement
  • df.mask() - Inverse where
  • df.isin() - Value membership
  • df.get() - Safe column access
  • df.xs() - Cross-section
  • df.pop() - Remove and return column
  • df.insert() - Insert column

✅ Statistical Methods

  • df.describe() - Summary statistics
  • df.mean() - Mean values
  • df.median() - Median values
  • df.mode() - Mode values
  • df.std() - Standard deviation
  • df.var() - Variance
  • df.min() / df.max() - Min/Max values
  • df.sum() - Sum
  • df.prod() - Product
  • df.count() - Non-null counts
  • df.nunique() - Unique counts
  • df.value_counts() - Value frequencies
  • df.quantile() - Quantiles
  • df.corr() - Correlation matrix
  • df.cov() - Covariance matrix
  • df.corrwith() - Pairwise correlation
  • df.rank() - Rank values
  • df.abs() - Absolute values
  • df.round() - Round values
  • df.clip() - Clip values
  • df.cumsum() - Cumulative sum
  • df.cumprod() - Cumulative product
  • df.cummin() - Cumulative min
  • df.cummax() - Cumulative max
  • df.diff() - Difference
  • df.pct_change() - Percent change
  • df.skew() - Skewness
  • df.kurt() - Kurtosis
  • df.sem() - Standard error
  • df.all() / df.any() - Boolean aggregation
  • df.idxmin() / df.idxmax() - Index of min/max
  • df.eval() - Expression evaluation

✅ Data Manipulation

  • df.drop() - Drop rows/columns
  • df.drop_duplicates() - Remove duplicates
  • df.duplicated() - Mark duplicates
  • df.dropna() - Remove missing
  • df.fillna() - Fill missing
  • df.ffill() / df.bfill() - Forward/backward fill
  • df.interpolate() - Interpolate values
  • df.replace() - Replace values
  • df.rename() - Rename labels
  • df.rename_axis() - Rename axis
  • df.assign() - Add columns
  • df.astype() - Convert types
  • df.convert_dtypes() - Infer types
  • df.copy() - Copy data

✅ Sorting and Ranking

  • df.sort_values() - Sort by values
  • df.sort_index() - Sort by index
  • df.nlargest() - N largest values
  • df.nsmallest() - N smallest values

✅ Reindexing

  • df.reset_index() - Reset index
  • df.set_index() - Set index
  • df.reindex() - Conform to new index
  • df.reindex_like() - Match another's index
  • df.add_prefix() - Add prefix to labels
  • df.add_suffix() - Add suffix to labels
  • df.align() - Align two objects
  • df.set_axis() - Set axis labels
  • df.take() - Select by positions
  • df.truncate() - Truncate by range

✅ Reshaping

  • df.pivot() - Pivot table
  • df.pivot_table() - Pivot with aggregation
  • df.melt() - Unpivot
  • df.stack() - Stack columns to index
  • df.unstack() - Unstack index to columns
  • df.transpose() / df.T - Transpose
  • df.explode() - Explode lists to rows
  • df.squeeze() - Reduce dimensions
  • df.droplevel() - Drop index level
  • df.swaplevel() - Swap index levels
  • df.swapaxes() - Swap axes
  • df.reorder_levels() - Reorder levels

✅ Combining / Joining / Merging

  • df.append() - Append rows
  • df.merge() - SQL-style merge
  • df.join() - Join on index
  • df.concat() - Concatenate
  • df.compare() - Compare differences
  • df.update() - Update values
  • df.combine() - Combine with function
  • df.combine_first() - Combine with priority

✅ Binary Operators

  • df.add() / df.radd() - Addition
  • df.sub() / df.rsub() - Subtraction
  • df.mul() / df.rmul() - Multiplication
  • df.div() / df.rdiv() - Division
  • df.truediv() / df.rtruediv() - True division
  • df.floordiv() / df.rfloordiv() - Floor division
  • df.mod() / df.rmod() - Modulo
  • df.pow() / df.rpow() - Power
  • df.dot() - Matrix multiplication

✅ Comparison Operators

  • df.eq() - Equal
  • df.ne() - Not equal
  • df.lt() - Less than
  • df.le() - Less than or equal
  • df.gt() - Greater than
  • df.ge() - Greater than or equal

✅ Function Application

  • df.apply() - Apply function
  • df.applymap() - Apply element-wise
  • df.map() - Apply element-wise (alias)
  • df.agg() / df.aggregate() - Aggregate
  • df.transform() - Transform
  • df.pipe() - Pipe functions
  • df.groupby() - Group by (returns GroupBy)

✅ Time Series

  • df.rolling() - Rolling window
  • df.expanding() - Expanding window
  • df.ewm() - Exponentially weighted
  • df.resample() - Resample time series
  • df.shift() - Shift values
  • df.asfreq() - Convert frequency
  • df.asof() - Latest value as of time
  • df.at_time() - Select at time
  • df.between_time() - Select time range
  • df.first() / df.last() - First/last periods
  • df.first_valid_index() - First valid index
  • df.last_valid_index() - Last valid index
  • df.to_period() - Convert to period
  • df.to_timestamp() - Convert to timestamp
  • df.tz_convert() - Convert timezone
  • df.tz_localize() - Localize timezone

✅ Missing Data

  • df.isna() / df.isnull() - Detect missing
  • df.notna() / df.notnull() - Detect non-missing
  • df.dropna() - Drop missing
  • df.fillna() - Fill missing
  • df.ffill() - Forward fill
  • df.bfill() - Backward fill
  • df.backfill() - Backward fill (alias)
  • df.pad() - Forward fill (alias)
  • df.interpolate() - Interpolate
  • df.replace() - Replace values

✅ Export / IO

  • df.to_csv() - Export to CSV
  • df.to_json() - Export to JSON
  • df.to_excel() - Export to Excel
  • df.to_parquet() - Export to Parquet
  • df.to_feather() - Export to Feather
  • df.to_hdf() - Export to HDF5
  • df.to_sql() - Export to SQL database
  • df.to_stata() - Export to Stata
  • df.to_pickle() - Pickle to file
  • df.to_html() - Render as HTML
  • df.to_latex() - Render as LaTeX
  • df.to_markdown() - Render as Markdown
  • df.to_string() - Render as string
  • df.to_dict() - Convert to dictionary
  • df.to_records() - Convert to records
  • df.to_numpy() - Convert to NumPy
  • df.to_clipboard() - Copy to clipboard
  • df.to_xarray() - Convert to xarray
  • df.to_orc() - Export to ORC
  • df.to_gbq() - Export to BigQuery

✅ Iteration

  • df.items() - Iterate (column, Series) pairs
  • df.iterrows() - Iterate (index, Series) pairs
  • df.itertuples() - Iterate as namedtuples

✅ Plotting

  • df.plot - Plotting accessor
  • df.plot.* - Various plot types
  • df.hist() - Histogram
  • df.boxplot() - Box plot

✅ Accessors

  • df.str - String accessor (for Series)
  • df.dt - Datetime accessor
  • df.sparse - Sparse accessor
  • df.style - Styling accessor

Series.str Accessor

The .str accessor provides all 56 pandas Series.str methods. Methods are implemented in two ways:

Lazy Methods (SQL-based, 51 methods)

These methods return ColumnExpr and remain lazy until execution:

# All these are lazy - no execution until to_df()
ds['name'].str.upper()           # → ColumnExpr (lazy)
ds['name'].str.lower()           # → ColumnExpr (lazy)
ds['name'].str.len()             # → ColumnExpr (lazy)
ds['name'].str.contains('test')  # → ColumnExpr (lazy)
ds['name'].str.replace('a', 'b') # → ColumnExpr (lazy)

# Assign to column (still lazy)
ds['upper_name'] = ds['name'].str.upper()

# Execute when needed
df = ds.to_df()  # ← SQL executes here

Lazy methods include: upper, lower, len, strip, lstrip, rstrip, contains, startswith, endswith, replace, split, rsplit, slice, pad, center, ljust, rjust, zfill, repeat, find, rfind, index, rindex, match, fullmatch, extract, encode, decode, capitalize, title, swapcase, casefold, normalize, isalnum, isalpha, isdigit, isspace, islower, isupper, istitle, isnumeric, isdecimal, wrap, get, count, join, slice_replace, translate, removeprefix, removesuffix

Executing Methods (5 methods)

These methods must execute because they change the return structure:

Method Return Type Why Execution Required
partition(sep) DataStore (3 columns) Returns DataFrame with 3 columns (left, sep, right) - cannot be represented as single SQL expression
rpartition(sep) DataStore (3 columns) Same as partition, splits from right
get_dummies(sep) DataStore (N columns) Creates dynamic number of columns based on unique values - column count unknown at query time
extractall(pat) DataStore Returns MultiIndex DataFrame with all regex matches - row count changes
cat(sep) str Aggregates all strings into single value - reduces N rows to 1 scalar
# These execute immediately and return results
ds['name'].str.partition('|')      # → DataStore (3 columns)
ds['name'].str.get_dummies('|')    # → DataStore (N dummy columns)
ds['name'].str.extractall(r'\d+')  # → DataStore (all matches)
ds['name'].str.cat(sep='-')        # → str ("John-Jane-Bob")

Why cat() Requires Execution

cat() is fundamentally different from other string methods:

# Other methods: row-wise transformation (N rows → N rows)
ds['name'].str.upper()  # ["john", "jane"] → ["JOHN", "JANE"]

# cat(): aggregation (N rows → 1 scalar)  
ds['name'].str.cat(sep='-')  # ["john", "jane"] → "john-jane"

cat() performs an aggregation operation that:

  1. Reads all values in the column
  2. Concatenates them with a separator
  3. Returns a single string

This cannot be expressed as a per-row SQL expression. It requires executing the data first, then calling pandas' str.cat() method.

✅ Comparison

  • df.equals() - Test equality
  • df.compare() - Show differences

✅ Miscellaneous

  • df.info() - Print summary
  • df.memory_usage() - Memory usage
  • df.copy() - Copy DataFrame

Key Differences from Pandas

1. Row Ordering Behavior

DataStore has different row ordering guarantees depending on the data source:

For DataFrame sources (in-memory): Row order IS preserved. DataStore uses chDB's built-in _row_id virtual column (available in chDB v4.0.0b5+) to maintain original row order and pandas index:

# ✅ Row order is preserved for DataFrame sources
df = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
ds = DataStore(df)
result = ds[ds['value'] > 10]
# Result preserves original row order and pandas index
assert list(result.index) == [1, 2]  # Original indices preserved

For file sources (CSV, Parquet): Row order is NOT guaranteed unless you explicitly specify ORDER BY:

# ❌ Order may vary between executions for file sources
ds = DataStore.from_file("data.csv")
result = ds.filter(ds.value > 50).to_df()
# Row order is NOT guaranteed to match the original file order

# ✅ Explicitly specify ORDER BY for deterministic order
result = ds.filter(ds.value > 50).order_by('id').to_df()
# Rows are ordered by 'id' column

Why the difference? For DataFrame sources, chDB provides a deterministic _row_id virtual column that represents the 0-based row position from the original DataFrame. For file sources, ClickHouse may return rows in any order for better performance (standard SQL behavior).

Impact on comparisons:

  • For DataFrame sources: Results should match pandas behavior including row order
  • For file sources: Sort both DataFrames first or use set-based comparisons
  • Use df.sort_values('col').reset_index(drop=True) before pd.testing.assert_frame_equal() for file sources

2. Immutability

DataStore operations are immutable - inplace=True is not supported:

# ❌ Not supported
df.drop(columns=['col'], inplace=True)

# ✅ Correct usage
df = df.drop(columns=['col'])

3. Return Types

DataStore uses lazy evaluation for optimal performance:

# DataFrame methods return DataStore
result = ds.drop(columns=['col'])  # Returns DataStore
df = result.to_df()  # Get underlying DataFrame

# Column access returns ColumnExpr (lazy)
col = ds['column']  # Returns ColumnExpr (displays like Series)
pd_series = col.to_pandas()  # Convert to pd.Series when needed

# Aggregations return LazyAggregate (lazy)
mean_result = ds['age'].mean()  # Returns LazyAggregate
print(mean_result)  # Triggers execution, displays value

# Convert to pandas types
df = ds.to_df()  # DataStore → pd.DataFrame
series = ds['col'].to_pandas()  # ColumnExpr → pd.Series

4. Comparing Results with pandas

DataStore uses a Lazy Execution model, so column operations return ColumnExpr or LazySeries objects instead of pd.Series. When comparing DataStore results with pandas, you need to convert to pandas first:

import pandas as pd
import datastore as ds

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
ds_df = ds.DataFrame(df)

pd_col = df['a']
ds_col = ds_df['a']

# ❌ Wrong: pandas.equals() doesn't recognize DataStore objects
pd_col.equals(ds_col)  # Returns False (pandas limitation)

# ✅ Correct: Use to_pandas() to convert first
pd_col.equals(ds_col.to_pandas())  # Returns True

# ✅ Correct: Use DataStore's equals() method (works both ways)
ds_col.equals(pd_col)  # Returns True

# ✅ For testing: Use pandas testing utilities
pd.testing.assert_series_equal(pd_col, ds_col.to_pandas())

# ✅ For values only: Use numpy
import numpy as np
np.array_equal(pd_col.values, ds_col.values)  # Returns True

Why does pd_col.equals(ds_col) return False?

pandas Series.equals() checks isinstance(other, pd.Series) first. Since DataStore's ColumnExpr is not a pd.Series subclass, it immediately returns False - this is a pandas design limitation.

Recommended patterns:

Use Case Recommended Method
Compare values ds_col.equals(pd_col) or pd_col.equals(ds_col.to_pandas())
Test assertions pd.testing.assert_series_equal(pd_result, ds_result.to_pandas())
Values only np.array_equal(pd_col.values, ds_col.values)
Float comparison np.allclose(pd_col.values, ds_col.values)

5. Series Handling and LazySeries

Operations that return Series in pandas return lazy objects in DataStore:

# Returns ColumnExpr (lazy), displays like pandas Series
series = ds['column']  
print(type(series))  # <class 'datastore.column_expr.ColumnExpr'>

# Convert to pandas when needed
pd_series = series.to_pandas()
print(type(pd_series))  # <class 'pandas.core.series.Series'>

# Multiple columns return DataStore
datastore = ds[['col1', 'col2']]
print(type(datastore))  # <class 'datastore.core.DataStore'>

# Method calls on columns return LazySeries
head_result = ds['column'].head(5)
print(type(head_result))  # <class 'datastore.lazy_result.LazySeries'>

# LazySeries executes on access (values, index, repr, etc.)
values = head_result.values  # Triggers execution

Why Lazy? This enables SQL query optimization and deferred execution.

6. The to_pandas() Method

Both DataStore and ColumnExpr provide to_pandas() for explicit conversion:

# DataStore to DataFrame
ds = DataStore.from_file("data.csv")
df = ds.to_pandas()  # Returns pd.DataFrame (alias for to_df())

# ColumnExpr to Series
col = ds['age']
series = col.to_pandas()  # Returns pd.Series

# Useful for library interoperability
import seaborn as sns
sns.histplot(ds['age'].to_pandas())

6. Method Naming

The INSERT VALUES method has been renamed to avoid conflicts:

# Old (conflicts with df.values property)
ds.insert_into('id', 'name').values(1, 'Alice')

# New (recommended)
ds.insert_into('id', 'name').insert_values(1, 'Alice')

Execution Model

DataStore implements a sophisticated Mixed Execution Engine that enables arbitrary mixing of SQL and pandas operations.

Key Innovation: SQL on DataFrames

After execution, SQL-style operations use chDB's Python() table function to execute SQL directly on cached DataFrames, enabling true mixed execution.

Three-Stage Execution

Stage 1: SQL Query Building (Lazy)

ds = DataStore.from_file("data.csv")
ds1 = ds.select('*')                    # Builds: SELECT *
ds2 = ds1.filter(ds.age > 25)           # Adds: WHERE age > 25
# ds2._executed = False (no execution yet)

Stage 2: Execution (First pandas Operation)

ds3 = ds2.add_prefix('emp_')            # ← Executes SQL here!
# ds3._executed = True
# ds3._cached_df = DataFrame with filtered data and prefixed columns

Stage 3: SQL on DataFrame (chDB Magic)

ds4 = ds3.filter(ds.emp_age > 30)       # SQL on DataFrame!
# Internally: SELECT * FROM Python(__datastore_cached_df__) WHERE emp_age > 30
# ds4._executed = True (result cached)

Arbitrary Mixing Examples

Example 1: SQL → Pandas → SQL → Pandas

result = (ds
    .filter(ds.age > 25)                      # SQL query building
    .add_prefix('emp_')                       # Pandas (executes)
    .filter(ds.emp_salary > 55000)            # SQL on DataFrame!
    .fillna(0))                               # Pandas on DataFrame

Example 2: Pandas → SQL → Pandas → SQL

result = (ds
    .rename(columns={'id': 'ID'})             # Pandas (executes)
    .filter(ds.ID > 5)                        # SQL on DataFrame
    .sort_values('salary')                    # Pandas
    .select('ID', 'name', 'salary'))          # SQL on DataFrame again!

Example 3: Complex Mixed Chain

result = (ds
    .select('*')                              # SQL 1
    .filter(ds.status == 'active')            # SQL 2
    .assign(revenue=lambda x: x['price'] * x['qty'])  # Pandas (executes)
    .filter(ds.revenue > 1000)                # SQL 3 on DataFrame
    .add_prefix('sales_')                     # Pandas
    .query('sales_revenue > 5000')            # Pandas
    .select('sales_id', 'sales_customer', 'sales_revenue'))  # SQL 4 on DataFrame

For detailed execution plan visualization, see Explain Method

Performance Tips

1. Use SQL for Filtering

Let the query engine do heavy filtering before pandas operations:

# ✅ Efficient
result = (ds
    .select('*')
    .filter(ds.date >= '2024-01-01')  # SQL filter
    .filter(ds.amount > 1000)         # SQL filter
    .assign(margin=lambda x: x['profit'] / x['revenue'])  # Pandas transform
    .groupby('category').agg({'revenue': 'sum'}))  # Pandas aggregation

# ❌ Less efficient
result = (ds
    .to_df()  # Load all data
    .query('date >= "2024-01-01" and amount > 1000'))  # Filter in memory

2. Understand Execution

Once executed (pandas operation applied), all subsequent operations use cached data:

ds = DataStore.from_file("big_data.csv")

# SQL operations - build query (lazy)
ds_filtered = ds.select('*').filter(ds.value > 0)  # No execution yet

# First pandas operation - executes
ds_prefixed = ds_filtered.add_prefix('col_')  # ← Query executes here!

# All subsequent operations use cached DataFrame
mean = ds_prefixed.mean()       # Uses cache, no SQL
std = ds_prefixed.std()         # Uses cache, no SQL
df = ds_prefixed.to_df()        # Returns cache, no SQL

3. Optimal Workflow Pattern

Best Practice: Filter in SQL, transform in pandas

# ✅ Optimal: SQL filtering → Pandas transformation
result = (ds
    .select('*')
    .filter(ds.date >= '2024-01-01')    # SQL: Filters billions of rows
    .filter(ds.amount > 1000)           # SQL: More filtering
    # ↑ Query built but not executed yet
    
    .add_prefix('col_')                 # ← Executes SQL here, caches result
    .fillna(0)                          # Pandas: Works on cached result
    .assign(margin=lambda x: x['col_profit'] / x['col_revenue']))  # Pandas

4. Chain Operations

Chain multiple operations for better readability and potential optimization:

result = (ds
    .drop(columns=['unused1', 'unused2'])
    .fillna(0)
    .assign(
        revenue=lambda x: x['price'] * x['quantity'],
        margin=lambda x: x['profit'] / x['revenue']
    )
    .query('margin > 0.2')
    .sort_values('revenue', ascending=False)
    .head(100))

Examples

Example 1: Data Cleaning

cleaned = (ds
    .drop(columns=['temp_col'])
    .dropna(subset=['important_col'])
    .drop_duplicates()
    .fillna({'numeric_col': 0, 'string_col': 'unknown'})
    .astype({'id': 'int64', 'amount': 'float64'}))

Example 2: Feature Engineering

featured = ds.assign(
    revenue=lambda x: x['price'] * x['quantity'],
    profit=lambda x: x['revenue'] - x['cost'],
    margin=lambda x: x['profit'] / x['revenue'],
    high_value=lambda x: x['revenue'] > 1000
)

Example 3: Time Series Analysis

ts_result = (ds
    .set_index('date')
    .sort_index()
    .asfreq('D')
    .fillna(method='ffill')
    .rolling(window=7).mean()
    .shift(1))

Example 4: Binary Operations

# Calculate year-over-year growth
growth = (current_year
    .set_index('product')
    .sub(last_year.set_index('product'))
    .div(last_year.set_index('product'))
    .mul(100))

Example 5: Conditional Operations

# Complex filtering and transformation
result = (ds
    .query('age > 18 and income > 50000')
    .assign(
        segment=lambda x: pd.cut(x['income'], 
                                  bins=[0, 75000, 150000, float('inf')],
                                  labels=['Low', 'Medium', 'High'])
    )
    .where(lambda x: x['score'] > 0, 0)
    .groupby('segment')
    .agg({'income': 'mean', 'score': 'sum'}))

Example 6: Mixing SQL and Pandas

# Optimal workflow
result = (ds
    # Use SQL for heavy filtering
    .select('customer_id', 'order_date', 'amount', 'product_category')
    .filter(ds.order_date >= '2024-01-01')
    .filter(ds.order_date < '2024-02-01')
    .filter(ds.amount > 0)
    
    # Use pandas for complex transformations
    .assign(
        month=lambda x: pd.to_datetime(x['order_date']).dt.month,
        is_high_value=lambda x: x['amount'] > x['amount'].quantile(0.75)
    )
    .groupby(['customer_id', 'month'])
    .agg({
        'amount': ['sum', 'mean', 'count'],
        'is_high_value': 'sum'
    })
    .reset_index()
    
    # Export
    .to_parquet('customer_monthly_summary.parquet'))

Limitations

Not Implemented

  • inplace=True parameter (DataStore is immutable)
  • Some deprecated pandas methods
  • Methods that don't make sense for DataStore (e.g., from_dict, from_records as instance methods)

Partial Support

  • df.groupby() - Returns pandas GroupBy object, not DataStore
  • Class methods - Return pandas objects, not DataStore

Pandas Version Compatibility

DataStore requires pandas >= 2.1.0 and Python >= 3.9.

Supported Features

Feature Status Notes
DataFrame.map() ✅ Available Recommended over applymap()
groupby.apply(include_groups=...) ✅ Available Control group column inclusion
Nullable type handling ✅ Full support Best dtype preservation
first()/last() deprecation ✅ FutureWarning Warns about offset string usage

Recommendations

  1. For production: Pin specific pandas version to ensure consistent behavior
  2. For best performance: Use pandas 2.2+ for latest optimizations

Getting Help

Summary

DataStore provides comprehensive pandas DataFrame API compatibility with seamless integration:

  • 209 pandas DataFrame methods implemented
  • 56 pandas .str accessor methods (all pandas str methods covered)
  • 42+ pandas .dt accessor methods (plus ClickHouse datetime extras)
  • 334 ClickHouse functions mapped to Pandas-like API
  • ClickHouse-specific accessors: .arr (37 methods), .json, .url, .ip, .geo
  • ✅ Mix SQL queries with pandas transformations
  • ✅ Automatic DataFrame/Series wrapping
  • ✅ Performance optimization through caching
  • ✅ Immutable, thread-safe operations

Use DataStore when you need the power of pandas with the performance of SQL query optimization!