DataStore provides comprehensive pandas DataFrame API compatibility, allowing you to use familiar pandas methods directly on DataStore objects while maintaining the benefits of SQL-based query optimization.
We don't guarantee 100% pandas compatibility—we optimize for practical migration.
Our compatibility strategy:
-
Real-World Testing: We test against actual pandas code from Kaggle notebooks and common data analysis patterns using
import datastore as pd. -
Prioritize Common Operations: We implement the pandas operations that appear most frequently in real workflows.
-
Minimal Code Changes: The goal is that most existing pandas code works with just an import change.
-
Document Differences: When behavior differs from pandas, we clearly document it.
# Typical migration
- import pandas as pd
+ import datastore as pd
# Most pandas code works unchanged
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()Alternative: Fluent SQL-style API
If you prefer more explicit, SQL-like syntax over pandas conventions:
from datastore import DataStore
ds = DataStore.from_file("data.csv")
result = (ds
.filter(ds.age > 25)
.select('city', 'salary')
.groupby('city')
.agg({'salary': 'mean'})
.to_df())Both styles produce identical results. Choose based on your preference.
DataStore provides a pandas-compatible read_csv() function that automatically chooses the optimal execution engine:
import datastore as ds
# These work exactly like pandas
df = ds.read_csv("data.csv") # First row is header
df = ds.read_csv("data.csv", header=0) # Explicit: first row is header
df = ds.read_csv("data.csv", header=None) # No header, auto-generate column names
df = ds.read_csv("data.csv", sep=";") # Semicolon delimiter
df = ds.read_csv("data.csv", sep="\t") # Tab delimiter (uses TSV format)
df = ds.read_csv("data.csv", nrows=100) # Read first 100 rows
df = ds.read_csv("data.csv", compression='gzip') # Compressed CSVThese parameters are translated to ClickHouse settings for optimal performance:
| Parameter | ClickHouse Setting | Notes |
|---|---|---|
sep=',' |
Default CSV format | Comma delimiter |
sep='\t' |
TSVWithNames format |
Tab delimiter uses native TSV |
header=None |
CSV format |
No header row |
skiprows=N |
input_format_csv_skip_first_lines |
Skip initial rows |
nrows=N |
LIMIT N |
Read first N rows |
compression |
File function parameter | gzip, zstd, etc. |
For full compatibility, these parameters automatically use pandas' read_csv():
- Column customization:
names,usecols,index_col - Type conversion:
dtype,converters - Date parsing:
parse_dates,date_parser,date_format - Custom delimiters: Any delimiter other than
,or\t - Complex features:
skipfooter,comment,thousands,chunksize
# These automatically use pandas for full compatibility
df = ds.read_csv("data.csv", usecols=['name', 'age']) # Column selection
df = ds.read_csv("data.csv", dtype={'age': int}) # Type conversion
df = ds.read_csv("data.csv", parse_dates=['date_col']) # Date parsing
df = ds.read_csv("data.csv", header=None, names=['a', 'b', 'c']) # Custom namesBy default, ClickHouse recognizes true/false (case-insensitive). For custom boolean strings:
# Custom boolean values (uses pandas fallback for full compatibility)
df = ds.read_csv("data.csv",
true_values=['yes', 'Yes', 'TRUE'],
false_values=['no', 'No', 'FALSE'])- Use standard CSV format: Files with comma delimiters and first-row headers work best with chDB engine
- Prefer chDB-supported parameters:
nrows,compressionfor performance - Fall back to pandas when needed: Complex parsing requirements are handled automatically
| Category | Pandas Total | Implemented | Notes |
|---|---|---|---|
| DataFrame methods | 209 | 209 | All pandas DataFrame methods |
| Series methods | 210 | (via delegation) | Delegated to pandas |
| Series.str accessor | 56 | 56 | All pandas str methods |
| Series.dt accessor | 42 | 42+ | All pandas + ClickHouse extras |
| Accessor | Methods | Description |
|---|---|---|
| Series.arr accessor | 37 | Array functions (ClickHouse-specific) |
| Series.json accessor | 13 | JSON functions |
| Series.url accessor | 15 | URL parsing functions |
| Series.ip accessor | 9 | IP address functions |
| Series.geo accessor | 14 | Geo/distance functions |
Note: DataStore implements all pandas DataFrame API methods and all pandas Series.str/dt accessor methods. Additionally, it provides ClickHouse-specific accessors for array, JSON, URL, IP, and geo operations.
| Engine | Functions |
|---|---|
| ClickHouse (ch_functions.json) | 1,475 |
| Implemented in DataStore | 334 |
| Implementation Rate | 22.6% |
- 209 Methods: 100% coverage of pandas DataFrame API
- Seamless Integration: Mix SQL-style queries with pandas transformations
- Automatic Wrapping: DataFrame/Series results automatically wrapped as DataStore
- Immutable: All operations return new instances (no
inplace=True) - Smart Execution: SQL operations build queries, pandas operations execute and cache results
- Correct Chaining: Handles mixed SQL→pandas→pandas chains correctly
from datastore import DataStore
ds = DataStore.from_file("data.csv")
# Use any pandas method
df = ds.drop(columns=['unused'])
.fillna(0)
.assign(revenue=lambda x: x['price'] * x['quantity'])
.sort_values('revenue', ascending=False)
.head(10)
# Mix SQL and pandas
result = (ds
.select('*')
.filter(ds.price > 100) # SQL-style
.assign(margin=lambda x: x['profit'] / x['revenue']) # pandas-style
.query('margin > 0.2') # SQL-style
.groupby('category').agg({'revenue': 'sum'})) # pandas-style-
df.index- Row labels -
df.columns- Column labels -
df.dtypes- Data types -
df.values- NumPy array representation -
df.shape- Dimensions (rows, cols) -
df.size- Total elements -
df.ndim- Number of dimensions -
df.empty- Empty check -
df.T- Transpose -
df.axes- Axis labels
-
df.loc[...]- Label-based indexing -
df.iloc[...]- Integer-based indexing -
df.at[...]- Fast scalar access -
df.iat[...]- Fast integer scalar access -
df['col']- Column selection -
df[['col1', 'col2']]- Multiple columns -
df.head(n)- First n rows -
df.tail(n)- Last n rows -
df.sample(n)- Random sample -
df.select_dtypes()- Select by dtype -
df.query()- Query by expression -
df.where()- Conditional replacement -
df.mask()- Inverse where -
df.isin()- Value membership -
df.get()- Safe column access -
df.xs()- Cross-section -
df.pop()- Remove and return column -
df.insert()- Insert column
-
df.describe()- Summary statistics -
df.mean()- Mean values -
df.median()- Median values -
df.mode()- Mode values -
df.std()- Standard deviation -
df.var()- Variance -
df.min()/df.max()- Min/Max values -
df.sum()- Sum -
df.prod()- Product -
df.count()- Non-null counts -
df.nunique()- Unique counts -
df.value_counts()- Value frequencies -
df.quantile()- Quantiles -
df.corr()- Correlation matrix -
df.cov()- Covariance matrix -
df.corrwith()- Pairwise correlation -
df.rank()- Rank values -
df.abs()- Absolute values -
df.round()- Round values -
df.clip()- Clip values -
df.cumsum()- Cumulative sum -
df.cumprod()- Cumulative product -
df.cummin()- Cumulative min -
df.cummax()- Cumulative max -
df.diff()- Difference -
df.pct_change()- Percent change -
df.skew()- Skewness -
df.kurt()- Kurtosis -
df.sem()- Standard error -
df.all()/df.any()- Boolean aggregation -
df.idxmin()/df.idxmax()- Index of min/max -
df.eval()- Expression evaluation
-
df.drop()- Drop rows/columns -
df.drop_duplicates()- Remove duplicates -
df.duplicated()- Mark duplicates -
df.dropna()- Remove missing -
df.fillna()- Fill missing -
df.ffill()/df.bfill()- Forward/backward fill -
df.interpolate()- Interpolate values -
df.replace()- Replace values -
df.rename()- Rename labels -
df.rename_axis()- Rename axis -
df.assign()- Add columns -
df.astype()- Convert types -
df.convert_dtypes()- Infer types -
df.copy()- Copy data
-
df.sort_values()- Sort by values -
df.sort_index()- Sort by index -
df.nlargest()- N largest values -
df.nsmallest()- N smallest values
-
df.reset_index()- Reset index -
df.set_index()- Set index -
df.reindex()- Conform to new index -
df.reindex_like()- Match another's index -
df.add_prefix()- Add prefix to labels -
df.add_suffix()- Add suffix to labels -
df.align()- Align two objects -
df.set_axis()- Set axis labels -
df.take()- Select by positions -
df.truncate()- Truncate by range
-
df.pivot()- Pivot table -
df.pivot_table()- Pivot with aggregation -
df.melt()- Unpivot -
df.stack()- Stack columns to index -
df.unstack()- Unstack index to columns -
df.transpose()/df.T- Transpose -
df.explode()- Explode lists to rows -
df.squeeze()- Reduce dimensions -
df.droplevel()- Drop index level -
df.swaplevel()- Swap index levels -
df.swapaxes()- Swap axes -
df.reorder_levels()- Reorder levels
-
df.append()- Append rows -
df.merge()- SQL-style merge -
df.join()- Join on index -
df.concat()- Concatenate -
df.compare()- Compare differences -
df.update()- Update values -
df.combine()- Combine with function -
df.combine_first()- Combine with priority
-
df.add()/df.radd()- Addition -
df.sub()/df.rsub()- Subtraction -
df.mul()/df.rmul()- Multiplication -
df.div()/df.rdiv()- Division -
df.truediv()/df.rtruediv()- True division -
df.floordiv()/df.rfloordiv()- Floor division -
df.mod()/df.rmod()- Modulo -
df.pow()/df.rpow()- Power -
df.dot()- Matrix multiplication
-
df.eq()- Equal -
df.ne()- Not equal -
df.lt()- Less than -
df.le()- Less than or equal -
df.gt()- Greater than -
df.ge()- Greater than or equal
-
df.apply()- Apply function -
df.applymap()- Apply element-wise -
df.map()- Apply element-wise (alias) -
df.agg()/df.aggregate()- Aggregate -
df.transform()- Transform -
df.pipe()- Pipe functions -
df.groupby()- Group by (returns GroupBy)
-
df.rolling()- Rolling window -
df.expanding()- Expanding window -
df.ewm()- Exponentially weighted -
df.resample()- Resample time series -
df.shift()- Shift values -
df.asfreq()- Convert frequency -
df.asof()- Latest value as of time -
df.at_time()- Select at time -
df.between_time()- Select time range -
df.first()/df.last()- First/last periods -
df.first_valid_index()- First valid index -
df.last_valid_index()- Last valid index -
df.to_period()- Convert to period -
df.to_timestamp()- Convert to timestamp -
df.tz_convert()- Convert timezone -
df.tz_localize()- Localize timezone
-
df.isna()/df.isnull()- Detect missing -
df.notna()/df.notnull()- Detect non-missing -
df.dropna()- Drop missing -
df.fillna()- Fill missing -
df.ffill()- Forward fill -
df.bfill()- Backward fill -
df.backfill()- Backward fill (alias) -
df.pad()- Forward fill (alias) -
df.interpolate()- Interpolate -
df.replace()- Replace values
-
df.to_csv()- Export to CSV -
df.to_json()- Export to JSON -
df.to_excel()- Export to Excel -
df.to_parquet()- Export to Parquet -
df.to_feather()- Export to Feather -
df.to_hdf()- Export to HDF5 -
df.to_sql()- Export to SQL database -
df.to_stata()- Export to Stata -
df.to_pickle()- Pickle to file -
df.to_html()- Render as HTML -
df.to_latex()- Render as LaTeX -
df.to_markdown()- Render as Markdown -
df.to_string()- Render as string -
df.to_dict()- Convert to dictionary -
df.to_records()- Convert to records -
df.to_numpy()- Convert to NumPy -
df.to_clipboard()- Copy to clipboard -
df.to_xarray()- Convert to xarray -
df.to_orc()- Export to ORC -
df.to_gbq()- Export to BigQuery
-
df.items()- Iterate (column, Series) pairs -
df.iterrows()- Iterate (index, Series) pairs -
df.itertuples()- Iterate as namedtuples
-
df.plot- Plotting accessor -
df.plot.*- Various plot types -
df.hist()- Histogram -
df.boxplot()- Box plot
-
df.str- String accessor (for Series) -
df.dt- Datetime accessor -
df.sparse- Sparse accessor -
df.style- Styling accessor
The .str accessor provides all 56 pandas Series.str methods. Methods are implemented in two ways:
These methods return ColumnExpr and remain lazy until execution:
# All these are lazy - no execution until to_df()
ds['name'].str.upper() # → ColumnExpr (lazy)
ds['name'].str.lower() # → ColumnExpr (lazy)
ds['name'].str.len() # → ColumnExpr (lazy)
ds['name'].str.contains('test') # → ColumnExpr (lazy)
ds['name'].str.replace('a', 'b') # → ColumnExpr (lazy)
# Assign to column (still lazy)
ds['upper_name'] = ds['name'].str.upper()
# Execute when needed
df = ds.to_df() # ← SQL executes hereLazy methods include: upper, lower, len, strip, lstrip, rstrip, contains, startswith, endswith, replace, split, rsplit, slice, pad, center, ljust, rjust, zfill, repeat, find, rfind, index, rindex, match, fullmatch, extract, encode, decode, capitalize, title, swapcase, casefold, normalize, isalnum, isalpha, isdigit, isspace, islower, isupper, istitle, isnumeric, isdecimal, wrap, get, count, join, slice_replace, translate, removeprefix, removesuffix
These methods must execute because they change the return structure:
| Method | Return Type | Why Execution Required |
|---|---|---|
partition(sep) |
DataStore (3 columns) |
Returns DataFrame with 3 columns (left, sep, right) - cannot be represented as single SQL expression |
rpartition(sep) |
DataStore (3 columns) |
Same as partition, splits from right |
get_dummies(sep) |
DataStore (N columns) |
Creates dynamic number of columns based on unique values - column count unknown at query time |
extractall(pat) |
DataStore |
Returns MultiIndex DataFrame with all regex matches - row count changes |
cat(sep) |
str |
Aggregates all strings into single value - reduces N rows to 1 scalar |
# These execute immediately and return results
ds['name'].str.partition('|') # → DataStore (3 columns)
ds['name'].str.get_dummies('|') # → DataStore (N dummy columns)
ds['name'].str.extractall(r'\d+') # → DataStore (all matches)
ds['name'].str.cat(sep='-') # → str ("John-Jane-Bob")cat() is fundamentally different from other string methods:
# Other methods: row-wise transformation (N rows → N rows)
ds['name'].str.upper() # ["john", "jane"] → ["JOHN", "JANE"]
# cat(): aggregation (N rows → 1 scalar)
ds['name'].str.cat(sep='-') # ["john", "jane"] → "john-jane"cat() performs an aggregation operation that:
- Reads all values in the column
- Concatenates them with a separator
- Returns a single string
This cannot be expressed as a per-row SQL expression. It requires executing the data first, then calling pandas' str.cat() method.
-
df.equals()- Test equality -
df.compare()- Show differences
-
df.info()- Print summary -
df.memory_usage()- Memory usage -
df.copy()- Copy DataFrame
DataStore has different row ordering guarantees depending on the data source:
For DataFrame sources (in-memory): Row order IS preserved. DataStore uses chDB's built-in _row_id virtual column (available in chDB v4.0.0b5+) to maintain original row order and pandas index:
# ✅ Row order is preserved for DataFrame sources
df = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
ds = DataStore(df)
result = ds[ds['value'] > 10]
# Result preserves original row order and pandas index
assert list(result.index) == [1, 2] # Original indices preservedFor file sources (CSV, Parquet): Row order is NOT guaranteed unless you explicitly specify ORDER BY:
# ❌ Order may vary between executions for file sources
ds = DataStore.from_file("data.csv")
result = ds.filter(ds.value > 50).to_df()
# Row order is NOT guaranteed to match the original file order
# ✅ Explicitly specify ORDER BY for deterministic order
result = ds.filter(ds.value > 50).order_by('id').to_df()
# Rows are ordered by 'id' columnWhy the difference? For DataFrame sources, chDB provides a deterministic _row_id virtual column that represents the 0-based row position from the original DataFrame. For file sources, ClickHouse may return rows in any order for better performance (standard SQL behavior).
Impact on comparisons:
- For DataFrame sources: Results should match pandas behavior including row order
- For file sources: Sort both DataFrames first or use set-based comparisons
- Use
df.sort_values('col').reset_index(drop=True)beforepd.testing.assert_frame_equal()for file sources
DataStore operations are immutable - inplace=True is not supported:
# ❌ Not supported
df.drop(columns=['col'], inplace=True)
# ✅ Correct usage
df = df.drop(columns=['col'])DataStore uses lazy evaluation for optimal performance:
# DataFrame methods return DataStore
result = ds.drop(columns=['col']) # Returns DataStore
df = result.to_df() # Get underlying DataFrame
# Column access returns ColumnExpr (lazy)
col = ds['column'] # Returns ColumnExpr (displays like Series)
pd_series = col.to_pandas() # Convert to pd.Series when needed
# Aggregations return LazyAggregate (lazy)
mean_result = ds['age'].mean() # Returns LazyAggregate
print(mean_result) # Triggers execution, displays value
# Convert to pandas types
df = ds.to_df() # DataStore → pd.DataFrame
series = ds['col'].to_pandas() # ColumnExpr → pd.SeriesDataStore uses a Lazy Execution model, so column operations return ColumnExpr or LazySeries objects instead of pd.Series. When comparing DataStore results with pandas, you need to convert to pandas first:
import pandas as pd
import datastore as ds
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
ds_df = ds.DataFrame(df)
pd_col = df['a']
ds_col = ds_df['a']
# ❌ Wrong: pandas.equals() doesn't recognize DataStore objects
pd_col.equals(ds_col) # Returns False (pandas limitation)
# ✅ Correct: Use to_pandas() to convert first
pd_col.equals(ds_col.to_pandas()) # Returns True
# ✅ Correct: Use DataStore's equals() method (works both ways)
ds_col.equals(pd_col) # Returns True
# ✅ For testing: Use pandas testing utilities
pd.testing.assert_series_equal(pd_col, ds_col.to_pandas())
# ✅ For values only: Use numpy
import numpy as np
np.array_equal(pd_col.values, ds_col.values) # Returns TrueWhy does pd_col.equals(ds_col) return False?
pandas Series.equals() checks isinstance(other, pd.Series) first. Since DataStore's ColumnExpr is not a pd.Series subclass, it immediately returns False - this is a pandas design limitation.
Recommended patterns:
| Use Case | Recommended Method |
|---|---|
| Compare values | ds_col.equals(pd_col) or pd_col.equals(ds_col.to_pandas()) |
| Test assertions | pd.testing.assert_series_equal(pd_result, ds_result.to_pandas()) |
| Values only | np.array_equal(pd_col.values, ds_col.values) |
| Float comparison | np.allclose(pd_col.values, ds_col.values) |
Operations that return Series in pandas return lazy objects in DataStore:
# Returns ColumnExpr (lazy), displays like pandas Series
series = ds['column']
print(type(series)) # <class 'datastore.column_expr.ColumnExpr'>
# Convert to pandas when needed
pd_series = series.to_pandas()
print(type(pd_series)) # <class 'pandas.core.series.Series'>
# Multiple columns return DataStore
datastore = ds[['col1', 'col2']]
print(type(datastore)) # <class 'datastore.core.DataStore'>
# Method calls on columns return LazySeries
head_result = ds['column'].head(5)
print(type(head_result)) # <class 'datastore.lazy_result.LazySeries'>
# LazySeries executes on access (values, index, repr, etc.)
values = head_result.values # Triggers executionWhy Lazy? This enables SQL query optimization and deferred execution.
Both DataStore and ColumnExpr provide to_pandas() for explicit conversion:
# DataStore to DataFrame
ds = DataStore.from_file("data.csv")
df = ds.to_pandas() # Returns pd.DataFrame (alias for to_df())
# ColumnExpr to Series
col = ds['age']
series = col.to_pandas() # Returns pd.Series
# Useful for library interoperability
import seaborn as sns
sns.histplot(ds['age'].to_pandas())The INSERT VALUES method has been renamed to avoid conflicts:
# Old (conflicts with df.values property)
ds.insert_into('id', 'name').values(1, 'Alice')
# New (recommended)
ds.insert_into('id', 'name').insert_values(1, 'Alice')DataStore implements a sophisticated Mixed Execution Engine that enables arbitrary mixing of SQL and pandas operations.
After execution, SQL-style operations use chDB's Python() table function to execute SQL directly on cached DataFrames, enabling true mixed execution.
Stage 1: SQL Query Building (Lazy)
ds = DataStore.from_file("data.csv")
ds1 = ds.select('*') # Builds: SELECT *
ds2 = ds1.filter(ds.age > 25) # Adds: WHERE age > 25
# ds2._executed = False (no execution yet)Stage 2: Execution (First pandas Operation)
ds3 = ds2.add_prefix('emp_') # ← Executes SQL here!
# ds3._executed = True
# ds3._cached_df = DataFrame with filtered data and prefixed columnsStage 3: SQL on DataFrame (chDB Magic)
ds4 = ds3.filter(ds.emp_age > 30) # SQL on DataFrame!
# Internally: SELECT * FROM Python(__datastore_cached_df__) WHERE emp_age > 30
# ds4._executed = True (result cached)Example 1: SQL → Pandas → SQL → Pandas
result = (ds
.filter(ds.age > 25) # SQL query building
.add_prefix('emp_') # Pandas (executes)
.filter(ds.emp_salary > 55000) # SQL on DataFrame!
.fillna(0)) # Pandas on DataFrameExample 2: Pandas → SQL → Pandas → SQL
result = (ds
.rename(columns={'id': 'ID'}) # Pandas (executes)
.filter(ds.ID > 5) # SQL on DataFrame
.sort_values('salary') # Pandas
.select('ID', 'name', 'salary')) # SQL on DataFrame again!Example 3: Complex Mixed Chain
result = (ds
.select('*') # SQL 1
.filter(ds.status == 'active') # SQL 2
.assign(revenue=lambda x: x['price'] * x['qty']) # Pandas (executes)
.filter(ds.revenue > 1000) # SQL 3 on DataFrame
.add_prefix('sales_') # Pandas
.query('sales_revenue > 5000') # Pandas
.select('sales_id', 'sales_customer', 'sales_revenue')) # SQL 4 on DataFrameFor detailed execution plan visualization, see Explain Method
Let the query engine do heavy filtering before pandas operations:
# ✅ Efficient
result = (ds
.select('*')
.filter(ds.date >= '2024-01-01') # SQL filter
.filter(ds.amount > 1000) # SQL filter
.assign(margin=lambda x: x['profit'] / x['revenue']) # Pandas transform
.groupby('category').agg({'revenue': 'sum'})) # Pandas aggregation
# ❌ Less efficient
result = (ds
.to_df() # Load all data
.query('date >= "2024-01-01" and amount > 1000')) # Filter in memoryOnce executed (pandas operation applied), all subsequent operations use cached data:
ds = DataStore.from_file("big_data.csv")
# SQL operations - build query (lazy)
ds_filtered = ds.select('*').filter(ds.value > 0) # No execution yet
# First pandas operation - executes
ds_prefixed = ds_filtered.add_prefix('col_') # ← Query executes here!
# All subsequent operations use cached DataFrame
mean = ds_prefixed.mean() # Uses cache, no SQL
std = ds_prefixed.std() # Uses cache, no SQL
df = ds_prefixed.to_df() # Returns cache, no SQLBest Practice: Filter in SQL, transform in pandas
# ✅ Optimal: SQL filtering → Pandas transformation
result = (ds
.select('*')
.filter(ds.date >= '2024-01-01') # SQL: Filters billions of rows
.filter(ds.amount > 1000) # SQL: More filtering
# ↑ Query built but not executed yet
.add_prefix('col_') # ← Executes SQL here, caches result
.fillna(0) # Pandas: Works on cached result
.assign(margin=lambda x: x['col_profit'] / x['col_revenue'])) # PandasChain multiple operations for better readability and potential optimization:
result = (ds
.drop(columns=['unused1', 'unused2'])
.fillna(0)
.assign(
revenue=lambda x: x['price'] * x['quantity'],
margin=lambda x: x['profit'] / x['revenue']
)
.query('margin > 0.2')
.sort_values('revenue', ascending=False)
.head(100))cleaned = (ds
.drop(columns=['temp_col'])
.dropna(subset=['important_col'])
.drop_duplicates()
.fillna({'numeric_col': 0, 'string_col': 'unknown'})
.astype({'id': 'int64', 'amount': 'float64'}))featured = ds.assign(
revenue=lambda x: x['price'] * x['quantity'],
profit=lambda x: x['revenue'] - x['cost'],
margin=lambda x: x['profit'] / x['revenue'],
high_value=lambda x: x['revenue'] > 1000
)ts_result = (ds
.set_index('date')
.sort_index()
.asfreq('D')
.fillna(method='ffill')
.rolling(window=7).mean()
.shift(1))# Calculate year-over-year growth
growth = (current_year
.set_index('product')
.sub(last_year.set_index('product'))
.div(last_year.set_index('product'))
.mul(100))# Complex filtering and transformation
result = (ds
.query('age > 18 and income > 50000')
.assign(
segment=lambda x: pd.cut(x['income'],
bins=[0, 75000, 150000, float('inf')],
labels=['Low', 'Medium', 'High'])
)
.where(lambda x: x['score'] > 0, 0)
.groupby('segment')
.agg({'income': 'mean', 'score': 'sum'}))# Optimal workflow
result = (ds
# Use SQL for heavy filtering
.select('customer_id', 'order_date', 'amount', 'product_category')
.filter(ds.order_date >= '2024-01-01')
.filter(ds.order_date < '2024-02-01')
.filter(ds.amount > 0)
# Use pandas for complex transformations
.assign(
month=lambda x: pd.to_datetime(x['order_date']).dt.month,
is_high_value=lambda x: x['amount'] > x['amount'].quantile(0.75)
)
.groupby(['customer_id', 'month'])
.agg({
'amount': ['sum', 'mean', 'count'],
'is_high_value': 'sum'
})
.reset_index()
# Export
.to_parquet('customer_monthly_summary.parquet'))inplace=Trueparameter (DataStore is immutable)- Some deprecated pandas methods
- Methods that don't make sense for DataStore (e.g.,
from_dict,from_recordsas instance methods)
df.groupby()- Returns pandas GroupBy object, not DataStore- Class methods - Return pandas objects, not DataStore
DataStore requires pandas >= 2.1.0 and Python >= 3.9.
| Feature | Status | Notes |
|---|---|---|
DataFrame.map() |
✅ Available | Recommended over applymap() |
groupby.apply(include_groups=...) |
✅ Available | Control group column inclusion |
| Nullable type handling | ✅ Full support | Best dtype preservation |
first()/last() deprecation |
✅ FutureWarning | Warns about offset string usage |
- For production: Pin specific pandas version to ensure consistent behavior
- For best performance: Use pandas 2.2+ for latest optimizations
- Documentation: See DataStore README
- Examples: Check examples/example_pandas_extended.py
- Pandas Docs: https://pandas.pydata.org/docs/reference/frame.html
DataStore provides comprehensive pandas DataFrame API compatibility with seamless integration:
- ✅ 209 pandas DataFrame methods implemented
- ✅ 56 pandas
.straccessor methods (all pandas str methods covered) - ✅ 42+ pandas
.dtaccessor methods (plus ClickHouse datetime extras) - ✅ 334 ClickHouse functions mapped to Pandas-like API
- ✅ ClickHouse-specific accessors:
.arr(37 methods),.json,.url,.ip,.geo - ✅ Mix SQL queries with pandas transformations
- ✅ Automatic DataFrame/Series wrapping
- ✅ Performance optimization through caching
- ✅ Immutable, thread-safe operations
Use DataStore when you need the power of pandas with the performance of SQL query optimization!