Skip to content

Commit 13bf22f

Browse files
perf: Replace slow pandas iterrows() with itertuples() or to_dict()
Refactored multiple loops across the codebase that were previously using `pd.DataFrame.iterrows()` to use more efficient iteration methods: `df.itertuples(index=False, name=None)` and `df.to_dict('records')`. `iterrows()` is notoriously slow due to the overhead of creating a `pd.Series` object for every single row. By switching to tuples or native dictionaries, the inner loops execute significantly faster. Co-authored-by: alinelena <3306823+alinelena@users.noreply.github.com>
1 parent df4d67b commit 13bf22f

5 files changed

Lines changed: 12 additions & 16 deletions

File tree

.jules/bolt.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,3 @@
1-
## 2024-05-19 - Caching YAML Load for Model Parsing
2-
**Learning:** `yaml.safe_load` on large configuration files like `models.yml` is significantly slower than parsing JSON or doing other basic IO. It can become a bottleneck when called repeatedly throughout an application's lifecycle (e.g., getting subsets of models, instantiating apps).
3-
**Action:** Always memoize or `@lru_cache` functions that load static, read-only configuration files (like `models.yml`) to prevent repeated disk I/O and parsing overhead.
4-
5-
## 2024-05-19 - Caching YAML Load for Framework Registry
6-
**Learning:** `yaml.safe_load` on `frameworks.yml` within `load_framework_registry()` was taking ~2-3 ms per call and it was repeatedly called for every framework entry via `get_framework_config()`. This was a micro-bottleneck, especially when dealing with lists or multiple frameworks.
7-
**Action:** Applied the `@lru_cache` and `deepcopy` pattern successfully again to `load_framework_registry()` and `get_framework_config()` to avoid caching a mutable dictionary directly and avoid repeated YAML I/O parsing.
1+
## 2024-05-30 - Iterating pandas DataFrames efficiently
2+
**Learning:** `pandas.DataFrame.iterrows()` is a major performance bottleneck for looping over datasets because it returns a Series for each row, creating significant overhead in Python.
3+
**Action:** Always replace `iterrows()` with `itertuples(index=False, name=None)` for very fast, index-based tuple access, or `to_dict('records')` for dictionary key access, to eliminate DataFrame construction overhead during loops.

ml_peg/calcs/bulk_crystal/elasticity/calc_elasticity.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -238,12 +238,12 @@ def run_elasticity_benchmark(
238238

239239
# Save relaxed structures to extxyz for visualisation
240240
atoms_list = []
241-
for _, row in results.iterrows():
241+
for row in results.to_dict("records"):
242242
struct = row.get("final_structure")
243243
if struct is not None:
244244
atoms = AseAtomsAdaptor.get_atoms(struct).copy()
245245
atoms.calc = None
246-
atoms.info = {"mp_id": row[benchmark.index_name]}
246+
atoms.info = {"mp_id": row.get(benchmark.index_name)}
247247
atoms_list.append(atoms)
248248
if atoms_list:
249249
ase_write(

ml_peg/calcs/conformers/MPCONF196/calc_MPCONF196.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,9 +86,9 @@ def get_ref_energies(data_path: Path) -> dict[str, float]:
8686
)
8787
ref_energies = {}
8888

89-
for row in df.iterrows():
90-
label = row[1][0]
91-
ref_energies[label] = float(row[1][2]) * KCAL_TO_EV
89+
for row in df.itertuples(index=False, name=None):
90+
label = row[0]
91+
ref_energies[label] = float(row[2]) * KCAL_TO_EV
9292

9393
return ref_energies
9494

ml_peg/calcs/conformers/solvMPCONF196/calc_solvMPCONF196.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,9 @@ def get_ref_energies(data_path: Path) -> dict[str, float]:
8484
)
8585
ref_energies = {}
8686

87-
for row in df.iterrows():
88-
label = row[1][0]
89-
e_ref = float(row[1][1]) * units.Hartree
87+
for row in df.itertuples(index=False, name=None):
88+
label = row[0]
89+
e_ref = float(row[1]) * units.Hartree
9090
ref_energies[label] = e_ref
9191

9292
return ref_energies

ml_peg/calcs/utils/gscdb138.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ def run_gscdb138(
106106
df_refs["Reference"] *= units.Hartree
107107

108108
# Calculate relative energy for each entry.
109-
for _, row in tqdm(df_refs.iterrows(), dataset, total=df_refs.shape[0]):
109+
for row in tqdm(df_refs.to_dict("records"), dataset, total=df_refs.shape[0]):
110110
atoms_list = []
111111
identifier = row["Reaction"]
112112
reactions = row["Stoichiometry"].split(",") # Parse stoichiometry string.

0 commit comments

Comments
 (0)