The future of data collection #1944
Replies: 14 comments 57 replies
-
Here is an illustration of the API overlap with AgentSet "gini":collect(model.agents, "wealth", function=calculate_gini) with AgentSet "gini": lambda model: calculate_gini(model.agents.get("wealth)) "n_quiescent":collect(model.get_agents_of_type(Citizen), "condition", func=lambda x: len(entry for entry in x if entry=='Quiescent')) with AgentSet "n_quiescent": lambda model: len(model.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent")) |
Beta Was this translation helpful? Give feedback.
-
Just for reference, this information is outdated. Python dictionaries used to be unordered. In Python 3.6 insertion order became in implementation detail of CPython (the reference implementation of Python). But since Python 3.7 insertion order is guaranteed, so it is perfectly fine to rely on it. That said the mental model for dictionaries is still set-orientaded (which I think is the right model). So I agree that it would be confusing if this works DataCollector(model, collectors={"wealth": collect(model.agents, "wealth"),
"gini": collect("wealth", func=calculate_gini)}
}) but this doesn't DataCollector(model, collectors={"gini": collect("wealth", func=calculate_gini),
"wealth": collect(model.agents, "wealth")}
}) So we still would have to work around this problem internally which complicates the code. But I don't think we need tiered data collectors at all. I think they are a bit hard to understand and provide little benefit. At least how I understand it, they are basically a performance optimization, so you don't need to loop over all agents more than once. For small to medium models I don't think its a problem at all. For larger models or if you really do lots of simulation runs, yes it can matter. But than a better solution would anyway be to calculate your derivate variables afterwards. That is you just collect the wealth attribute, turn your data collection into a pandas dataframe and calculate the gini coefficient from the dataframe. That probably is even faster, because pandas can parallelize the calculations across all rows. This way you also don't mix any logic into your data collector. I think it is actually bad practice to calculate things in the data collector. It should basically an observer. If you have the gini coefficient in your model definition feel free to collect it. Otherwise calculate it as part of the data analysis. So for me the callable should be only used to filter your objects (e.g. only a certain type, or based on a condition) |
Beta Was this translation helpful? Give feedback.
-
This is my summary of the problems in the current data collector. I made a summary for the rest of @projectmesa/maintainers. Needs your opinion so that this can happen in time just before the 3.0 release. I think this should not be a GSoC 2024 project. Data collection problems:
|
Beta Was this translation helpful? Give feedback.
-
I suggest we try to contain the discussion on DataCollection here rather than having it spread over multiple locations. I am getting confused trying to find all the useful ideas and discussions. So rather than respond in #1933, I'll respond here. In 1933, @rht wrote
I am not entirely sure about this. Dataframes, for me, are associated with analyzing the results of a run. So, in my branch, Measures in my understanding are
So is State a single thing, or can it be multiple things? For example, an agent's position is clearly part of the agent (and by extension) model state. However, most of the time, position will be some tuple. So, somewhere, we have to translate the position into its elements. Do we want to do this in Measure, which would imply having multiple "fields" in a measure, or do we handle this downstream wherever Measure is being used? I personally am inclined to handle this further downstream. To continue the position example, in data collection, we might want to split position into x, y, (and z). For visualization, however, this splitting might not be required. So, I am unsure if we need multiple attributes/functions on a Measure. Instead, in my current thinking Measure always reflects a single state variable. |
Beta Was this translation helpful? Give feedback.
-
A quick update from my side. I have been trying to figure out a way to make it possible to access the value of Measure as an attribute. So the basic idea is that the following code works. class Measure:
def __init__(self, group, function):
self.group = group
self.function = function
def get_value(self):
return self.function(self.group)
class MyModel(Model):
def __init__(self, *args, **kwargs):
# some initiliaziation code goes here
self.gini = Measure(self.agents, "wealth", calculate_gini)
if __name__ == '__main__':
model = MyModel()
print(model.gini) # should actually do model.gini.get_value() This turns out to be not trivial because in this example class Measure:
def __init__(self, model, identifier, *args, **kwargs):
self.model = model
self.identifier = identifier
def get_value(self):
return getattr(self.model, self.identifier)
class MeasureDescriptor:
def __set_name__(self, owner, name):
self.public_name = name
self.private_name = "_" + name
def __get__(self, obj, owner):
return getattr(obj, self.private_name).get_value()
def __set__(self, obj, value):
setattr(obj, self.private_name, value)
class Model:
def __setattr__(self, name, value):
if isinstance(value, Measure) and not name.startswith("_"):
klass = type(self)
descr = MeasureDescriptor()
descr.__set_name__(klass, name)
setattr(klass, name, descr)
descr.__set__(self, value)
else:
super().__setattr__(name, value)
def __init__(self, identifier, *args, **kwargs):
self.gini = Measure(self, "identifier")
self.identifier = identifier
if __name__ == '__main__':
model1 = Model(1)
model2 = Model(2)
print(model1.gini)
print(model2.gini) To make I hope this explanation is clear enough. I admit it is a bit convoluted. It is also one of the only ways I have been able to come up with so far that makes it possible for Measures to behave as if they are normal attributes. Please let me know what you think of this direction for implementing Measure or whether the complexity is not worth it, and we forego the idea of having Measure behave as if it is an attribute that returns a simple value (e.g., int, float, string). |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for this. I think we're on the right track, I would only change the abstraction level on which the There are basically the following problems:
Then there is the complication that you sometimes have an object with members with attributes (like an AgentSet) and sometimes just have an object with attributes directly (like a Model). So basically there three levels that need to be defined:
You can already see how complicated this can possibly get. I will try to think about some possible abstractions, but feel free to built on this in the mean time. |
Beta Was this translation helpful? Give feedback.
-
On Group: I can see how groups can be used outside of the measure and data collection use case. They may be reused to organize agents step execution as well, e.g. if I want only the quiescent citizens in the Epstein civil violence to take certain actions. def step(self): # of a model
# Instead of
self.agents.select(agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent").do("rest")
# we do
self.quiescents.do("rest") What about doing addition on the groups # The drawback being this is not cacheable
(self.quiescents + self.injured_cops).do("rest")
# Needs to be
self.needs_rest = Group(self.quiescents + self.injured_cops)
self.needs_rest.do("rest") |
Beta Was this translation helpful? Give feedback.
-
The problem is an extension/detailing of 6. Let me try to explain in a bit more detail one of the details I am currently stuck on. The basic idea of a Collector is that it retrieves one or more attributes from an object or collection of objects, and optionally applies a callable to it. The issue now is that there is no way to specify the return of this optional callable in the current design. This return matters because it affects how data is stored in the collector and how it will be turned into a dataframe in So, for example, we are retrieving One idea I had after the conversation with @EwoutH is that the entire problem is analogous to e.g., pandas.DataFrame.apply. In case of collecting data from a collection of objects and next applying a callable to it, the user should specify the "axis" over which this function will operate. If you operate over the "columns", you are aggregating the information across all objects, while if you operate over "rows", the function is applied to the collected data for each object separately. I hope this helps to clarify the issue. |
Beta Was this translation helpful? Give feedback.
-
Played a bit around a few days ago. Now that we have our very powerful AgentSet, API seems to be able to get simpler: datacollector = DataCollector(
collectors = [
c(target=Model, attributes=["n_agents"], methods=calculate_energy),
c(target=Wolf, attributes=["sheep_eaten"]),
c(target=Sheep, attributes=["age"], methods=calculate_energy),
c(target=model.agents, attributes=["energy"], agg={"energy": np.mean}),
] Few notes:
c(target=Model, attributes=["n_agents"], methods=calculate_energy), gives {
f"{Model.__name__}_{n_agents}": {...},
f"{Model.__name__}_{calculate_energy}": {...},
}
Just one approach. Don't know if it's the best. |
Beta Was this translation helpful? Give feedback.
-
I took another stab at working out an API and data storage format: Proposed API DesignThe core of the proposal is a unified from mesa.datacollection import DataCollector, collect
import numpy as np
class WolfSheepModel(mesa.Model):
def __init__(self, n_wolves=10, n_sheep=50, grass_regrowth_time=30):
super().__init__()
# [...model initialization...]
# Initialize the data collector with various collectors
self.datacollector = DataCollector([
# Model-level attributes
collect(target=self, attributes=["steps", "living_wolves", "living_sheep"]),
# Agent type-specific collection
collect(target=Wolf, attributes=["energy", "sheep_eaten"]),
collect(target=Sheep, attributes=["energy", "grass_eaten"]),
# Dynamic agent filtering
collect(
target=self.agents.select(lambda a: a.energy < 2),
attributes=["energy", "pos"],
name="starving_agents"
),
# Aggregated metrics
collect(
target=self.agents,
attributes=["energy"],
aggregates={
"mean_energy": np.mean,
"energy_gini": self.calculate_gini
}
),
# Custom function
collect(
target=self,
function=lambda m: self.calculate_spatial_density(),
name="spatial_density"
)
]) Data AccessThe # Run the model
model = WolfSheepModel()
for _ in range(100):
model.step()
# Get all data as a comprehensive DataFrame (long format)
all_data = model.datacollector.get_dataframe()
"""
Step DataType Entity ID Attribute Value
0 model Model - steps 0
0 model Model - living_wolves 10
0 agents Wolf 1 energy 20
0 aggregates - - mean_energy 17.5
...
"""
# Get specific data with multi-index DataFrames
wolf_df = model.datacollector.get_dataframe(target=Wolf)
"""
energy sheep_eaten
Step ID
0 1 20 0
2 18 0
...
"""
# Filter by attribute across all agent types
energy_data = model.datacollector.get_dataframe(attribute="energy")
"""
energy
Step Type ID
0 Wolf 1 20
Sheep 3 15
...
"""
# Get dynamically filtered collections by name
starving_df = model.datacollector.get_dataframe(name="starving_agents")
# Get aggregated metrics
aggregates = model.datacollector.get_dataframe(data_type="aggregates")
"""
mean_energy energy_gini
Step
0 17.50 0.11
1 16.25 0.12
...
"""
# Additional filtering options
time_range_df = model.datacollector.get_dataframe(time_range=(10, 20))
long_format_df = model.datacollector.get_dataframe(format="long") The multi-indexed DataFrames enable powerful analysis: # Average energy by agent type over time
energy_by_type = energy_data.groupby(level=["Step", "Type"]).mean()
# Calculate rate of change in wolf population
wolves_over_time = wolf_df.groupby(level="Step").size()
population_change = wolves_over_time.diff() Memory-Efficient Internal StructureThe internal data structure is optimized to avoid string duplication and use arrays for efficient storage: {
# Schema defined once - no string duplication
"schema": {
"Wolf": ["energy", "sheep_eaten"],
"Sheep": ["energy", "grass_eaten"],
"model": ["steps", "living_wolves", "living_sheep"],
"aggregates": ["mean_energy", "energy_gini"]
},
# Data storage uses position-based arrays matching the schema
"data": {
1: { # Timestep
"model": [1, 8, 42], # Values match schema positions
"agents": {
"Wolf": {
"ids": [1, 2, 3],
"values": [
[10, 2], # Agent 1: [energy, sheep_eaten]
[12, 1], # Agent 2: [energy, sheep_eaten]
[8, 0] # Agent 3: [energy, sheep_eaten]
]
},
"Sheep": {
"ids": [4, 5, 6],
"values": [
[5, 3], # Agent 4: [energy, grass_eaten]
[6, 4], # Agent 5: [energy, grass_eaten]
[4, 2] # Agent 6: [energy, grass_eaten]
]
}
},
"aggregates": [7.5, 0.18] # Values match schema positions
}
}
} I hope to found a balance with this API and data storage between flexibility, powerful features, collection and storage efficiency, and ease of use. Curious what everyone thinks. |
Beta Was this translation helpful? Give feedback.
-
class PredatorPreyModel(Model):
def __init__(self, num_wolves=10, num_sheep=50, seed=None):
super().__init__(seed=seed)
# Create agents
for _ in range(num_wolves):
Wolf(self, age=self.random.randint(0, 10), energy=self.random.randint(50, 100))
for _ in range(num_sheep):
Sheep(self, age=self.random.randint(0, 5), energy=self.random.randint(50, 100))
# Setup a datacollector with empty model_reporters
# We'll manually add values to it in our collect_data method
self.datacollector = DataCollector()
def collect_data(self):
"""Calculate and collect all model metrics."""
# Get agent sets we need
wolves = self.agents.select(agent_type=Wolf)
sheep = self.agents.select(agent_type=Sheep)
mature_wolves = wolves.select(lambda a: a.age > 5)
# Calculate all metrics
metrics = {
"Wolves": len(wolves),
"Sheep": len(sheep),
"Mature Wolves": len(mature_wolves),
"Average Wolf Energy": wolves.agg("energy", np.mean) if wolves else 0,
"Total Sheep Wool": sheep.agg("wool", sum) if sheep else 0
}
# Add current step's metrics to datacollector
for key, value in metrics.items():
if key not in self.datacollector.model_vars:
self.datacollector.model_vars[key] = []
self.datacollector.model_vars[key].append(value)
def step(self):
# Model logic
hungry_wolves = self.agents.select(
agent_type=Wolf,
filter_func=lambda a: a.energy < 30
)
sheep = self.agents.select(agent_type=Sheep)
for wolf in hungry_wolves:
if sheep:
prey = self.random.choice(list(sheep))
wolf.energy += prey.energy * 0.5
wolf.kills += 1
prey.remove()
self.agents.shuffle_do("step")
# Collect data
self.collect_data() Once you stop defining everything in your Model init, you can do dynamic selection and aggregation so much easier. |
Beta Was this translation helpful? Give feedback.
-
Quick thought I originally raised in my Mesa-Frames proposal but wanted to mention here as well : Event Based data collection - data is collected only when defined conditions are met like -
although the initial idea was to reduce redundant data storage at mesa-frames scales I think it would be useful here as well example -
|
Beta Was this translation helpful? Give feedback.
-
So I threw this whole conversation into Gemini 2.5 Pro, and it came up with this. I think this is the best API I've seen so far. Curious what everybody thinks. If it looks good I will try to move forward with an implementation. Synthesizing a Declarative Hi all, This has been a really valuable (and extensive!) discussion on the future of data collection in Mesa. Reading through it, particularly the recent ideas from @EwoutH and the summary of requirements from @rht, I wanted to try and synthesize a potential path forward that aims to capture the best aspects discussed, focusing on a declarative API that felt closer in the analysis above. While the manual Perhaps we can achieve the power demonstrated there (especially regarding dynamic sets and aggregations) within a more structured, declarative API centered around a Proposed Core Idea: Initialize # Illustrative Example API
from mesa.datacollection import DataCollector, collect
import numpy as np
# Assume Wolf, Sheep, Citizen Agent classes and calculate_gini defined elsewhere
class MyModel(mesa.Model):
def __init__(self, ...):
super().__init__()
# ... model setup ...
self.datacollector = DataCollector([
# --- Model Level ---
collect(name="step_count", target=self, attributes=["schedule.steps"]),
collect(name="total_wealth", target=self,
function=lambda m: sum(a.wealth for a in m.agents)),
# --- Agent Type Level (Resolved at runtime) ---
collect(name="wolf_data", target=Wolf, attributes=["energy", "kills"]),
# Example: apply function per agent (output stored per agent ID)
collect(name="sheep_age_category", target=Sheep,
function=lambda agent: "lamb" if agent.age < 2 else "adult",
apply_level="agent"), # Hint needed for per-agent application
# --- Dynamic AgentSet Level (Lambda evaluated at runtime) ---
collect(name="starving_agents_pos",
target=lambda m: m.agents.select(lambda a: a.energy < 10),
attributes=["pos"]), # Collects pos attribute for matching agents
collect(name="avg_starving_energy",
target=lambda m: m.agents.select(lambda a: a.energy < 10),
function=lambda agent_set: agent_set.agg("energy", np.mean)), # Aggregate over dynamic set
# --- Aggregation Focused ---
collect(name="energy_stats", target=self.agents, # Target can be an AgentSet
attributes=["energy"], # Base attribute(s) to collect first
aggregates={ # Dictionary: output_name -> func(list_of_values)
"mean_energy": np.mean,
"median_energy": np.median,
"energy_gini": calculate_gini # Assumes calculate_gini takes a list
}),
# Direct aggregation on an AgentSet (e.g., count) without collecting individuals first
collect(name="quiescent_count", target=Citizen, # Agent Type target
function=lambda agentset: agentset.select(lambda a: a.condition == "Quiescent").count), # agent_set passed to function
# --- Conditional Collection (Addresses @Ben-geo's point) ---
collect(name="periodic_wolf_aggression", target=Wolf, attributes=["aggression"],
trigger=lambda model: model.schedule.steps % 10 == 0) # Only collect every 10 steps
])
def step(self):
# ... step logic ...
self.datacollector.collect(self) # Pass model instance to process collectors Key Components of
How this Addresses Key Problems:
Internal Storage & Retrieval: @EwoutH's schema-based internal storage idea seems excellent for efficiency. Conclusion: This synthesized |
Beta Was this translation helpful? Give feedback.
-
There has been quite some discussion in various places about changing data collection. This is my attempt to think this through in some more detail. It is heavily inspired by a suggestion by @Corvince at some point.
In the general case, data collection is taking an object and extracting from this one or more attributes, and optionally applying a callable to this. It might involve only an object and a callable applied to this object in specific cases. This object can be the model, an agent, an agentset, a space, or some user-defined class.
So, it seems sensible to create a separate Collector class that implements this basic logic. Because the behavior of AgentSet is a bit different from other objects (i.e., AgentSet.get instead of relying on getattr), I believe it makes sense to have 2 Collector classes: BaseCollector and AgentSetCollector (PEP 20, flat is better than nested). Rather than burden the user with this distinction, it is possible to use a factory function (e.g.,
collect(obj, attrs, func=None)
) to create the appropriate Collector instance.Ideally, data should only be extracted once. So, in the case of the Boltzman wealth model, the data collector should be smart enough to extract the
wealth
attribute only once from the agentset. This can relatively easily be realized by maintaining an internal mapping of all objects and the attributes to be retrieved from them. Moreover, extracting all relevant attributes from a given object in one go might be possible to avoid unnecessary iteration. This would, however, require a minor update toAgentSet.get
so thatattr_name
takes a string or list of strings.I believe it is possible to design and implement this new-style DataCollector so that the current one can be implemented on top of it for backward compatibility.
Like with the current DataCollector, data collection should happen whenever
data_collector.collect
is called. However, I believe it is paramount that the data collector also always extracts the current simulation time. Only by having the simulation time for each call tocollect
can you produce a clean and complete time series of the dynamics of the model over time. In fact, these time stamps could become part of the index/column labels of the DataFrames when turning the retrieved data into a DataFrame.Like with the current DataCollector, it should be easy to turn any retrieved data into a DatafFrame. This can easily be done through a
to_dataframe
method on the Collector class.So, what could the resulting API look like?
So, does the basis idea of object, retrieval of one or more attributes, and/or applying a callable make sense? Have a missed a key concern? Is there something obviously wrong or missing in the sketch of the API?
Beta Was this translation helpful? Give feedback.
All reactions