Skip to content

Commit 6b9ee56

Browse files
Add docs/concepts.md (#75)
1 parent 1be0e6a commit 6b9ee56

File tree

2 files changed

+225
-0
lines changed

2 files changed

+225
-0
lines changed

docs/concepts.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# Etre Concepts
2+
3+
**Etre is a data store and API for tracking and finding resources.**
4+
5+
Those aren't randomly chosen words; each one is meaningful:
6+
7+
* _Data store_: Etre is backed by a _durable_ data store, not a cache
8+
* _API_: Etre is accessed by a simple REST API—for humans and machines
9+
* _Tracking_: Etre is a source of truth for "what exists" (resources)
10+
* _Finding_: Etre uses the [Kubernetes label selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) syntax to make finding resources easy
11+
* _Resources_: Etre can store anything but it's built to store resources that actually exist
12+
13+
In large fleets, Etre allows humans and machines to find arbitrary groups of resources.
14+
For example, at [Block](https://github.com/block/) where Etre was created, there are thousands of databases.
15+
When a DBA needs to reboot a certain group of databases, how do they find and select only the databases they need?
16+
They use Etre.
17+
18+
Etre has three main concepts:
19+
20+
* **entity**: A unique resource of a specific type
21+
* **label**: Key-value pairs associated with an entity
22+
* **ID**: An ordered set of labels that uniquely identify an entity
23+
24+
The canonical example is physical servers:
25+
26+
```json
27+
[
28+
{
29+
"hostname": "db1.local",
30+
"model": "Dell R760"
31+
},
32+
{
33+
"hostname": "db2.local",
34+
"model": "Dell R760",
35+
"rack": "34a"
36+
}
37+
]
38+
```
39+
40+
In this example, the user-defined entity type is `host`, and there are 2 host entities.
41+
42+
Each host entity has 2 labels: `hostname` and `model`.
43+
The second entity has a third label: `rack`.
44+
Labels are user-defined and can vary between entities.
45+
Only one label is special and required: the ID label.
46+
47+
Since servers are typically identified by hostname, the `hostname` label is the ID label.
48+
(ID can be more than one label.)
49+
Uniqueness is enforced by a unique key on the backend data store; Etre currently support MongoDB and MongoDB-compatible data stores.
50+
51+
## Data Modeling
52+
53+
Data modeling in Etre is denormalized by entity type.
54+
55+
This is more difficult than it sounds and more important than it seems.
56+
To understand why—and to learn how to model data properly in Etre—it's imperative to understand the fallacy of false structure.
57+
58+
### False Structure
59+
60+
_False structure_ is a subjective view of resource organization that purports to be objective.
61+
62+
When engineers insist "_This_ is how the resources are organized," it's usually a false structure.
63+
"This" is one way the resources can be organized, but there are probably many different ways depending on one's point of view.
64+
65+
Consider a typical setup of racks, servers, and databases:<br><br>
66+
67+
<img alt="Rock-Host-Db Tree" src="img/rack-host-db-tree.svg" style="width:480px"><br>
68+
69+
This hierarchical structure is not wrong, but there are at least three points of view:
70+
71+
* Data center engineer cares about racks and hosts
72+
* DBA cares about databases and maybe hosts
73+
* Application developer care only about their databases
74+
75+
Depending on one's point of view, the structure changes.
76+
Application developers, for example, often see the resources structured this way:
77+
78+
```
79+
app/
80+
env/
81+
region/
82+
db-cluster/
83+
db-node/
84+
```
85+
86+
Also notice that the diagram and the directory tree capture different aspects of the resources.
87+
In the diagram, app, env, and region aren't visible.
88+
In the directory tree, these three aspects are clearly visible but hosts are not.
89+
90+
Realizing and accepting that <u>the same resources can have different views</u> is key to understanding Etre and proper Etre data modeling.
91+
92+
Any structure is valid if helps a person (or machine) find what they need and accomplish what they're trying to do.
93+
However, that presents a challenge: how do you model "any structure"?
94+
Answer: you don't; you model what exists and let a user's point of view generate a (virtual) structure:
95+
96+
`What Exists + Point of View = Structure`
97+
98+
### What Exists
99+
100+
Etre data modeling starts by identify _what exists_.
101+
102+
Start with obvious, uncontentious resources:
103+
* A physical server (host) obviously exist
104+
* A network switch obviously exists
105+
* A database instance (node) obviously exists
106+
* A service (app) obviously exists
107+
108+
When it comes to less obvious resources, like a database cluster, follow these principles:
109+
110+
* **Entity Principle**: If removing it leaves nothing, it's an entity.
111+
(Or, other way around: if it's removed and something remains, it's not an entity; what remains is the entity.)
112+
<br><br>
113+
* **Indivisible Principle**: An entity doesn't depend on parts to exists. Without getting into a [Ship of Theseus paradox](https://en.wikipedia.org/wiki/Ship_of_Theseus), a server is still a server when a hard drive is removed.
114+
<br><br>
115+
* **ID Principle**: An entity must be uniquely identifiable within its type.
116+
If an ID isn't obvious, then it might not be an entity.
117+
<br><br>
118+
* **Scalar Principle**: Labels and label values should be scalar (single value), not lists or enumerated.
119+
(Enumerated means label1, label2, ..., labelN.)
120+
If non-scalar labels or values are needed, it's inverted&mdash;denormalize it.
121+
<br><br>
122+
* **Sparse Principle**: Entities should _not_ be sparse.
123+
If an entity has very few labels, it's probably not an entity.
124+
The labels probably belong to another entity type.
125+
<br><br>
126+
* **Stability Principle**: Labels and values should be stable and long-lived.
127+
(Write once, read many is ideal.)
128+
If not, it's probably a resource (or label) that should not be stored in Etre.
129+
<br><br>
130+
* **Binary Principle**: Querying more than 2 entity types is usually an anti-pattern and a sign of [false structure](#false-structure).
131+
<br><br>
132+
* **Duplication Principle**: Judicious duplication is necessary and normal.
133+
Trying to avoid duplication usually leads to [false structure](#false-structure).
134+
Duplication also helps avoid violating the Binary Principle.
135+
<br><br>
136+
* **Pragmatic Principle**: Whatever makes real-world usage fast and easy is acceptable.
137+
Use this principle sparingly and only as a last resort because it tends to create short-tem solutions and long-term problems.
138+
<br><br>
139+
140+
Is a database cluster an entity?
141+
This can be argued both ways depending on the type of cluster:
142+
143+
_No_
144+
* A traditional database "cluster" (i.e. replication topology) is not an entity because if you take away the cluster the nodes remain&mdash;the Entity and Indivisible principles do not hold.
145+
It's also likely that the Sparse Principle doesn't hold either because, given the Duplication Principle, any cluster-level settings can and should be duplicated into the node entities.
146+
Instead, the database node entities should have a cluster label; and when a user (or program) wants to "build" the cluster, the query includes cluster=value.
147+
148+
_Yes_
149+
* Sometimes clusters can be "headless": have no database instances.
150+
In this case, the Entity and Indivisible principles hold.
151+
The Sparse Principle might still be violated, but then the Pragmatic Principle applies: headless clusters are unusual but they do exist for some purposes.
152+
In this case, the Scalar and Binary principles are important to avoid introducing false structures.
153+
Even with a cluster entity type, labels like cluster name should be put (Duplication Principle) on database instance entities, too.
154+
155+
### Labels (Point of View)
156+
157+
`What Exists + Point of View = Structure`
158+
159+
Labels allow many points of view on the same underlying resources.
160+
When querying Etre, users select labels to match _and_ labels to return.
161+
An important principle and design of Etre is that the usage of labels cannot be known ahead of time.
162+
163+
Granted, users can and should know most common access patterns.
164+
But experience proves that there's always some new combination of labels applied to (or projected from) the resources.
165+
166+
All labels should actual, useful, and consistent:
167+
168+
* **Actual**: The label and value are actual properties of the resource.
169+
If a label is being put on an entity because it's unclear where else it should go, it probably indicates another entity type.
170+
No entity is a dumping ground for labels.
171+
* **Useful**: The label is either queried or returned (projected).
172+
If a label is never used, it shouldn't be stored. This helps guard against [what not to store](#what-not-to-store).
173+
* **Consistent**: Labels should be consistent across all entities and entity types. See [Conventions](#conventions).
174+
175+
The latter two are the usual problems: storing useless labels and not following (or violating) a convention.
176+
177+
A similar "pragmatic principle" applies.
178+
Storing, for a example, a _slowly_ changing status value as a label is probably acceptable, especially if it eventually ends on a steady state.
179+
180+
## Conventions
181+
182+
* Singular noun entity names: host, switch, app
183+
* Snake_case names and labels: aws_region, db_id, backup_retention
184+
* Consistent values: "prod" or "production", not both
185+
* Terse but not cryptic: db_id _not_ database_identifier [1]
186+
* Lowercase names and labels
187+
188+
[1] Terse is important for a denormalized data model with high duplication. At scale, every byte matters.
189+
190+
## What _Not_ to Store
191+
192+
* Ephemeral data, values, properties (status, roles, etc.)
193+
* Non-atomic values: lists, document/objects, CSV
194+
* Enumerated labels: `"foo1": "...", "foo2": "..."`
195+
* User-created features: indexes, feature flags, etc.
196+
197+
## Data Format
198+
199+
Etre is pseudo-schemaless and format-flexible.
200+
Internally, every entity is stored like:
201+
202+
```json
203+
{
204+
"_id": "b93aksdfjz09",
205+
"_type": "host",
206+
"_rev": 1,
207+
"key1": "val1",
208+
"keyN": "valN"
209+
}
210+
```
211+
212+
`_id`, `_type` and `_rev` are internal fields that the user cannot change.
213+
`_id` is the data store ID (not the user-level Etre ID label).
214+
`_type` is the Etre entity type.
215+
`_rev` is the revision of the entity, incremented on every write.
216+
The latter two fields are necessary for the change data capture (CDC) stream.
217+
218+
Entities are stored as JSON objects on the back end (in the data store), but they are also represented as key-value CSV _text_ (strings) by the CLI:
219+
220+
```
221+
_id:b93aksdfjz09,_type:host,_rev:1,key1:val1,keyN:valN
222+
```
223+
224+
The CLI defaults to key-value CSV text because it's intended to be processed with Bash and command-line tools like `cut`, `sed`, and `awk`.

docs/img/rack-host-db-tree.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)