-
Notifications
You must be signed in to change notification settings - Fork 5
Description
As someone who is new to logs I could follow the example. However whilst discussing the subject area with others it brought up key terms and subject areas that were new to me. I think it would be useful to include some of this context in the readme for those who may stumble upon this repo without knowing what it is first.
What is a log?
AKA write-ahead log, commit log, transaction log. In this repo it will not refer to 'application logging', the kind of logging you might see for error messages.
A log is one of the most simple possible storage abstractions. An append-only, totally-ordered sequence of records ordered by time. They are visualised horizontally from left to right.
They're not all that different from a file or a table. If we consider a file as an array of bytes and a table as an array of records. Then a log can be thought of as a kind of table where records are sorted by time.
Logs are event driven. They record what happened and when continuously. As the records are stored in the order that the changes occurred this means that at any point you can revert back to a given point in time by finding it in your records. They can do this in near real-time, making them ideal for analytics. They are also helpful in the event of crashes or errors as their record of the state of the data at all times means data can easily be restored. By keeping an immutable log of the history of your data it means your data is kept clean and is never lost or changed. The log is added to by publishers of data and used / acted upon by subscribers but the records themselves cannot be mutated.
Keywords
Time series database: a database system optimised for handling time series data (arrays of numbers indexed by time). They handle queries for historical data/ time zones better than relational dbs.
Data integration: making all the data an organisation has available in all its services and systems.
Log compaction: methods to tidy up a log by deleting no longer needed data.
Questions
- Are we performing physical (the data itself) or logical (the command or calculation which results in the data) logging?
- Are all logs time series databases?
- Does a log contain all of the fields of data that would be captured in a relational DB schema? Does this lead to there being a lot of empty rows because some columns aren't applicable to everything?
- Are all logs append only?
- What is a record in the context of a log? Is it the equivalent of a row in a table? Does it have column titles so all records have the same field titles?
- Do all logs run horizontally?
- Is a timestamp value mandatory in a record in a log? Do they act as the unique identifier for records or does a numeric id?
- When are horizontal scaling partitions useful? I didn't understand in what context they'd be used from the article... Would you have a separate log per user and the partition is made for each user ID?
- The article referred a lot to 'distributed data systems' - if a log is a move away from them, what kind of system would you call a log system? Does it have a name? Is it an integrated system?
These notes and questions came from reading:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying