-
Notifications
You must be signed in to change notification settings - Fork 1
doc: initial design proposal #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
2e0b440
31a86c3
c2910e9
8c993b2
75b7480
4864eea
320ef5e
7c1e28b
642a972
3bbc2c2
9349c62
24c060c
21af1f6
560c65e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # We are using https://standardjs.com coding style. It is unfortunately incompatible with Prettier; | ||
| # see https://github.com/standard/standard/issues/996 and the linked issues. | ||
| # Standard does not handle Markdown files though, so we want to use Prettier formatting. | ||
| # The goal of the configuration below is to configure Prettier to lint only Markdown files. | ||
| **/*.* | ||
| !**/*.md | ||
|
|
||
| # Let's keep LICENSE.md in the same formatting as we use in other PL repositories | ||
| LICENSE.md |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| printWidth: 80 | ||
| proseWrap: always |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| # piece-indexer | ||
|
|
||
| A lightweight IPNI node mapping Filecoin PieceCID → payload block CID. | ||
|
|
||
| - [Design doc](./docs/design.md) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,299 @@ | ||
| # Design Doc: Piece Indexer | ||
|
|
||
| ## Introduction | ||
|
|
||
| ### Context | ||
|
|
||
| Filecoin defines a `Piece` as the main unit of negotiation for data that users | ||
| store on the Filecoin network. This is reflected in the on-chain metadata field | ||
| `PieceCID`. | ||
|
|
||
| On the other hand, content is retrieved from Filecoin using the CID of the | ||
bajtos marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| requested payload. | ||
|
|
||
| This dichotomy poses a challenge for retrieval checkers like Spark: for a given | ||
| deal storing some PieceCID, what payload CID request when testing the retrieval? | ||
bajtos marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Spark v1 relies on StorageMarket's DealProposal metadata `Label`, which is often | ||
| (but not always!) set by the client to the root CID of the payload stored. | ||
|
|
||
| In the first half of 2024, the Filecoin network added support for DDO (Direct | ||
| Data Onboarding) deals. These deals don't have any DealProposal, there is no | ||
| `Label` field. Only `PieceCID`. We need to find a different solution to support | ||
| these deals. | ||
|
|
||
| In the longer term, we want IPNI to provide a reverse lookup functionality that | ||
| will allow clients like Spark to request a sample of payload block CIDs | ||
| associated with the given ContextID or PieceCID, see the | ||
| [design proposal for InterPlanetary Piece Index](https://docs.google.com/document/d/1jhvP48ccUltmCr4xmquTnbwfTSD7LbO1i1OVil04T2w). | ||
|
|
||
| To cover the gap until the InterPlanetary Piece Index is live, we want to | ||
| implement a lightweight solution that can serve as a stepping stone between | ||
| Label-based CID discovery and full Piece Index sampling. | ||
|
|
||
| ### High-Level Idea | ||
|
|
||
| Let's implement a lightweight IPNI ingester that will process the advertisements | ||
| from Filecoin SPs and extract the list of `(ProviderID, PieceCID, PayloadCID)` | ||
| entries. Store these entries in a Postgres database. Provide a REST API endpoint | ||
| accepting `(ProviderID, PieceCID)` and returning a single `PayloadCID`. | ||
|
|
||
| #### Notes | ||
|
|
||
| Pieces are immutable. If we receive an advertisement saying that a payload block | ||
| CID was found in a piece CID, then this information remains valid forever, even | ||
| after the SP advertise that they are no longer storing that block. This means | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if the SP says that they are no longer storing the piece then we should no longer make retrieval requests to that SP for any payloads in the piece, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we should not make any retrievals for such payload. I am envisioning the following architecture:
When Spark build a list of tasks for the current round, it will ask the deal tracker for 1000 active deals. This ensures we test retrieval for active deals only. When Spark checker tests retrieval, it will first consult the piece indexer to convert deal's It's ok if the piece indexer stores data for expired deals, because Spark is not going to ask for that data. Of course, storing expired deals unnecessarily increases storage requirements, but since we want to run this service only for 2-6 months, I think it's fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sgtm, could you add this context to the document please? I was also not aware of the distinction between the piece indexer and the deal tracker |
||
| our indexer can ignore `IsRm` advertisements. | ||
|
|
||
| The indexer protocol does not provide any guarantees about the list of CIDs | ||
| advertised for the same Payload CID. Different SPs can advertise different lists | ||
bajtos marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| (e.g. the entries can be ordered differently) or can even cheat and submit CIDs | ||
| that are not part of the Piece. Our indexer must scope the information to each | ||
| index provider. | ||
|
||
|
|
||
| ### Anatomy of IPNI Advertisements | ||
|
|
||
| Quoting from | ||
| [Ingestion](https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#ingestion): | ||
|
|
||
| > The indexer reads the advertisement chain starting from the head, reading | ||
| > previous advertisements until a previously seen advertisement, or the end of | ||
| > the chain, is reached. The advertisements and their entries are then processed | ||
| > in order from earliest to head. | ||
|
|
||
| An Advertisement has several properties (see | ||
| [the full spec](https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#advertisements)), | ||
| we need the following ones: | ||
|
|
||
| - **`PreviousID`** is the CID of the previous advertisement, and is empty for | ||
| the 'genesis'. | ||
|
|
||
| - **`Metadata`** represents additional opaque data. The metadata for Graphsync | ||
| retrievals includes the PieceCID that we are looking for. | ||
|
|
||
| - **`Entries`** provide a paginated list of multihashes. For our purposes, it's | ||
bajtos marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| enough to take the first entry and ignore the rest. | ||
|
|
||
| Advertisements are made available for consumption by indexer nodes as a set of | ||
| files fetched via HTTP. | ||
bajtos marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Quoting from | ||
| [Advertisement Transfer](https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#advertisement-transfer): | ||
|
|
||
| > All IPNI HTTP requests use the IPNI URL path prefix, `/ipni/v1/ad/`. Indexers | ||
| > and advertisement publishers implicitly use and expect this prefix to precede | ||
| > the requested resource. | ||
| > | ||
| > The IPLD objects of advertisements and entries are represented as files named | ||
| > by their CIDs in an HTTP directory. These files are immutable, so can be | ||
| > safely cached or stored on CDNs. To fetch an advertisement or entries file by | ||
| > CID, the request made by the indexer to the publisher is | ||
| > `GET /ipni/v1/ad/{CID}`. | ||
|
|
||
| The IPNI instance running at https://cid.contact provides an API returning the | ||
| list of all index providers from which cid.contact have received announcements: | ||
|
|
||
| https://cid.contact/providers | ||
|
|
||
| The response provides all the metadata we need to download the advertisements: | ||
|
|
||
| - `Publisher.Addrs` describes where we can contact SP's index provider | ||
|
||
| - `LastAdvertisement` contains the CID of the head advertisement | ||
|
||
|
|
||
| ## Proposed Design | ||
|
|
||
| ### Ingestion | ||
|
|
||
| Ingesting announcements from all Storage Providers is the most complex | ||
| component. For each storage provider, we need to periodically check for the | ||
| latest advertisement head and process the chain from head until we find an | ||
| advertisement we have already processed before. The chain can be very long, | ||
| therefore we need to account for the cases when the service restarts or a new | ||
| head is published before we finish processing the chain. | ||
juliangruber marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #### Proposed algorithm | ||
|
|
||
| Use the following per-provider state persisted in the database: | ||
|
|
||
| - `provider_id` - Primary key. | ||
|
|
||
| - `provider_address` - Provider's address where we can fetch advertisements | ||
| from. | ||
|
|
||
| - `last_head` - The CID of the head where we started the previous walk. All | ||
| advertisements from `last_head` to the end of the chain were already | ||
juliangruber marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| processed. | ||
bajtos marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - `next_head` - The CID of the most recent head seen by cid.contact. This is | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The initial walk will take a long time to complete. While we are walking the "old" chain, new advertisements (new heads) will be announced to IPNI.
I suppose we don't need to keep track of In my current proposal, when the current walk finishes, we can immediately continue with walking from the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See also the diagram in https://github.com/filecoin-station/piece-indexer/pull/2/files#r1689884939 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I captured my explanation in the design doc. |
||
| where we need to start the next walk from. | ||
|
|
||
| - `head` - The CID of the head advertisement we started the current walk from. | ||
| We update this value whenever we start a new walk. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this state? Should it not be enough to go from next_head to last_head and afterwards update last_head? Or is this for the case where this takes a long time and we want to be able to resume the chain walk? In this case, what about we simplify the design by saying that we will only ever walk X links per iteration, thereby eliminating the need for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then we could also remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current walk starts from
What we can do, is to remove
We must always walk the chain all the way to the genesis or to the entry we have already seen & processed. Here is how the state looks like in the middle of a walk: |
||
|
|
||
| - `tail` - The CID of the next advertisement in the chain that we need to | ||
| process in the current walk. | ||
|
|
||
| Every minute, fetch the latest providers from cid.contact. For each provider | ||
|
||
| found, fetch the state from the database and run the following algorithm (using | ||
| the name `new_head` for the CID of the latest advertisement). | ||
juliangruber marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. If `last_head` is not set, then we need to start the ingestion from scratch. | ||
| Update the state as follows and start the chain walker: | ||
|
|
||
| ``` | ||
| last_head = new_head | ||
| next_head = new_head | ||
| head = new_head | ||
| tail = new_head | ||
| ``` | ||
|
|
||
| 2. If `new_head` is the same as `next_head`, then there was no change since we | ||
| checked the head last time and we are done. | ||
|
|
||
| 3. If `next_tail` is not null, then there is an ongoing walk of the chain we | ||
| need to finish before we can ingest new advertisements. Update the state as | ||
| follows and abort. | ||
|
|
||
| ``` | ||
| next_head := new_head | ||
| ``` | ||
|
|
||
| 4. `next_tail` is null, which means we have finished ingesting all | ||
| advertisements from `head` to the end of the chain. Update the state as | ||
| follows and start the chain walker. | ||
|
|
||
| ``` | ||
| next_head = new_head | ||
| head = new_head | ||
| tail = new_head | ||
| ``` | ||
|
|
||
| The chain-walking algorithm loops over the following steps: | ||
|
|
||
| 1. If ` tail == last_head || tail == null`, then we finished the walk. Update | ||
| the state as follows: | ||
|
|
||
| ``` | ||
| last_head = head | ||
| head = null | ||
| tail = null | ||
| ``` | ||
|
|
||
| If `next_head != last_head` then start a new walk by updating the state as | ||
| follows: | ||
|
|
||
| ``` | ||
| head = next_head | ||
| tail = next_head | ||
| ``` | ||
|
|
||
| 2. Otherwise take a step to the next item in the chain: | ||
|
|
||
| 1. Fetch the advertisement identified by `tail` from the index provider. | ||
| 2. Process the metadata and entries to extract one `(PieceCID, PayloadCID)` | ||
| entry. | ||
| 3. Update the `tail` field using the `PreviousID` field from the | ||
| advertisement. | ||
|
|
||
| ``` | ||
| tail = PreviousID | ||
| ``` | ||
|
|
||
| #### Handling the Scale | ||
|
|
||
| At the time of writing this document, cid.contact was tracking 322 index | ||
| providers. From Sparks' measurements, we know there are an additional 843 | ||
| storage providers that don't advertise to IPNI. The number of storage providers | ||
| grows over time, our system must be prepared to ingest advertisements from | ||
| thousands of providers. | ||
|
|
||
| Each storage/index provider produces tens of thousands to hundreds of thousands | ||
| of advertisements (in total). The initial ingestion run will take a while to | ||
| complete. We also must be careful to not overload the SP by sending too many | ||
| requests. | ||
|
|
||
| The design outlined in the previous section divides the ingestion process into | ||
| small steps that can be scheduled and executed independently. This allows us to | ||
| avoid the complexity of managing long-running per-provider tasks and instead | ||
| repeatedly execute one step of the process. | ||
|
|
||
| Loop 1: Every minute, fetch the latest provider information from cid.contact and | ||
| update the persisted state as outlined above. | ||
|
|
||
| Loop 2: Discover walks in progress and make one step in each walk. | ||
|
|
||
| 1. Find all provider state records where `tail != null`. | ||
|
|
||
| 2. For each provider, execute one step as described above. We can execute these | ||
| steps in parallel. Since each parallel job will query a different provider, | ||
| we are not going to overload any single provider. | ||
|
|
||
| 3. Optionally, we can introduce a small delay before the next iteration. I think | ||
| we won't need it because the time to execute SQL queries should create enough | ||
| delay. | ||
|
|
||
| ### REST API | ||
|
|
||
| Implement the following endpoint that will be called by Spark checker nodes. The | ||
| endpoint will sign the response using the server's private key to allow | ||
| spark-evaluate to verify the authenticity of results reported by checker nodes: | ||
|
|
||
| ``` | ||
| GET /sample/{providerId}/{pieceCid}?seed={seed} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we add an endpoint that allows a SP to see which records there are in the DB associated to them? As per F8 Ptrk's comment a week or so back There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See the "Observability" section below. Do you think we need something different? |
||
| ``` | ||
|
|
||
| Response in JSON format, when the piece was found: | ||
|
|
||
| ```json | ||
| { | ||
| "samples": ["exactly one CID of a payload block advertised for PieceCID"], | ||
| "pubkey": "server's public key", | ||
| "signature": "signature over dag-json{providerId,pieceCid,seed,samples}" | ||
| } | ||
| ``` | ||
|
|
||
| Response in JSON format, when the piece or the provider was not found: | ||
|
|
||
| ```json | ||
| { | ||
| "error": "code - e.g. PROVIDER_NOT_FOUND or PIECE_NOT_FOUND", | ||
| "pubkey": "server's public key", | ||
| "signature": "signature over dag-json{providerId,pieceCid,seed,error}" | ||
| } | ||
| ``` | ||
|
|
||
| In the initial version, the server will ignore the `seed` value and use it only | ||
| as a nonce preventing replay attacks. Spark checker nodes will set the seed | ||
| using the DRAND randomness string for the current Spark round. | ||
|
|
||
| In the future, when IPNI implements the proposed reverse-index sampling | ||
| endpoint, the seed will be used to pick the samples at random. See the | ||
| [IPNI Multihash Sampling API proposal](https://github.com/ipni/xedni/blob/526f90f5a6001cb50b52e6376f8877163f8018af/openapi.yaml) | ||
|
|
||
| ### Observability | ||
|
|
||
| We need visibility into the status of ingestion for any given provider. Some | ||
| providers don't advertise at all, some may have misconfigured integration with | ||
| IPNI, we need to understand why our index does not include any data for a | ||
| provider. | ||
|
|
||
| Let's enhance the state table with another column describing the ingestion | ||
| status as a free-form string and implement a new REST API endpoint to query the | ||
| ingestion status. | ||
|
|
||
| ``` | ||
| GET /ingestion-status/{providerId} | ||
| ``` | ||
|
|
||
| Response in JSON format: | ||
|
|
||
| ```json | ||
| { | ||
| "providerId": "state.provider_id", | ||
| "providerAddress": "state.provider_address", | ||
| "ingestionStatus": "state.ingestion_status", | ||
| "lastHeadWalkedFrom": "state.last_head", | ||
| "piecesIndexed": 123 | ||
| // ^^ number of (PieceCID, PayloadCID) records found for this provider | ||
| } | ||
| ``` | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Uh oh!
There was an error while loading. Please reload this page.