Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# We are using https://standardjs.com coding style. It is unfortunately incompatible with Prettier;
# see https://github.com/standard/standard/issues/996 and the linked issues.
# Standard does not handle Markdown files though, so we want to use Prettier formatting.
# The goal of the configuration below is to configure Prettier to lint only Markdown files.
**/*.*
!**/*.md

# Let's keep LICENSE.md in the same formatting as we use in other PL repositories
LICENSE.md
2 changes: 2 additions & 0 deletions .prettierrc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
printWidth: 80
proseWrap: always
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# piece-indexer

A lightweight IPNI node mapping Filecoin PieceCID → payload block CID.

- [Design doc](./docs/design.md)
299 changes: 299 additions & 0 deletions docs/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# Design Doc: Piece Indexer

## Introduction

### Context

Filecoin defines a `Piece` as the main unit of negotiation for data that users
store on the Filecoin network. This is reflected in the on-chain metadata field
`PieceCID`.

On the other hand, content is retrieved from Filecoin using the CID of the
requested payload.

This dichotomy poses a challenge for retrieval checkers like Spark: for a given
deal storing some PieceCID, what payload CID request when testing the retrieval?

Spark v1 relies on StorageMarket's DealProposal metadata `Label`, which is often
(but not always!) set by the client to the root CID of the payload stored.

In the first half of 2024, the Filecoin network added support for DDO (Direct
Data Onboarding) deals. These deals don't have any DealProposal, there is no
`Label` field. Only `PieceCID`. We need to find a different solution to support
these deals.

In the longer term, we want IPNI to provide a reverse lookup functionality that
will allow clients like Spark to request a sample of payload block CIDs
associated with the given ContextID or PieceCID, see the
[design proposal for InterPlanetary Piece Index](https://docs.google.com/document/d/1jhvP48ccUltmCr4xmquTnbwfTSD7LbO1i1OVil04T2w).

To cover the gap until the InterPlanetary Piece Index is live, we want to
implement a lightweight solution that can serve as a stepping stone between
Label-based CID discovery and full Piece Index sampling.

### High-Level Idea

Let's implement a lightweight IPNI ingester that will process the advertisements
from Filecoin SPs and extract the list of `(ProviderID, PieceCID, PayloadCID)`
entries. Store these entries in a Postgres database. Provide a REST API endpoint
accepting `(ProviderID, PieceCID)` and returning a single `PayloadCID`.

#### Notes

Pieces are immutable. If we receive an advertisement saying that a payload block
CID was found in a piece CID, then this information remains valid forever, even
after the SP advertise that they are no longer storing that block. This means

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the SP says that they are no longer storing the piece then we should no longer make retrieval requests to that SP for any payloads in the piece, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should not make any retrievals for such payload.

I am envisioning the following architecture:

  1. A piece indexer I describe in this document. It will be eventually replaced by a proper IPNI reverse index solution.
  2. A deal tracker - a component that listens for actor events and builds a list of active deals - conceptually a list of pairs (piece_cid, miner_id). We will need to build this.

When Spark build a list of tasks for the current round, it will ask the deal tracker for 1000 active deals. This ensures we test retrieval for active deals only.

When Spark checker tests retrieval, it will first consult the piece indexer to convert deal's piece_cid to a payload CID to retrieve.

It's ok if the piece indexer stores data for expired deals, because Spark is not going to ask for that data.

Of course, storing expired deals unnecessarily increases storage requirements, but since we want to run this service only for 2-6 months, I think it's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm, could you add this context to the document please? I was also not aware of the distinction between the piece indexer and the deal tracker

our indexer can ignore `IsRm` advertisements.

The indexer protocol does not provide any guarantees about the list of CIDs
advertised for the same Payload CID. Different SPs can advertise different lists
(e.g. the entries can be ordered differently) or can even cheat and submit CIDs
that are not part of the Piece. Our indexer must scope the information to each
index provider.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By index proivder do you mean IPNI instance or Storage provider creating indexes/advertisements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index provider is the actor submitting data to the index. Typically, Boost and Venus Droplet.

@patrickwoodhead
How can I improve the text to make this easier to understand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "index provider" the term to use? If so, I don't think this needs to be clearer up. Maybe we could add the term definition to a definition section in this document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "index provider" is used by IPNI:

However, the spec uses just "provider" 🤷🏻‍♂️

https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#terminology

Publisher: This is an entity that publishes advertisements and index data to an indexer. It is usually, but not always, the same as the data provider. A publisher is identified by a libp2p peer ID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding a section to explain the terminology 👍🏻


### Anatomy of IPNI Advertisements

Quoting from
[Ingestion](https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#ingestion):

> The indexer reads the advertisement chain starting from the head, reading
> previous advertisements until a previously seen advertisement, or the end of
> the chain, is reached. The advertisements and their entries are then processed
> in order from earliest to head.

An Advertisement has several properties (see
[the full spec](https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#advertisements)),
we need the following ones:

- **`PreviousID`** is the CID of the previous advertisement, and is empty for
the 'genesis'.

- **`Metadata`** represents additional opaque data. The metadata for Graphsync
retrievals includes the PieceCID that we are looking for.

- **`Entries`** provide a paginated list of multihashes. For our purposes, it's
enough to take the first entry and ignore the rest.

Advertisements are made available for consumption by indexer nodes as a set of
files fetched via HTTP.

Quoting from
[Advertisement Transfer](https://github.com/ipni/specs/blob/90648bca4749ef912b2d18f221514bc26b5bef0a/IPNI.md#advertisement-transfer):

> All IPNI HTTP requests use the IPNI URL path prefix, `/ipni/v1/ad/`. Indexers
> and advertisement publishers implicitly use and expect this prefix to precede
> the requested resource.
>
> The IPLD objects of advertisements and entries are represented as files named
> by their CIDs in an HTTP directory. These files are immutable, so can be
> safely cached or stored on CDNs. To fetch an advertisement or entries file by
> CID, the request made by the indexer to the publisher is
> `GET /ipni/v1/ad/{CID}`.

The IPNI instance running at https://cid.contact provides an API returning the
list of all index providers from which cid.contact have received announcements:

https://cid.contact/providers

The response provides all the metadata we need to download the advertisements:

- `Publisher.Addrs` describes where we can contact SP's index provider

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the HTTP address where the SP can respond to indexer requests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't have to be an HTTP address; it could be a Graphsync/libp2p address, too.

However, there is a push from IPNI to move to HTTP transport. For our lighweight indexer, we will require SPs to support the HTTP protocol for handling indexer requests.

Unfortunately, this requires SPs to tweak their Boost configuration and provide the public hostname at which Boost can be reached from outside. On the bright side, I think it's most likely that SPs have to configure this option anyways if they want cid.contact to receive their advertisements, in which case our lightweight indexer is not adding any new requirements.

https://boost.filecoin.io/configuration/http-indexer-announcement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this item as follows:

Publisher.Addrs describes where we can contact SP's index provider to retrieve content for CIDs, e.g., advertisements.

- `LastAdvertisement` contains the CID of the head advertisement

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the Last advertisement for a specific SP or just in the IPNI instance altogether?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the last advertisement for a specific SP.

All fields described in this section are per-provider.

How can I improve the text to make this more clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

The response provides all the metadata we need to download the advertisements:

->

The response provides all the per-SP metadata we need to download the advertisements:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded as follows:

The response provides all the metadata we need to download the advertisements.
For each index provider, the response includes:

- `Publisher.Addrs` describes where we can contact SP's index provider to
  retrieve content for CIDs, e.g., advertisements.
- `LastAdvertisement` contains the CID of the head advertisement from the SP


## Proposed Design

### Ingestion

Ingesting announcements from all Storage Providers is the most complex
component. For each storage provider, we need to periodically check for the
latest advertisement head and process the chain from head until we find an
advertisement we have already processed before. The chain can be very long,
therefore we need to account for the cases when the service restarts or a new
head is published before we finish processing the chain.

#### Proposed algorithm

Use the following per-provider state persisted in the database:

- `provider_id` - Primary key.

- `provider_address` - Provider's address where we can fetch advertisements
from.

- `last_head` - The CID of the head where we started the previous walk. All
advertisements from `last_head` to the end of the chain were already
processed.

- `next_head` - The CID of the most recent head seen by cid.contact. This is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does next_head differ from head? Do we not start each walk from the current head?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial walk will take a long time to complete. While we are walking the "old" chain, new advertisements (new heads) will be announced to IPNI.

  • next_head is the latest head announced to IPNI
  • head is the advertisement where the current walk-in-progress started

I suppose we don't need to keep track of next_head. When the current walk finishes, we will wait up to one minute until we make another request to cid.contact to find what are the latest heads for each SPs.

In my current proposal, when the current walk finishes, we can immediately continue with walking from the next_head.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I captured my explanation in the design doc.

where we need to start the next walk from.

- `head` - The CID of the head advertisement we started the current walk from.
We update this value whenever we start a new walk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this state? Should it not be enough to go from next_head to last_head and afterwards update last_head?

Or is this for the case where this takes a long time and we want to be able to resume the chain walk? In this case, what about we simplify the design by saying that we will only ever walk X links per iteration, thereby eliminating the need for the head state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we could also remove tail

Copy link
Member Author

@bajtos bajtos Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current walk starts from head and walks up to last_head. When the current walk reaches last_head, we need to set last_head ← head so that the next walk knows where to stop.

next_head is updated every minute when we query cid.contact for the latest heads. If the walk takes longer than a minute to finish, then next_head will change and we cannot use it for last_head.

What we can do, is to remove next_head, as I explained in https://github.com/filecoin-station/piece-indexer/pull/2/files#r1689865074

In this case, what about we simplify the design by saying that we will only ever walk X links per iteration, thereby eliminating the need for the head state?

We must always walk the chain all the way to the genesis or to the entry we have already seen & processed. Here is how the state looks like in the middle of a walk:

next_head
  ↓
(entries announced after we started the current walk)
  ↓
head
  ↓
(entries visited in this walk)
  ↓
tail
  ↓
(entries NOT visited yet)
  ↓
last_head
  ↓
(entries visited in the previous walks)
  ↓
(null)


- `tail` - The CID of the next advertisement in the chain that we need to
process in the current walk.

Every minute, fetch the latest providers from cid.contact. For each provider

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each minute, we are running an algorithm that has a complexity of the order of the number of providers. Do we have any concerns about how much processing we will be doing each minute? Might this process take more than a minute to run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about with one minute delay between every run?

Copy link
Member Author

@bajtos bajtos Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine. Every minute, we need to do these three four steps:

  1. Run a SQL query to get the next_head for each SP
  2. Make one HTTP call to cid.contact to find the latest advertisement heads announced by all SPs
  3. Run one SQL query to update next_head for all SPs where there is a new head
  4. Kick-off advertisement walks (up to one walk per provider). These walks are executed in the background and don't block this loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the spec to capture my explanation.

found, fetch the state from the database and run the following algorithm (using
the name `new_head` for the CID of the latest advertisement).

1. If `last_head` is not set, then we need to start the ingestion from scratch.
Update the state as follows and start the chain walker:

```
last_head = new_head
next_head = new_head
head = new_head
tail = new_head
```

2. If `new_head` is the same as `next_head`, then there was no change since we
checked the head last time and we are done.

3. If `next_tail` is not null, then there is an ongoing walk of the chain we
need to finish before we can ingest new advertisements. Update the state as
follows and abort.

```
next_head := new_head
```

4. `next_tail` is null, which means we have finished ingesting all
advertisements from `head` to the end of the chain. Update the state as
follows and start the chain walker.

```
next_head = new_head
head = new_head
tail = new_head
```

The chain-walking algorithm loops over the following steps:

1. If ` tail == last_head || tail == null`, then we finished the walk. Update
the state as follows:

```
last_head = head
head = null
tail = null
```

If `next_head != last_head` then start a new walk by updating the state as
follows:

```
head = next_head
tail = next_head
```

2. Otherwise take a step to the next item in the chain:

1. Fetch the advertisement identified by `tail` from the index provider.
2. Process the metadata and entries to extract one `(PieceCID, PayloadCID)`
entry.
3. Update the `tail` field using the `PreviousID` field from the
advertisement.

```
tail = PreviousID
```

#### Handling the Scale

At the time of writing this document, cid.contact was tracking 322 index
providers. From Sparks' measurements, we know there are an additional 843
storage providers that don't advertise to IPNI. The number of storage providers
grows over time, our system must be prepared to ingest advertisements from
thousands of providers.

Each storage/index provider produces tens of thousands to hundreds of thousands
of advertisements (in total). The initial ingestion run will take a while to
complete. We also must be careful to not overload the SP by sending too many
requests.

The design outlined in the previous section divides the ingestion process into
small steps that can be scheduled and executed independently. This allows us to
avoid the complexity of managing long-running per-provider tasks and instead
repeatedly execute one step of the process.

Loop 1: Every minute, fetch the latest provider information from cid.contact and
update the persisted state as outlined above.

Loop 2: Discover walks in progress and make one step in each walk.

1. Find all provider state records where `tail != null`.

2. For each provider, execute one step as described above. We can execute these
steps in parallel. Since each parallel job will query a different provider,
we are not going to overload any single provider.

3. Optionally, we can introduce a small delay before the next iteration. I think
we won't need it because the time to execute SQL queries should create enough
delay.

### REST API

Implement the following endpoint that will be called by Spark checker nodes. The
endpoint will sign the response using the server's private key to allow
spark-evaluate to verify the authenticity of results reported by checker nodes:

```
GET /sample/{providerId}/{pieceCid}?seed={seed}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add an endpoint that allows a SP to see which records there are in the DB associated to them? As per F8 Ptrk's comment a week or so back

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the "Observability" section below. Do you think we need something different?

```

Response in JSON format, when the piece was found:

```json
{
"samples": ["exactly one CID of a payload block advertised for PieceCID"],
"pubkey": "server's public key",
"signature": "signature over dag-json{providerId,pieceCid,seed,samples}"
}
```

Response in JSON format, when the piece or the provider was not found:

```json
{
"error": "code - e.g. PROVIDER_NOT_FOUND or PIECE_NOT_FOUND",
"pubkey": "server's public key",
"signature": "signature over dag-json{providerId,pieceCid,seed,error}"
}
```

In the initial version, the server will ignore the `seed` value and use it only
as a nonce preventing replay attacks. Spark checker nodes will set the seed
using the DRAND randomness string for the current Spark round.

In the future, when IPNI implements the proposed reverse-index sampling
endpoint, the seed will be used to pick the samples at random. See the
[IPNI Multihash Sampling API proposal](https://github.com/ipni/xedni/blob/526f90f5a6001cb50b52e6376f8877163f8018af/openapi.yaml)

### Observability

We need visibility into the status of ingestion for any given provider. Some
providers don't advertise at all, some may have misconfigured integration with
IPNI, we need to understand why our index does not include any data for a
provider.

Let's enhance the state table with another column describing the ingestion
status as a free-form string and implement a new REST API endpoint to query the
ingestion status.

```
GET /ingestion-status/{providerId}
```

Response in JSON format:

```json
{
"providerId": "state.provider_id",
"providerAddress": "state.provider_address",
"ingestionStatus": "state.ingestion_status",
"lastHeadWalkedFrom": "state.last_head",
"piecesIndexed": 123
// ^^ number of (PieceCID, PayloadCID) records found for this provider
}
```
24 changes: 21 additions & 3 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
"type": "module",
"main": "index.js",
"scripts": {
"lint": "prettier --check .",
"lint:fix": "prettier --write .",
"test": "echo \"Error: no test specified\" && exit 1"
},
"repository": {
Expand All @@ -16,5 +18,8 @@
"bugs": {
"url": "https://github.com/filecoin-station/piece-indexer/issues"
},
"homepage": "https://github.com/filecoin-station/piece-indexer#readme"
"homepage": "https://github.com/filecoin-station/piece-indexer#readme",
"devDependencies": {
"prettier": "^3.3.2"
}
}