Skip to content

Conversation

wy65701436
Copy link
Contributor

No description provided.

Signed-off-by: wang yan <wangyan@vmware.com>

## Compatibility and Consistency

No breaking changes; tag permissions are enforced at the API level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a breaking change , because the way the tag links are stored is changed from distribution to harbor.

Copy link
Contributor Author

@wy65701436 wy65701436 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the OCI perspective, it's not a breaking change. It's a internal behavior change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from the outside OCI interfaces, yes, but internally you swap parts of the logic implemented in distribution with a different logic.

#### Pros: Simple to implement, immediate performance gain.
#### Cons: Leaves tag files behind, may cause confusion or inconsistencies for users browsing storage directly.

### Option 2: Batch Tag Deletion via Upstream Patch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what effort do you expect in implementing option #3, instead of switching to latest upstream registry, as proposed in #2?

Copy link
Contributor Author

@wy65701436 wy65701436 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe switching to the latest version of the registry is a separate topic, as it would require significantly more time to validate potential regressions — especially given that Harbor users rely on a wide variety of storage backends.

Option 2 is still based on the v2.8 release, but includes a backport of the necessary pull request to support parallel tag deletion.

Btw, all those three options will not impact the v3 bump.

#### Pros: Compatible with current Distribution behavior, improves performance.
#### Cons: Limited gain, still depends on Distribution API and backend performance.

### Option 3: Do Not Land Tag Files in Backend (Proposed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementation not just a stop gap measure and will be removed once distribution v3 is used in Harbor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the proposal is a permanent since the tag link is no longer stored in the storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementation not just a stop gap measure and will be removed once distribution v3 is used in Harbor?

@Vad1mo @wy65701436 Could you clarify whether this is related to distribution v3? Will GC performance be improved in distribution v3?

@Vad1mo Have you done any test to verify this?

@chlins
Copy link
Member

chlins commented Jul 24, 2025

I think options 2 and 3 can work together to complement each other. For historical stock images, option 2 can improve performance. For newly built image scenarios, option 3 ensures that there are no tags in the distribution. In this case, calling its tag deletion API should theoretically also be fast (perhaps this needs to be confirmed?). This way, we can address the issue of slow garbage collection for both existing and newly built images.

@wy65701436
Copy link
Contributor Author

I think options 2 and 3 can work together to complement each other. For historical stock images, option 2 can improve performance. For newly built image scenarios, option 3 ensures that there are no tags in the distribution. In this case, calling its tag deletion API should theoretically also be fast (perhaps this needs to be confirmed?). This way, we can address the issue of slow garbage collection for both existing and newly built images.

I don't understand clearly, the option 3 is to remove the tag deletion permanently. How can we make those two options work together?

Copy link
Member

@Vad1mo Vad1mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow GC performance , has been a problem in Harbor and Distribution since day one, so for more than 10 years.

  • This issue affects <0.1%-1% of users; >99% will never run into this issue.
  • It only impacts users who are using object store (afaik more on Azure than S3).
  • It only impacts users who are running GC occasionally. (every couple of months).
  • The issue won't occur if the GC runs eg. weekly/monthly.

Option 3: One-Way Door Decision, Split Logic, No Way Back

I think Option no. 3 is the worst option for the future of Harbor. Because this is a one-way door decision, although even two two-way-door decisions exist.

  • No option to disable the feature
  • Not futureproof
  • Solution for ~0.1% of users while having an unpredictable impact on 100% of users.
  • If a better solution is introduced in distribution in the future, Harbor can only adopt it with a lot of effort.
  • Extremely difficult to debug and trace down errors in operation correctly. E.g what happens if transaction rollbacks?
  • Split logic, as some data is in the DB and some data is in the object store. There is a high risk that things can get out of sync.
  • Creating a backup of only images becomes impossible.
  • Image/object store and DB need to be in sync; this is not the case yet, but currently this only affects some rare edge cases.
  • If DB and object store are out of sync, what is the strategy to fix it?

Option 2: Take Patch From Upstream Distribution V3 or Tell Users to Use V3

  • While Harbor currently only supports V2, there is nothing stopping users from using V3 instead of V2.
  • This is a fix in upstream distribution v3 - distribution/distribution#4329
  • It does not look problematic to backport it, since we already forked V2
  • When Distribution V3, is adopted in Harbor the problem will be solved upstream.
  • Forward and backward compatible.
  • Users are already adopting V3 with Harbor.

Option 1: Allow Skip Tag Deletion if Users Really Need It

As said, this is the simplest option.

  • Will provide relief to 0.1% of users suffering from this issue while not impacting the other 99%
  • Can be enabled or disabled if needed
  • Future proof
  • Forward and backward compatible
  • Tags can be deleted out of bounds with a script if needed

Going forward, I am suggesting adapting with a feature switch, Option 1 or Option 2.

@Vad1mo Vad1mo changed the title add proposal for gc enhancement Garbage Collection Performance Enhancement, Store Tags in DB Jul 30, 2025
@wy65701436
Copy link
Contributor Author

wy65701436 commented Aug 21, 2025

@Vad1mo, thanks for your comments and I am happy to see the discussion.

Regarding the proposed option 3, I’d like to provide some background: since harbor v2.0, harbor no longer leverages tags in the back-end storage. Tags are stored both in the storage and database, and harbor core interacts with the storage using digests.This means users can still pull-by-tag, but on the harbor core side the tag is translated into the corresponding digest, and the request is proxied as a pull-by-digest.In short, Harbor stores the tag information but doesn’t actually use it when interacting with storage — which is why we eventually need to clean it up.

Going even further, because harbor stores it but not use it, and the GC has to delete it.

Regarding the SYNC point you mentioned, this is a good observation, but Harbor does not sync tag information between the database and storage at all. Harbor only trusts and uses the database. For example, if you add a tag through the Harbor UI/API, Harbor will not create a corresponding tag in storage. The only time a tag file is created in storage is when an image is pushed by tag. In that case, Harbor proxies the original request to the back end, which lands a tag file in storage. However, even then, Harbor itself never uses that tag file — it only causes performance overhead, which is exactly the issue this proposal aims to address.

Here’s a concrete scenario:

  1. push the harbor-ip/library/hello-world:latest into harbor.
  2. remove the latest tag from UI.
  3. pull the harbor-ip/library/hello-world:latest, you will see the not found in the client.
  4. but you still can see the tag in the storage at this moment since harbor doesn't sync it and no need to sync it.

From this perspective, GC acts like a sync mechanism by cleaning up these unused tag files. Additionally, because there is no sync between Harbor’s DB and storage, I renamed the binary to registry_do_not_use_gc. The reason is that the native Distribution GC may incorrectly treat an image as garbage even though Harbor still considers it valid in the database. Running the native GC directly could therefore lead to data loss in storage.

Also, from a design perspective, this isn’t a good practice, because Harbor only stores tags that are pushed by clients, and does not store tags created via the Harbor API.

All to all, that’s why I proposed option 3, avoid landing tag files in storage altogether.

This won’t affect any functionality, except for one edge case you mentioned— a user configuring Distribution directly against Harbor’s storage — which I assume we don’t need to support.

Last, from a performance perspective, avoiding deletion altogether is far more efficient than performing deletions in parallel.

@wy65701436
Copy link
Contributor Author

More generally, from Harbor’s perspective, distribution acts only as a storage driver that locates files by repository and digest to fulfill GET blob and GET manifest requests. Harbor does not rely on it to satisfy the full OCI Distribution v2 API. For example, APIs like /v2/_catalog and /tags are served entirely from Harbor’s database, not from Distribution.

@wy65701436 wy65701436 changed the title Garbage Collection Performance Enhancement, Store Tags in DB Garbage Collection Performance Enhancement -- avoid landing tag files in storage Aug 21, 2025
Copy link
Contributor

@bupd bupd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wy65701436 Thank you for this proposal. This is something the community has been asking for so long time.

Here are my suggestions
The current idea stated in proposal only solves part of the performance issue. (i.e not storing tags in the filesystem) - it’s like crossing half the river. If Harbor also stores both digests and their tag relations directly in its DB, we can remove the need for digest lookups during garbage collection done by distribution.

From what we have observed, the time taken to traverse cross-linked blobs by the distribution is actually greater than the time taken to delete an item. This becomes even more of a problem in S3-like backends where traversal is far more expensive than put or delete operations.

A better long-term approach is to upgrade to distribution v3. This enables batch deletion of unreferenced blobs which is very helpful for deleting multiple blobs in a single request and pairing this with Harbor GC to run entirely at the DB level, without relying on distribution to traverse references. would be a great perf improvement.

So, distribution will only be responsible for storing and deleting objects, while Harbor manages all reference tracking. This removes costly traversal, simplifies GC overall.

@wy65701436
Copy link
Contributor Author

@wy65701436 Thank you for this proposal. This is something the community has been asking for so long time.

Here are my suggestions The current idea stated in proposal only solves part of the performance issue. (i.e not storing tags in the filesystem) - it’s like crossing half the river. If Harbor also stores both digests and their tag relations directly in its DB, we can remove the need for digest lookups during garbage collection done by distribution.

From what we have observed, the time taken to traverse cross-linked blobs by the distribution is actually greater than the time taken to delete an item. This becomes even more of a problem in S3-like backends where traversal is far more expensive than put or delete operations.

A better long-term approach is to upgrade to distribution v3. This enables batch deletion of unreferenced blobs which is very helpful for deleting multiple blobs in a single request and pairing this with Harbor GC to run entirely at the DB level, without relying on distribution to traverse references. would be a great perf improvement.

So, distribution will only be responsible for storing and deleting objects, while Harbor manages all reference tracking. This removes costly traversal, simplifies GC overall.

Thanks @bupd for your comments.

I’m not opposed to upgrading to v3 from Harbor’s perspective—in fact, I support it since I’m also one of its maintainers. However, one concern we’ve already discussed in the community is that we’d like to wait until v3 is stable before making the move, as many Harbor users run it in production environments.

That said, upgrading to v3 is not within the scope of this proposal. Option 3 does not prevent us from upgrading in the future. I understand the benefits of v3, but this is a separate topic. Even if we move to v3, as long as Harbor continues to store tag files in the backend, we will still need to handle tag deletion regardless of any performance improvements.

For the reasoning behind my proposal to avoid storing tags in the backend, please see the earlier comments.

In summary, this proposal is intended to drive discussion based on the current situation, which is still built on distribution v2.

@bupd
Copy link
Contributor

bupd commented Aug 22, 2025

@wy65701436 I understand your concern about v3 not being fully mature, but I’m not too worried about v3 at this stage. What I’d really like your thoughts on the traverse problem.

When running GC the time taken to traverse cross-linked blobs by the distribution is actually greater than the time taken to delete an item. This becomes even more of a problem in S3-like backends where traversal is far more expensive than put or delete operations.

Without addressing the traversal overhead, This would be like crossing half the river. If we do fix this traversing problem we will see huge improvements in GC performance. (i.e days to ~hours) improvement. The larger the registry the slower the traversal gets for unreferenced blobs.

@wy65701436
Copy link
Contributor Author

@wy65701436 I understand your concern about v3 not being fully mature, but I’m not too worried about v3 at this stage. What I’d really like your thoughts on the traverse problem.

When running GC the time taken to traverse cross-linked blobs by the distribution is actually greater than the time taken to delete an item. This becomes even more of a problem in S3-like backends where traversal is far more expensive than put or delete operations.

Without addressing the traversal overhead, This would be like crossing half the river. If we do fix this traversing problem we will see huge improvements in GC performance. (i.e days to ~hours) improvement. The larger the registry the slower the traversal gets for unreferenced blobs.

@bupd

I believe there are two main areas we can improve:

  • Tag deletion performance in v2 – this needs to be addressed regardless of which option we choose, even if none of the three proposed options are ideal.
  • Performance benefits by moving to v3 – upgrading to v3 could bring additional improvements.

These two areas are independent and can be tackled separately without conflict. For this proposal, the focus is on area-1. For area-2, we can start a separate discussion thread.

And I would like to hear your suggestions on the three options, and especially the option 3 that I proposed.

@wy65701436
Copy link
Contributor Author

wy65701436 commented Aug 25, 2025

@wy65701436 I understand your concern about v3 not being fully mature, but I’m not too worried about v3 at this stage. What I’d really like your thoughts on the traverse problem.

When running GC the time taken to traverse cross-linked blobs by the distribution is actually greater than the time taken to delete an item. This becomes even more of a problem in S3-like backends where traversal is far more expensive than put or delete operations.

Without addressing the traversal overhead, This would be like crossing half the river. If we do fix this traversing problem we will see huge improvements in GC performance. (i.e days to ~hours) improvement. The larger the registry the slower the traversal gets for unreferenced blobs.

As for the traversing, Harbor doesn't rely on the distribution to locate the file but the track the reference in the harbor DB. So, we don't have any traversing performance problem.

Copy link
Contributor

@bupd bupd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@wy65701436
Copy link
Contributor Author

@Vad1mo, thanks for your comments and I am happy to see the discussion.

Regarding the proposed option 3, I’d like to provide some background: since harbor v2.0, harbor no longer leverages tags in the back-end storage. Tags are stored both in the storage and database, and harbor core interacts with the storage using digests.This means users can still pull-by-tag, but on the harbor core side the tag is translated into the corresponding digest, and the request is proxied as a pull-by-digest.In short, Harbor stores the tag information but doesn’t actually use it when interacting with storage — which is why we eventually need to clean it up.

Going even further, because harbor stores it but not use it, and the GC has to delete it.

Regarding the SYNC point you mentioned, this is a good observation, but Harbor does not sync tag information between the database and storage at all. Harbor only trusts and uses the database. For example, if you add a tag through the Harbor UI/API, Harbor will not create a corresponding tag in storage. The only time a tag file is created in storage is when an image is pushed by tag. In that case, Harbor proxies the original request to the back end, which lands a tag file in storage. However, even then, Harbor itself never uses that tag file — it only causes performance overhead, which is exactly the issue this proposal aims to address.

Here’s a concrete scenario:

1. push the harbor-ip/library/hello-world:latest into harbor.

2. remove the latest tag from UI.

3. pull the harbor-ip/library/hello-world:latest, you will see the not found in the client.

4. but you still can see the tag in the storage at this moment since harbor doesn't sync it and no need to sync it.

From this perspective, GC acts like a sync mechanism by cleaning up these unused tag files. Additionally, because there is no sync between Harbor’s DB and storage, I renamed the binary to registry_do_not_use_gc. The reason is that the native Distribution GC may incorrectly treat an image as garbage even though Harbor still considers it valid in the database. Running the native GC directly could therefore lead to data loss in storage.

Also, from a design perspective, this isn’t a good practice, because Harbor only stores tags that are pushed by clients, and does not store tags created via the Harbor API.

All to all, that’s why I proposed option 3, avoid landing tag files in storage altogether.

This won’t affect any functionality, except for one edge case you mentioned— a user configuring Distribution directly against Harbor’s storage — which I assume we don’t need to support.

Last, from a performance perspective, avoiding deletion altogether is far more efficient than performing deletions in parallel.

@Vad1mo any updates?


Add a cleanup tool to remove orphaned tag link files (optional).

Benchmark GC performance in a real S3 environment before and after the change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think benchmark is necessary to help us make decision, so it should be done in the PoC phase rather than post-impelmentation.

I feel we need option 1 to have immediate improvement for existing images, and option 3 to have long term consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the proposal to merge the option 1&3 and I will do a testing on S3 based env once I got a validate environments.

1, remove the option 1,2 and 3.
2, combind the option 1 and 3 as parts of the solution.

Signed-off-by: wang yan <yan-yw.wang@broadcom.com>
Copy link
Member

@chlins chlins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants