Skip to content

Critical Need to Prioritize Substrate Upgrades to Mitigate Operational Risks #1030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sameh-farouk opened this issue Feb 19, 2025 · 5 comments
Assignees

Comments

@sameh-farouk
Copy link
Member

sameh-farouk commented Feb 19, 2025

Describe the bug

Our blockchain infrastructure currently faces significant technical debt due to delayed framework updates. Specifically:

  • Vulnerability Exposure:
    We encountered a critical bug that was only resolved in the Substrate polkadot-1.16.0 release. Remaining on our current version (1.1.0) leaves the chain exposed to multiple known vulnerabilities patched in later releases.

  • Operational Risks:
    Failure to steadily upgrade risks catastrophic chain halts (e.g., finality stalls, block production failures). While we recovered from a recent multi-day outage, a recurrence in production could:

    • Damage the company’s reputation

    • Force costly emergency measures (chain reset/restart)

As the sole maintainer of the TFChain project ATM, I am facing significant resource constraints. Despite the critical nature of ongoing work, I am continuously redirected to other priorities/projects, leaving essential TFChain development and maintenance tasks at risk of delay or neglect. I recommend allocating dedicated engineering resources to systematically upgrade Substrate. Postponing this work compounds technical debt and exponentially increases operational risk.

Additional context

#1029

@despiegk
Copy link
Contributor

how much work to upgrade?

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 19, 2025

how much work to upgrade?

The required work will span several months (including iteration of development work, deploying, and testing) and cannot be precisely estimated, as there are approximately 15 releases to be migrated to, each of which may introduce breaking changes to dependent Substrate core pallets.

  • The upgrades road map follows an incremental process and cannot be executed in parallel.
  • Each upgrade typically includes migrations for core pallets, requiring careful execution and thorough testing before deployment.
  • Every release must be deployed on development networks and rigorously tested to identify and resolve potential issues.

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 19, 2025

I can provide more accurate timelines once I dedicate focused effort to this matter, potentially leading 1-2 upgrades myself.

BTW This issue was known and flagged earlier (by Dylan before he left) but deprioritized due to resource constraints.
Since joining the TFChain project, one of my main focuses has been addressing the backlog of bug reports, particularly critical billing-related issues, all of which were resolved in the last milestone. Erwan was previously tasked with leading upgrades but left before completing them.

@despiegk
Copy link
Contributor

maybe it will be easier to just create a new TFChain 4.0
without billing we won't need nor capacity tracking
and migrate nodes on it

would that be less work?

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 20, 2025

Possibly, yes. If there’s no plan to keep version 3 operational alongside version 4, then deprioritizing it is justified. I just felt it necessary to bring this issue to attention.

That said, I’m curious what’s the timeline for planning and developing TFChain 4? I’d appreciate being involved in the discussions early on so I can contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants