Skip to content

Conversation

sitole
Copy link
Member

@sitole sitole commented Oct 7, 2025

  • Deploys Traefik v3.5 as a Nomad job with configuration for auto-discovery of Nomad services.
  • Ingress load balancer that will handle traffic for services behind Traefik.
  • Move "additional domains" to the security store so they can be taken from other IaC projects and correctly propagate DNS for each domain that should be routed with Traefik

Note

Deploys Traefik-based ingress with a managed HTTPS load balancer and moves additional domain routing to Secret Manager, adding ingress vars and wiring across modules.

  • Ingress:
    • Deploy Traefik (nomad/jobs/ingress.hcl) via nomad_job.ingress with configurable count, CPU/memory, and ports.
    • Add GCP HTTPS LB for ingress (nomad-cluster/network/ingress.tf): health check, backend service, URL map, target HTTPS proxy, global IP, forwarding rule; reference cert map.
    • Open firewall and expose named port for ingress (nomad-cluster/network/main.tf, nodepool-api.tf).
  • Domains configuration:
    • Create Secret Manager secret routing-domains and initial version (init/main.tf); output routing_domains_secret_name (init/outputs.tf).
    • Read routing-domains and merge with env ADDITIONAL_DOMAINS to form local.additional_domains (main.tf); pass to cluster.
  • Variables & plumbing:
    • Introduce ingress_port (with defaults) and ingress_count (variables.tf); surface through modules (main.tf, nomad/variables.tf, nomad-cluster/variables.tf, nomad-cluster/network/variables.tf).
    • Wire ingress_port to network, firewall, and instance group named ports; pass ingress_count to Nomad job.
    • Makefile: add INGRESS_COUNT to Terraform var propagation.

Written by Cursor Bugbot for commit 8d137bc. This will update automatically on new commits. Configure here.

@sitole sitole added the improvement Improvement for current functionality label Oct 7, 2025
Copy link

linear bot commented Oct 7, 2025

@sitole
Copy link
Member Author

sitole commented Oct 7, 2025

This is a prerequisite for https://github.com/e2b-dev/belt/pull/217 that shows exposing the service via an ingress service.

@sitole sitole marked this pull request as ready for review October 7, 2025 15:42
cursor[bot]

This comment was marked as outdated.

}
}

resource "google_compute_url_map" "ingress" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a better name here? I think we now have two load balancers, "ingress" and "orch_map", neither of their names help understand what they do. Maybe "traefik" and "direct"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ingress as its common name and it describes what it is. Yes, "orch_map" is, in my opinion, a mistake, as it no longer makes sense. Ideally, I would like to transition away from the current load balancer once the migration is complete/rename it to something like "ingress-sandboxes" or a similar name to distinguish better.

I don't like to call it Traefik, as we can switch the ingress backend at any time in the future, but I'm okay with you coming up with a better name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note here, if we want to rename the orch_map to ingress-sandboxes, maybe we should name this ingress something like ingress-api or ingress-management or ingress-services

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don’t like that we would need two load balancers just because we cannot filter sandbox traffic. Will look into again tomorrow

Yep, we can rename ingress to something else. Iam not sure about management/api as we can use it for something different in future. Ingress services sounds okay to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though it's actually quite nice to have separate LBs for user's sandbox traffic and our services traffic (different limitations, limits, HTTP support, etc), but maybe it's unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should be able to match sandbox traffic to different rules (now it's catch-all fallback) so we can apply different limits/armor rules to them, then we don't need to have different LBs.

For supporting newer versions of HTTP, etc, we can still relatively easily migrate everything, and I'm not sure if we would need some special LB that cannot handle both sandbox and services traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, discovered that GCP Armor policy rule allows you to filter based on host regex, so we can use one shared backend and apply dynamic rules based on the domain there. This will solve our issue with needing two load balancers. I would stick with ingress naming, as after migration is completed, we can remove orch-map as outdated.

Example of a regexp that can catch sandbox traffic and apply rate limiting. In the same way, we can create rules for API limiting and other related restrictions. The good thing is that this only appends rules to already existing security policy, so we can push rules even from a private monorepo that will handle block/rate limit for services that are not open source.

resource "google_compute_security_policy_rule" "sandbox-throttling-ip" {
  security_policy = google_compute_security_policy.default["session"].name
  action          = "throttle"
  priority        = "500"

  match {
    expr {
      expression = <<-EOT
request.headers["host"].matches("^(?i)[0-9]+-[a-z0-9-]+\\.e2b-jirka\\.dev$")
EOT
    }
  }

  rate_limit_options {
    conform_action = "allow"
    exceed_action  = "deny(429)"

    enforce_on_key = ""

    enforce_on_key_configs {
      enforce_on_key_type = "IP"
    }

    rate_limit_threshold {
      count        = 40000
      interval_sec = 60
    }
  }

  description = "Requests to sandboxes from IP address"
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal docs that will contain all info related to migration and current state -> https://www.notion.so/e2bdev/Ingress-Migration-288b8c296873807a8264f1615602d11d

@sitole sitole requested a review from djeebus October 8, 2025 09:11
@jakubno jakubno self-assigned this Oct 8, 2025
Copy link
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, this doesn't route any traffic yet and it's just a preparation?

cursor[bot]

This comment was marked as outdated.

@sitole sitole force-pushed the feat/ingress-for-additional-services-deployed-to-our-cloud-eng-3138 branch from 9e24600 to 8d137bc Compare October 10, 2025 10:13
@sitole
Copy link
Member Author

sitole commented Oct 10, 2025

Simplified diagram of current state

flowchart TD
    B("Google Load Balancer") --> C{"Rule matching"}
    n1["Services Traffic"] --> B
    A["Sandbox Traffic"] --> B
    C -- "<span style=background-color:>#1 api.e2b.dev</span>" --> F["API GCP Backend"]
    C -- "<span style=background-color:>#2 docker.e2b.dev</span>" --> n4["Docker GCP Backend"]
    C -- #3 catch all --> n2["Sandbox Proxy GCP Backend"]
    n2 -- Sandbox rate limiting --> n5["Client Proxy"]
    n4 --> n6["Docker proxy instance"]
    F -- Api rate limiting --> n7["API instance"]

    n1@{ shape: rect}
    n4@{ shape: rect}
    n2@{ shape: rect}
    n5@{ shape: rect}
    n6@{ shape: rect}
    n7@{ shape: rect}
Loading

This is state that should be the outcome of ingress migration. Theoretically, if tested properly, we can use Traefik even for sandbox traffic routing, but that is a separate thing for now. With proposed changes, we can later gradually migrate the API and Docker proxy service to the new ingress-services load balancer.

The only reason we cannot easily use a single load balancer is that we are capturing sandbox traffic with a "catch all" rule and adding sandbox rate limits to it. If in the future this is done at the app-level (client proxy, for example) as customers limits will be different for each team/sandbox, we can migrate to use just one load balancer for everything

flowchart TD
    B("Current Load Balancer") --> C{"Rule matching"}
    A["Sandbox Traffic"] --> B
    C -- #3 catch all --> n2["Sandbox Proxy GCP Backend"]
    n2 -- Sandbox rate limiting --> n5["Client Proxy"]
    n4["Ingress Backend"] -- Subdomain based rate limits --> n6["Traefik"]
    n1["Services Traffic"] --> n8["Ingress Load Balancer"]
    n8 --> n9["Rule matching"]
    n9 --> n4
    n6 --> n10["API"] & n11["Docker Proxy"]

    n2@{ shape: rect}
    n5@{ shape: rect}
    n4@{ shape: rect}
    n6@{ shape: rect}
    n1@{ shape: rect}
    n8@{ shape: rounded}
    n9@{ shape: diam}
    n10@{ shape: rect}
    n11@{ shape: rect}
Loading

cc @dobrac

@sitole
Copy link
Member Author

sitole commented Oct 10, 2025

Internal docs with ingress migration status and next steps:
https://www.notion.so/e2bdev/Ingress-Migration-288b8c296873807a8264f1615602d11d

@sitole sitole requested a review from dobrac October 12, 2025 18:40
@dobrac dobrac assigned dobrac and unassigned jakubno Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement for current functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants