Skip to content

Commit a4273e4

Browse files
authored
Add custom email domain to Azure Communication Services (#884)
### Summary & Motivation Switch transactional email sending from the auto-provisioned `*.azurecomm.net` Azure-managed sender to a brand-aligned `no-reply@<custom-domain>` sender on Azure Communication Services. The Azure-managed sender disqualifies the brand from DMARC alignment, sender reputation tied to the custom domain, and anti-phishing signals such as Apple Mail's OTP autofill, which requires the sender's eTLD+1 to match the bound verify-form domain. - Add an optional `domainName` parameter to `cloud-infrastructure/modules/communication-services.bicep`. When set, the module provisions a `CustomerManaged` domain alongside the existing `AzureManagedDomain`, creates a `no-reply` sender username on it, and links both domains to the `communicationServices` resource. - Promote the `CustomerManaged` domain to be the active sender automatically once its DNS verification status reaches `Verified`. Until then, the `fromSenderDomain` output stays on the Azure-managed fallback so transactional email keeps flowing through the verification window. The existing `SENDER_EMAIL_ADDRESS=no-reply@${communicationService.outputs.fromSenderDomain}` wiring on `account-api` and `main-api` picks up the new sender on the next container app revision, with no env-var or app-code change. - Pass the cluster's `domainName` parameter (already wired to the `STAGING_DOMAIN_NAME` and `PRODUCTION_DOMAIN_NAME` GitHub variables and used today for cluster ingress) into the communication-services module so the email sender matches the host the user lands on. - Extend the existing `Show DNS Configuration` step in `.github/workflows/_deploy-infrastructure.yml` to also surface the ACS Email custom-domain DNS records. The step uses the same gating pattern as the cluster-ingress DNS check: silent on first deploy when the resource does not exist, one line on subsequent runs once the domain is fully verified, and a list of TXT and CNAME records to add at the registrar while verification is still pending. While pending, the step also calls `az communication email domain initiate-verification` for each record still in `NotStarted`, `VerificationFailed`, or `CancellationRequested` (idempotent, required because Azure does not auto-trigger verification when DNS records appear). ### Checklist - [x] I have added tests, or done manual regression tests - [x] I have updated the documentation, if necessary
2 parents 5b7d84f + fd1ec40 commit a4273e4

5 files changed

Lines changed: 210 additions & 3 deletions

File tree

.github/workflows/_deploy-infrastructure.yml

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,35 @@ jobs:
8787
- name: Plan Shared Environment Resources
8888
run: bash ./cloud-infrastructure/environment/deploy-environment.sh ${{ inputs.unique_prefix }} ${{ inputs.azure_environment }} ${{ inputs.shared_location }} ${{ inputs.production_service_principal_object_id }} --plan
8989

90+
- name: Detect Email Custom Domain Verification
91+
if: ${{ inputs.domain_name != '' && inputs.domain_name != '-' }}
92+
run: |
93+
# Auto-detect whether the email custom domain (eTLD+1 of domain_name) is fully verified in
94+
# Azure Communication Services. If yes, export USE_CUSTOM_EMAIL_DOMAIN=true to $GITHUB_ENV
95+
# so the next step (Plan Cluster Resources) inherits it; the bicepparam reads the env var
96+
# and the Bicep links the CustomerManaged domain + flips SENDER_EMAIL_ADDRESS to no-reply@<apex>.
97+
# When not verified (or when the resource does not exist yet on the very first deploy), the
98+
# env var stays unset, deploy-cluster.sh defaults it to false, the link is skipped, and mail
99+
# keeps flowing on the AzureManaged sender. This means there is no operator toggle to flip:
100+
# add DNS, verification completes, the next deploy auto-flips the sender.
101+
CLUSTER_RESOURCE_GROUP_NAME="${{ inputs.unique_prefix }}-${{ inputs.azure_environment }}-${{ inputs.cluster_location_acronym }}"
102+
email_domain_name=$(echo "${{ inputs.domain_name }}" | awk -F. '{ if (NF >= 2) print $(NF-1)"."$NF; else print $0 }')
103+
104+
az extension add --name communication --allow-preview true --only-show-errors 2>/dev/null || true
105+
email_domain_details=$(az communication email domain show --name "$email_domain_name" --email-service-name $CLUSTER_RESOURCE_GROUP_NAME --resource-group $CLUSTER_RESOURCE_GROUP_NAME -o json 2>/dev/null || echo "")
106+
107+
if [[ -n "$email_domain_details" ]] && echo "$email_domain_details" | jq empty 2>/dev/null; then
108+
domain_status=$(echo "$email_domain_details" | jq -r '.verificationStates.Domain.status // "NotStarted"')
109+
spf_status=$(echo "$email_domain_details" | jq -r '.verificationStates.SPF.status // "NotStarted"')
110+
dkim_status=$(echo "$email_domain_details" | jq -r '.verificationStates.DKIM.status // "NotStarted"')
111+
dkim2_status=$(echo "$email_domain_details" | jq -r '.verificationStates.DKIM2.status // "NotStarted"')
112+
113+
if [[ "$domain_status" == "Verified" ]] && [[ "$spf_status" == "Verified" ]] && [[ "$dkim_status" == "Verified" ]] && [[ "$dkim2_status" == "Verified" ]]; then
114+
echo "USE_CUSTOM_EMAIL_DOMAIN=true" >> $GITHUB_ENV
115+
echo "$(date +"%Y-%m-%dT%H:%M:%S") Email custom domain '$email_domain_name' is verified - sender will be no-reply@$email_domain_name."
116+
fi
117+
fi
118+
90119
- name: Plan Cluster Resources
91120
id: deploy_cluster
92121
env:
@@ -142,6 +171,81 @@ jobs:
142171
echo "- A CNAME record with the Host name '${{ inputs.back_office_domain_name }}' that points to address 'back-office.$default_domain'."
143172
fi
144173
fi
174+
175+
# Email custom domain DNS. Derived from domain_name as its eTLD+1 (apex) - cluster ingress
176+
# is typically a CNAME and DNS forbids TXT/other records at the same name as a CNAME
177+
# (RFC 1034); the apex of a domain cannot be a CNAME, so SPF/DKIM records work there. The
178+
# awk derivation takes the last two parts of the host (correct for .net, .com, .io; wrong
179+
# for multi-part public suffixes like .co.uk - revisit if that ever applies).
180+
# Skip silently when the resource does not exist yet, report "configured correctly" once
181+
# fully Verified, and only print the records (and kick verification, idempotent) while
182+
# there is real work to do. The communication extension is in preview and emits a stderr
183+
# banner on every call, so we discard stderr and validate captured stdout is JSON.
184+
email_domain_name=$(echo "${{ inputs.domain_name }}" | awk -F. '{ if (NF >= 2) print $(NF-1)"."$NF; else print $0 }')
185+
az extension add --name communication --allow-preview true --only-show-errors 2>/dev/null || true
186+
email_domain_details=$(az communication email domain show --name "$email_domain_name" --email-service-name $CLUSTER_RESOURCE_GROUP_NAME --resource-group $CLUSTER_RESOURCE_GROUP_NAME -o json 2>/dev/null || echo "")
187+
188+
if [[ -n "$email_domain_details" ]] && echo "$email_domain_details" | jq empty 2>/dev/null; then
189+
domain_status=$(echo "$email_domain_details" | jq -r '.verificationStates.Domain.status // "NotStarted"')
190+
spf_status=$(echo "$email_domain_details" | jq -r '.verificationStates.SPF.status // "NotStarted"')
191+
dkim_status=$(echo "$email_domain_details" | jq -r '.verificationStates.DKIM.status // "NotStarted"')
192+
dkim2_status=$(echo "$email_domain_details" | jq -r '.verificationStates.DKIM2.status // "NotStarted"')
193+
194+
if [[ "$domain_status" == "Verified" ]] && [[ "$spf_status" == "Verified" ]] && [[ "$dkim_status" == "Verified" ]] && [[ "$dkim2_status" == "Verified" ]]; then
195+
echo "$(date +"%Y-%m-%dT%H:%M:%S") Email custom domain '$email_domain_name' is already configured correctly. Sender 'no-reply@$email_domain_name' is active."
196+
else
197+
# Verification does not auto-start when DNS records appear; kick it on every relevant
198+
# type. Idempotent. Domain ownership must be kicked too - without this call, Azure
199+
# leaves it in NotStarted forever even after the ms-domain-verification TXT is in DNS.
200+
for verification_type in Domain SPF DKIM DKIM2; do
201+
current_status=$(echo "$email_domain_details" | jq -r --arg t "$verification_type" '.verificationStates[$t].status // "NotStarted"')
202+
if [[ "$current_status" == "NotStarted" ]] || [[ "$current_status" == "VerificationFailed" ]] || [[ "$current_status" == "CancellationRequested" ]]; then
203+
az communication email domain initiate-verification --domain-name "$email_domain_name" --email-service-name $CLUSTER_RESOURCE_GROUP_NAME --resource-group $CLUSTER_RESOURCE_GROUP_NAME --verification-type "$verification_type" --only-show-errors 1>/dev/null 2>&1 || true
204+
fi
205+
done
206+
207+
# Red ANSI on the headline so the records jump out in the log; the records themselves
208+
# stay plain so they are easy to copy. The final ::error:: directive is a GitHub Actions
209+
# workflow command - it surfaces a red annotation in the workflow summary and on any PR
210+
# check, not only inline.
211+
# Azure returns absolute names for apex records (Domain TXT, SPF TXT) and relative names
212+
# for subdomain records (DKIM CNAMEs). DNS UIs (Microsoft 365, Azure DNS, Cloudflare,
213+
# Route 53, ...) want "@" as shorthand for the apex - typing the literal full domain
214+
# often causes the panel to append the zone again (e.g., "platformplatform.net" becomes
215+
# "platformplatform.net.platformplatform.net"). Translate apex hits to "@" so the
216+
# operator can paste verbatim into M365's DNS panel.
217+
normalize_name() {
218+
if [[ "$1" == "$email_domain_name" ]]; then echo "@"; else echo "$1"; fi
219+
}
220+
221+
echo -e "$(date +"%Y-%m-%dT%H:%M:%S") \033[0;31mPlease add the following DNS entries for the email custom domain and then retry:\033[0m"
222+
domain_record_name=$(echo "$email_domain_details" | jq -r '.verificationRecords.Domain.name // ""')
223+
domain_record_value=$(echo "$email_domain_details" | jq -r '.verificationRecords.Domain.value // ""')
224+
if [[ -n "$domain_record_name" ]]; then echo "- A TXT record with the name '$(normalize_name "$domain_record_name")' (apex of '$email_domain_name') and the value '$domain_record_value' (Domain ownership). This is a separate TXT record - it coexists alongside any existing SPF or other TXT records at the same name."; fi
225+
spf_record_name=$(echo "$email_domain_details" | jq -r '.verificationRecords.SPF.name // ""')
226+
spf_record_value=$(echo "$email_domain_details" | jq -r '.verificationRecords.SPF.value // ""')
227+
if [[ -n "$spf_record_name" ]]; then echo "- A TXT record with the name '$(normalize_name "$spf_record_name")' (apex of '$email_domain_name') and the value '$spf_record_value' (SPF). If a TXT record with this exact value already exists at this name (e.g., from Microsoft 365), no action is needed; if a different SPF value exists, merge the includes so only one SPF record remains."; fi
228+
dkim_record_name=$(echo "$email_domain_details" | jq -r '.verificationRecords.DKIM.name // ""')
229+
dkim_record_value=$(echo "$email_domain_details" | jq -r '.verificationRecords.DKIM.value // ""')
230+
if [[ -n "$dkim_record_name" ]]; then echo "- A CNAME record with the Host name '$dkim_record_name' that points to address '$dkim_record_value' (DKIM selector1)."; fi
231+
dkim2_record_name=$(echo "$email_domain_details" | jq -r '.verificationRecords.DKIM2.name // ""')
232+
dkim2_record_value=$(echo "$email_domain_details" | jq -r '.verificationRecords.DKIM2.value // ""')
233+
if [[ -n "$dkim2_record_name" ]]; then echo "- A CNAME record with the Host name '$dkim2_record_name' that points to address '$dkim2_record_value' (DKIM selector2)."; fi
234+
235+
# Fail the plan when the email custom domain exists but verification is incomplete.
236+
# Letting plan succeed sends the operator into a multi-cycle loop where the apply
237+
# eventually links the domain (once verified by a later run) but each interim deploy
238+
# silently runs without the desired sender. Surfacing the records and failing now is
239+
# cheaper than a 30-minute cycle each iteration.
240+
echo "::error::Email custom domain '$email_domain_name' verification is incomplete. Add the DNS records above and re-run this workflow."
241+
exit 1
242+
fi
243+
else
244+
# Mirror the cluster-ingress style "checked but no work to confirm" hint so the operator
245+
# sees the email check ran. The customer-managed domain only exists after the first apply
246+
# with this domain configured; a follow-up workflow run will surface the records.
247+
echo "$(date +"%Y-%m-%dT%H:%M:%S") Email custom domain '$email_domain_name' is not yet provisioned in Azure Communication Services. Re-run this workflow after the next apply completes."
248+
fi
145249
else
146250
echo "$(date +"%Y-%m-%dT%H:%M:%S") DNS configuration instructions will be shown after the Container Apps Environment is created."
147251
fi
@@ -174,6 +278,31 @@ jobs:
174278
- name: Deploy Shared Environment Resources
175279
run: bash ./cloud-infrastructure/environment/deploy-environment.sh ${{ inputs.unique_prefix }} ${{ inputs.azure_environment }} ${{ inputs.shared_location }} ${{ inputs.production_service_principal_object_id }} --apply
176280

281+
- name: Detect Email Custom Domain Verification
282+
if: ${{ inputs.domain_name != '' && inputs.domain_name != '-' }}
283+
run: |
284+
# Mirror of the plan job's auto-detect step. Re-queries verification because $GITHUB_ENV does
285+
# not cross job boundaries; the second az call is cheap. When verified, exports
286+
# USE_CUSTOM_EMAIL_DOMAIN=true so the apply links the CustomerManaged domain and the next
287+
# account-api/main-api revision picks up the new sender.
288+
CLUSTER_RESOURCE_GROUP_NAME="${{ inputs.unique_prefix }}-${{ inputs.azure_environment }}-${{ inputs.cluster_location_acronym }}"
289+
email_domain_name=$(echo "${{ inputs.domain_name }}" | awk -F. '{ if (NF >= 2) print $(NF-1)"."$NF; else print $0 }')
290+
291+
az extension add --name communication --allow-preview true --only-show-errors 2>/dev/null || true
292+
email_domain_details=$(az communication email domain show --name "$email_domain_name" --email-service-name $CLUSTER_RESOURCE_GROUP_NAME --resource-group $CLUSTER_RESOURCE_GROUP_NAME -o json 2>/dev/null || echo "")
293+
294+
if [[ -n "$email_domain_details" ]] && echo "$email_domain_details" | jq empty 2>/dev/null; then
295+
domain_status=$(echo "$email_domain_details" | jq -r '.verificationStates.Domain.status // "NotStarted"')
296+
spf_status=$(echo "$email_domain_details" | jq -r '.verificationStates.SPF.status // "NotStarted"')
297+
dkim_status=$(echo "$email_domain_details" | jq -r '.verificationStates.DKIM.status // "NotStarted"')
298+
dkim2_status=$(echo "$email_domain_details" | jq -r '.verificationStates.DKIM2.status // "NotStarted"')
299+
300+
if [[ "$domain_status" == "Verified" ]] && [[ "$spf_status" == "Verified" ]] && [[ "$dkim_status" == "Verified" ]] && [[ "$dkim2_status" == "Verified" ]]; then
301+
echo "USE_CUSTOM_EMAIL_DOMAIN=true" >> $GITHUB_ENV
302+
echo "$(date +"%Y-%m-%dT%H:%M:%S") Email custom domain '$email_domain_name' is verified - linking the CustomerManaged domain and flipping the sender."
303+
fi
304+
fi
305+
177306
- name: Deploy Cluster Resources
178307
id: deploy_cluster
179308
env:

cloud-infrastructure/cluster/deploy-cluster.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,14 @@ export GOOGLE_OAUTH_CLIENT_SECRET
5454
export STRIPE_PUBLISHABLE_KEY
5555
export STRIPE_API_KEY
5656
export STRIPE_WEBHOOK_SECRET
57+
# Set to "true" by the deploy workflow's "Detect Email Custom Domain Verification" step once it
58+
# observes the CustomerManaged email domain (eTLD+1 of DOMAIN_NAME) as fully verified. When true,
59+
# Bicep links the domain to the CommunicationServices resource and the SENDER_EMAIL_ADDRESS env var
60+
# on account-api/main-api flips from no-reply@<azurecomm.net> to no-reply@<apex of DOMAIN_NAME>.
61+
# Defaults to "false" so mail keeps flowing on the AzureManaged sender during the verification window
62+
# and so the first apply (which always precedes verification) does not fail trying to link an
63+
# unverified domain. Operators do not flip this manually.
64+
export USE_CUSTOM_EMAIL_DOMAIN="${USE_CUSTOM_EMAIL_DOMAIN:-false}"
5765

5866
export CONTAINER_REGISTRY_NAME=$UNIQUE_PREFIX$ENVIRONMENT
5967
export GLOBAL_RESOURCE_GROUP_NAME="$UNIQUE_PREFIX-$ENVIRONMENT-global"

cloud-infrastructure/cluster/main-cluster.bicep

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ param mainVersion string
1515
param applicationInsightsConnectionString string
1616
param communicationServicesDataLocation string = 'europe'
1717
param mailSenderDisplayName string = 'PlatformPlatform'
18+
param useCustomEmailDomain bool = false
1819
param revisionSuffix string
1920

2021
@description('Object ID of the Entra ID security group for PostgreSQL administration')
@@ -135,6 +136,20 @@ module stripeSecrets '../modules/key-vault-secrets.bicep' = if (!empty(stripeApi
135136
}
136137
}
137138

139+
// Derive the email custom domain as the eTLD+1 (apex) of the cluster's ingress domainName. Cluster
140+
// ingress is typically a CNAME, and DNS rules forbid TXT/other records at the same name as a CNAME
141+
// (RFC 1034). The apex of a domain cannot itself be a CNAME, so SPF (TXT) and DKIM CNAMEs at
142+
// sub-subdomains of the apex can coexist freely with whatever else lives on the apex.
143+
// Apple Mail OTP autofill matches on eTLD+1, so a sender at the apex still autofills on any subdomain
144+
// of the same apex (e.g., sender no-reply@platformplatform.net autofills forms on
145+
// staging.platformplatform.net or app.platformplatform.net).
146+
// The "last two parts" derivation is correct for single-suffix TLDs (.net, .com, .io). It is wrong
147+
// for multi-part public suffixes like .co.uk - replace with an explicit param if that ever applies.
148+
var domainNameParts = split(domainName, '.')
149+
var emailDomainName = empty(domainName)
150+
? ''
151+
: '${domainNameParts[length(domainNameParts) - 2]}.${domainNameParts[length(domainNameParts) - 1]}'
152+
138153
module communicationService '../modules/communication-services.bicep' = {
139154
scope: clusterResourceGroup
140155
name: '${clusterResourceGroupName}-communication-services'
@@ -144,6 +159,8 @@ module communicationService '../modules/communication-services.bicep' = {
144159
dataLocation: communicationServicesDataLocation
145160
mailSenderDisplayName: mailSenderDisplayName
146161
keyVaultName: keyVault.outputs.name
162+
emailDomainName: emailDomainName
163+
useCustomEmailDomain: useCustomEmailDomain
147164
}
148165
}
149166

cloud-infrastructure/cluster/main-cluster.bicepparam

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ param globalResourceGroupName = readEnvironmentVariable('GLOBAL_RESOURCE_GROUP_N
77
param environment = readEnvironmentVariable('ENVIRONMENT')
88
param containerRegistryName = readEnvironmentVariable('CONTAINER_REGISTRY_NAME')
99
param domainName = readEnvironmentVariable('DOMAIN_NAME', '')
10+
param useCustomEmailDomain = readEnvironmentVariable('USE_CUSTOM_EMAIL_DOMAIN', 'false') == 'true'
1011
param backOfficeDomainName = readEnvironmentVariable('BACK_OFFICE_DOMAIN_NAME', '')
1112
param backOfficeEntraClientId = readEnvironmentVariable('BACK_OFFICE_ENTRA_CLIENT_ID')
1213
param backOfficeAdminsGroupId = readEnvironmentVariable('BACK_OFFICE_ADMINS_GROUP_ID', '')

0 commit comments

Comments
 (0)