Description
TL;DR:
Failures that occur outside of the handler of a Custom Resource result in long periods of inactivity when invoking CDK commands, they're also never raised as actual failures. This is a PITA.
So recently whilst working on HLS I was making some 🎸 Custom Resources 🤘
I'd wrapped my logic in beautiful try/except
blocks and I'd handled the CFN Callbacks so that my Custom Resource called back to the mothership 👽 🛸 to tell CFN what was happening. This is used by Custom Resources to tell CloudFormation (and CDK) whether a resources creation/update/delete
has been successful or not.
BUT
When I ran cdk deploy
, my deployment was seemingly stuck on creating the Custom Resource forever. Upon further inspection, I could see in the logs of the handler that it was erroring out as soon as it was invoked - strange, this should be caught and Cloud Formation should be informed of the failure and begin the rollback logic.
So, I've got a Stack stuck deploying, my first thought? Delete the thing. So I deleted the stack... and it got stuck deleting the Custom Resource forever 🙃 .
u wot 🤨
So, this was confusing at first but then I took a look at the error messages in CloudWatch. Let's say I had a index.py
like:
import cfnresponse
import my_cool_module
def handler(event, context):
try:
my_cool_module.do_something()
cfnresponse.send_success() # This isn't real but you get the idea
except my_cool_module.a_not_so_cool_exception as ex:
print(ex)
cfnresponse.send_failure(ex)
My importing of my_cool_module
was erroring, not anything in my handler
function. Because of this, I was never reaching any of my callback code, which meant that as far as CDK/CloudFormation were concerned, my Custom Resource was doing its thing and it'd hear from it eventually.
Because these callbacks are required for any CDK action, they'd result in infinitely (1+ hours) running deploys/updates/destroys
which really wastes time.
You might ask, did you not test your code locally @CiaranEvans?! - Well, I did. It worked beautifully because of how it was interpreting the import statement... not so correct when actually on its own in a Lambda 😭
So what should we do?
I suppose the easiest and grossest way could be:
import cfnresponse
try:
import my_cool_module
except:
cfnresponse.send_failure()
def handler(event, context):
try:
my_cool_module.do_something()
cfnresponse.send_success() # This isn't real but you get the idea
except my_cool_module.a_not_so_cool_exception as ex:
print(ex)
cfnresponse.send_failure(ex)
I don't like try/catch or conditional imports. So if someone has a better idea or knows of how we could gracefully handle this kind of issue, I'm all ears!
I imagine most languages will suffer this kind of situation, or at least have it a situation that's possible - is there a way for CDK to treat any failure that's not explicitly handled as a failure for CloudFormation? 🤷
- Mind dump over.
cc. @developmentseed/earthdata-infrastructure