Well, the topic is clear. And if you’re here, that means you also couldn’t find anything helpful in the documentation on how to get notified on Glue Job failures. Or at least, not in a single place.
After some reading and several hours of chatting with AWS Support, I was able to get it done. Well, the final solution may seem super straight forward, but it didn’t come easy. Maybe because I was too naive or it actually was complicated. Nevertheless here is how I configured to get notified when an AWS Glue Job fails.
The Setup
- Detect failure of the Glue Job.
- Trigger an AWS Cloud Watch Rule from that.
- Push the event to a notification stream.
There are several ways of detecting failures of components in AWS. The typical and the most common way is AWS CloudWatch Alarms. We have configured Alarms for almost all of our components and all those are sent to an SNS Topic, which then we have linked to our team Slack channel. So our goal was to send any errors on our Glue Jobs to this SNS topic, for which we will be notified on Slack.
They have documentation for configuring alarms for Glue Jobs. But, as you might have guessed, it wasn’t so helpful. They also explain using CloudWatch Metrics of the Glue Job to create the alarm from the web console. But we need CloudFormation!
HOWEVER, we couldn’t get it working using CloudFormation Templates. The only way we could get something working was to have an Event defined for a Lambda function.
MergerCompleteUpdaterLambda:
Type: AWS::Serverless::Function
Properties:
...
Events:
CloudWatch:
Type: CloudWatchEvent
Properties:
Pattern:
detail:
jobName:
- my-glue-job-name
state:
- FAILED
Note: It’s weird they mix up parameter naming conventions. And also, Pattern is meant to be a JSON string, but we can always write them in YAML.
Then you can have this Lambda function to manually trigger any alarm you want. It is not that elegant because now we have to have an intermediate component, but it works.
But I wouldn’t be writing this article if we stopped there, right!
Capturing the failure
So, we resumed our quest to find how can we can trigger an alarm or at least send a notification to our SNS Topic with less/no additional components involved.
When we implemented the above Lambda function, we suspected that we can define the event mapping separately, just we can do for Kinesis-Lambda events. So we looked more into AWS CloudWatch Rules.
In the Rule, EventPattern
and the Targets
are the most important properties for us. EventPattern
property defines what type of events we should listen to, which is Glue Job Failures in our case. Targets
property defines the target it should trigger when such an event occurs, which is an SNS Topic in our case.
But unfortunately, there is no sign of Glue Jobs in the EventPattern documentation. So, we tried to imitate what’s already there for EC2 with the detail-type values found in another place.
EventPattern:
detail-type: Glue Job State Change
source: aws.glue
resources:
- arn:aws:glue:ew-west-1:<account_id>:job/my-glue-job-name
detail:
state: FAILED
But as you already must have guessed, this doesn’t work as well. So we tried to scrape the Internet for a solution, but couldn’t find any constructive solution. So we asked for help from AWS Support.
Apparently, the way we define the source component is different from one another. So, defining the Glue Job in the resources
section doesn’t work! That should be in a new field
named jobName
inside the detail
section and we shouldn’t specify the resources
at all!
And the Target
is straightforward where we define the ARN of the SNS Topic.
So here is the final CloudFormation template we came up with.
MyGlueJobFailEvent:
Type: AWS::Events::Rule
Properties:
Description: This rule is to detect if the Glue Jobs fails
EventPattern:
source:
- aws.glue
detail-type:
# It should be exactly like this.
- Glue Job State Change
detail:
jobName:
- !Ref MyGlueJob
- my-other-glue-job-name
state:
- FAILED
Name: !Sub
- ${StackName}-fail-event
- { StackName: !Ref "AWS::StackName" }
State: ENABLED
Targets:
- Arn: arn-of-my-sns-topic
"Id": "SNS-ID"
There two nice things about the above implemention.
- We can define multiple GlueJobs as the source
It means we don’t have to define Rules for every Glue Job we have. - We can have multiple targets
It means we can direct this failure event to many other components.
Well, our task is not yet done!
Set the permissions
We also learned, unlike other components, CloudWatch Events need permissions explicitly defined to be able to push messages to SNS. It will be automatically added when you create the event binding from the web console. In the documentation, they explain a hacky-way of adding this policy to the SNS topic using the awscli
. But we wanted everything to be defined in the CloudFormation templates.
Repositories (code) must reflect what’s on production.
Infrastructure as Code
For this, we need to define a TopicPolicy
for the SNS Topic with permissions for CW Events to allow pushing messages to the corresponding SNS Topic.
Resources:
AlarmTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Sub
- ${StackName}--alarms-topic
- { StackName: !Ref "AWS::StackName" }
DisplayName: !Sub
- ${StackName}--alarms-topic
- { StackName: !Ref "AWS::StackName" }
AlarmTopicPolicy:
Type: AWS::SNS::TopicPolicy
Properties:
PolicyDocument:
Version: '2012-10-17'
Id: CloudWatchEvents
Statement:
- Sid: EventsToSNS
# This Statement allows CW Events to push messages to SNS
Effect: Allow
Principal:
Service: events.amazonaws.com
Action: SNS:Publish
Resource: !Ref AlarmTopic
- Sid: DefaultSNS
# This is the default Statement
Effect: Allow
Principal:
AWS: "*"
Action:
- SNS:GetTopicAttributes
- SNS:SetTopicAttributes
- SNS:AddPermission
- SNS:RemovePermission
- SNS:DeleteTopic
- SNS:Subscribe
- SNS:ListSubscriptionsByTopic
- SNS:Publish
- SNS:Receive
Resource: !Ref AlarmTopic
Condition:
StringEquals:
AWS:SourceOwner: !Ref "AWS::AccountId"
Topics:
- !Ref AlarmTopic
We MUST include the Statement specified with DefaultSNS
. Otherwise, it won’t have default permissions for the SNS Topic.
Now we have a Glue Job, SNS Topic, an SNS Topic Policy to allow pushing messages, and an Event Rule to bridge the events from the Glue Job to the SNS. Now we’re good to go!
Still… Even though all the error messages are in the SNS Topic, we should configure a mechanism to be notified of these messages. But that’s not in the scope of this article.
However, when we were looking for a solution to get AWS Alarms into a Slack channel, the easiest solution we could find was Opsidian. It’s super easy to configure and saves a lot of time. You should totally give it a try! And no, they or anyone else didn’t sponsor this article, though I would have loved if someone did 🙁
Hi,
I have been trying your solution to create the SNS topics, event rule and topic policy but couldn’t able to achieve the cloudFormation stack. I am getting the following error.
The following resource(s) failed to create: [MyGlueJobFailEvent]. . Rollback requested by user.
Value of property Arn must be of type String.
kindly help.
Can you post your final CloudFormation template? It says the ARN should be a string. Make sure it’s enclosed with double quotes or uses gets reduced to string (if you’re using
!Sub
or!ImportValue
)yeah… it got resolved.
Do you have any idea on how can we match glue job (with names ending with “prod”) in event pattern ?
What I usually do is define a parameter in the CloudFormation (i.e. Environment) and pass the value when deploying. Then I can use this value with
${Environment}
.Ahhh… it worked out… I wasn’t supllying my AWS account id. For now, I had to hard-code it from my user role. But can we pick the account id here ?
Yes, you can use Pseudo parameter references to get the Account ID.
Check here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/pseudo-parameter-reference.html
awesome article thankyou
Hi Praneet, I was trying to do the same in the cloudwatch event rule, where i am trying to get an event when some of my CRAWLER failes. I tried adding the “jobName”, “crawlerName” but it dint work. I did a quick search in internet but could not find it. Did you happen to work on it
Can you show your CloudFormation template?