Get notified on AWS Glue Job failures

Well, the topic is clear. And if you’re here, that means you also couldn’t find anything helpful in the documentation. Or at least, not in a single place.

After some reading and several hours of chatting with AWS Support, I was able to get it done. Well, the final solution may seem super straight forward, but it didn’t come easy. Maybe because I was too naive or it actually was complicated. Nevertheless here is how I configured to get notified when an AWS Glue Job fails.

The Setup

  1. Detect failure of the Glue Job.
  2. Trigger an AWS Cloud Watch Rule from that.
  3. Push the event to a notification stream.

There are several ways of detecting failures of components in AWS. The typical and the most common way is AWS CloudWatch Alarms. We have configured Alarms for almost all of our components and all those are sent to an SNS Topic, which then we have linked to our team Slack channel. So our goal was to send any errors on our Glue Jobs to this SNS topic, for which we will be notified on Slack.

They have documentation for configuring alarms for Glue Jobs. But, as you might have guessed, it wasn’t so helpful. They also explain using CloudWatch Metrics of the Glue Job to create the alarm from the web console. But we need CloudFormation!

Source: https://symbiotics.co.za/understanding-infrastructure-as-code-with-aws-cloudformation/

HOWEVER, we couldn’t get it working using CloudFormation Templates. The only way we could get something working was to have an Event defined for a Lambda function.

MergerCompleteUpdaterLambda:
  Type: AWS::Serverless::Function
  Properties:
    ...
    Events:
      CloudWatch:
        Type: CloudWatchEvent
        Properties:
        Pattern:
          detail:
          jobName:
            - my-glue-job-name
          state:
            - FAILED

Note: It’s weird they mix up parameter naming conventions. And also, Pattern is meant to be a JSON string, but we can always write them in YAML.

Then you can have this Lambda function to manually trigger any alarm you want. It is not that elegant because now we have to have an intermediate component, but it works.

But I wouldn’t be writing this article if we stopped there, right!

So, we resumed our quest to find how can we can trigger an alarm or at least send a notification to our SNS Topic with less/no additional components involved.

When we implemented the above Lambda function, we suspected that we can define the event mapping separately, just we can do for Kinesis-Lambda events. So we looked more into AWS CloudWatch Rules.

In the Rule, EventPattern and the Targets are the most important properties for us. EventPattern property defines what type of events we should listen to, which is Glue Job Failures in our case. Targets property defines the target it should trigger when such an event occurs, which is an SNS Topic in our case.

But unfortunately, there is no sign of Glue Jobs in the EventPattern documentation. So, we tried to imitate what’s already there for EC2 with the detail-type values found in another place.

EventPattern:
  detail-type: Glue Job State Change
  source: aws.glue
  resources:
  - arn:aws:glue:ew-west-1:<account_id>:job/my-glue-job-name
  detail:
    state: FAILED

But as you already must have guessed, this doesn’t work as well. So we tried to scrape the Internet for a solution, but couldn’t find any constructive solution. So we asked for help from AWS Support.

Apparently, the way we define the source component is different from one another. So, defining the Glue Job in the resources section doesn’t work! That should be in a new field named jobName inside the detail section and we shouldn’t specify the resources at all!

And the Target is straightforward where we define the ARN of the SNS Topic.

So here is the final CloudFormation template we came up with.

MyGlueJobFailEvent:
  Type: AWS::Events::Rule
  Properties:
    Description: This rule is to detect if the Glue Jobs fails
    EventPattern:
      source:
        - aws.glue
      detail-type:
        # It should be exactly like this.
        - Glue Job State Change
      detail:
        jobName:
          - !Ref MyGlueJob
          - my-other-glue-job-name
        state:
          - FAILED
    Name: !Sub
      - ${StackName}-fail-event
      - { StackName: !Ref "AWS::StackName" }
    State: ENABLED
    Targets:
      - Arn: arn-of-my-sns-topic
        "Id": "SNS-ID"

There two nice things about the above implemention.

  1. We can define multiple GlueJobs as the source
    It means we don’t have to define Rules for every Glue Job we have.
  2. We can have multiple targets
    It means we can direct this failure event to many other components.

Well, our task is not yet done!

We also learned, unlike other components, CloudWatch Events need permissions explicitly defined to be able to push messages to SNS. It will be automatically added when you create the event binding from the web console. In the documentation, they explain a hacky-way of adding this policy to the SNS topic using the awscli. But we wanted everything to be defined in the CloudFormation templates.

Repositories (code) must reflect what’s on production.

Infrastructure as Code

For this, we need to define a TopicPolicy for the SNS Topic with permissions for CW Events to allow pushing messages to the corresponding SNS Topic.

Resources:
  AlarmTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub
        - ${StackName}--alarms-topic
        - { StackName: !Ref "AWS::StackName" }
      DisplayName: !Sub
        - ${StackName}--alarms-topic
        - { StackName: !Ref "AWS::StackName" }

  AlarmTopicPolicy:
    Type: AWS::SNS::TopicPolicy
    Properties:
      PolicyDocument:
        Version: '2012-10-17'
        Id: CloudWatchEvents
        Statement:
          - Sid: EventsToSNS
            # This Statement allows CW Events to push messages to SNS
            Effect: Allow
            Principal:
              Service: events.amazonaws.com
            Action: SNS:Publish
            Resource: !Ref AlarmTopic
          - Sid: DefaultSNS
            # This is the default Statement
            Effect: Allow
            Principal:
              AWS: "*"
            Action:
              - SNS:GetTopicAttributes
              - SNS:SetTopicAttributes
              - SNS:AddPermission
              - SNS:RemovePermission
              - SNS:DeleteTopic
              - SNS:Subscribe
              - SNS:ListSubscriptionsByTopic
              - SNS:Publish
              - SNS:Receive
            Resource: !Ref AlarmTopic
            Condition:
              StringEquals:
                AWS:SourceOwner: !Ref "AWS::AccountId"
      Topics:
        - !Ref AlarmTopic

We MUST include the Statement specified with EventsToSNS. Otherwise, it won’t have default permissions for the SNS Topic.

Now we have a Glue Job, SNS Topic, an SNS Topic Policy to allow pushing messages, and an Event Rule to bridge the events from the Glue Job to the SNS. Now we’re good to go!

Still… Even though all the error messages are in the SNS Topic, we should configure a mechanism to be notified of these messages. But that’s not in the scope of this article.

However, when we were looking for a solution to get AWS Alarms into a Slack channel, the easiest solution we could find was Opsidian. It’s super easy to configure and saves a lot of time. You should totally give it a try! And no, they or anyone else didn’t sponsor this article, though I would have loved if someone did 🙁

Leave a Reply