Ken Muse

How to Handle Step and Job Errors in GitHub Actions


In most cases, we write a CI/CD workflow where all of the steps succeed. If a step fails, the job and workflow fails. But what do we do when we need a step to fail (such as with negative testing)? And how can we provide corrective logic to handle a failed job in the workflow (such as rolling back a deployment)? In this case, we need a way to enable the workflow to keep processing our steps. In this post, I’ll be diving into one of the options and exploring a lesser-known GitHub Actions feature, continue-on-error.

The continue-on-error feature is a way to tell GitHub Actions to continue running even if a step (or job) fails. This allows you to run additional steps after the one that failed. The property can be applied at the job or step level, but its behaviors are slightly different depending on where it is used. Why would you want to use this feature? One reason is to create workflows that are more resilient to failure. For example, the workflow can identify a deployment failure based on a failing step, then execute a rollback process. Another reason is to create workflows that can be used for negative testing. In this approach, we actually expect a particular step to fail (if it is working correctly). This requires the following steps to be able to execute and determine if the failure occurred as expected.

Configured at the step level, it looks like this:

 1jobs:
 2  test:
 3    runs-on: ubuntu-latest
 4    steps:
 5      - name: Step 1
 6        run: echo "This step will pass"
 7      - name: Step 2
 8        id: step2
 9        continue-on-error: true
10        run: echo "This step will fail" && exit 1
11      - name: Step 3
12        run: echo "And this step ..."
13  after-test:
14    needs: [test]
15    

In this case, Step 2 will fail, but the workflow will continue to Step 3. We can determine the state of Step 2 by looking at the value of steps.step2.outcome. If it is failure, then we know that the step failed as expected. If it is success, then we know that the step passed unexpectedly. Because continue-on-error is set to true, the job and step will succeed (with a green success bubble on the job). In fact, steps.step2.conclusion and needs.test.result will both be success. In short, the workflow proceeds successfully as if the step had never failed.

Now, let’s look at what happens when continue-on-error is set at the job level:

 1jobs:
 2  test:
 3    runs-on: ubuntu-latest
 4    continue-on-error: true
 5    steps:
 6      - name: Step 1
 7        run: echo "This step will pass"
 8      - name: Step 2
 9        id: step2
10        run: echo "This step will fail" && exit 1
11      - name: Step 3
12        run: echo "And this step ..."
13  after-test:
14    needs: [test]
15    

In this case, Step 2 will still fail. The values of steps.step2.outcome and steps.step2.conclusion will be failure. The workflow will not, however, proceed to Step 3. The workflow will continue with the after-test job, and needs.test.result will report success. If you want to proceed to Step 3, you’d need to override the implied if: ${{ success() }} condition that exists on step 3. For example:

1- name: Step 3
2  if: ${{ failure() && steps.demo.conclusion == 'failure' }}
3  run: echo "And NOW this step will run"

At the job level, however, there is no additional field provided through needs that will indicate that the job failed. The results will contain success. The only way to access the details about the previous test job is to provide an output from the job. For example:

1jobs:
2 test:
3   runs-on: ubuntu-latest
4   continue-on-error: true
5   outputs:
6     actualResult: ${{ steps.step2.outcome }}

There is one more difference when you use continue-on-error at the job level. In this case, despite the fact the test job’s results report success, the overall workflow will report a failure. The job will also have a red X to indicate that it failed. Because of that, the job level continue-on-failure is best used when the workflow still needs to ultimately fail.

And that’s the basics of continue-on-error. It’s a powerful feature that can be used to create more resilient workflows and to implement complex testing strategies. It’s not a feature that you’ll use every day, but when you need it, it’s a great tool to have in your toolbox!