GitHub Actions: Dealing with a failing step in a workflow

2022-09-18 | github-actions, devops, ci-cd

Here, a ‘failing step’ means a step that includes a command with a non-zero exit code.

Simple scenario: Single-line commands #

For very simple cases, using shell script notation do_something || do_something_else (‘or’) can be sufficient. The command after the double pipes is only executed if the first command do_something evaluates to false, or more importantly for us, returns a non-zero exit code (usually indicating failure).

In the example below, the step should create a kubernetes secret if it doesn’t exist already. To achieve this, the control flow should act as follows:

graph LR start{Get secret:
$SECRET_NAME} continue[Continue to next step] start-->|Not found:
exit code 1|create[Create Secret]-->continue start-->|Success:
exit code 0|continue

In a GitHub Actions step, this can be done with:

...
- name: Create kubernetes secret
  run: |
    kubectl get secret -n=$NAMESPACE $SECRET_NAME || \
    kubectl create secret <type> $SECRET_NAME -n=$NAMESPACE ...    

...

More involved example: Parsing error output, fix the cause, and reattempt a command #

Sometimes the control flow needs to use the contents of the error to gracefully resolve the problem. There are a few steps that the flow needs to do in this case:

Define an id for the step that may fail (so that we can refer to its outcome status later in the flow).
The step that may fail must have the flag continue-on-error: true defined - otherwise GitHub will stop and fail the action.
The failing steps needs to save the error message so that it can be processed later. This is done using the 2> err.txt to save this error output to a file.
Following steps can use the status of the failing step as a condition for execution.

The control flow looks like this:

graph LR fail_step[Step: Run command] gate{Outcome status?} cont[Continue flow] fix_step[Step: Fix issue] reattempt_step[Step: Reattempt command] final_gate{Reattempt
outcome?} fail_step --> gate gate-->|success|cont gate-->|fail|fix_step-->reattempt_step-->final_gate final_gate-->|success|cont final_gate-->|fail|fail[Fail the flow]

In a GitHub Actions syntax:

...
- name: Attempt a step that will potentially fail
  id: potentially-failing-step
  continue-on-error: true
  run: | # FYI: commands after a failing line will NOT be executed. Hence, "Success!" would NOT be printed if your_command fails.
    echo "Attempting the step..."
    your_command 2> err.txt
    echo "Success!"

- name: Dealing with the problem that caused the error
  if: steps.potentially-failing-step.outcome =='failure'
  run: | # This is where you could deal with the problem that caused the error, reading err.txt if needed - increment version, change a feature flag, retry compilation with a different dependency etc. Perhaps you changed a parameter. The syntax below shows how to save the NEW_PARAMETER as an environment variable for use later within the flow.
    cat err.txt | something
    echo "NEW_PARAMETER=$NEW_PARAMETER" >> $GITHUB_ENV
    

- name: Reattempt the earlier step after fixing
  if: steps.potentially-failing-step.outcome =='failure'
  run: | # Final attempt at running your_command - if it still fails here, the flow fails. Log with NEW_PARAMETER for better observability.
    your_command
    echo "Step successful with new parameter: ${{env.NEW_PARAMETER}}"  
...