Writing Code That Can Handle Failure

“It has been estimated that up to 90% of an application’s code is related to handling exception or error conditions. (McConnell, 2004)”

If there’s one thing that’s certain, it’s that your code will fail. Failed code is not always your fault. There’s a number of factors outside of your control that can cause failure, things like a network blip, infrastructure, outages, unhandled exceptions, cascading failures, and so on.

Writing code that can deal with these situations helps you avoid common incidents, unnecessary alerts, and noisy logs. And hopefully, give you a better night of sleep.

Handling failure is hard. In any software, there are way more error paths than successful ones. However, there are a few helpful PHP and Laravel strategies that can be implemented on any language or framework.

We can do better than programming with the expectation of success and writing TODOs to handle the errors “when we have time.”

3 years ago TODO

You can’t predict them all

There’s no way to plan for every error out there. At least not upfront. You should definitely implement error handling for all the things you know can happen, and for HOW they’ll happen (a few tips for this below), but it’s impossible to predict everything that could go wrong.

Logging

Use a generalized error handling mechanism that logs errors with the most information possible. Don’t let your code fail silently.

“Over-communicate. It’s better to tell someone something they already know than to not tell them something they needed to hear.” — Alex Irvine

Laravel comes with an extensive exception handler making it easy to add context, render specific errors, and so on.

Monitoring & Alerts

Just logging the errors somewhere isn’t enough. All errors should be made visible in a way that helps you notice and fix issues before your users start reporting. Centralized logging platforms like EKS, Papertrail, Datadog, etc. are key for configuring alerts to appear in different channels, such as email and slack, based on how significant the errors are.

Here are some examples of alerts triggers:

When any CRITICAL (or above) log happens;
When the rate of errors coming in increases;
If NO logs with a specific message come in. This means that a key part of the system isn’t working;
Key routines start taking more time than they should;

There’s a lot more to know about logging, monitoring, and alerting, but those are key basics. Now let’s focus on how to better deal with failures in code.

Better odds with retries

cURL error 6: Could not resolve host
cURL error 6: getaddrinfo() thread failed to start

I bet you’ve seen this error before. This type of error can happen for many different reasons outside your control. Retries are a way to prevent having your code fail in this situation, or at least give it another chance to succeed.

Let’s look at one example of dispatching a job:

ProcessSearch::dispatch($request->all());

Let’s say that AWS SQS is down for a moment and the above code fails because the HTTP request isn’t making it to the SQS. That would stop this job from getting dispatched to the queue, and you would lose the HTTP request data.

We can do better by implementing some retry logic using Laravel’s retry helper

retry(10, fn () => ProcessSearch::dispatch($request->all()), 100);

Here, we’ll try 10 times with a 100ms interval between each try, before the code can fail.

Another great use of the retry helper is when you’re using a service like Algolia for searching. Algolia can fail at any time, but usually the failures are quick and it swiftly recovers. So, applying some kind of exponential backoff using this function would make it pretty handy:

$locations = retry(
    5,
    fn () => $this->search($request->get('q')),
    fn ($attempt) => $attempt * 50
);

Retry logic can be implemented in a lot of different places. Let’s look at one example of a retry logic using Laravel’s queued jobs with an approach using exponential backoff.

class ProcessSearch implements ShouldQueue
{
    public $tries = 5;
 
    public function handle()
    {
         try {
             $this->processSearch();
         } catch (SearchEngineUnavailableException $e) {
             $this->release($this->jobs->attempts() * 15);
         }
    }
}

Safe retries with idempotent routines

Using this method, certain operations can be executed multiple times without changing the final result. For example, this code ensures that running the same job multiple times won’t charge your customer multiple times or keep sending the same email over and over.

Let’s say you have a queued job that charges the customer and sends a confirmation email.

class ChargeAndSendEmail implements ShouldQueue
{
    public $tries = 3;
 
    public function __construct(User $user, Order $order)
    {
        $this->user = $user;
        $this->order = $order;
    }
 
    public function handle(PaymentProvider $provider)
    {
        $provider->charge($this->user, $this->order);
        Mail::to($this->user)->send(new OrderApproverMail);
    }
}

Using this code, if the email provider is down, the job will fail; since the snippet is configured to try 3 times, it would charge the user 3 times.

Sometimes, a simple check can do the trick:

public function handle(PaymentProvider $provider)
{
    if (! $this->order->charged()) {
        $provider->charge($this->user);
    }
 
    Mail::to($this->user)->send(new OrderApproverMail);
}

We could also instead of actually sending the email directly from this job, queue this job, so if an error like the email provider is down, we would be able to easily retry only the email notification.

This can come in handy when updating a database record via a job payload as well.

Let’s say you have a system that receives thousands of payloads to create/update/delete documents. Sometimes, for a number of reasons, these jobs fail. But if we re-run the jobs a few hours later, a new update for the document could have been executed already, with newer data. So re-running the old job could actually override the new data.

class UpdateDocumentJob implements ShouldQueue
{
    public function __construct($document, $payload)
    {
        $this->document = $document;
        $this-> payload = $payload;
    }
 
    public function handle()
    {
          if ($this->document->updated_at > $this->payload->updated_at) {
            Log::info("[UpdateDocumentJob] Not updating document because document last updated is greater than the payload last updated", [
                'document' => $this->document->id,
                'document_updated_at' => $this->document-> updated_at,
                'payload_updated_at' => $this->payload->updated_at,
            ]);
 
            return;
        }
    }
}

It can still fail

Even with retries, your code can still fail. In a situation where something like AWS us-east-1 is down, the best you can do is have a plan in place to deal with these types of situations. How bad is it if you lose that request data? How can you save it?

Simply logging some of the data in case of final failure can make it possible to get things running quickly when all systems are stable again.

If you are using Laravel, make sure to keep an eye on your failed_jobs table. It can have some very valuable informations on things that are failing and you may not even know they are failing. You can also use something like the failed job monitor to get notified of those things. Also you can use our queue-batch-retry package to retry specific jobs in batches and save some of your precious time.

Exceptions are your friend

Exceptions are not just errors. Errors are things we try to prevent, while exceptions can happen even in the most stable environment.

One common exception I see in code is a return of either the resource, false, or null. This is fine, but what happens when you start having more than just two states (success or failure?). How do you get better information about an error with this type of approach?

$contact = $this->marketingSystemClient->createContact($data);
 
if (! $contact) {
    Log::error('Error creating contact');
}

Is it an API error? Validation error? Network blip? Without enough data it becomes hard to deal with exceptions. The key to handling failure well is to know which failures to handle.

try {
    $contact = $this->marketingSystemClient->updateContact($data);
} catch (GuzzleHttp\Exception\RequestException $e) {
    if ($e->getStatusCode() === 404) {
        $this->marketingSystemClient->createContact($contact);
    }
} catch (RateLimitException $e) {
    $this->release($e->getRetryAfterInSeconds());
}

By throwing and catching exceptions we know how to handle, it becomes very easy to read and extend this type of approach to handle other error situations as we identify them. As you gather experience with different errors and exceptions, you’ll gather more strategies for solving and preventing them.

By throwing specific exceptions, you can create custom methods like getRetryAfterInSeconds which make it very clear and easy to read.

Domain-based exceptions

Throwing domain-based exceptions makes it a whole lot easier to understand and plan for exceptions within your code. Check this out:

try {
    $contact = $this->marketingSystemClient->updateContact($data);
} catch (MarketingSystem\NotFoundException $e) {
    $this->marketingSystemClient->createContact($data);
}

Using domain-based exceptions, you can hide implementation details, while making it a lot more manageable and readable for the user or your own application.

If your error handling code feels a bit messy, ask yourself if you can maybe throw a specific exception or two at lower levels.

Handle exceptions as close to the top as possible

The closer to the top you are when actually handling exceptions (retry, send notifications, update state in the database, etc), the better. For instance, catching an exception in the controller makes it very easy to return an error response to the user. Catching an exception in the queued job makes it very easy to retry it as needed.

That doesn’t mean the rest of your code (API Clients, Actions, Services) shouldn’t be catching exceptions. Ideally, they are catching and throwing domain-based exceptions as mentioned above while hiding as much of the unneeded details as possible.

Laravel renderable & reportable exceptions

Laravel has this really awesome feature of renderable and reportable exceptions where you can define a render and report method in the exception. If exceptions bubble up to the global exception handler, it will report and/or render the response based on those methods.

class PaymentFailedException extends Exception
{
    public function render($request)
    {
        return view('orders.errors.payment');
    }
}

Use Static Analysis

Running static analysis on your CI pipeline on every commit/pull request can catch a lot of things, and help you write code that’s easier for your IDE to understand. However, starting with static analysis and implementing it on an existing application can be a complicated and annoying process. Nuno Maduro has a great talk on the subject with practical tips on how to get started.

The best tools for the PHP ecosystem are PHPStan and Psalm.

Deal with them

“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.” - Donald Rumsfeld

Most of the failures that you deal with won’t occur while writing new features, but when your application is running. You can identify failures through user reports, logging, monitoring, testing, etc; and when you do identify them, deal with them.

Errors can be caused by anything from a bug or infrastructure issue, to a problem with a third-party service. If the cause of the error is a bug, ship the fix with a test and make sure it doesn’t happen again. When the infrastructure is the problem, you can fix it and add monitoring around it to get notified before the problem can happen again. Issues coming from a third-party service can be solved by some good retry logic and doing your best to have the service fix itself without you or your team having to jump in to fix it.

Programming by coincidence can create an issue by allowing you to create a temporary fix without truly understanding the problem. Avoid future issues by building code that makes sense within your application, otherwise it’s likely to come back to bite you down the line.

Have a process to deal with bugs and failures as well as a process for catching and solving incidents and exceptions. Knowing how to respond, who should respond, when to escalate, the best way to communicate when things go wrong can help ensure that when errors happen, you’re ready to handle them. Share knowledge learned with incidents through post-mortems and follow-up actions to craft a system that can handle any failure.