Monthly Archives: May 2017

Don't Read Your Logs

Photo provided by Karen Arnold under the CC0 Public Domain license.

I’ve had a number of discussions, both offline and online, about logging practices. In my view, reading individual log lines is (almost) always a sign that there are gaps in your system’s monitoring tools – especially when dealing with software at a medium- or large scale. Just as I would never want to use tcpdump to analyze statsd metric packets individually, ideally, I don’t want to look at individual log lines either.

This advice follows the spirit of “disable port 22 on your machines”. While most developers would agree that using dedicated orchestration tools is preferable to manual server wrangling, I know of very few projects or companies that take the extreme stance of disabling SSH access to all machines. Nevertheless, treating manual SSH as a sign of gaps in your system’s orchestration tools is an excellent master cue for architecting scalable and maintainable systems.

Similarly, I know of none that truly send all logs to /dev/null, or block read access to all logs. But treating logs as an extreme anti-pattern provides an excellent forcing function for designing observable systems.

Logs as Metrics

Let’s start with a basic example:

log.Printf(“Loaded page in %f seconds”, duration)

In this case, the log line serves the same purpose as a statsd metric. We could replace it immediately with

statsd.Histogram(“page.load.time_ms”, duration)

and the result would be better, because we’d be able to use the full extent of aggregation tools at our disposal. It’s possible to extract this information from a log line into a structured form, but it’s more work, and it’s unnecessary. The log line doesn’t give us any information that the structured metric doesn’t.

Logs as Debugger Tracing

A more common example:

log.Printf(“about to make API request on”, obj_id)
obj = c.Load(obj_id)
if obj == nil {
    log.Printf(“could not load object”, obj_id)
} else {
    log.Printf(“loaded object”, obj_id)
}

Logs are oftentimes used as a runtime pseudo-debugger. In this case, we’re using logs as a way to verify that a particular line of code was called for a particular transaction. The actual text of the log doesn’t even matter. Instead of “About to make API request”, we could have written

log.Printf(“api.request_pre”, id)

or even

log.Printf(“I like potato salad”, obj_id)

As long as it’s unique to that particular line of code, it serves functionally the same purpose – it confirms that the program execution reached that point in the code.

When we use log lines this way, we’re forming a mental model of the code, and using the logs to virtually step through the code, exactly the way a traditional debugger like GDB might. Transaction-level (or request-level) tracing tools provide this same kind of visibility, with a better visual display.

Without actually counting, I’d estimate that at least 80% – if not more – of log lines that I’ve seen in most open-source projects fit this overall use case: using log lines to virtually “trace” the execution path of source code on a particular piece of problematic input.

Logs as Error Reporting

Another common pattern:

try:
    writeResult()
except Exception as e:
    log.info(“error writing result!”, e)

Here, we’re using logs as a way to capture context for an error. But again, this is a relatively inconvenient way to explore information like a stacktrace. Tools like Sentry or Crashlytics also allow exception reporting, but unlike logging tools, they allow us to classify and group exceptions. We can still view individual stacktraces, but we don’t have to sift through as much noise to identify problems. And by tracking the state of reported exceptions, we can identify regressions more easily. Structured logging systems are generally not capable of handling this – and even when the workflow is possible, it’s nowhere near as convenient as what dedicated exception tracking systems allow.

If you really can’t break the habit of logging errors, you can at least add a hook to log.error that sends the error (and a complete stacktrace) to an error reporting tool like Sentry.

Logs as Durable Records

Furthermore, you can’t even assume that the logs you try to write will actually get written! Even when operating on a small scale, logs can be a lossy pipeline, and the potential for failure only increases with scale.

For example, if you’re developing a script intended to run on your local machine, do you know how your code will behave if your disk hangs? If you run out of space on the partition? How reliable is your log rotation? What happens to your server if it fails?

For a very small script, these sorts of failures may not matter to you, but they can come back to bite you, even at that small scale. For larger-scale services with tighter reliability guarantees, there are ways to mitigate these specific problems – a buffer, a log collector, a distributed indexer – but each solution comes with its own risks and problems. If you try to keep patching these by introducing more tools to make your logging reliable, at some point, you’ll discover that you’ve reinvented your own distributed database. And writing your own database is not inherently a bad idea, but to do it well, it’s the sort of task that’s best undertaken intentionally, rather than by accident.

To be fair to logging, this problem is not unique to logs. It comes from the limitations set forth by the CAP theorem, which means that every monitoring tool has to figure out its own way to deal with them. The problem with logs, however, is that the failure modes are much more subtle and easy to overlook.

For example:

statsd.Increment(‘api.requests_total”, tags={“country”: country.String()}, rate=.1)

It’s relatively obvious to see that this line of code might not always emit a metric, because even under normal operating conditions, you know (a) network operations can have problems, (b) it uses UDP, and (c) the metrics are sampled at the specified rate.

It’s a lot less obvious that this line of code might fail, because STDOUT and STDERR are generally assumed to be always available under normal operating conditions:

log.Printf(“Received api request from %s”, country.String())

Compared to logging, the failure modes of using runtime metrics, request tracing tools, or exception tracking tools are more visible and well-defined.

Instead of using logs as an accidental database, consider your underlying use case and which dedicated database would serve that need better. There’s no one-size-fits-all answer here; you may find that your use case is best served by a NoSQL database like CouchDB, or you may find that you’re really aiming to replicate the functionality of a message queue like Kafka, or another database altogether. If any of these (or other) tools fit your use case, they’re almost guaranteed to be a better fit, long-term, than logs.

Don’t Stop Logging, But Stop Reading Log Lines

By this point, it may sound like I’m firmly anti-logging. I do consider myself an environmentally-friendly person, but when it comes to software, I support careful uses of logging.

Logging can be useful for some purposes. However, it’s rare that they’re the only tool for monitoring your code. And it’s even rarer that they’re the best tool. When writing software that scales, you need to be able to deal with aggregate information – the firehose is too unwieldy to parse mentally. Logs that can be aggregated are better than logs that can’t. In those cases, it’s best to keep logging, but when you need to diagnose a problem, you’ll be interested in reading aggregate queries across your logs, rather than viewing raw, unaggregated logs in chronological order. The former is a powerful way to absorb a lot of information about your systems quickly. The latter is a glorified tail -f | grep.

The next time you start to write a log line, ask yourself whether another observability tool would be a better fit. Oftentimes, it will! If not, that’s fine. Just remember that, ideally, nobody should ever be reading that raw line, so take care to structure the information in a way that facilitates the kind of aggregation queries you’ll need.