Category Archives: Observability

Observing Kubernetes Services With Veneur

Veneur is a distributed, fault-tolerant observability pipeline

What is Veneur?

Veneur is a data pipeline for observing distributed systems. You can use Veneur for aggregating application and system data, like runtime metrics or distributed traces, and intelligently routing the data to various tools for storage and analysis. Veneur supports a number of these tools – called “data sinks” – such as SignalFX, Datadog, or Kafka. For this walkthrough, we’ll use Datadog.

Collecting pod-local metrics with Veneur

The first step to observing our services with Veneur is to deploy a local sidecar instance of Veneur inside our pod. To keep things simple, we’ll create a simple application whose only feature is to emit a heartbeat metric. We’ll do this using veneur-emit.

Veneur-emit is a convenience utility we wrote that allows us to emit statsd-compatible metrics at the command line for testing. It’s equivalent to using statsd client libraries (e.g. statsd) within an application. We wrote it to synthesize data transmission and is used in a similar way to netcat.. So the following netcat command:

$ echo -n “resp_success:1|c|#method:post” | nc -4u localhost 8126

could be written as

$ veneur-emit -hostport udp://localhost:8126 -count 1 -name "resp_success" -tag "method:post"

Basic Example

In this example, veneur-emit is a stand-in for any application you want to observe, if it’s instrumented using statsd client libraries. Since it’s the main application – and first container – in our pod, we’ll begin with the vanilla setup:

That’s the first container in our pod, and if we deploy it as-is, it’ll start faithfully firing off UDP metrics on port 8126. But there’s nothing listening on that port, so it’s just talking into the void. Let’s give it an audience by adding a second container in the same pod:

- name: veneur

Veneur requires almost no configuration out of the box; it defaults to working values. This new container is almost ready to start collecting metrics. However, without a Datadog API key, it won’t be able to send the metrics anywhere, so at the very least, we’ll need to configure our downstream data sinks.


Configuring a Pod-Local Veneur Collector

When running in non-containerized environments, Veneur can read its configuration from a file. But in Kubernetes, environment variables are more convenient. Thanks to envconfig, we can use these two forms of configuration interchangeably. Every config option specified in the yaml config example has a corresponding environment variable, prefixed with VENEUR_. So, we provide the following:

              name: datadog
              key: datadog_api_key

For more information on setting the secret key values in Kubernetes, see the Kubernetes documentation on secrets.

In addition, we listen for UDP packets on port 8126 to receive the metrics, and we also listen for HTTP traffic on port 8127. Port 8127 is used for the healthcheck endpoint. It can also be used to listen for metrics that are emitted over TCP using authenticated TLS instead of UDP, but we won’t cover that here, as it’s not needed for most Kubernetes deployments.

So, putting it all together, we have:

If we apply this deployment (`kubectl apply -f veneur-emit.yaml`), we’ll start to see these metrics come in, at the rate of six per minute. We’re sending metrics all the way from our application to Datadog!

If all you care about is collecting pod-local metrics, that’s it! You can stop here. But, there’s a good chance you want global metric aggregation as well. For that, we’ll have to add a little more.


Global Aggregation – How it Works

Veneur is capable of supporting global metric aggregation.  

Let’s say you have an API server with 200 different replicas responding to stateless requests, load-balanced evenly across all instances. If you want to reduce your API latency, you probably want to know the 95th or 99th percentile for API latency across all 200 replicas, rather than the 99th percentile latency for the first pod, the 99th percentile for the second, and so forth. Having the separate latencies for the individual pods doesn’t help you, because percentiles aren’t monoids: there’s no mathematical operation to construct the global 99th percentile from the 200 different measurements of the 99th percentile on each pods.

Some time-series databases sidestep this problem by storing the raw data and performing the percentile aggregation at read-time, but this requires tremendous network, disk, and compute resources at scale, especially if we want good performance from our queries at read-time. Performing write-time aggregation reduces this load. But calculating percentiles before writing the results requires either sending all metrics to a single aggregator pod (which becomes a single-point-of-failure) or calculating them on a per-pod basis. Fortunately, Veneur has a nifty approach that gives the best of both worlds: global aggregate calculations at write-time, with no single-point-of-failure, and with an architecture that scales horizontally using existing Kubernetes primitives.

Veneur uses t-digests to compute extremely accurate percentile estimates in an online manner, using constant memory. While these percentile calculations are technically approximations, in practice, the error properties yield exact results for tail ends of the distribution, like the 99th percentile. Conveniently, that’s generally what we care about when observing our services, especially latencies.

Veneur’s provides a horizontally scalable mechanism for calculating global aggregates of timers, histograms and sets. The pod-local metrics – counters and gauges – are shipped off to Datadog immediately, but timers, histograms, and sets are forwarded to a separate service for global aggregation. The aggregation happens in two layers.


Veneur Proxy and Veneur Global

First, the pod-local veneur processes forward all timers, histograms, and sets to the veneur-proxy service, which is exposed by Kubernetes. Because Veneur is stateless, the choice of pod and node are arbitrary for the veneur-proxy service, and this can be handled by the built-in Kubernetes support for service discovery and load-balancing.

Under-the-hood, the veneur-proxy processes have to do a bit of coordination to split up their metrics properly. Assume that we have three proxy pods and two global pods, and that each proxy pod receives three metrics – named “red”, “blue”, and “pink” for convenience – from an arbitrary number of application pods. We can’t just lean on regular load-balancing within Kubernetes – we need to ensure that all metrics of the same name end up on the same global pod, or else we won’t have aggregated them properly.

Fortunately, this is all handled automatically by the veneur-proxy and veneur-global services. The veneur-proxy pods distribute the metrics consistently – all red metrics (the solid red arrow) end up on the same global pod, all blue metrics (blue dotted arrow) end up on the same global pod, and all pink metrics (dashed pink arrow) end up on the same global pod.

Of course, the number of replicas of the veneur-proxy and veneur-global services is arbitrary (and each veneur-global pod is capable of aggregating far more than one metric at a time!). Since both services are stateless, they can be horizontally scaled and adjusted dynamically to adapt to your particular metric load.

To create the veneur-proxy and veneur-global services, we just need to provide the following definitions.

For the proxy:

The second section is the same as what we saw before – veneur-proxy is a separate application binary, and for it to emit metrics about itself, we’ll want it to have its own, pod-local Veneur instance to talk to.

The global service configuration is even easier:

Resource Limits

While Veneur’s functionality extends beyond just collecting runtime metrics, compared to other dedicated metric collectors, Veneur’s resource usage is pretty lightweight. The portion that’s the most expensive computationally – global metric aggregation – is performed on dedicated pods (veneur-global).

Kubenetes has built-in support for resource limits, leveraging the power of cgroups in a convenient abstraction. We can use resource limits to cap the CPU and memory usage of the pod-local Veneur instances.

- name: veneur
          cpu: 25m


With this three-tier setup, we maintain a number of useful invariants about our observability system.

  • Stateless. Each Veneur process is entirely stateless. No data remains in memory for longer than ten seconds, so failure or eviction of a single pod running Veneur has very limited impact on our system’s overall observability.
  • Distributed: There is no single point of failure. If any pod-local veneur process is killed and restarted, we only lose data for that pod, and only for one flush cycle. If any proxy or global veneur process is killed and restarted, we only lose ten seconds of data, and only for 1/n metrics (where n is the number of replicas we’re running).
  • Horizontally scalable: We can run as many instances of the proxy and global boxes as we need to, to support arbitrarily large loads.
  • Fault-tolerant: If a global veneur instance becomes unreachable, all proxy boxes will update their hashing scheme immediately for the next flush cycle.


Veneur also supports pull-based metric aggregation with Prometheus. veneur-prometheus is a separate binary which polls a Prometheus metrics endpoint to ingest metrics into the Veneur pipeline.

….and more!

We’re just scratching the surface of how Veneur can help you get better observability into your Kubernetes systems. Observability is not just about runtime metrics – it’s about request traces and logs as well. Veneur is an embodiment of the philosophy that logs, runtime metrics, and traces are all deeply connected, and that there should be an intelligent metrics pipeline handling all three.

If any of this excites or intrigues you – or if you have more thoughts on what kinds of visibility you need in your Kubernetes deployments, you’re in luck! Veneur is actively developed, with tagged major releases every six weeks. Drop us a line over on the issue tracker!


Thanks to Jess Frazelle, Kris Nova, and Cory Watson for reading drafts of this post!

Don't Read Your Logs

Photo provided by Karen Arnold under the CC0 Public Domain license.

I’ve had a number of discussions, both offline and online, about logging practices. In my view, reading individual log lines is (almost) always a sign that there are gaps in your system’s monitoring tools – especially when dealing with software at a medium- or large scale. Just as I would never want to use tcpdump to analyze statsd metric packets individually, ideally, I don’t want to look at individual log lines either.

This advice follows the spirit of “disable port 22 on your machines”. While most developers would agree that using dedicated orchestration tools is preferable to manual server wrangling, I know of very few projects or companies that take the extreme stance of disabling SSH access to all machines. Nevertheless, treating manual SSH as a sign of gaps in your system’s orchestration tools is an excellent master cue for architecting scalable and maintainable systems.

Similarly, I know of none that truly send all logs to /dev/null, or block read access to all logs. But treating logs as an extreme anti-pattern provides an excellent forcing function for designing observable systems.

Logs as Metrics

Let’s start with a basic example:

log.Printf(“Loaded page in %f seconds”, duration)

In this case, the log line serves the same purpose as a statsd metric. We could replace it immediately with

statsd.Histogram(“page.load.time_ms”, duration)

and the result would be better, because we’d be able to use the full extent of aggregation tools at our disposal. It’s possible to extract this information from a log line into a structured form, but it’s more work, and it’s unnecessary. The log line doesn’t give us any information that the structured metric doesn’t.

Logs as Debugger Tracing

A more common example:

log.Printf(“about to make API request on”, obj_id)
obj = c.Load(obj_id)
if obj == nil {
    log.Printf(“could not load object”, obj_id)
} else {
    log.Printf(“loaded object”, obj_id)

Logs are oftentimes used as a runtime pseudo-debugger. In this case, we’re using logs as a way to verify that a particular line of code was called for a particular transaction. The actual text of the log doesn’t even matter. Instead of “About to make API request”, we could have written

log.Printf(“api.request_pre”, id)

or even

log.Printf(“I like potato salad”, obj_id)

As long as it’s unique to that particular line of code, it serves functionally the same purpose – it confirms that the program execution reached that point in the code.

When we use log lines this way, we’re forming a mental model of the code, and using the logs to virtually step through the code, exactly the way a traditional debugger like GDB might. Transaction-level (or request-level) tracing tools provide this same kind of visibility, with a better visual display.

Without actually counting, I’d estimate that at least 80% – if not more – of log lines that I’ve seen in most open-source projects fit this overall use case: using log lines to virtually “trace” the execution path of source code on a particular piece of problematic input.

Logs as Error Reporting

Another common pattern:

except Exception as e:“error writing result!”, e)

Here, we’re using logs as a way to capture context for an error. But again, this is a relatively inconvenient way to explore information like a stacktrace. Tools like Sentry or Crashlytics also allow exception reporting, but unlike logging tools, they allow us to classify and group exceptions. We can still view individual stacktraces, but we don’t have to sift through as much noise to identify problems. And by tracking the state of reported exceptions, we can identify regressions more easily. Structured logging systems are generally not capable of handling this – and even when the workflow is possible, it’s nowhere near as convenient as what dedicated exception tracking systems allow.

If you really can’t break the habit of logging errors, you can at least add a hook to log.error that sends the error (and a complete stacktrace) to an error reporting tool like Sentry.

Logs as Durable Records

Furthermore, you can’t even assume that the logs you try to write will actually get written! Even when operating on a small scale, logs can be a lossy pipeline, and the potential for failure only increases with scale.

For example, if you’re developing a script intended to run on your local machine, do you know how your code will behave if your disk hangs? If you run out of space on the partition? How reliable is your log rotation? What happens to your server if it fails?

For a very small script, these sorts of failures may not matter to you, but they can come back to bite you, even at that small scale. For larger-scale services with tighter reliability guarantees, there are ways to mitigate these specific problems – a buffer, a log collector, a distributed indexer – but each solution comes with its own risks and problems. If you try to keep patching these by introducing more tools to make your logging reliable, at some point, you’ll discover that you’ve reinvented your own distributed database. And writing your own database is not inherently a bad idea, but to do it well, it’s the sort of task that’s best undertaken intentionally, rather than by accident.

To be fair to logging, this problem is not unique to logs. It comes from the limitations set forth by the CAP theorem, which means that every monitoring tool has to figure out its own way to deal with them. The problem with logs, however, is that the failure modes are much more subtle and easy to overlook.

For example:

statsd.Increment(‘api.requests_total”, tags={“country”: country.String()}, rate=.1)

It’s relatively obvious to see that this line of code might not always emit a metric, because even under normal operating conditions, you know (a) network operations can have problems, (b) it uses UDP, and (c) the metrics are sampled at the specified rate.

It’s a lot less obvious that this line of code might fail, because STDOUT and STDERR are generally assumed to be always available under normal operating conditions:

log.Printf(“Received api request from %s”, country.String())

Compared to logging, the failure modes of using runtime metrics, request tracing tools, or exception tracking tools are more visible and well-defined.

Instead of using logs as an accidental database, consider your underlying use case and which dedicated database would serve that need better. There’s no one-size-fits-all answer here; you may find that your use case is best served by a NoSQL database like CouchDB, or you may find that you’re really aiming to replicate the functionality of a message queue like Kafka, or another database altogether. If any of these (or other) tools fit your use case, they’re almost guaranteed to be a better fit, long-term, than logs.

Don’t Stop Logging, But Stop Reading Log Lines

By this point, it may sound like I’m firmly anti-logging. I do consider myself an environmentally-friendly person, but when it comes to software, I support careful uses of logging.

Logging can be useful for some purposes. However, it’s rare that they’re the only tool for monitoring your code. And it’s even rarer that they’re the best tool. When writing software that scales, you need to be able to deal with aggregate information – the firehose is too unwieldy to parse mentally. Logs that can be aggregated are better than logs that can’t. In those cases, it’s best to keep logging, but when you need to diagnose a problem, you’ll be interested in reading aggregate queries across your logs, rather than viewing raw, unaggregated logs in chronological order. The former is a powerful way to absorb a lot of information about your systems quickly. The latter is a glorified tail -f | grep.

The next time you start to write a log line, ask yourself whether another observability tool would be a better fit. Oftentimes, it will! If not, that’s fine. Just remember that, ideally, nobody should ever be reading that raw line, so take care to structure the information in a way that facilitates the kind of aggregation queries you’ll need.