Observing Kubernetes Services With Veneur

April 5, 2018Observabilitykubernetes, observability, programming, veneurAditya Mukerjee

What is Veneur?

Veneur is a data pipeline for observing distributed systems. You can use Veneur for aggregating application and system data, like runtime metrics or distributed traces, and intelligently routing the data to various tools for storage and analysis. Veneur supports a number of these tools – called “data sinks” – such as SignalFX, Datadog, or Kafka. For this walkthrough, we’ll use Datadog.

Collecting pod-local metrics with Veneur

The first step to observing our services with Veneur is to deploy a local sidecar instance of Veneur inside our pod. To keep things simple, we’ll create a simple application whose only feature is to emit a heartbeat metric. We’ll do this using veneur-emit.

Veneur-emit is a convenience utility we wrote that allows us to emit statsd-compatible metrics at the command line for testing. It’s equivalent to using statsd client libraries (e.g. statsd) within an application. We wrote it to synthesize data transmission and is used in a similar way to netcat.. So the following netcat command:

$ echo -n “resp_success:1|c|#method:post” | nc -4u localhost 8126

could be written as

$ veneur-emit -hostport udp://localhost:8126 -count 1 -name "resp_success" -tag "method:post"

Basic Example

In this example, veneur-emit is a stand-in for any application you want to observe, if it’s instrumented using statsd client libraries. Since it’s the main application – and first container – in our pod, we’ll begin with the vanilla setup:

https://gist.github.com/4bd96253a0003769231798d0779fd4af

That’s the first container in our pod, and if we deploy it as-is, it’ll start faithfully firing off UDP metrics on port 8126. But there’s nothing listening on that port, so it’s just talking into the void. Let’s give it an audience by adding a second container in the same pod:

- name: veneur
    image: index.docker.io/stripe/veneur:3.0.0

Veneur requires almost no configuration out of the box; it defaults to working values. This new container is almost ready to start collecting metrics. However, without a Datadog API key, it won’t be able to send the metrics anywhere, so at the very least, we’ll need to configure our downstream data sinks.

Configuring a Pod-Local Veneur Collector

When running in non-containerized environments, Veneur can read its configuration from a file. But in Kubernetes, environment variables are more convenient. Thanks to envconfig, we can use these two forms of configuration interchangeably. Every config option specified in the yaml config example has a corresponding environment variable, prefixed with VENEUR_. So, we provide the following:

env:                
    - name: VENEUR_DATADOGAPIKEY
      valueFrom:
          secretKeyRef:
              name: datadog
              key: datadog_api_key

For more information on setting the secret key values in Kubernetes, see the Kubernetes documentation on secrets.

In addition, we listen for UDP packets on port 8126 to receive the metrics, and we also listen for HTTP traffic on port 8127. Port 8127 is used for the healthcheck endpoint. It can also be used to listen for metrics that are emitted over TCP using authenticated TLS instead of UDP, but we won’t cover that here, as it’s not needed for most Kubernetes deployments.

So, putting it all together, we have:

https://gist.github.com/d873787a4a08a6eafee870e4fdaf426f

If we apply this deployment (`kubectl apply -f veneur-emit.yaml`), we’ll start to see these metrics come in, at the rate of six per minute. We’re sending metrics all the way from our application to Datadog!

If all you care about is collecting pod-local metrics, that’s it! You can stop here. But, there’s a good chance you want global metric aggregation as well. For that, we’ll have to add a little more.

Global Aggregation – How it Works

Veneur is capable of supporting global metric aggregation.

Let’s say you have an API server with 200 different replicas responding to stateless requests, load-balanced evenly across all instances. If you want to reduce your API latency, you probably want to know the 95th or 99th percentile for API latency across all 200 replicas, rather than the 99th percentile latency for the first pod, the 99th percentile for the second, and so forth. Having the separate latencies for the individual pods doesn’t help you, because percentiles aren’t monoids: there’s no mathematical operation to construct the global 99th percentile from the 200 different measurements of the 99th percentile on each pods.

Some time-series databases sidestep this problem by storing the raw data and performing the percentile aggregation at read-time, but this requires tremendous network, disk, and compute resources at scale, especially if we want good performance from our queries at read-time. Performing write-time aggregation reduces this load. But calculating percentiles before writing the results requires either sending all metrics to a single aggregator pod (which becomes a single-point-of-failure) or calculating them on a per-pod basis. Fortunately, Veneur has a nifty approach that gives the best of both worlds: global aggregate calculations at write-time, with no single-point-of-failure, and with an architecture that scales horizontally using existing Kubernetes primitives.

Veneur uses t-digests to compute extremely accurate percentile estimates in an online manner, using constant memory. While these percentile calculations are technically approximations, in practice, the error properties yield exact results for tail ends of the distribution, like the 99th percentile. Conveniently, that’s generally what we care about when observing our services, especially latencies.

Veneur’s provides a horizontally scalable mechanism for calculating global aggregates of timers, histograms and sets. The pod-local metrics – counters and gauges – are shipped off to Datadog immediately, but timers, histograms, and sets are forwarded to a separate service for global aggregation. The aggregation happens in two layers.

Veneur Proxy and Veneur Global

First, the pod-local veneur processes forward all timers, histograms, and sets to the veneur-proxy service, which is exposed by Kubernetes. Because Veneur is stateless, the choice of pod and node are arbitrary for the veneur-proxy service, and this can be handled by the built-in Kubernetes support for service discovery and load-balancing.

Under-the-hood, the veneur-proxy processes have to do a bit of coordination to split up their metrics properly. Assume that we have three proxy pods and two global pods, and that each proxy pod receives three metrics – named “red”, “blue”, and “pink” for convenience – from an arbitrary number of application pods. We can’t just lean on regular load-balancing within Kubernetes – we need to ensure that all metrics of the same name end up on the same global pod, or else we won’t have aggregated them properly.

Fortunately, this is all handled automatically by the veneur-proxy and veneur-global services. The veneur-proxy pods distribute the metrics consistently – all red metrics (the solid red arrow) end up on the same global pod, all blue metrics (blue dotted arrow) end up on the same global pod, and all pink metrics (dashed pink arrow) end up on the same global pod.

Of course, the number of replicas of the veneur-proxy and veneur-global services is arbitrary (and each veneur-global pod is capable of aggregating far more than one metric at a time!). Since both services are stateless, they can be horizontally scaled and adjusted dynamically to adapt to your particular metric load.

To create the veneur-proxy and veneur-global services, we just need to provide the following definitions.

For the proxy:

https://gist.github.com/50630f831efc628d4030482f297e7a80

https://gist.github.com/4caa8066bfc4836b57a02f4d9130e05f

The second section is the same as what we saw before – veneur-proxy is a separate application binary, and for it to emit metrics about itself, we’ll want it to have its own, pod-local Veneur instance to talk to.

The global service configuration is even easier:

https://gist.github.com/bdfe9e331411b89b6e9b99807a1d6990

https://gist.github.com/ce1000c4dfb4a95d6e32179185576603

Resource Limits

While Veneur’s functionality extends beyond just collecting runtime metrics, compared to other dedicated metric collectors, Veneur’s resource usage is pretty lightweight. The portion that’s the most expensive computationally – global metric aggregation – is performed on dedicated pods (veneur-global).

Kubenetes has built-in support for resource limits, leveraging the power of cgroups in a convenient abstraction. We can use resource limits to cap the CPU and memory usage of the pod-local Veneur instances.

- name: veneur
  image: index.docker.io/stripe/veneur:3.0.0
  resources:
      limits:
          cpu: 25m

Invariants

With this three-tier setup, we maintain a number of useful invariants about our observability system.

Stateless. Each Veneur process is entirely stateless. No data remains in memory for longer than ten seconds, so failure or eviction of a single pod running Veneur has very limited impact on our system’s overall observability.
Distributed: There is no single point of failure. If any pod-local veneur process is killed and restarted, we only lose data for that pod, and only for one flush cycle. If any proxy or global veneur process is killed and restarted, we only lose ten seconds of data, and only for 1/n metrics (where n is the number of replicas we’re running).
Horizontally scalable: We can run as many instances of the proxy and global boxes as we need to, to support arbitrarily large loads.
Fault-tolerant: If a global veneur instance becomes unreachable, all proxy boxes will update their hashing scheme immediately for the next flush cycle.

Prometheus

Veneur also supports pull-based metric aggregation with Prometheus. veneur-prometheus is a separate binary which polls a Prometheus metrics endpoint to ingest metrics into the Veneur pipeline.

https://gist.github.com/89d3d038b7858e050e6306369029607b

….and more!

We’re just scratching the surface of how Veneur can help you get better observability into your Kubernetes systems. Observability is not just about runtime metrics – it’s about request traces and logs as well. Veneur is an embodiment of the philosophy that logs, runtime metrics, and traces are all deeply connected, and that there should be an intelligent metrics pipeline handling all three.

If any of this excites or intrigues you – or if you have more thoughts on what kinds of visibility you need in your Kubernetes deployments, you’re in luck! Veneur is actively developed, with tagged major releases every six weeks. Drop us a line over on the issue tracker!

Acknowledgements

Thanks to Jess Frazelle, Kris Nova, and Cory Watson for reading drafts of this post!

Remote Work: 2 Years In

October 3, 2017programming, Workextrovert, introvert, pair programming, programming, remote, work from homeAditya Mukerjee

I’ve been working for Stripe for over two years now. The company is based in San Francisco, and I’ve been working remotely from New York City the entire time. When I was deciding whether to accept my offer, I spoke with a number of people to get a feel for the company, the work I’d be doing, and the experience of working remotely.

One of the people I talked to was Julia Evans, who had conveniently written two great blog posts about her experiences working remotely, both three months in and eight months in. With remote work becoming more popular at companies, I’d like to share my experiences of remote work after two years.

I’ve experienced a wide range of remote working configurations: For the first year, I was working at a coworking space with a few other people from the same company. For logistical reasons, we ended up moving out of the space that we worked in together, so since then, I’ve been in a different coworking space by myself. During all of that, I’ve also spent some days here and there working from other places: my apartment, working from a house where I’m visiting friends or family, and working out of our international offices.

Based on that, and talking to friends who work remotely at other companies, here are some things I’ve learned for myself about working remotely. Your mileage may vary, but it’s what works well for me. Remote works has both pros and cons, and it might not be for everyone, but at this point, I’ve found I now actually prefer it to working in a traditional office.

Don’t work from home

Working remotely doesn’t mean you have to work from your couch or your bedroom. Especially if, like me, you live in a major city and have a small apartment, I’d strongly discourage it. Going to a coworking space every day, even if I’m the only one from my company working there, provides a mental transition in my day. When I’m there, I’m in work mode, and it’s easy to focus with no distractions. I’m not averse to working from home on occasion if I need to, but if I had to do it more than a day or two in a row, it’d be really stifling. (If you’re in New York or the tri-state area and are looking for coworking space recommendations, drop me a line – I’ve worked at a number of different ones over the years, and I can give you a good run-down of the pros/cons of each).

If you do decide to work from home, I would recommend a few things. First, set up a dedicated home office. Ideally, this is a separate room that you use only for work. At the very least, put up a curtain or folding partition to give some physical separation. Don’t use this space for anything else – if you’re a PC gamer, for example, put your gaming rig in a different room. If you live with others (roommates, significant others, kids, or pets), having a physical space that says “when I’m here, treat me like I’m at the office and I’m not at home” makes things easier.

Secondly, make sure you leave the house! Cabin fever is real. Give yourself reasons to leave the house at the beginning and end of your day. Going to the gym, getting coffee, walking the dog – make these part of your routine. One person I know would leave the house every morning, walk clockwise around the block, and call that his “commute” to the office. In the evenings, he’d repeat the process, but counter-clockwise. While it might seem a little artificial, it makes a difference!

Oh, and, if you’re ever being interviewed on live TV, make sure that the door locks.

The team matters

You don’t have to be on an all-remote team to have a good experience working remotely, but the rest of your team has to be on board with supporting remote-friendly workflows. That means making sure all meetings include video conference links, so you can participate. Or using asynchronous communication (email, JIRA, etc.) to document state which otherwise would only be exchanged verbally. In a pinch, even Twitter can work – whatever you do, just make sure that the rest of the team either already has remote-friendly habits or is willing to develop them. (Remote-friendly workflows are also beneficial even on teams that are 100% co-located, so it’s to their benefit to be doing this anyway, but that’s the topic of another post!).

This applies to the company as well. If your company isn’t committed to remote-friendly workflows, or to remote work in general, you’ll have a much harder time. You don’t need the company to be 100% distributed – or even majority distributed. If you’re considering working remotely for a company, ask about how many other people work remotely, and how the company ensures that those people are integrated into their workflow. If there aren’t any, that’s fine – someone has to be first! – in which case you’ll want to ask why they’re looking to hire people remotely, and what parts of their company workflow, policies, and structure they anticipate having to modify to transition into a remote-friendly company. (There’s no specific “right” answer here; you’re looking to see what their planning process is like).

Timezones require some effort, but aren’t too hard

On my team, I’m the only person on Eastern time. One is on Central time, and the rest are on Pacific. Most engineers on other teams are also on Pacific time. At first glance, that sounds like it’d be hard – I’m 3 hours offset from most of the people I work with – but in practice, it’s not been too difficult after some adjustment. In practice, this divides my day in two: most of my meetings are after lunch, which leaves my mornings free to focus with uninterrupted time.

I still have approximately the same number of meetings per week as my teammates, so the downside is that this means my meetings are compressed into the afternoon hours. This can be tiring on a meeting-heavy day, but I’ve gotten accustomed to it.

The timezone difference is probably the biggest change between my first few months working remotely and my current experience. In the first couple of months at any job, you’re more likely to get stuck on something unfamiliar, or need to request access, or somehow involve another person in your work process. If you bump into one of these early in the day, it means you’ll need to wait until after lunch to get unblocked. The workaround is to make sure you have a few different projects in parallel that you can work on, so that getting blocked on one doesn’t kill your morning. Fortunately, this stage doesn’t last too long. Within a month or two, I found this happening less often – at least, no more often than I might get blocked on someone who worked in the same physical office.

Normally I work out of New York, but I spent a bit of time working from London and Singapore when I was traveling there for conferences. Working from London and Singapore was definitely harder, because there’s almost no overlap between normal working hours in California and London or Singapore. In both cases, it was short-term (less than a week), which meant that we could manage it with some careful planning (choosing projects that are encapsulated for that week). It would be harder to manage that longer-term without having at least some other engineers on my team working from an overlapping timezone.

Use a headset for video conferencing

For a long time, I used the built-in microphone for my laptop (along with headphones and an external webcam). Getting an all-in-one headset made a huge difference, both for me and my teammates. Between the latency and the (lack of) noise correction, using the built-in microphone meant a lot of slightly awkward pauses, or accidentally interrupting because the other person had started speaking, but I couldn’t hear it yet. With an all-in-one headset, the microphone is much closer to the source (your voice) and it doesn’t pick up the sound from your headset speakers the way your laptop microphone does when you use it without headphones. This makes the noise correction works much better, giving the illusion of lower latency, and so it feels a lot more like having a natural conversation. (I have the Jabra UC Voice), but I think any dedicated computer headset will work better than the built-in laptop microphone and speakers).

The microphone on the other end matters as well. At our office, the microphones in the conference room have noise detection that’s eerily powerful. When I first started, it took some getting used to – I kept thinking that the audio on the other end had dropped, when in reality, nobody was speaking, and the background noise was getting filtered out! It’s not perfect, but it’s better than using Hangouts or Skype between two laptops.

Visit other coworkers often

Between Slack/IM and video conferencing, it’s easy to forget that I’m 3,000 miles away from my teammates. I’m good friends with all of them, and we chat a lot throughout the day. Seeing your coworkers regularly in-person is important.

The exact cadence depends on your personal situation and also your team layout – for a team that’s mostly distributed, you can probably get away with less frequent visits. For me, going to our headquarters every three months at a minimum helps me feel connected, not just to my teammates, but to coworkers on other teams that I work with less frequently.

At first, visiting the HQ office as a remote worker is a weird experience. I end up spending a lot of my time meeting with people, mostly in unstructured meetings (“coffee walks”, etc.).Compared to when I’m working the rest of the time, I spend a lot less time writing code. The best way I’ve found to approach it is to remember that: when I’m working remotely, I don’t have as much face time with my coworkers as they do with each other, and my visits to the office are my time to pay down that “debt”.

I think that one of the reasons working remotely works well for me is that my personal interest rate on this debt is low. I’m able to get most of what I need from email, slack, or video conferencing, and for the rest, I’m able to catch up quickly enough from in-person visits. On the flip side, context-switching is expensive enough that I’d rather group my “catch-up” time into a fixed period than interleave it throughout my normal daily routine.

This might sound tiring if you’re introverted – I’ll get to this later – but it doesn’t need to be. Don’t overdo it. Just do what feels natural, and don’t worry if it feels like you’re less productive when you’re visiting. Remember: productivity isn’t just measured by lines of code, and the time you spend building relationships with your coworkers is part of your work.

Also, note that I said “other coworkers”, not “other offices”! The tech lead for our team also works remotely, and we’ve met up in other cities as well – he happened to be speaking at a conference in New York (where I live), and we both went to a conference in Portland together. As I write this, I realize that he and I have actually never spent any appreciable amount of time together in the San Francisco office, but I’ve still never felt a lack of relatedness, because I still see and interact with him enough.

At the end of the day, you’re optimizing for relatedness – the feeling of being connected to other people and caring about them – not the number or frequency of your in-person interactions. Relatedness is “the universal want to interact, be connected to, and experience caring for others” – some people need this to happen in-person to satisfy their need for human interaction, but others may not. If you’re able to satisfy this for yourself through virtual interactions, you can probably get away with meeting in-person less frequently (or even at all – I know a couple of people who are totally fine never meeting their coworkers in-person, but I think that’s rare).

Schedule recurring pair-programming meetings

I’d strongly recommend this for anyone, whether or not they’re working remotely, but it’s particularly useful for remote engineers. I have standing weekly pairing sessions with four people on my team. Ideally I could schedule a weekly pairing session with everyone, but there are only so many hours in the day!

There are a lot of benefits to regular, scheduled pairing (as opposed to ad-hoc pairing, or no pairing at all), but for now, I’ll just talk about the part that’s relevant to working remotely. When working remotely at a company that is not 100% remote, you have to make an active effort to make sure that your non-remote coworkers are aware of you, the work you’re doing, and your areas of expertise. Regular pairing sessions are a good forcing function for that: on any given week, whether you’re working on a project of yours or a project of theirs, you’re getting “face time” with your coworkers, and sharing knowledge about each others’ projects.

Introversion or extroversion

I’ve heard a lot of people argue very strongly that remote work is best for introverted people, because you spend most of your days alone. I’ve heard others argue that it’s best for extroverted people, because you have to spend more effort getting to know people across the company when you’re not physically present.

I don’t think it matters. I know people who work remotely with success who are introverted, extroverted, or ambiverted. People will adjust their work and life to match their personal styles wherever they are working, remotely or not. If you discover you need more social interaction, you’ll find ways to increase your interpersonal interaction with your coworkers, or to do that outside work. If you discover you need less, you’ll find ways to reduce it, by relying less on synchronous communication or adjusting your work schedule. Being remote isn’t only for introverts (or extroverts), the same way working in an office isn’t only for extroverts (or introverts).

If you’re working remotely (or considering working remotely), I’d love to talk more about this with you and exchange tips. Ask me on Twitter or drop me an email. Oh, and I just found out that we have a conference now too!

Thanks to Julia Evans and Karla Burnett for reading drafts of this post!

Don't Read Your Logs

May 19, 2017ObservabilityAditya Mukerjee

Photo provided by Karen Arnold under the CC0 Public Domain license.

I’ve had a number of discussions, both offline and online, about logging practices. In my view, reading individual log lines is (almost) always a sign that there are gaps in your system’s monitoring tools – especially when dealing with software at a medium- or large scale. Just as I would never want to use tcpdump to analyze statsd metric packets individually, ideally, I don’t want to look at individual log lines either.

This advice follows the spirit of “disable port 22 on your machines”. While most developers would agree that using dedicated orchestration tools is preferable to manual server wrangling, I know of very few projects or companies that take the extreme stance of disabling SSH access to all machines. Nevertheless, treating manual SSH as a sign of gaps in your system’s orchestration tools is an excellent master cue for architecting scalable and maintainable systems.

Similarly, I know of none that truly send all logs to /dev/null, or block read access to all logs. But treating logs as an extreme anti-pattern provides an excellent forcing function for designing observable systems.

Logs as Metrics

Let’s start with a basic example:

log.Printf(“Loaded page in %f seconds”, duration)

In this case, the log line serves the same purpose as a statsd metric. We could replace it immediately with

statsd.Histogram(“page.load.time_ms”, duration)

and the result would be better, because we’d be able to use the full extent of aggregation tools at our disposal. It’s possible to extract this information from a log line into a structured form, but it’s more work, and it’s unnecessary. The log line doesn’t give us any information that the structured metric doesn’t.

Logs as Debugger Tracing

A more common example:

log.Printf(“about to make API request on”, obj_id)
obj = c.Load(obj_id)
if obj == nil {
    log.Printf(“could not load object”, obj_id)
} else {
    log.Printf(“loaded object”, obj_id)
}

Logs are oftentimes used as a runtime pseudo-debugger. In this case, we’re using logs as a way to verify that a particular line of code was called for a particular transaction. The actual text of the log doesn’t even matter. Instead of “About to make API request”, we could have written

log.Printf(“api.request_pre”, id)

or even

log.Printf(“I like potato salad”, obj_id)

As long as it’s unique to that particular line of code, it serves functionally the same purpose – it confirms that the program execution reached that point in the code.

When we use log lines this way, we’re forming a mental model of the code, and using the logs to virtually step through the code, exactly the way a traditional debugger like GDB might. Transaction-level (or request-level) tracing tools provide this same kind of visibility, with a better visual display.

Without actually counting, I’d estimate that at least 80% – if not more – of log lines that I’ve seen in most open-source projects fit this overall use case: using log lines to virtually “trace” the execution path of source code on a particular piece of problematic input.

Logs as Error Reporting

Another common pattern:

try:
    writeResult()
except Exception as e:
    log.info(“error writing result!”, e)

Here, we’re using logs as a way to capture context for an error. But again, this is a relatively inconvenient way to explore information like a stacktrace. Tools like Sentry or Crashlytics also allow exception reporting, but unlike logging tools, they allow us to classify and group exceptions. We can still view individual stacktraces, but we don’t have to sift through as much noise to identify problems. And by tracking the state of reported exceptions, we can identify regressions more easily. Structured logging systems are generally not capable of handling this – and even when the workflow is possible, it’s nowhere near as convenient as what dedicated exception tracking systems allow.

If you really can’t break the habit of logging errors, you can at least add a hook to log.error that sends the error (and a complete stacktrace) to an error reporting tool like Sentry.

Logs as Durable Records

Furthermore, you can’t even assume that the logs you try to write will actually get written! Even when operating on a small scale, logs can be a lossy pipeline, and the potential for failure only increases with scale.

For example, if you’re developing a script intended to run on your local machine, do you know how your code will behave if your disk hangs? If you run out of space on the partition? How reliable is your log rotation? What happens to your server if it fails?

For a very small script, these sorts of failures may not matter to you, but they can come back to bite you, even at that small scale. For larger-scale services with tighter reliability guarantees, there are ways to mitigate these specific problems – a buffer, a log collector, a distributed indexer – but each solution comes with its own risks and problems. If you try to keep patching these by introducing more tools to make your logging reliable, at some point, you’ll discover that you’ve reinvented your own distributed database. And writing your own database is not inherently a bad idea, but to do it well, it’s the sort of task that’s best undertaken intentionally, rather than by accident.

To be fair to logging, this problem is not unique to logs. It comes from the limitations set forth by the CAP theorem, which means that every monitoring tool has to figure out its own way to deal with them. The problem with logs, however, is that the failure modes are much more subtle and easy to overlook.

For example:

statsd.Increment(‘api.requests_total”, tags={“country”: country.String()}, rate=.1)

It’s relatively obvious to see that this line of code might not always emit a metric, because even under normal operating conditions, you know (a) network operations can have problems, (b) it uses UDP, and (c) the metrics are sampled at the specified rate.

It’s a lot less obvious that this line of code might fail, because STDOUT and STDERR are generally assumed to be always available under normal operating conditions:

log.Printf(“Received api request from %s”, country.String())

Compared to logging, the failure modes of using runtime metrics, request tracing tools, or exception tracking tools are more visible and well-defined.

Instead of using logs as an accidental database, consider your underlying use case and which dedicated database would serve that need better. There’s no one-size-fits-all answer here; you may find that your use case is best served by a NoSQL database like CouchDB, or you may find that you’re really aiming to replicate the functionality of a message queue like Kafka, or another database altogether. If any of these (or other) tools fit your use case, they’re almost guaranteed to be a better fit, long-term, than logs.

Don’t Stop Logging, But Stop Reading Log Lines

By this point, it may sound like I’m firmly anti-logging. I do consider myself an environmentally-friendly person, but when it comes to software, I support careful uses of logging.

Logging can be useful for some purposes. However, it’s rare that they’re the only tool for monitoring your code. And it’s even rarer that they’re the best tool. When writing software that scales, you need to be able to deal with aggregate information – the firehose is too unwieldy to parse mentally. Logs that can be aggregated are better than logs that can’t. In those cases, it’s best to keep logging, but when you need to diagnose a problem, you’ll be interested in reading aggregate queries across your logs, rather than viewing raw, unaggregated logs in chronological order. The former is a powerful way to absorb a lot of information about your systems quickly. The latter is a glorified tail -f | grep.

The next time you start to write a log line, ask yourself whether another observability tool would be a better fit. Oftentimes, it will! If not, that’s fine. Just remember that, ideally, nobody should ever be reading that raw line, so take care to structure the information in a way that facilitates the kind of aggregation queries you’ll need.

I Can Text You 💩, But I Can’t Write My Name

March 17, 2015bengali, emoji, unicodeAditya Mukerjee

Today, Model View Culture published an article I wrote about Unicode, character encoding, and non-Latin alphabets. I’ve included an excerpt below:

I am an engineer, and I am a writer. As an engineer, I spend a lot of time thinking about how text is stored, but relatively little about what information the text actually represents. To the computer, text is an abstract entity – a
stream of 0s and 1s, and any semantic meaning is in the eye of the
beholder. As a writer, of course, the meaning is everything, and the
mechanics of how the text is stored is merely a technical detail.

But in an economy that is increasingly digital, increasingly global,
and increasingly multilingual, we can no longer maintain this
distinction. The information we want to represent is intimately linked
to how it is stored. We can no longer separate the two.

Read the rest at Model View Culture

Photo CC-BY TMAB2003.

What Would Body Cameras Do?

December 18, 2014akai gurley, body cameras, eric garner, icantbreathe, mike brown, susan sontag, sydette harryAditya Mukerjee

On December 3rd, a New York City grand jury failed to indict NYPD officer Daniel Pantaleo in the death of Eric Garner. Protesters around the world, from Oakland to New Delhi, reacted to this decision, demanding reforms to counterbalance the power wielded by law enforcement. They adopted as a slogan Garner’s chilling final words: I can’t breathe. I can’t breathe. I can’t breathe.

Garner is only one of several high-profile cases of black men killed by police. Sadly, these incidents are not rare. By some counts, a black man is killed by police officers nearly every day. Race plays heavily into this risk: a black, male teenager is 21 times more likely to be shot dead by a police officer than a white one.

Of all the varied proposals for reform, perhaps the most popular among politicians is to outfit all police officers with body cameras. President Obama recently requested over $250 million from Congress to fund body cameras and police training. Proponents of this plan claim that body cameras will ensure that evidence is available in all cases of alleged police misconduct. They note that people behave differently when they know they are being watched, and conclude that body cameras will reduce misconduct, both by police officers and by civilians.

This argument draws on a common narrative: photography as documentation. This narrative is by no means new. Susan Sontag wrote, “A photograph passes for incontrovertible proof that a given thing happened. The picture may distort; but there is always a presumption that something exists, or did exist, which is like what’s in the picture”.

And yet, we must ask ourselves: is there such a thing as an impartial photograph? After all, every photograph tells a story. Every photograph is narrated in the first person.

Sontag explains, “To photograph is to appropriate the thing photographed. It means putting oneself into a certain relation to the world that feels like knowledge – and, therefore, like power”. Body cameras, mounted on the bodies of the police, serve as a permanent record of what the officers see. Body cameras, mounted on the bodies of the police, ensure that police remain in the position of power: the allegedly infallible narrator. Body cameras, mounted on the bodies of the police, reinforce the same imbalance in power structures that they are purported to keep in check. They allow officers to appropriate every interaction by legitimizing the literal viewpoint of the officer. But the objective of reform is not to appropriate civilians targeted by law enforcement; it is to appropriate law enforcement itself.

It might be a different story if we ensured that this power would be reciprocal: that citizens would be able to appropriate law enforcement, just as law enforcement appropriates black lives. But this is not the case: citizen bystanders often get harassed by officers when recording encounters, even when recording officers is legal. In fact, in Garner’s case, Pantaleo was not indicted, but Ramsey Orta, the bystander who filmed Garner’s death, was. At the same time as we provide police with an additional form of power, we rob citizens of this same tool. Police officers may tell their story, but citizens remain “the thing photographed.”

Defendants are not required to testify before a grand jury. Their attorneys usually recommend against it, as it can be incredibly risky. Defendants’ attorneys are not permitted to be present, and with no judge, defendants are completely at the mercy of the prosecutor. Yet, despite these circumstances, Pantaleo felt confident enough to testify, and during the grand jury hearing, he narrated “three different videos of the arrest that were taken by bystanders”. If he had worn a body camera, perhaps he could have stayed at home; his account would have been presented as a fourth video, with him behind the camera.

So we must ask ourselves – would body cameras have made a difference in Garner’s case? If not, what is the goal of arming officers with one more weapon? Or more bluntly, as Sydette Harry asks, ‘Why must black death be broadcast and consumed to be believed, and what is it beyond spectacle if it cannot be used to obtain justice?’.

Thanks to Andrea Garcia-Vargas, Dan Mundy, Michael, and Jakob for reading drafts of this post

Image provided by Scott Robinson under the Creative Commons 2.0 License

Beyond Culture Fit: Community Value-Add

April 29, 2014UncategorizedAditya Mukerjee

Recently, a founder of an early-stage startup asked me for tips on evaluating culture fit when building an early team. As the founder of most successful startups will agree, picking the first few members of your team is important. They set the tone for your company as it grows.

Personally, I think that the term “culture fit” can be misleading. It implies a sort of homogeneity, which is actually the exact opposite of what most companies want. I make a point of the language here because I think it can be harmful to internalize the phrase “culture fit” when what you really want is to build a community. “Community value-add” might be a better term.

If you think of yourselves as evaluating “culture fit” you’re placing your brain in pattern-matching mode, using the team you already have as a pattern and evaluating individuals against that pattern to test their fit. Even if you don’t intend to, this means you may implicitly be looking for someone who is like you and your cofounder(s)/teammate(s). Those people aren’t necessarily bad to have, but it can be a limiting perspective. A good community has people who can create some tension (in the appropriate ways!), because that’s what creativity and thinking “outside the box” are all about: respectfully challenging the status quo, for the sake of improving the company, product, etc.

Taken to the extreme, a company trapped in pattern-matching mode might subconsciously only hire people who fit their background and demographics. Aside from being potentially illegal (discrimination, etc.), this is actually very bad for your company and product. A healthy company needs a variety of perspectives represented in product decisions and day-to-day operations.

So, what is it you really are looking to evaluate? You’re looking for someone who is excited to be a member of your workplace community, to build your product, and isn’t afraid to challenge your assumptions when necessary, but knows how to do so respectfully and appropriately.

Finding people who are excited to work with you is best done by letting them self-identify. Give them opportunities to express their interest, and they will make themselves known.

As for the last part (finding who knows how to respectfully disagree), pose tough questions in interviews. You don’t want to try to set up “mind tricks” (this usually backfires), but do give them a chance to play tug-of-war with you.

In short, don’t expect people to fit your existing company culture. Instead, ask yourself what that person brings to your company’s community, and then ask yourself if that is a valuable addition

Allies

April 1, 2014UncategorizedAditya Mukerjee

In college, I served on the board of a student group that advocated sensible drug policy. During this time, our school’s chapter was named one of the top ten most succesful chapters in the country. This honor was in part because we succeeded in passing a “Good Samaritan” policy to encourage students to seek medical attention for drug overdoses. It was also because we managed to unite a number of otherwise disparate groups – we co-hosted various events with the College Republicans, the College Democrats, the Arab students organization, Hillel (the center for Jewish student life), and more.

When we organized events with the College Republicans, they did not refuse to collaborate with us simply because one of our board members supported raising taxes to fund a single-payer healthcare system. When we organized events with the College Democrats, they did not refuse to collaborate with us simply because a different board member supported defunding Medicare extending the Bush tax cuts.

Those issues were core to what these organizations believed in, and they were actively lobbying for both issues at the same time as they sponsored initiatives with us. However, they were able to recognize what was relevant to our collaboration and what wasn’t, and recognize the difference between the personal views of our members and our stance as an organization.

Effecting social change involves building a coalition, and a coalition is by nature diverse. While I would love for the leader of every company to agree that I deserve the right to marry, I also understand that one’s allies in one movement may not be allies in every other. Disagreement about other issues is not the sign of a bad coalition; it’s the sign of a broad one.

Bypassing a DNS man-in-the-middle attack against Google Drive

January 12, 2014UncategorizedAditya Mukerjee

Boston to New York City is a frequently traveled route, so a number of different bus lines provide service between the cities. Most offer free WiFi as an amenity.

However, all WiFi is not created equal. Today I was traveling by the Go Bus, and I assumed I’d be able to do some work on the bus.

I needed to access a document on Google Drive. However, when I tried to open Drive, I was greeted with this sight.

I use OpenDNS instead of relying on my ISP’s DNS servers, and I figured that there was some error on OpenDNS’s end. So, I changed my /etc/resolv.conf to use the Google DNS servers, figuring that that would work.

No luck.

At this point, I realized that the bus network must be hijacking traffic on port 53, which was easy to test.

dig gave me the following output:

Visiting 67.215.65.130 directly gives the following page.

Saucon TDS uses OpenDNS for DNS lookups, but they redirect undesired lookups to their block page. I confirmed this by asking my neighbor across the aisle to visit drive.google.com – he happened to be using Safari, which gave him a 404-eque page instead of the big red error message that Chrome gave, but that was enough for me to confirm that the bus was, indeed, hijacking traffic on Port 53.

But how to fix it? The correct IP address for drive.google.com is actually 74.125.228.1 (ironically, I looked this up using OpenDNS: http://cachecheck.opendns.com/). However, entering that IP address into your browser will give you the Google homepage, because unlike most sites, their servers check the hostname (the same is true for all Google subdomains).

The fix is actually rather simple – add 74.125.228.1 to /etc/hosts. This will skip the DNS lookup altogether, but the browser will still think that you’re going to drive.google.com “normally” (in a way, you are).

I write this post to illustrate how easy it is to get around this kind of traffic shaping, for anybody else who has the misfortune of running into this problem.

On principle, supporters of net neutrality oppose traffic blocks based on content (instead of volume). However, Go Bus and Saucon TDS are not simply blocking traffic – they are hijacking it. My DNS queries are made to a third party, and yet they decide to redirect them to their own DNS servers anyway. From a user perspective, this is incredibly rude. From a security perspective, it’s downright malicious. I let them know over Twitter, though I haven’t received a response yet.

Other than using a VPN (which would have required advance preparation on my part), is there a long-term solution to authenticating DNS queries? Some people advocate DNSSEC. On the other hand, Tom Tpacek (tptacek), whom I tend to trust on other security matters, strongly opposes it and recommends DNSCurve instead.

In the meantime, let’s hope that providers treat customers with respect, and stop this malicious behavior.

Flipping a Coin Over the Phone

December 16, 2013UncategorizedAditya Mukerjee

Last week, a friend and I had to arrange an in-person meeting after work, by email. He’s based on the Upper East Side, and I’m in Chelsea. Neither one of us wanted to make the trek to the other’s office, and there was no logical place “in between” where we’d have a quiet space.

The obvious solution would be to flip a coin, which he suggested. But how do we know that the other is telling the truth?

The procedure for having Alice and Bob flip a coin over the phone is actually fairly simple. (Conveniently, my friend’s name begins with a ‘B’, so I’ll make him Bob, and I’ll be Alice).

First, Alice flips a coin, but keeps the result of the coin flip secret. Let’s say that ‘H’ is 1, and ’T’ is 0.

Then, Alice finds a book – (almost) any book will do, as long as Bob has a copy of the book as well. She picks an arbitrary page in the book. If the coin flip was H (1) she should pick an odd page; otherwise, she should pick an even page. She notes both the page number and the first two words on that page.

Then, Alice emails Bob the first two words, and asks him to guess whether the page is even or odd. After Bob reveals his guess, Alice reveals the page number. Since Bob has a copy of the same book, he can verify that Alice is telling the truth about the parity of the page number (ie, whether the number is odd or even).

This protocol works because it is easy to find the first two words on a page, but it is hard to find a page that begins with a given pair of words. This serves the purpose of a one-way-function (a function that is hard to invert). By telling Bob the first two words, Alice is telling Bob a signature, and promising that she knows a value that produces that signature. Because Bob has a copy of the book, he is able to verify this signature.

A few interesting things to note about this technique, which is known as a commitment scheme:

If Alice only chooses a single, common word (like ‘the’), it would be easy for her to find both an odd page and an even page that start with that word. This would let Alice change the outcome of the coin flip after Bob makes his guess.
If Alice chooses too many words (such as an entire sentence), she runs the risk of providing enough context for Bob to figure out where to find the sentence (particuarly if he has read the book and knows the plot).
The ideal book is one that both Alice and Bob possess, but which neither one has read, for the above reason.
The book should be a work of fiction, as nonfictional books tend to have an index that provides a mapping of words -> page numbers.
This technique assumes that Bob does not have access to a digital version of the book that is easily searchable.

There are certainly a few ways in which this procedure could be cheated – either to guarantee a certain outcome, or to tip the results in one’s own favor. But in cryptography, we sometimes make certain concessions (such as assuming an “honest but curious” adversary, as opposed to a truly malicious one). In this case we assume that both Alice and Bob are “honest, but temptable” – ie, Alice or Bob might be tempted to lie about a coin flip, but neither will go to the trouble of manually finding phrases that appear on both even and odd pages in the same book).

Image provided by Филип Романски under the Creative Commons Attribution-Share Alike 3.0 Unported license (via the Wikimedia Commons)

Dorm Room Fund

November 20, 2013UncategorizedAditya Mukerjee

I am very excited to announce that I’ve just joined the Dorm Room Fund team in New York City!

For those who aren’t familiar with the fund, Dorm Room Fund is a venture firm run by students and for students. The fund invests exclusively in student-run companies, providing seed financing on very founder-friendly terms. The goal is to serve as a springboard for driven, entrepreneurial students, providing support both financially and in other ways.

I have always had a special interest in students and young entrepreneurs, which is why I have been a mentor for groups such as hackNY and the Thiel Fellowship. I’ve found students are some of the most exciting entrepreneurs to work with – they bring fresh eyes to problems both new and old, and inspiring levels of energy and determination.

I’m looking forward to working with the rest of the team this year, as well as meeting all sorts of students working on a variety of enterprises.

/var/blog

blogging data from aditya mukerjee

Observing Kubernetes Services With Veneur

What is Veneur?

Collecting pod-local metrics with Veneur

Basic Example

Configuring a Pod-Local Veneur Collector

Global Aggregation – How it Works

Veneur Proxy and Veneur Global

Resource Limits

Invariants

Prometheus

….and more!

Acknowledgements

Remote Work: 2 Years In

Don’t work from home

The team matters

Timezones require some effort, but aren’t too hard

Use a headset for video conferencing

Visit other coworkers often

Schedule recurring pair-programming meetings

Introversion or extroversion

Don't Read Your Logs

Logs as Metrics

Logs as Debugger Tracing

Logs as Error Reporting

Logs as Durable Records

Don’t Stop Logging, But Stop Reading Log Lines

I Can Text You 💩, But I Can’t Write My Name

What Would Body Cameras Do?

Thanks to Andrea Garcia-Vargas, Dan Mundy, Michael, and Jakob for reading drafts of this post

Image provided by Scott Robinson under the Creative Commons 2.0 License

Beyond Culture Fit: Community Value-Add

Allies

Bypassing a DNS man-in-the-middle attack against Google Drive

Flipping a Coin Over the Phone

Image provided by Филип Романски under the Creative Commons Attribution-Share Alike 3.0 Unported license (via the Wikimedia Commons)

Dorm Room Fund

What is Veneur?

Collecting pod-local metrics with Veneur

Basic Example

Configuring a Pod-Local Veneur Collector

Global Aggregation – How it Works

Veneur Proxy and Veneur Global

Resource Limits

Invariants

Prometheus

….and more!

Acknowledgements

Share this:

Don’t work from home

The team matters

Timezones require some effort, but aren’t too hard

Use a headset for video conferencing

Visit other coworkers often

Schedule recurring pair-programming meetings

Introversion or extroversion

Share this:

Logs as Metrics

Logs as Debugger Tracing

Logs as Error Reporting

Logs as Durable Records

Don’t Stop Logging, But Stop Reading Log Lines

Share this:

Share this:

Thanks to Andrea Garcia-Vargas, Dan Mundy, Michael, and Jakob for reading drafts of this post

Image provided by Scott Robinson under the Creative Commons 2.0 License

Share this:

Share this:

Share this:

Share this:

Image provided by Филип Романски under the Creative Commons Attribution-Share Alike 3.0 Unported license (via the Wikimedia Commons)

Share this:

Share this: