Shawn Burke - building the Catalyst serverless platform at Uber

Nov 3, 2017

Emit is the conference on event-driven, serverless architectures.

Shawn Burke knows something about scale—he works on scalability infrastructure at Uber. The number of total rides they've served has roughly quintupled in a mere ~2 years.

He arrived at Uber to discover a flurry of snowflakes: aka, the visual multiplication that results from having 1000+ microservices and 4 languages in regular use. Serverless was a great way for them to deal with this; they could decouple consumption from production across a variety of domains. That simplicity was a big deal.

They ended up building a very dev-centric serverless platform called Catalyst, which you can learn all about in Shawn's talk below (or by reading the transcript!).

More videos:

The entire playlist of talks is available on our YouTube channel here: Emit Conf 2017

To stay in the loop about Emit Conf, follow us at @emitconf and/or sign up for the newsletter.


Shawn: Hi, folks. So, everybody here today has been thinking about great ways to use serverless architectures. You're probably been thinking to yourself, "Gosh, I would sure like to build one of those systems." And so, that's what we decided to do at Uber, and today I'm gonna take you through that process of how that happened, why we're doing it, and we're gonna kind of get into the muck about how we built a little bit, some of the concerns that you have to think about if you build a system like this, for scale.

So, why serverless at a company like Uber? It's really about simplicity. So, when I arrived at, my background is, I worked in developer tools at Microsoft for like 15 years, I had a startup adventure and then ended up at Uber about two and a half years ago. And I arrived and what I was most surprised by, was the amount of complexity that was found in the ecosystem there. So, last few years, the world has been, like, "Microservices man, we just got to do microservices, this is gonna be amazing." When you have a company that's growing as quickly as Uber has Uber, when I arrived was growing very quickly on three axes. One axis was the number of developers that we had at the company. I was engineer like, 900, a week before I showed up they had like 150 engineers. So, number of people on the platform was scaling and then the amount of load on the platform was scaling, like Uber trips. It's insane, we did our billionth trip in like, late 2015 after I'd been there a couple of months. Probably you saw on the news, we've done 5 billion as of, I don't know, a couple months ago. And, you know, if you calculate the slope of that, it's insane. So, for all of the good times in "The New York Times" and "Hacker News for Uber" the business is still growing really quickly.

And then the number of businesses that we're getting into has also been scaling quite rapidly. So there was trips, and then there was pool, and then there was UberEATS, and then there's freight, and then there's just a lot of other businesses. So this puts a lot of pressure on infrastructure, and just to be honest, the infrastructure there wasn't quite ready for that because they had made some choices about growing really quickly. And those choices that allowed you to grow really quickly early in your career as a software company become very painful later in your career. Specifically, when I arrived there was well over 1000 micro-services, there was 4 languages in regular use, and then you take that matrix and you multiply it out, what happens is you have a bunch of snowflakes. You have a bunch of services that are built a little bit differently because you can't write good frameworks for four languages, you can't write good client libraries for four languages. So one team's client library, that's a Go team, so they have an amazing library in Go and then the Python library, sort of, sucks. And then another team they're a Python team, so the inverse happens. When you want to add a new feature to your RPC framework, which one do you add in first, how do you roll it out? There's just complexity up and down the stack.

So, serverless was a great way to deal with this because it allowed us to decouple consumption from production across a variety of domains, whether it's HTTP or RPC, we use mostly Thrift at Uber. Whether it was consuming Kafka, we do, like, a trillion Kafka messages a day at Uber, very big Kafka shop. So, that simplicity was a big deal, that ability to put an abstract platform between, between our customers, or our users, our developers, and our infrastructure.

And then, if you put that abstraction in place, you get some really great benefits out of it. Not only do you make it better for your new developers coming in because your onboarding got simpler and your docs got better, but it allows you to do some really cool things on the backend as well. Because, now instead of developers linking to all the libraries directly, they're linking through this sort of abstract, "Hey I just want this to happen," sort of, an interface. Then you can do some cool stuff that I'll talk about here in Catalyst as a way to go.

And so the first question that people ask, it's a good question is, "Why aren't you guys just using AWS Lambda?" And so there's a bunch of reasons for that, the major reason is that, we are not in an AWS shop in production. We run our own data centers and are moving towards a multi-DC strategy. We have to because of the number of places that our business operates in. In terms of around the world different, localities things like this. But also, at our scale, we need better introspection to the system than we get from sitting on top of another system. In terms of, what are the types of messages transports that we support, how fast is it, what are the SLAs guarantees for how fast messages are routed, can we get at the messages, can we inspect the packets? Lots of stuff like that. So, naturally, as an engineer, I was, like, "Sweet, let's build one, that sounds like a good thing to do." This was when the Seattle office was new, I had just joined the Seattle office, I was the fourth hire there. So, I got the charter and we built a team and we built it.

Yeah, and then one of the goals that we had here was to be able to really add different sorts of event sources into the system. So, what were the goals for Catalyst? I wanted this to be a very, very developer focused product. So, the tagline is essentially, "Write your code, tell the system when to run it, and you're done." And so we tried to keep that promise throughout the entire system. For example, Catalyst runs on your local desktop just fine, just like it runs in production. One of the goals I have for things like this, is I want to be able to write code on an airplane. That's my goal. If I can't write code on an airplane, something is wrong. So we built a system that runs happily on a local machine, and we came up with an architecture that is actually exactly the same locally, as it is in production. So it's really nice and I wish I had time to give a demo here today, maybe we can do it at Auckland later.

Overall, we wanted to dramatically simplify the process of developing business logic, our developers had a tremendous amount of other play code in their services. As these thousands of services were proliferating, each of them had this giant stack of like setting up its RPC server and getting configuration running and getting logging set up. Get the stuff out of there nobody cares. And then pluggability, we had four languages we pushed hard to get down to two, we've, sort of, focused on Go, or...Google tells us we're the biggest Go shop in the world, I'm sure they tell everybody that, but we think we might be the biggest Go shop in the world. So, we really focused on Go and Java for some of our,sort of finance and streaming cases. And that multi-language support was important to limit it to the fact that we didn't have to write every single library for every single language, and when we get to the architecture in a second, I'll show you how we solve that problem. And then it has to be minimal compute, minimal infrastructure tie-in, so we have a specific Uber compute, we run on Mase us Aurora, but we want to be able to run on AWS or GCP or these other things. It has to be really fast. So, the goal we had is a five-millisecond P99 tax for running a catalyst versus handling your own thing. And so, those are the goals that we had setting out, in about a year ago, is when we started building it.

And so, we wanted to end an experience, we want to deliver a product so that when developers came to Uber and did their job, they felt like they were interacting with a product and not a sort of a bunch of different technologies. There's a easy fallacy that a lot of organizations fall into, which is, well, the metric system is not that hard, you go and you look at the docs and figure out the metric systems work.Okay, now the auto-scaling system's not that hard, you go and look at the auto-scaling system and see how that works. Okay. So, logging system's not that hard, on and on and on and on and on, you get this proliferation of not that hard technologies. But you end up falling back to tribal knowledge, because until you've learned how they work, it's too hard to discover them, no we're not consistent. So, we really wanted to fight that battle, so with Catalyst, batteries are included, we make sure that developers logging, metrics, config., local run, running into bugger, all these things just work out of the box. No code needed. We wanna have, you know, important breaking glass so that you can get to the underlying systems if you need to. But for the standard developer, it just works. So, that's the product level experience that we wanted to deliver.

And then the other thing that we wanted to deliver, and we'll talk about some of the concerns here, is essentially, we wanted to insulate our developers from a large set of the distributed systems concerns that they face building stuff at scale. So, a lot of cases here are difficult. I mentioned we're a super heavy Kafka shop. What happens when one data center fails over? How do you deal with the fact that now you're running instances in another data center, and they're on different offsets because the Kafka's aren't actually connected, they're asynchronous replication? How do you deal with that? It's not something we want product level, full stack engineers having to reroll every time. Just generally, capacity, so you have three data centers, how do you make sure that you have the right capacity in each data center so that if usage is growing in the main data center and it's New Years Eve, and something bad happens and all that traffic flips over to the other one, that you don't end up having a cascading failure?

These are all really difficult concerns, both from a, "How do you manage your capacity?" Point of view, but also from a correctness point of view. How do you replicate this over, how do you make sure that if you start writing your second data center, and the other one comes back up, the systems come back to their direction? So, we're really trying to insulate developers from that with catalyst and make sure that it just works. And then the final thing that I really wanted to deliver was this just really enjoyable loop that we all get into. This, like, write code, test the code, debug what's wrong, deploy it, you know, rinse and repeat loops. That's what you're really focused on, you're not, like, trying to figure out how things work or frustrated that you just want to... You know, all of us, when we build something, there's this kind of motion we go through of getting everything set up. And you just really want to write your first line of code and hit that breakpoint and then start really doing your thing. And how fast can we get people to that? And with Catalyst, we've gotten people there in about 90 seconds from nothing, so works pretty well.

So let's see how Catalyst works, so I'll show you some Go code first since we're mostly Go and the interesting thing about doing Go first, as our first language, is that Go is really hard for this sort of stuff. Go does not have any, sort of a pluggability model, there's no library model, at runtime, everything has to be compiled together. The dependency management is still young, very difficult, and there's no metadata, there's no attributes or things that you would use to, sort of, add functionality of code. So, we went about a bunch of different models for this and I'll show you the one we settle on here. So with Catalyst, you write a handler, hello world handler, just a plain old function, not a big idea. This is the code that you wanna run your business logic and then you gotta tell Catalyst when to run it. So what we have is, essentially, we'll call this registration handler for you, we parse your code at, we have it built to a parse of the code and generates code for you. So what it's gonna do is your handler run, you declare your age to be handler in this case, and then down here, you tell Catalyst, Catalyst register, tell it the handler name and the route and the method, get post, put whatever. That's all you need to tell Catalyst, we'll figure out the rest.

Likewise, Kafka, exactly the same experience. Write my handler, tell Catalyst when to invoke it, all Catalyst really needs to know is the topic name, it can figure everything else out and then, that's it. Because we parse code, we can make it do other smart things. So, if you notice, that message there is a byte array, it's just a raw message, that's not very fun. Maybe your topic has got strings in it, so you just change that to a string Catalyst will figure out what, sort of, marshaling you need to do, to do that based on the code that you wrote. But usually things aren't strings or raw byte arrays, as we saw in the last talk, usually, they're Avro or some other encoded binary format so maybe it's JSON. So you just tell Catalyst that it's JSON and we figure it out. So, we're able to do that deserialization for you automatically, we're able to send error, smart errors if it doesn't work, and just free developers from a lot of that sort of stuff. Batteries included, so there's this thing called a context up there in the message. Hanging off that context is some accessor functions to logging, to configuration, to metrics, so that when you decide you need to log something or look at configuration, we've already worked out for you. So, the metrics interface you get is already tagged with your project name, and your service name, and the hostname and all that stuff that you would have, otherwise have to manually code up. The logger already has the same stuff on it and we have configurations pre-baked. So, you can handler level configuration or project level configuration but your code only sees the configuration that's valid for your scope.

So, we also support Java, this is how that sort of same functionality looks in Java. And I'll let you stare at this for a second, it's way nicer in Java, isn't it? Because there's attributes, and just metadata, you just say, "Hey, it's a handler, it's Kafka, there's the topic name, off you go, so. Architecturally, we can do this really simply, building a new language support in the Catalyst is a person weeks effort, it's not very difficult. And how do we accomplish that? So, first, before I get into the nitty-gritty details, some glossaries. So, we're gonna talk about a few different things, the first thing is a source. A source is Catalyst abstraction around how to get an event from the outside world into Catalyst. So, there's a Kafka source, runs a Kafka ingester. There's an HTTP source that runs an HTTP server, there's a thrift source that runs a thrift server, it goes on. We have an STK, you can build these, so these are extensible. Anything that you can sort of model as an external event, very easy to build a source on for Catalyst and we've got, sort of, a growing ecosystem of these at Uber right now.

A worker, when I talk about a worker, it's the binary that hosts user code, it's the handlers we wrote, end up in a worker. And then in the worker is handlers and then a group is a, just a grouping of handlers. Developers can group them however they want, but that's the inter-deployment, so when you roll out a group those are the things that version together. So, if you build a big project, you probably have a group of your Kafka handlers and a group of your RPC front ends and those are the things that you would version.

So, how does it work? So we, sort of, double-click into the infrastructure here and we'll get into the details of how this works. So, at runtime we have three main components, I already told you about sources and workers. We have a component called the nanny, which is a very thin layer that manages these other processes, that manages process lifecycle and monitors metrics, and it receives what we call goal states and we'll talk about goal states in the second. But this is essentially the same as our local...this is a container in production and this is local, what we run locally as well. So, this is our, sort of, unit of deployment that we can run locally. You can attach a debugger to these, and the sources themselves and the worker have STKs that we've developed that you build on top of. So, you get things for free, like metric, so no matter what source you have, we emit like traffic metrics, how many requests came in, how many were successful, how many had errors? You get all that stuff for free. More importantly, I mentioned this leverage point, we can do capture, replay, generically. So, with any source, you can run a command on RCLI and say, "capture traffic," and it'll capture traffic, bring it back and you can run it locally, we're working out the PI concerns for that, but from a technology point of view we can do that generically across Kafka and HTTP and everything else. And so, we've got a bunch of new stuff coming that we think is gonna be really exciting there, but that's that's basically how that works.

And then the important thing here is that Catalyst is a shared nothing architecture, meaning at runtime, out in production, the Catalyst Control plane can go down, your handlers just keep running. There's no intermediate layer in between, so all that stuff is local, super durable system design for availability. So, let's walk through what it takes to run Catalyst on your user desktop. So, you run this command called Catalyst, we've got a CLI that has all of our functionality in it, you scaffold out your project, you add some code, your run Catalyst Run. When we run Catalyst Run, we do two things, we build what's called a manifest, which is metadata file that has, like, oh, it's Shawn's project, it's got a Kafka handler and it's got an HTTP handler, and it's got some other stuff and then we build the actual binary, we don't care what language is after this point. Once it's built, we don't care, it's just it's just a package. And then, so this gives you an idea of what's in the manifest, it's just a Jason file, it's got, like, oh, I'm an HTTP source, this is the method, here's the path, same thing with Kafka. So, if you think about it, architecturally, what happens is this manifest goes up somewhere and then we can use this manifest to figure out which of those sources to start up and how to connect them all together, so this is how we glue the whole thing back together at runtime.

So then we send that manifest to the nanny, it downloads the HTTP source and the Kafka source, it downloads the worker binary, it makes sure they're all healthy, it connects them together, and then in comes traffic. That's pretty much it. So at the, at a user desktop, what's cool about this is that we actually have a process we can attach to with a debugger now. So that worker process, we have a mode where we'll launch that thing for you and wait, and so we can actually, sorry. We have configuration that allows you to bugger to launch that, the nanny knows to wait. So this means, now you have debugability and you can actually run them, you don't have to rely 100% on unit tests and other sorts of mocking, you can actually run it. We have a capture, replay for Kafka, so you can run a local Kafka, capture production traffic replay it locally as many times you want and do it in this scenario, so you have a debugger and you can run tests, it just really tightens that loop that I mentioned. So then at runtime, at production time, that same shape is what we run a container in, so it's exactly the same as I mentioned the same infrastructure that runs out in production. So it works great. Super reliable, it's easy to unit test, it's easy to run locally, it's easy to run in CI, it works really, really well.

So let's talk about what happens when we get a request. So, the source, if you're gonna sit down and write your own source, say you're writing a Kafka source or, say you're wanting to write an AWS SQS source, which you could totally do. You would take our STK, you would write a piece that would go and pull SQS based on configuration which we've been in that manifest because then it would know which thing to pull. And so it gets that manifest, it knows which things to pull or which...let's see, SCS ques to pull. And then it's super, most of the functionality is in a backend STK that we provide. So you just call into that, so SQS message, we get one from your poller, you send it to our backend, we figure out which handler idea it is, which we can just look up in that manifest. We know which things are in there, we know that we got a message off of this queue, we send it down into the backend, along with the raw payload. So, we do not deserialize in here. This is for performance, it's just raw bytes, we figure out the headers and the metadata but we don't actually touch the payload. Then send that thing, that message that we created, over to the worker, it's over a UDS socket, Unix domain socket over GRPC. It's like microseconds of overhead, it's superfast, hits the worker, this is the part that you have to write for every language in the worker, it's a dispatcher to take the thing and call the user's function. DC realizes that payload, calls the users function. So, whether that happens over reflection or some other code generation doesn't really matter, you just code fires and then this whole process works backwards if you want to return the result. That's basically how it works.

So, the key thing here is that these are separate processes. So I mentioned that level of abstraction, it means that this worker process is not bound to a Kafka seven library, or a Kafka eight library or Kafka nine library, it's just bound to a thing that says I care about Kafka with some topic name. And maybe a thing has a hint about what version of Kafka it is, but it's different in a library, it's really abstracted. Same thing with lots of other sources, so for example, right now we're on thrift, we're moving towards GRPC and proto at Uber. Same thing, we can decouple that process and not have to replicate that in every single handler or even necessarily have handlers have to rebuild to upgrade infrastructure. And then with, these things generate a ton of metrics and logging for free.

So, this is basically how it works in production now, so that was the basic flow for a worker and, sort of local to your sources in your workers and stuff like that. But what happens when you want to play this to production? So, that same CLI allows you to upload and deploy that code that you built to production when you do that, we have a component called the registry that stores those manifest I showed you, then that registry sends the actual binaries up to S3, is what we use for storage. There's a component called controller that's listening to registry, did anything change? Is there something interesting that's happened here? Once Registry says, "Yes, Shawn has a new project, he would like two instances." Controller goes to a component called Mission Control and says, "Please give me two instances." So, we maintain a warming pool of instances that we've already managed if we need more, we just grow it, if we need less, we just shrink it. So, we're ready to go, mission control gives us back two instances. That's all strongly consistent, so there's no weird stuff that happens there. And then mission control picks one of those instances, walks up to its nanny and says, "Please do this." The nanny then downloads us bits from S3, downloads the worker, downloads the sources, lights up, off it goes. So, that's how it works in production, really similar to how it works on your local desktop and it scales out beautifully because there's no shared components here. So, people ask us, "What's the max QPS that Catalyst can handle?" And we actually don't really care, doesn't matter.

So Catalyst Today, it's written in Go. We're onboarding customers, so it's a real thing, this isn't isn't dream-ware. We're using Persistent Containers, so we're not quite the same as the other serverless platforms today that we don't go away after each method invocation. That's on our roadmap, but right now, Uber has got servers sitting in a giant room somewhere doing something, so having a container sitting on then idle is not a big deal. And our tax right now is P99 in about two milliseconds. P50 is like 0.05 milliseconds, so it's really small, most of that is, like, Go GC and other stuff happening on the box. We're hardening in production, we're load testing, we're testing negative scenarios, I'm obsessive about testing negative scenarios, just, like when things fail do they fail correctly? When your worker panics, or times out, or who knows what it does that you get, like, really good information and tooling back.

And then where we're headed, we don't have auto-scaling today, we're working with the compute team to get auto-scaling. And we're working on some really cool stuff, long tail handlers, which means, how do we share a single source across a lot of workers for things that only get called once a day or twice a day, so we can sort of share that capacity. SLA-base priority, that says basically, when we're routing, can we hold certain messages to save compute infrastructure if they're low SLA? So, because those messages come into the source, they get converted into a Catalyst format, we could dump them somewhere else and then reingest them later and it would work fine, as long as they're not request response. And traffic-based placement, this is something that's really cool. So, there's a couple of core flows at Uber that are like all the time. It's like all the trip flows, essentially. Every time you, you know, request a trip or take a ride, those are core trip flows and we have tracing at Uber that tells us all the micro-services that are involved in that. And so we can compute the graph of things that run in the hot path, and then we wanna be able to tell the compute system to co-locate everything within the same switches so that there's no cross traffic. So, that's something that we're headed towards doing, and we do intend to open source this platform. We made an intentional decision to take the hit of not open sourcing it up front, so we could battle hardened it, get it super flexible and stable before we unleash it on the world. But that's where we're headed, and so that's pretty much it. Got two minutes for questions. Yeah?

Man: What are your message guarantee semantics from this source to a worker?

Shawn: So, the question is, what are the message guarantee semantics from this source to a worker? So, we fully expect the worker crash. So, what happens is that that GRPC connection, that source STK, has got a bunch of telemetry around that and so, what happens basically, is if the worker crashes, so if the worker returns an error, that's actually, from our point of that's fine, that it returned an error because an error is a successful indication. We mark that it was an error, but we don't do anything weird. If the worker was to crash, nanny is gonna restart it and reconnect it to the source. If the worker, we use governors, so if the worker or the source crashed a certain amount of time within a certain period of time, we'll actually mark the instance as bad, unroll it, roll out a new one. If enough instances do the same thing, at the next level we'll mark the whole deployment as bad try to roll back to the last version. So, does that's your question?

Man: I got the response that, the worker might [inaudible 00:24:01] at least once.

Shawn: Yes, it's definitely at least once. That's right, that's right. Yeah?

Woman: What customers are already using it?

Shawn: Question was, what customers are already using it? Some customers at uber that have, let's see, what are some of the use cases? One of the use cases that we're using right now is that there was a tax issue in New York that if we didn't have the correct setting on one of the vehicles, that we got charged, like, $1 million a day, or that's not that much. Like, it was a fairly reasonable fine, and we relied on ops people when something changed in a drivers profile, to change a thing manually and so, that was work that was too small for a team to actually pick up. It was this weird piece of work, but two guys did it over beers one night and so now, when that change happens they automatically update it and that doesn't happen. Saves us, like, half a million dollars a year. And there's a bunch of scenarios like that today. Anything else?

All right, thanks for paying attention.

Subscribe to our newsletter to get the latest product updates, tips, and best practices!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.