Rob Gruhl - towards a serverless event-sourced Nordstrom

Oct 25, 2017

Emit is the conference on event-driven, serverless architectures.

Rob Gruhl kicked off Emit Conference with a peek behind the scenes at Nordstrom's architecture.

In his talk, Rob discussed Nordstrom's use of the event-sourced architecture with what it is and why it's a good fit for the Serverless paradigm. He also notes some difficulties with the event-sourcing pattern, the ways Nordstrom has worked around it, and ends with a few #awswishlist items that would make it easier to develop event-sourced applications with Serverless.

Nordstrom were an early adopter of the Serverless Framework for services from personalization to frequently-viewed items. They've since contributed some excellent resources back to the community, including the "Hello, Retail" project and a Serverless Artillery project for performance testing.

Watch the video or read the transcript below for a complete run-down.

More videos:

The entire playlist of talks is available on our YouTube channel here: Emit Conf 2017

To stay in the loop about Emit Conf, follow us at @emitconf and/or sign up for the newsletter.


Rob: Good morning, my name is Rob Gruhl and I support a small team of engineers at Nordstrom called "The Technology Acceleration Group." For the last three years, we've been exclusively focused on serverless patterns and practices both in production, proof of concepts through workshops. In the last year or so, we've become exceptionally fascinated with event sourced architectures.

So I'm gonna talk a little bit today about what we're doing in that area. So this is an image of our team that we used for a recent internal serverless workshop. These are great engineers on the left, and what I like about this is we're waving goodbye to all those old server problems, the coal-powered steamship, but kind of behind us is these new problems coming in. And some of you might recognize the guy driving that boat right there.

And we like to just stay a little bit pragmatic about some of the challenges that we face as we take on some of this new architectures, and that's a little bit of what I'm gonna be talking about today. So let's take a quick trip back. So back in the holiday of 2014, I was working on the personalization team and we had a simple feature called "Recently Viewed" which basically rendered a shelf on the bottom of every web page, and every mobile page showing customers what they viewed.

You view 10 things and then you see those 10 things show up in the shelf. So a simple function, a simple feature, but critical to our business. It drove a huge amount of directly attributable revenue every day. The challenge with the existing system was that it used a combination of cookies and batch processing, with some traditional server architecture behind the scenes, and it could get delayed by 15 to 16 minutes. There were problems with cookie settings. There were problems with multiple sessions or in private browsing.

So we decided to do a proof of concept using this distributed ledger. It's also called the stream. It's also called the log. I don't think anyone fully decided how to describe these things yet. But whether it's Kinesis or Kafka, we produced 50 million customer click events onto the stream and then we used a serverless function to process that stream and populate a NoSQL database. So this was an incredibly successful experiment. We reduced the latent feed to less than two seconds.

The cost came down tremendously. Maybe less than $100 a day for this processing capability. It also allowed us to use that exact same stream to do some more interesting things like frequently viewed. Maybe brand category affinity for personalization. So after that, we did a bunch of request-response services, both in production and as proof of concept. And one thing that we found is anytime you have a lot of functions, you have a lot of serverless architecture, you need a deployment framework to kind of keep yourself organized.

So we found the deployment framework and we were able to make a number of contributions that made it work well at Nordstrom. So this ability to update the framework, the responsiveness of the team, and the quality of the community, is one of the reasons why we continue to use this framework and recommend it.

So fast forward a few years. At this point we have, I would say on a given day, we have hundreds of millions of serverless invocations. We have hundreds and hundreds of serverless functions live. We have dozens of teams using serverless functions. And across all of these serverless functions, we've found that when you invoke a serverless function, you're looking for an event.

So you start looking through your architecture for events, like, how can I pull events. I could steam to find a batch. I could trigger off when something landing in an object store. And we started thinking about how can we have a better way of organizing these events, and feeding them into these serverless functions. And it turned out that this was not a new pattern. This is something that's been kicking around for a while in various forms, and some people call it event sourced architecture. You can also call it a central unified log.

You'll see, again, a number of different ways of referring to that. But what it basically means is instead of this connected graph microservices situation where you have to go and talk to an individual microservice to ask them to add an attribute, ask them to scale to your load, you have something more like up at the top where you have individual microservices either producing events some of the events stream or consuming events from the events stream. So it really simplifies how you think about your architecture, and allows you to decouple those producers and consumers.

There's a number of other benefits from a distributed system standpoint. As I mentioned, we're not the first to think about this. There's been a lot of people coming before us. This is some of our favorite reading, that we like, around event sourced architecture. If you're kind of newer to things or if you haven't read this one, this is our very favorite. We ask every candidate, that's interviewing with our team, to read through this e-book and take Eric to coffee and talk about it.

It covers a bit of the history of streaming and also the future potential of streaming, so it gives a lot of context. If you're more familiar with events sourced architecture, this unfinished e-book by Greg Young, we just ran across, and it is fantastic. It talks about a lot of the deeper and more complex situations that you'll run into in an event sourced architecture. So I highly recommend anything by these folks. They've done a tremendous job of writing that.

So let's jump in to "Hello, Retail." So we're hooked on this idea of event sourced architecture. What would the world look like at Nordstrom if we had one stream of product events, everything that's ever happened to a product from the moment it's conceived to the last item being sold on clearance. We have a stream of everything that customers have ever done, and we have a stream of everything our sales people have ever done.

Well, if we had that, it would really simplify a lot of what we do, and we'd be able to react very quickly and build new features. So we built a proof of concept called "Hello, Retail." So "Hello, Retail" it's an open-source, 100% serverless functional proof-of-concept showcasing an event-sourced approach as applied to the retail platform space. So I'll quickly kind of click you through what this might look like. If you wanna see it live, I have it on my phone. It's easy to walk you through.

So here I'm login in. This is a static web page with ReactJS. I have my top-level navigation. You can tell we're a back-end team and not a front-end team by our beautiful UIs. I can jump in as a customer and view a number of categories and products. So I go back out and I'm gonna register as a photographer. So here I put in my phone number that is produced as an event onto the events stream, and there's a microservice that picks that up. I go back into the menu. I create a new product. This product is CATs that were found in my Airbnb, and that event goes onto the events stream.

Now, when that event goes on the events stream, we have a category microservice that picks it up and populates the category list. A product microservice that populates the product details. But we also have this photographer microservice that says, "Do I have an available photographer? Go contact them using Twilio." That photographer takes a picture. Responds and say, "Thanks so much Rob Gruhl," and now that image becomes an event on the stream.

At this date and time a new image is available, and when I go back in and I look at my website, there I have my CAT. We also have a workshop that's associated with this, where you will extend the functionality of this, and that's also available through open-source. So here I've placed an order for the CATs. And, again, that order is just, again, an event produced onto the stream. So this is the basic architecture. The tube in the center is the ledger or the stream or the log, and it looks pretty complex.

We've got...I'll just walk you through a little bit, of what you're seeing here, with this pointer. Okay. So here's the...these are traditional request-response endpoints as are these. And so for the clients, most of the time they're just thinking about calling an API like you would do before, and really it's the behind the scenes architecture that's changed here. The thing that makes me say, "Don't panic here," is that most of that is just configuration. That's just like a serverless CML and the serverless framework.

There's only a small amount of code behind the scenes. So, as usual, our team spent a lot of time writing a small amount of code, which is kind of how we like it. And each one of those languages might be a 100 lines of code or less and maybe a lot of shared code across those. All right, so given that kind of background, I wanna talk a little bit about what we think we learned about event sourcing, and as it's kind of a new endeavor for us. Any of you that have feedback, I'm sure a lot of you have been thinking about this for longer than we have, we really love to hear from you.

Also, this goes up on YouTube. Comment below, I'd love to get your open and honest opinions. So we think that it's important to produce simple, high-quality events. So high-quality events are written in the past tense at this date and time this thing occurred. And it can either be an observation of something that happened in the world, like, at this date and time Rob started his talk, at this date and time he finished his talk, hopefully on time. But it can also be the record of a decision.

And this caused us to kinda twist our brains a little bit because when you say something like, "I'm producing an event onto the stream that says, 'At this date and time Rob added this item to his cart' it hasn't actually happened anywhere." There's no database that's been updated. So I started to think about this a little bit like a Royal Decree. Let it be known that at this date and time Rob has added his item to his cart, and it's up to all the microservices to scramble and go like, "Oh my God, I better, like, update a database or do something, right."

So it's really the decision itself that gets captured onto the stream before any of the actual work occurs. We think it's important to swim upstream. So when you're thinking about your event-sourced system, you say, "Okay, what is my source of events?" Maybe it's a batch of data that I have or an enterprise data warehouse. Well, those events came from somewhere, so can you follow those events upstream? Can you find the original person or idea or system that generated those events?

Because if you can, and you can generate the events at that point, it becomes more useful for all the systems down, you get fresher events with lower latency, and maybe at a higher quality. Now, when we do design reviews and we talk about this, often people say, "Well, I can't use that stream because there's something wrong with it or it doesn't have the attributes." I really think this is super critical, as you're making a transition into an event sourced environment, is go and fix the stream.

Find the team that runs that and help them out because oftentimes let's say you're doing a loyalty-reward system, what you really need is an omnichannel transaction stream, as does the personalization team, as does the purchase history team, as does the fraud detection team. So if you can all work together and find the source of this data and clean it up, everyone is going to benefit from that.

As we've looked at streams, I think we've found that there's two fundamental kinds of steams, and it's what I'm calling "Tech Debt Streams and Published Streams." So the "Tech Debt Stream" might be something like, you've got an Oracle database and you've got GoldenGate, and you're doing a change data capture off of that. And you're dropping it onto a stream, and it's just blah. All right, you've got all these table updates and it's really hard to understand. There's a lot of technical context.

Now, you have some stream processing layers, some stream transform layer that says, "Well, I'm gonna reverse engineer all those table changes and turn them into an event, like, this shirt now costs $10 less." That's the business event. That's the thing you wanna publish. Maybe on a developer portal you've got your request response APIs and there's Swagger. You've also got your streams and their schemes. So you want developers to be able to subscribe to or attach to those streams, and so that's the published stream.

And I think there's an interesting opportunity where you say, "I have this tech debt stream and it's being transformed into this published stream. Maybe when I refactor this system in the future, the goal of that system will just be to publish that original event onto the stream, and I can replace this legacy system. So something to think about there.

Ordering is a little bit controversial. You'll see a lot of systems that maintain ordering, and systems that don't maintain ordering. We're trying to take the high road and maintain ordering across all of our systems. It's fairly difficult, though, even once you've had those items ordered onto your Kinesis, your Kafka stream or whatever mechanism you're using. There's lots of systems and libraries, and tools and developers that want to process those events asynchronously.

So here we have a chess match, you know, with 51 different moves and someone says, "Great, I've pulled a batch of 51 moves. I'm gonna process them asynchronously. It's gonna be fantastic." So we're almost at a point now where you have to do work to back off of that asynchronous reflex and say, "No, we need to do this single-threaded synchronous processing of these events, and that can be very slow if each one of those events needs to be processed with an external system.

So we'll talk a little bit about an optimization that we've come up with for this a little bit later. Partition keys. Some teams say, "I don't wanna think about partition keys." It turns out we believe partition keys are really important because in a distributed system, in your partitions, it's only the partition key that allows you to guarantee ordering. So if you select customer ID, you're going to get that customer's events in order. If you select skew, you're gonna get skew events in order.

So you really have to think about it with your architecture, what is the partition key and how are you going to use it. And if you don't think you need ordering guarantees or partition keys, I would ask, "Will any system in your entire system behave differently if those events were to be reordered? And there are some subtle ways that those issues may come up.

So really think about and make a deliberate decision about whether you're going to maintain ordering across your system and what your partition key is going to be. All right, so it is interesting that we've got...I didn't know there's going to be...this is magnificent, Rube Goldberg machine over here. But one question I would have is if this is Austen pushing the first domino here as the first event, could we trigger a serverless function off of that that kind of went all the way over here and did this?

You know, can you bypass some of that intermediary system? So just like I mentioned, swimming upstream to find the source of your events, can you also look downstream and look at what the effect is because maybe you can hop over a few steps there. You're not just transforming data and handing it to the next system, you're asking the question, "What does that system do?" And if I have a dedicated computer intensive container working for me, can I just do that?

So it's not always going to be the case, but something to think about. I mentioned this is a list of things we think we know. I think we're pretty sure about this one. Distributed systems are hard and eventual consistency is weird. So it's a very different intuitive mindset. For example, you wanna do a simple acceptance test. You wanna add an item to your cart and then you wanna read your cart and you say, "I added, you know, this address and I went and I read the address is not there. Is the system broken or is it just eventual consistency? Or it looks different from the last time I saw it."

There's all kinds of these really interesting and challenging problems that come up. We absolutely love this book. If you don't have this book, you should get it. It's Martin Clinton's "Magnum Opus [SP]." It's like 500 or 600 pages long. It's kind of a 201 explanation of everything with tons and tons of footnotes and pointers. And the thing I love about this book is I read through a chapter and the whole team is reading through this book. I read through a chapter and there's a bunch of stuff and like, "Oh, I can use that. I really understand that."

There's a bunch of stuff I'm like, "I have no idea what he's talking about," but I feel dumber and that's an important thing, right, to understand what I don't understand, and there's always the opportunity to follow those footnotes and read those links. And there's some parts where he's like, "This is a complex topic. If you wanna learn more, read the textbook," footnote. So it's like RTF textbook.

Okay. So that's kind of what we think we know. Here are some of the areas that we're focused on and we're trying to figure out. So "Joints and Aggregates across Partition Keys." So I've got two different partition keys, like, maybe riders and taxis, and I've got a wonderful set of ordered events across those. But between those two streams, I can't make any causal guarantees. So if I replay into this system as two streams, and then I go and replay into this system as two streams, I'm gonna get a different ordering.

So how do I deal with that? Two approaches that we're taking. One, is you aggregate those two streams into a canonical joined stream, and by canonical I mean now everyone who wants that combination has to use that stream, and everyone is gonna agree on what that ordering is. The second opportunity, which I think could be an entire talk in itself and is super exciting, is CRDTs, these convergent data types.

I mentioned, at the beginning, that our technology acceleration team has been focused on serverless. We thought that was a big deal. Event sourcing we think that's a big deal. I'm guessing this is going to be the next big deal because it allows you to do things like occasionally connected database, and you can do joints across multiple streams. And I think it just works really well with partitioned event sourcing systems. So we're gonna be looking a lot more at CRDTs.

Okay. So I kind of hinted earlier that there might be a better approach, and this came from an idea that Eric had where I was saying, "I'm really worried about our Fanout Lambda and we either have an asynchronous mode or we have a synchronous mode. And synchronous mode I'm very worried about how slow it is." And Eric brought out the point that strictly speaking when you're doing ordering within a partition key, even within a batch of items within the same partition, you only care about the ordering for items with the same partition value.

So here A1 and A2 represent two items with the same partition value A and a value one and a value two. You can split these events up and process B, C, D, E, and A1, A2 group asynchronously, and only use your blocking single-threaded synchronous processing on A1, A2. So this is a hybrid approach between synchronous and asynchronous that we're exploring, to see if we can get some performance improvements out of that.

Effect after cause. So this one's pretty tricky, and it gets trickier when you have a system with multiple streams and multiple microservices consuming from those streams multiple different databases. What you wanna be able to say is, "Given that I've added this item to my cart, what are the contents of my cart?" So three different approaches we're using here.

The first approach, if you have the luxury of having a UI that can hold up in a dynamic connection, to your data layer, you can say, "Well, I'm just gonna show the contents of the carts and as that item drops into the cart it will show up, and maybe there's a little bit of delay. Maybe I can use a UI to represent that. There is a dynamically loading view." And really the latency here is usually on the order of seconds anyway, so maybe this is a perfectly acceptable solution if you have the luxury of having a customer facing UI and this works for you.

The second solution that we've implemented in production for a few of our systems is to put something like DynamoDB in front of the stream and then you write through that DynamoDB and use DynamoDB streams in order to populate your ledger. And now you have an immediate read after write strong consistency. The thing I don't like about this approach is that it couples your event producer and your event consumer and you lose some of that glorious, like, my job as a producer is I just through something on the stream and I walk away. My job is done. And as a consumer, I'm just watching the stream.

So we're doing a bunch of thinking and a little bit of work investigating this third option which is, "How about when I write to the ledger, I get a ledger receipt which gives me my offset." So I say, "At the cart, you're at offset 30." And then I walk over here to the cart microservice and I say, "Hey, cart microservice, once you've caught up to the ledger position 30, can you tell me the contents of the cart?" This gets again, I mentioned, a little more complex when you have multiple layers, so now we're looking at using some distributed tracing ideas to trace that providence of the event all the way through, and then update the database with a small graph that shows the relationships.

So now I can say, "All right, even though I'm consuming from this, I know that this links to this, links to this, and yet I'm up to the 30, so you should be good with this." So some really interesting things that start happening when you've got this glorious decoupling of your services, but then there's this question of if I through this event out, what happened? When did it happen and how do I synchronize across services?

All right, so my serverless event sourcing wish list. As I tell my kids when they go to Nordstrom to see the Santa there, they only get two. So I had to narrow it down to two. The first one is a really big one for us, and this is a...we love the tools that we use, but there's a significant problem with fanout. So you've got this amazing stream that everyone's excited to use, now you have 50 different teams that all want to consume from it. We use both Kafka and Kinesis internally.

Kafka does this a lot better. The challenge I have with Kafka is you really need maybe three full-time engineers for the care and feeding of that cluster. So we like to be able to use a managed version of this, but the managed version is strictly limited. You can only do five reads per second. The pulling frequency of the serverless function is once a second and this creates a significant bottleneck. The way we've solved this is we use a Fanout Lambda, and we fan it out to all of our subscribing partners.

So you're responsible for standing up your own Kinesis stream. We'll subscribe you to that Fanout Lambda and that will write to your stream. And now each team gets to manage that resource of their stream. The problem with that approach is now you have a single point of failure with that Fanout Lambda. Let's say somebody, one of your consuming or one of your subscribing streams changes the policy, then you can no longer write to that. So now what do you do? Do you stop? Do you create a second Fanout Lambda and subscribe the kind of the sick fish stream? How do you do that?

So it's a little bit clunky at this point getting to that enterprise scale of, I've got a whole bunch of developers. You know, the end vision is I've got my developer portal with my APIs and my streams. I want anyone to be able to look at those streams and connect to one of those streams. All right, so I'm waiting for someone to make this magic button because what I really wanna do, in an event sourced system, is recreate application state from the ledger. So I've got my streams. I have some amount of durability on those streams and your Kinesis is up to seven days.

And Kafka it's configurable, but really it depends on how much resources you're throwing against it, and at some point, you're gonna have to write it to an archive. So I want the magic button, and my magic button is pretty specific. I have a team that says, "We have a new feature. We want a request-response API that tells you the average sales velocity by blush color of all cosmetics sold at Nordstrom." So I wanna cold start a new database using the last three years of transactional events.

So I want to do that using the same serverless function that I'm gonna use on my live stream because I wanna write that function once. So I go in, I write my function. It says, "Grab a batch of events." If it's a cosmetics event look at the shade. Look at the price. Add the price to this table and done. Really simple piece of code. Maybe a couple of 100 lines of code. I want my job at that point to be done. What I want to do is point that function at my archive, push the magic button, and have the entire archive replay through that function at scale.

And at 10,000 times speed I get three years of transactional data in less than an hour. Maybe I insert $50 or $100 for all of this massively scaled capability. And then once it's caught up, that database is caught up to the archive, I wanna attach that to my real-time stream. And so now I come back from lunch, and I've got this new capability on my system. So this can also be used for things like disaster recovery and a number of other capabilities. Most of the systems now will do a good job archiving, but we're still in search of a system that will read from the archive.

Google dataflow system has an interesting approach where it treats the stream and archive very similarly, so we're investigating that as one possibility. But if you have any ideas or thoughts here, we would love to hear them. All right, so if you need a serverless or haven't tried serverless, I'll do a shameless plug here for our open source project called Serverless Artillery. We found, internally at Nordstrom, this is an excellent way for teams to get started with serverless because it's using serverless functions to test your existing architecture. So Serverless Artillery takes the core that was developed by [inaudible 00:30:26] in Shoreditch Ops, the good folks in London.

And we've hosted that inside of a serverless function which can invoke as many serverless functions as you need to generate arbitrary load instantaneously. So you can hook this into your deployment pipeline. You can run it once a minute as a health check. You can run it once as an acceptance test. It's very versatile, and it's very quick to invoke and very little cost. So the nice thing in a way that this is a good step into serverless is in order to use this as a load test against your service, you have to solve problems like how do I communicate with that service?

What subnet is it in? What are the firewalls? What are my authentication authorization concerns? How do I communicate the search, etc., etc. Once you've solved that, for this testing scenario, you've now also solved it for any functional serverless function that you want to communicate with your service. If you're thinking about event-sourcing, we learned so much when we did the recently viewed tray. It's a very simple use case, but it's one that' very visible. It was easy for us to understand when things went wrong. It's very visible on the website. It's very visible testing.

You can go and hit an A item and a B item and a C item, and see if they show up. So there's a lot of subtle issues that you'll run up against. So if you can find a use case where you have high visibility and maybe low criticality to it, you can get started and start establishing that stream. And as you establish that stream, think of it in terms of a published stream. See if you can get it to that high quality published stream, and think about how you can load all the attributes that were observed at the moment of that event's existence. So thanks. That's all I have today. We'd love to hear from you and here is a list of the open source products that I mentioned. Thank you.

Subscribe to our newsletter to get the latest product updates, tips, and best practices!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.