Ajay Nair - being a good citizen in an event-driven world

Oct 31, 2017

Emit is the conference on event-driven, serverless architectures.

Ajay Nair, Lead Product Manager at AWS Lambda, talked to us about considerations when designing event-driven systems.

Okay, he says, we can all admit that serverless architectures tend to be event-driven: all logic is embodied as functions (or, events), that talk to something downstream. And eventually, it'll be more and more common that companies want to emit event streams that let anyone do cool things with them.

All kinds of SaaS companies and ISPs will be event sources.

So, in Ajay's experience, what are the practicalities and ins-and-outs? This one was definitely a crowd favorite.

Watch his full talk below, or read the transcript.

More videos:

The entire playlist of talks is available on our YouTube channel here: Emit Conf 2017

To stay in the loop about Emit Conf, follow us at @emitconf and/or sign up for the Serverless.com newsletter.

Transcript

Ajay: All right. That's my cue. I was told if I see purple, I should go. Apparently good advice for life. All right, folks. Hi, Ajay. Hi people. I'm Ajay. I lead the PM team for AWS Lambda. I'm really excited to see how many people are excited about how this whole event thing that got sorted out.

In the course of the conversations today a lot of us have been talking about event of an architecture, the serverless and serverless event of an architectures. I'm taking an opinion in this talk which is while all serverless architectures tend to be event driven not all events of an architectures are serverless. So, I'm gonna try and stick to the word event driven as much as possible but pardon me if I slip up a little bit.

So, to kind of get into it a little bit, we're all here because we're kind of excited about this movement around event driven architecture that this push to serverless has triggered. All of us are excited about the kind of architectures we have seen shared today from Nordstrom and Capital One. There's one that I really like to kick off talks with which is a company called Uber.

Many of you may have seen some of their Twitter posts and others. This is the architecture that we kind of talk about as the before of serverless architectures, your conventional load balancers, a few [inaudible 00:02:04] servers talking to a standard database and this is the future of serverless as some people start talking about it.

Austin showed up a simpler version of this but this is serverless in all its glory. This particular customer has 170 functions and [inaudible 00:02:20]. Their deployment time dropped from 30 minutes to seconds. Their cost savings moved about 95% reduction in costs while shipping about 15 times more features every month than what they were previously doing.

So, cost-benefits, agility, time-to-market all of those things are values that we're starting to realize with this. Apparently, my clicker isn't working. I'm sorry. This is what I was trying to talk to. All right. And, we're seeing this part on being spread across a wide variety of scenarios. We've heard customers talk about web applications and frameworks coming out for building, floss base and express based ones using simple synchronous invocations within a [inaudible 00:03:03] Lambda all the way to very complex data processing applications. Something like what Nordstrom was talking about with events sourcing and auto manipulation and recommendation engines to something as complex as big data processing even trending towards what you would call HPC.

If folks haven't checked it out, check out this framework called Pyron that one of the folks at UC Berkeley has put together for running massively distributed data processing app that allows you to do 25 trillion floating point operations over a billion records. But, the point is all of these are new scenarios and applications enable following what actually boils down to a very simple or grossly simplified architecture pattern. One where we say all logic is embodied as functions that are some things called events that trigger these functions and then the functions then talk to something downstream which in turn in themselves may be event sources or actors that act on that particular thing.

Something that's common across events or architectures is the fact that every communication ideally happens through events in APIs. The execution layer is stateless and ephemeral which means there is no concept of anything being retained over there. And that's clear almost a forcing function of separation of logic, data, and state.

So, we've been so far kind of standard events or architectures. One thing Austin told me when I was talking about this talk was he says, "Well, you can get on stage but you can't talk about Lambda," which cuts out 90% of my material that I have. So, what I'll talk about today was how do you go about being an effective event source provider in this particular model? And the reason I bring this up is many of you will be creating SaaS companies or ISPs or products of your own. And at some point, you're gonna be emitting events which you want to participate in applications that other people build.

So, say you're the next, next Uber, maybe the next Uber is already built and you wanna emit an event stream that allows anyone to create a function for doing something smart with it. Say, creating a strike payment every time someone sits inside Uber or whatever. What are the decisions that you need to make about creating an event source or participating in this event source infrastructure? A lot of the patterns I'm gonna talk about today are ones that we use internally for our internal services for emitting events across S3, Dynamo, and others.

Okay. So, how do you effectively become a good event source? So, to quickly start off with making sure that we're all talking about it and framing it the same way, we all talk about events as various things but I'm trying to use a standard definition, which is an event is an indication that something of interest happened and its service is telling other people what that interesting event was. The standard definition of the reactive manifesto is out there. I'm surprised no one mentioned this until now so I'll take first credit for calling that out. But, when you look at it there's kind of two conceptual pieces of what actually delivering events over there.

So, one is the event source itself, the component which is responsible for identifying the change that happened and then emitting a payload with some interesting information about what needs to happen over there. And there's a second component, a logically separate component, whether they merge or not remains to be seen that's actually responsible for getting that event to the actual provider. Keeping with kind of the principles of loosely coupling the processor and the source itself you want that to be a potentially separate service. In fact, Austin's talked about kind of event gateway which fits nicely into this router concept that is in place over there.

Some examples of these particular pieces within AWS just to kind of bring the point home, the services on the right, sorry, I get mixed up, on the right acts as event sources. So, data stores like Dynamo, S3, even EC2 instances all emit events. So, for example, EC2 can emit an event that says my instance responds up or down. Dynamo can say my record has been updated. S3 says a new object has been shown up or deleted and then you have various ways to route them to actors. Most commonly Lambda functions but you can imagine it being routed to a container service to get deployed which could be SNS which is a general POP subconstruct, CloudWatch events which allows you to map arbitrary events to various destinations like SNS, Lambda has this concept of event source mapping, so on and so forth, right.

I'm not trying to be comprehensive in explaining these things just to give you a real-life example of what I mean by these two types. Okay. So, with me so far? Okay. So, the first decision you would make is what actually goes into your payload. And here there's these two distinct patterns that we kind of realize even when we are building our ecosystem of event sources.

There's one thing which is common, which is sort of a standard baseline which is all your events must contain provenance information. And, by provenance I mean who's the source, what was the relevance of the event that happened and when it happened, right? More often than not times time information is something that you will find useful in terms of determining both.

In fact, interestingly both times [inaudible 00:08:27] interesting of what was the event that happened that was interesting and when did the event source or the thing that was watching it notified that that particular event happened. More often than not those two are not necessarily separate but maybe potentially worth capturing.

What goes in the rest of the event party varies depending on what your actual scenario is and there are two patterns here. And, it kind of goes back to what your philosophy or what you believe an event can be. One pattern is to believe that an event is a notification. In the sense that it tells you that something interesting has happened or if you want one of my teammates called this a passive aggressive notification in the sense that I'll tell you something happened but if you wanna know more you have to come back and talk to me. I'm not gonna tell you what happened.

So, the idea behind this is if you're application logic is baseline on the fact that the original event source has to be talked back to this is probably a reasonable pattern to follow. So, in this case, the event payload contains information about the event source resource that was affected, more data about who made the change, what goes into it, the security construct in order it should happen, etc. and the assumption is that the component that's reacting to the event processor reaches back to the event source to find out what happened.

So, let's take an example of kind of the next-next Uber that we were talking about. Where we say that in this model if that service wanted to emit it would assume that the function has the ability to talk back to that service and asked for something interesting, say, ice cream information or whatever it is that they wanna kind of find out over there.

The real-life example you know our infamous S3 top mailing example that is now been beaten to death multiple times but I present it to you yet again because it's a great one to talk about is an example of this. So, when an object shows up inside S3 and a notification is published the notification does not actually contain the pillar of the object that was created. It simply says that an object was. And then, the function has the choice of reaching back into the S3 bucket and then pulling that content back and deciding what to do with it.

So, the tradeoff you have over here is much more lightweight communication across the services however you're ending up with somewhat of a tighter coupling because now each downstream service is now aware of the one that's upstream. The other one to remember is the potential doubling of traffic on your event source. In an extreme case over here imagine if every object that was being uploaded needed to be thumbnail, now has doubled the traffic on your bucket, one for actually putting the object and one for reading the object to go back and process it.

So, there is a trade-off that you're setting up for yourself. Imagine if you were in place of S3 what does that look like for you? Are you okay with that additional traffic coming back to you or if your traffic is going to be all the time there's another pattern for you to consider?

The second model to consider is kind of the state transfer pattern or sort of having the event pass state forward in the assumption that the downstream service never comes back and talks to you. And borrowing both these verbages from Martin Fowler over here because it kind of gave a nice clean way to explain about it. Now, what this assumes is the opposite of what was before which is you can't talk back to this event source. Consider your connected device a story where an IoT device is putting something out that's potentially very limited case whether you can go back and reach the device and find out what's going on over there. Or, again if your service doesn't have a public endpoint or something that customers can call back you can't do it.

So, in that case, you would lean towards putting a payload on the object itself. Trade-off obviously much more data being schlepped around. You have now considered security constructs where what actually goes in there and now worry about how that gets passed forward but more importantly it brings another component which we typically don't consider over here which is this concept of an events store.

When you say, and I'll dig into that just a little bit. So, to these are two examples of AWS events that we actually emit today. One is an S3 event the other one an IoT data blob being passed in through Kinesis using Kinesis as the event store in that particular case and I'll talk a little bit more about that. The portion on the top is the common element across the board, that's provenance. It's saying this particular event was created at this particular time. The reason it happened was an S3 object or a Kinesis port and passed on forward. But what you see is a difference in the rest of the body. The S3 event surfaces things those are characteristics of the event source itself. It says it's the bucket, this was the object etc. gives me all the information that as a function I would need to go back and reach into it.

Kinesis, on the other hand, just stuffs the payload in there, encrypted of course, so that the downstream service can provide it and you rarely go back and talk to the thing that was there before Kinesis itself in front of it. And, going back to this, if you are a provider think about which of these things matter to you. Is the information that you're passing to process downstream valuable to stuff in the payload itself or would you require them to talk back to you? And, this doesn't necessarily just have to take this identity model that is put in there. It could be as something as simple as a callback URL that you put in there saying that, "Hey, this is how you call back and talk to me if you need to when you're ready."

So, going back to this previous thing I was saying about sort of event store if you have state transfer and you as an event source don't have a concept of state it needs to land somewhere. And, as we already said your quintessential "fast provider" is not stateful. So, you're not gonna have any concept or landing over there. So, you need something that is potentially an event store in the middle. And it's a decision to be made not something to be taken lightly. On one hand, with an event store, you get enhanced potential durability and retention. And, this is valuable for many, many scenarios that you see over there. Durability for the reason that events are accessible potentially if the event source itself is done.

So, God forbid your next, next Uber doesn't show up and is down for some reason the events are still there for replaying and kind of having a conversation with. Secondly, you have this concept of retention which means they can be revisited, replayed, used for rehydrating production stores, etc. The quintessential scenario here is kind of the event sourcing model. If you have a data store that is having a change log being published you want it to go into durable store that then you can go back and replay.

The flip side of it is you have potential additional storage, retention, and complexity of another component being introduced over there. Previously you were dealing with this thing that emitted events and thing that reacted to events and now you have this thing in the middle that is retaining and potentially shifting events around over there.

So, now again I'm gonna stress this again, you don't need it in sort of two cases. Don't worry about it in the S3 model if state is already present over there. Don't worry about it in the case where state is passed back and this is a fancy way of saying if it's a synchronous invocation don't bother.

So, let's kind of use a couple of real examples of it because there are a couple of event store models that you can go down. One is the streams model, the advantage being your events can be potentially processed in order. You have this concept of broadcast showing up but you have multiple providers actually being able to act against it. The downside is you start dealing with things like Rob was talking about in his first talk about how do you deal with sharing or distribution off your events across all these different collections that you have within your stream.

What do you do, your fan out or kind of the "finance scale and behavior" that you get with functions is somewhat restricted because of the ordering fixation that you have over here. We'll talk a little bit about that. So, in this kind of example over here, imagine this is a table that's writing say a product listing into it and it's publishing the update to that product table as a stream which is then processed by Lambda and then backed up. You're creating essentially an eventual consistent copy maybe in another region as well as an entity snapshot for doing archiving in the background. Pretty straightforward. An architecture that many of you would consider but in this model, the right approach that you wanna do is make sure that you have a stream-based event source or an event store as opposed to the next one I'm talking about which is a queue-based one.

So, I'm not talking about anything revolutionary here. This queue-based, SoA, you know, maybe there are terms which even I don't know at this point. You can talk about this is your quintessential what if I use an enterprise service bus as a way for kind of decoupling to systems over there. The nice thing about it, it still applies to the concepts we're talking about here.

Concurrent process on individual events you can process each one discreetly. The flip side of it being obviously no ordering guarantees and there's the concept of doing multiple consumers on it flips around. But, there are scenarios where it's valuable. So, here's an example of an architecture that one of our customers has set up for running wireless scans on machines images. So, every time someone uploads machine images or armies in the AWS terms into an S3 bucket, the notification of that gets published in to an SQS queue and then they have multiple workers which actually spin up their antivirus software installed at machine image, run scans on it, and then notify if they find something interesting.

So, it's an example that a consumer isn't necessarily a function going back to the whole thing about event driven and serverless but the queue in the middle makes sense because now you have multiple workers, although with limited fan out that's actually being able to act on the event. And it really doesn't match which image gets processed first or later. It's the right thing for you to do.

In both these cases, the queue is acting as the durable store over there. And, the reason this is important is that particular...you may have a case where they have their entire fleet down. Someone went home for the night and accidentally switched out the entire fleet. You don't want to lose sight of the fact that 50 images needed to have a virus scan run on them and they just sort of evaporated because no one was there watching for it. That's the case where you need an event store in place. Both viable models but something for you to consider over here.

Interestingly enough, like I was kind of alluding to earlier the event story you pick also has implications on how you're scaling and your ability behavior works. In the default case where there is no event stored, if an event source is talking directly to your provider it's the closest thing that you can get to a synchronous operation but the provider has no concept of doing say built-in retries or any kind of retention over there. You bring in something like queues, you now have this concept of potentially doing retry offers because the object is still, the event itself is still present on the queue, you can try to repopulate that image and kind of go from there.

And, by the way, this is the model that we use at Lambda Asynchronous API underneath the covers. And this was the reason why we kind of chose to do this is we kind of give you a miniature version of an event store inside our Asynch API so the vendor function runs there's this concept of retry behavior that we can kind of build into it.

And finally, when you talk about stream-based ones because of the fact that events are being shouted across multiple streams your concurrency gets limited to the number of shots or partitions that you actually have on your listing which is very different from the other model where your concurrency or parallelism that you're getting is driven by the individual events that are in play over there.

So, as an event source provider, the thing you have to think about is if I'm exposing an event store A) decide whether you wanna expose an event store or not and if I do which of these models do I pick? You don't have to. You can always choose to tell your customer saying that, "Hey give me an SQS queue to publish my stuff onto or tell me your functions directly and I'll push directly too." But, more often than not it's valuable for you to consider like do I expose say an event stream or an event queue which then any consumer can go in and publish off of.

And, as you can imagine the stream-based model is the one that Dynamo DB has chosen to adopt. Dynamo DB streams is a built-in construct into Dynamo DB that shows up over there. The queue-based approach is what to some degree S3 has chosen to follow. They have an internal queue where they actually retain some of those messages and do one level of retries before they even get to Lambda over there.

All right. So, the third decision is about the router construct. Now again, recapping that we talked about a bunch of services within AWS which are routers but the interesting thing is to think about what are the capabilities that you need for the router to be present. And, as I was actually creating the slide one thing I want to assert over here is a lot of these capabilities are pretty implicitly bound to the downstream actor.

So, in many cases, you will find Lambda or [inaudible 00:21:01] or others providing services that embody these capabilities in some form or the other. Take SNS is an example that I'm familiar with that allows you to kind of do a pub sub construct on top of this. But you're welcomed to actually write your own event source bridge. So, we have custom providers who use a Lambda function to act as their router for reading off their particular event store and then turning around triggering any functions that they might have on their end.

So, in this particular case these three main checkpoints you're to have for your event source or your event router. One, the ability to securely associate multiple providers to multiple sources to multiple destinations. And, I stress the word securely over here. In the world where events are flowing all over the place and you have 50 milling components sitting around over the place is extremely essential to make sure that anyone who's emitting an event is authorized to do so and anyone that's consuming an event is authorized to do so. And, you're sure that the person who's emitting it is the person who says it is.

In Bobby's example, you don't want someone else saying here is $100,000 credit to my account and slip that into the event stream without anyone else knowing about it. So, you have to make sure that the mapper itself has this ability to kind of map security between those two. The second key tenet that you will see over there is conditionals. The ability for you to filter down the events that the provider actually has to work on. This is more an efficiency play more than anything else but it's one that we have seen, we've started embodying almost in all our routing services internally.

So, you will see this say within S3 you can filter by prefix and suffix. On SNS you'll start seeing some kind of a filtering construct as well. You can do this with Dynamo. You can say specify which particular objects you actually want to specify then go out over there and then finally because of the fact that they are the ones responsible for communicating with the downstream service, the concept of on failure semantics rest with the router as well.

And, this is where the affinity with the downstream consumption service becomes really interesting because you will now have the ability to say, "Okay. If the function failed to exit the event on the first time what do I do? Do I throw away the event? Do I put it onto a dead letter queue so that it gets replayed? Do I end up going and telling the function to do something differently? Do I rely on the failure and semantics of the function itself?"

So, if you are choosing to build your own event router or if you're using a standard platform put it through kind of this checklist order. Is it able to give you these capabilities and what does it do with it? You can imagine kind of a whole bunch of more ideal scenarios that show up with there. Imagine Dynamic discovery of both event sources and destinations that are there and then dynamically binding the components back and forth. The ability for you to combine events with identical schemas and then doing joins with it. Throw graph QL into the mix and see what happens on that particular front or you bring something like multiplexing where you have fan in, fan out being handled by the centralized service without having each store and destination worry about it.

But, I think it's gonna be an evolution that you're gonna see happening more. So, to finish it all off I just kinda wanted to bring this example. So, this is actually one of the services within AWS has this simple architecture of handling automated capacity ordering when someone puts in limit increases. So, the limiting increase is going to Dynamo DB. There's a stream that's published out there. Lambda function processes it, pushes it to SNS which then both notifies an operator and then puts it into a queue for some of our automated ordering processes to go and kick it off on that particular front.

And, this kind of embodies a lot of the things we talked about here. The first Dynamo DB table acts as an event source. The Dynamo DB stream is the event store over there. The first Lambda function is acting as your consumer over there but then also acting as a simple event source itself. Because it's not telling a downstream service to do something nor does it use a store. All it's saying is, "Hey, I got an event going over there and it uses SNS as its router to go to multiple components downstream." And over here it uses SQS as its sort of retention window for downstream services to reliably go and process it over there.

So, this may be kind of a bigger example but you can replace any one of those components with your particular service and imagine what that particular workflow looks like. Do you want your consumers to be creating Kinesis streams in queues or do you create it on their particular behalf?

All right. So, to kind of quickly bring it together be smart about what goes into your payload. Don't overstuff it with information you don't need. Have a good thought process about what your scenario is for it to be happening. If it needs your service to be involved in it make it a notification. If it's something that you can just pass on forward and then be completely disconnected but the payload in there.

Second, surface an event store where appropriate. And, I believe things like event streams are probably critical components moving forward although there are scenarios where they're optional. But queues and streams give you a durable and potentially reliable way for events to be replayed, re-created and otherwise to move forward over there. And finally, think about routers as I wouldn't always recommend that you have to go and build routers.

My hope is that folks like serverless and all of us together enhance the particular router construct over there. But, if you have to build one there are a few guidelines that you can follow around sort of having multiple support for doing secure access across the two, supporting for filtering, and making sure that you have clear semantics for failures and success that you can actually push forward on.

And, my hope is kind of making sure that the more event sources that are out there the more [inaudible 00:26:41] serverless architectures of the future get. So, looking forward to all of you contributing to the event source ecosystem. All right. Thanks.

Subscribe to our newsletter to get the latest product updates, tips, and best practices!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.