WebCat’s Transition to Step Functions

WebCat is Clouden’s domain registrar service. Our customers register and renew domains and manage their DNS records. This is simple on the surface, but involves quite a few integration points. We interact with external services like the .fi top-level-domain registry and external DNS services. We also have an order processing system that has grown more complicated over time.

In this article, we describe how we originally implemented our system with Amazon SNS and SQS and later transitioned to Step Functions. The goal was to enable more advanced features, such as processing of multi-item orders. We also made our system more observable and easier to extend.

Original solution based on Amazon SNS and SQS

When we initially launched WebCat, customers could register one domain at a time. We implemented all integrations using Amazon’s SNS and SQS messaging services. SNS is a publish-and-subscribe service for real-time one-to-many messaging. SQS is a message queue service for eventual one-to-one messaging.

Amazon SNS is useful for keeping services detaching from each other. The sending service doesn’t have to know anything about the receiving services. It just publishes a message to a specific topic, which acts as the API endpoint. We can develop services independently and replace them with new implementations as necessary.

Amazon SQS is useful for making sure that usage peaks don’t cause service outages. Requests are queued and integration services process them as quickly as possible. For instance, we can queue hundreds or thousands of REFRESH_DOMAIN requests immediately. The integration service, which connects to the .fi top-level-domain registry, processes one request at a time and deletes it from the queue.

Shortcomings in the original solution

Although Amazon SNS and Amazon SQS work quite well for their intended purposes, they also have weaknesses when used to implement service integrations.

SNS is a one-way medium with no “receipt” feature. The sending service doesn’t know whether receiving services processed the message or not. Errors go unnoticed.
SQS tracks individual messages but it stores failed messages in a dead letter queue. There is no way to refer to one particular message and check its status.

Due to these weaknesses, we originally developed a complementary service that uses DynamoDB to keep track of requests and their statuses. When AWS released Step Functions, we realized that it provides the same thing out-of-the-box. Consequently, we decided to transition our system to use to Step Functions.

Transitioning to Step Functions

Step Functions are a way to build and execute state machines in the cloud. When you use Step Functions to implement a service integration, the state machine defines basic operating logic.

A state machine execution flows through each state of the machine and typically calls small Lambda functions in a predefined order. The state machine may include Choice states which act similarly to if-then statements. It may also include Map states which iterate over arrays like for-loops. It can also respond to error conditions in the same way as try-catch clauses work.

Step Functions let you define Activities, which are similar to SQS message queues. An Activity State waits for an external service to process the Activity and report success or failure. The operation is synchronous, which means that the next Lambda function receives the activity result as its input.

Overall, Step Functions provide a useful abstraction that can correspond to real-world concepts like order processing and other end-user actions. They reduce the amount of low-level code you need to write in Lambda functions. They also make your business logic more observable and organize it into well-defined state machines.

Anatomy of an Execution

When a customer submits an order, we create a state machine execution to fulfill it. An execution has a state and a unique name. The initial state is always Running and eventually it becomes Succeeded, Failed, Timed out or Aborted. We use the unique name to uniquely identify the order.

To create a state machine execution, we need to know the state machine name and define the execution input as a JSON object. This is roughly equivalent to sending a JSON message to an SNS topic or an SQS queue. Since the execution has a unique name, we can also track it through its lifetime. At the end we will know the final execution status and receive the final output.

However, we are rarely interested in the final status and the final output of a Step Function execution. Instead, we include a final Lambda function state as part of the state machine. In the normal case, this Lambda function receives the final output, finishes handling the order and captures any pending credit card charge. This way the final Lambda function is part of the execution and we can observe it in the same place as the rest of the state machine.

Error handling

What happens when a Step Function execution fails? For that purpose we have a separate Lambda function that subscribes to CloudWatch Events outside the state machine. It receives receives the final status and error message from failed Step Function executions and finishes handling the order with an error status. It also cancels any pending credit card charges.

Under normal conditions, our Step Function executions don’t fail. When an individual service integration fails, we don’t throw an exception that would fail the entire execution. Instead, we return an Error attribute in the output JSON to indicate an error. This makes it possible to use Step Function logic like Choice states to handle errors in various ways, such as calling a specific Lambda function depending on the type of the error and where it occurred.

For example, we might be processing an order to register ten domains. An error may occur while registering one of the domains because it was snatched by someone else a few milliseconds earlier. In this case, we don’t want to fail the entire execution. We want to continue processing and register as many of the remaining domains as possible. The final output will indicate whether each registration was successful or not, and we calculate the final credit card charge based on that information.

Nested Step Functions

It is sometimes useful to split a state machine into several nested state machines. In our case, the top-level state machine is usually an Order. It processes each item of the order and executes an inner state machine to perform the related action, such as a domain registration. We call these inner state machines Actions.

This separation lets us detach service integration details from order processing. When we add new service integrations, we define them as new state machines. The top-level state machine doesn’t have to know anything about the details, just the state machine name that it needs to execute.

Step Functions can also call many cloud service integrations directly. For instance, you can read and write data in DynamoDB without writing any code. This makes it increasingly feasible to implement all business logic as nested Step Functions, without necessarily having to write any Lambda functions at all.

Conclusion and some advice

To conclude, here’s some general advice based on our experiences in implementing Step Functions. Your mileage may vary, but we encourage you to consider these points when planning your own implementation.

DO

Define the JSON input and output structure of every Lambda function as strictly as possible. When your Step Functions grow more complex, you quickly lose track of those structures unless you use static typing in TypeScript. This is particularly true when you use nested executions, Choices, Map iterations and ResultPath.
Give your state machine executions explicit names to make it easier to understand them in the AWS Console. The name can be up to 80 characters, which can fit two UUID values and a human-readable label. Two UUID values are useful when you want to use a DynamoDB primary key as part of the name.
Also give your nested executions explicit names using the Name.$ parameter. Include the name in the input object so that you can refer to it with a path like $.executionName. You can also do this for nested executions inside a Map iterator, if you include the names for each iterated execution in the input. Be careful to make them unique.
Keep in mind that errors can occur on many different levels. States can return “soft errors” in their output objects, individual states can fail, and entire state machine executions can fail. Use an external Lambda function and CloudWatch Events to catch failures at the outmost level.

DON’T

Don’t overwrite the original input object of your executions. Use ResultPath in your states. It retains the original input and passes it through to the next state. This ensures that all states always have access to the original input. You’ll be thankful for this later on, when you develop the state machine further and don’t have to refactor everything to access the input object.
Don’t pass the entire input object as a parameter to Map iterator states. You can end up multiplying the input object hundreds of times and cause an overflow. Design your data structures carefully so that this doesn’t happen.
Don’t throw errors to indicate failures. Instead, define your own protocol for returning error codes as part of the JSON output from Lambda functions. This way, your higher-level failures won’t terminate the entire Step Function execution and you have more flexibility in processing them.

Thanks for reading and we hope this article was useful for you! If you’d like to see how our system works in practice, we warmly welcome you to register your next .fi domain at WebCat.

AWS re:Invent 2020