dev-resources.site
for different kinds of informations.
Railway Event Processor: Safe integrations in Event Sourcing
An integral default approach for the design and implementation of third party integrations in event-sourced systems.
⚠️The abstractions proposed in this paper will be soon implemented in booster, the algorithms proposed are proven correct in this repository. A solution to a real world problem with these abstractions will be soon shared.
Railway Event Processor
When in doubt, treat your system as distributed
The need to integrate with third-party services is prevalent in the software industry. Many cross-cutting problems present massive and unique challenges that require specialized focus to solve; some of them would be:
- Making sure that transactional emails are sent to consumers without being filtered as spam requires such a degree of commitment that there are companies that do exclusively that.
- Receiving payments online, interacting with the myriad of platforms available, and with the banks reaches a point of complexity that many payment providers become banks themselves to make the service more reliable.
- Keeping track of interactions with customers has enough nuance to create an entire industry of Customer Relationship Management.
- Logistics and chain of supply.
- Providing customer support.
- Authentication and authorization.
The issue with this is that, no matter how simple you think your application is, having any of those increasingly necessary integrations, effectively turn any system into a distributed one, and in the words of Leslie Lamport:
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
Any integration with these systems, services in your own company, or actually anything that happens outside the memory assigned on your program, like a database, presents four major challenges:
- Orchestration.
- Fault tolerance.
- Concurrency handling.
- Communication.
⚠️ This research looks to address Fault Tolerance and Concurrency Handling, but since they can’t be addressed without taking into account the other two, orchestration will be handled using the Process Manager Pattern, with some lessons learned from the Saga Pattern, and Communication will be solved via event streams and read models. It might catch the attention of the reader that, given the spotlight that message streaming platforms have in enterprise environments, to the point that seminal books as enterprise integration patterns deal almost exclusively with messaging, messaging is not used. As a matter of fact, the read models that will be used for scheduling can, and often should, be stored via message queues, but that extra step will be omitted for simplicity.
The language described in this paper has the aspiration to be a framework used as a bridge between engineering and business teams, so the degree of complexity of any given integration is visible and understandable by everybody involved.
First, the problem will be laid out in terms of the nature of integrations and why they are a frequent source of errors and technical challenges. Then the core concepts will be introduced, the Task
, the Compromise
, and the Transition
.
All of these will be put together with an example in the Workflow
definition.
Finally, a design will be provided for the abstraction of every kind of Task
, with a formal proof of correctness for each. Together with the solutions, there will be proof of the unsolvable nature of the opaque integrations.
The solution design is built around the assumption of an event-sourced, or at least event-driven system, and it should be possible to apply these principles to such systems without any modification. A non event-oriented system can implement this pattern with relative ease as long as it follows the Command Query Responsibility Segregation
principle. Beyond that, major concessions might be needed, forfeiting the correctness provided out of the box by this framework.
The end goal is to provide a safe structure to connect every step of these processes, so they can be built and maintained as a deterministic state machine.
Modelling systems as a workflow of tasks
Before delving into the concept of a task as the key component of our solution, it is essential to provide a brief introduction to the concept of a workflow as a means of orchestration. Details regarding this concept and the communication between tasks will be discussed later. At this point, it suffices to understand that any system orchestration situation can be modeled as a combination of tasks that run sequentially.
For instance, consider a series of tasks, labeled A to D, where task B is an action to be performed with information obtained from the result of A. Task A has the responsibility to start B. Then, when B completes, it will decide whether to continue with C, because customer support has to make a review, or D, which is a straightforward automatic process. Finally, either of them will emit an event with the outcome of the workflow.
The aforementioned configuration can be represented using a diagram as follows:
This example serves merely as an illustration of a possible task orchestration scenario. The crucial idea to grasp at this stage is that tasks will run in sequence, and that they are unlikely to show unexpected behavior, forming execution graphs of arbitrary complexity. These execution graphs provide a mental map of the system's orchestration, enabling efficient and adaptable process modeling.
More importantly, there is no hidden complexity in these diagrams, every possible outcome is represented in the graph.
The Task
An abstraction over a single step in a workflow, it is modeled after a state of a state machine, and all its possible results, including those of failure, will point out to a different task.
To-Do Item
They require one external action to happen: a user interaction, receiving a message from a queue, a webhook from a payment provider, or other similar interactions. Once that is received, a decision will be made to move to the next step.
State
A step in the workflow that will perform an action synchronously and, using the result of such action, will decide on the next step.
⚠️ Notice that neither of them are concerned with fault, the goal of the abstraction is exclusively to make a decision.
The transition
One key element of the Railway Event Processor is that there is no concept of a failed task. Even for technical faults, the inability to attempt the task has to be translated into a business outcome.
That is, a payment might be successful or fail because of a lack of funds, this means that the payment itself can be rejected, but the task is not concerned whether the result was desirable or not, but about that, there was one of two possible results, and that each of them will lead to a different task in the flow.
In this example, a successful payment will lead to the fulfillment of the order, and a failed one will lead to a new request for payment to the user, possibly with a retry counter, which will eventually lead to the order being canceled.
⚠️ A task that “fails” for business reasons it’s not considered a failure in this framework, as business failures (such as a payment rejected or not having permissions to perform an action) are abstracted as just one of several possible outcomes.
e.g. Performing a fraud check can determine whether the user is trustworthy or not. Both cases are just possible outcomes of the check: A result of potential fraud is not a failure but a successful verification that returned useful information.
The transition is represented as one of the following options:
- A domain event that ends the workflow, this would be the outcome of the workflow.
- The
State
that follows the decision. - The
To Do Item
that follows the decision.
⚠️ The domain event that ends the workflow can be used to chain parts of a more complex workflow.
Technical categories of tasks
External action required
The Task
is waiting for the third party to continue the execution.
e.g. A webhook from a payment provider or a response sent by a human.
Idempotency enabled
The Task
is executed against a service that provides a mechanism to uniquely identify an action. Said mechanism allows us to try to perform the action knowing that it will only impact the system once even if we have attempted it on the past or there is a duplicated process attempting the same action. This removes concerns for concurrency and fault tolerance, as it is possible to just attempt the action until we get an acceptance or rejection result and succeed in directing the workflow toward the desired next Task
.
e.g. Many payment providers accept an Idempotency Key
in their requests, so if we attempt a refund, and we generate a unique key per refund, we know we will never refund more than once.
Simply add the Idempotency Key
to the request, any parallel executions, or retries due to failure to register the outcome will be ignored by the third party.
Optimistic lock enabled
This is aTask
with the following characteristics:
- Either the API of the third party offers:
- A way to attach a unique identifier to the action we want to perform.
- A query mechanism that allows filtering via the aforementioned unique identifier.
- Or:
- The nature of the task makes it so that the action can be identified deterministically without an additional identifier.
- And the API offers a query mechanism that allows querying for that.
- A multistep process that can be canceled at any time before it is completed.
This way, we can, before we even start processing it, detect whether there is another process trying to perform the same action or whether there was a previous attempt that faulted. This allows us to attempt the action and verify between steps again for faulty concurrent access. Finally, it is possible to run a cleanup after one process has been successful. That is, we virtually allow multiple processes to race concurrently, allowing only the first of them to write the result and canceling the others.
Pessimistic lock enabled
This is a Task
with the following characteristics:
- Either the API offers:
- A way to attach a unique identifier to the action we want to perform.
- A query mechanism that allows filtering via the aforementioned unique identifier.
- Or:
- The nature of the task makes it so that the action can be identified deterministically without an additional identifier.
- And the API offers a query mechanism that allows to query for that.
Without a multistep process that can be canceled at any time, we must rely on a central place to lock any other processes out of attempting the operation.
That is, we won’t allow a second attempt to run if there’s an ongoing process.
Black boxed
This Task
provides no ability to reliably determine if the operation is being performed by another process, or whether we are attempting it again from a previous failure.
For these processes, the beginning of an attempt is stored, so is the result, any attempt that starts, will check if there was a previous beginning recorded without a matching result, if that’s the case, a possible failure is detected, as it is impossible to know if the action was performed and its outcome.
This case would require compromising in case of failure. A decision needs to be made if it’s detected that there was a previous attempt that did not finish:
- Have a human review it.
- Ignore it and move on.
- Retry it up to a certain amount of times.
Common language definitions
All these definitions, for the sake of simplicity, will be condensed from the perspective of the common language into these three terms, so when discussing the design of workflows that integrate third parties, they will get referred to in the following three categories.
To Do
This is the External Action Required
type. An external service needs to reach to continue the workflow.
Trivial
Services that have an explicit idempotency mechanism, rare, but always preferable.
Integrating with this category poses no challenge both in terms of correctness and performance.
We can count on any action performed towards this service as that they will eventually complete.
Deterministic
This includes optimistic and pessimistic lock enabled third parties.
Integrating with this category poses a challenge in terms of correctness. The simplest approach is to consider all of them pessimistic and switch to the optimistic ones if there's a performance problem.
We can count on any action performed towards this service as that they will eventually complete.
Opaque
Black boxed.
These present an unsolvable challenge in terms of correctness, as we can neither guarantee a single execution nor failure (in the case of having a record of the attempt starting but not of the attempt outcome).
A business decision needs to be made in terms of what to do when failure is suspected.
Business concerns of a Black Boxed Task
This includes all the factors we need to take into account when making compromises over black-boxed tasks. This part of the framework is not formal, it aims to provide a checklist of factors to consider when making a decision, the decision itself is formal, as it will have an abstraction on top of it.
Visibility
How likely is the non-execution or failure of the task to be perceived.
- Self-evident. By its own nature, the failure will be noticed. e.g. If a refund is never sent, the user will eventually complain.
- Traceable. It’s possible to add reliable tracking mechanisms to follow up on the action being performed. e.g. The deduction of the loyalty points from a user that purchased a product using them fails. We would need to set an alert for it and have someone verify what happened.
- Invisible. We can’t guarantee to know the outcome of a task. e.g. We cannot determine whether an email has been opened or not. Mechanisms like tracking pixels might be blocked by the email service, so it’s possible that an email that was read shows as unread.
Necessity
How critical is the step to the workflow
- Must happen. The workflow can’t continue without this step. e.g. The payment must be confirmed before we start handling an order.
- Should happen. The workflow can continue, but there’s a consequence if it fails. e.g. Bank account number validations are known to put back customers when they make a typo, so it’s desirable to allow them to continue and then reach out to them about a failed payment.
- Nice to have. It’s not needed, but it is presumed that it adds some value to the product. e.g. Sending a satisfaction survey.
Concern
Who gets directly hurt if the task fails:
- Own. e.g. There’s a failure when registering an order as sent, and it gets sent twice.
- Third party. e.g. The delivery data for an order to a logistics company gets corrupted, and the provider needs to reach out to fix the issue.
- Customer. e.g. A refund is not sent when it’s due.
Risk
The kind of consequences suffered:
- Reputation. e.g. A customer gets charged twice for a product. Event after returning the amount, the customer might not come back.
- Legal. e.g. A Service Level Agreement is not met, a customer loses business because of that, and sues us for the losses.
- Data breach. e.g. There is an email with an invoice from customer "A" sent to customer "B", leaking GDPR protected information.
- Economic loss. e.g. An expensive product was bought, but an error caused two of them to be sent.
Compromise Decision of Black Boxed Tasks
Once taken the concerns into account, a decision has to be made about handling possible failure for the task:
⚠️ This is about
possible failure
it is often possible to guarantee that a task has failed, if that’s the case, that failure will be treated as it would in a deterministictask
.
At least once
If there’s a suspicion that the task has failed, it will be retried indefinitely, assuming the risk that it might happen more than once. It is important to understand that this means to attempt it until its success is registered
, or it expires, meaning that it might even succeed indefinitely.
e.g. Blocking a user that has been deemed hostile should be attempted until success is registered.
Up to n
times
If there’s a suspicion that the task has failed, it will be retried a certain amount of times, assuming the risk that it might happen more than once.
e.g. Sending a verification code as an SMS, it’s desirable to guarantee that it gets sent, but the costs could pile up rather quickly if the retries happen indefinitely for every user.
At most once
If there’s a suspicion that the task has failed, it’s given up silently.
e.g. Sending a welcome email.
Important: Human intervention required
If there’s a suspicion that the task has failed, there will be a step where a human has to make a decision. This is not a special kind of compromise, as there’s already a kind of task for this, the External Action Required
.
e.g. A user requested the deletion of their private data, but the confirmation email could not be sent, it is desirable to reach out to the user to avoid legal risks.
The conclusion
The outcome of a task that has been deemed possibly failed
needs to be mapped to a business outcome, that is, a transition
. All the other conclusions are mapped just like the deterministic ones.
Algorithms for each kind of Task
Every task requires:
Task identifier
A unique, human-readable identifier that is unique for the action that is being attempted and that has to be consistently and deterministically generated only with the information contained in the event that triggered it. It will be referred to as the Task identifier
. Optionally, they can be part of the signature of the events that trigger a task. Some examples would be:
refund-eaf1c4d0-3cfd-4fb4-bae0-864a6142446d
-
partial-payment-invoice-6ddcfdc3-8769-4f53-9cb8-66f6d6aa9b69-part-0
In this case, the event needs to carry the part, number and the aggregate responsible for it needs to keep track of which partial payment it is emitting the event for. payment-initiated-6ddcfdc3-8769-4f53-9cb8-66f6d6aa9b69
Expiration functions
- A function that receives the event that triggered the task, its
Task identifier
, and determines whether it still needs to be performed. - A function that maps the expiration to one of the possible transitions of the task.
To-Do Item
Tasks that expose an endpoint that a third party must complete, this might even be a human actor. The abstraction requires the following definitions:
- A read model definition.
- An endpoint definition:
- Access control.
- A data contract.
- Type.
- Message queue.
- Synchronous API route segment (web API, GraphQL…)
- A function that queries the read model based on the information in the data received.
- A function that makes a decision based on both the read model and the data received, and nothing else.
- A function that receives the event that leads to this task and generates a read model.
The algorithm:
State
Tasks that receive an event, attempt an action and, once successful, make a decision and emit a new event.
Retriability
All state tasks can define retry logic, this is done through a back off function that receives how many times the function has been attempted, the event, the Task identifier
and an optional error type (an exception in most languages), and returns either an amount of time to wait until the next attempt or a decision to not try again.
To implement this function, it is important to take into account the “retriability” of a task based on the risk assessment made in the business concern section.
- At least once
- Up to
n
times - At most once
In general, tasks falling into the first category can be retried until the timeout function stops the cycle. If that is the case, no retry back-off function is needed.
Fault tolerance
All state tasks need to define a function that takes an error type, represented as exceptions
in most languages, the event, and the Task identifier
, and returns a business decision. This function will be called if the retriability function decides not to retry again.
Idempotency enabled
This task is an integration with a third party that provides a mechanism to uniquely identify an action, often with an idempotency header, so the action can be attempted many times without producing the effect several times. The definition requires:
- A function that receives the event and the
Task identifier
, attempts the action and with the result of it, makes a decision.
The algorithm:
Optimistic lock enabled
This task is an integration with a third party that allows to perform an action in two steps, and determine whether the first step of the task is already underway by another process by querying for the first step based on some form of metadata that includes the Task identifier
. The definition consists on:
- A function that receives the event and the
Task identifier
, and attempts the first step of the task. - A function that receives the event and the
Task identifier
, and determines whether the action has already started. - A function that receives the event and the
Task identifier
, attempts the second step, and based - An optional function that receives the event and the
Task identifier
, and undoes any incomplete attempt of the task.
The Task will:
- Attempt the first step.
- Verify that the first step just made is the oldest one not expired for this identifier.
- If it isn't, it tries to delete the step and exits without an outcome.
- Verify that there isn't a completed second step for this identifier.
- If there is, it tries to delete the step and exits without an outcome.
- Attempt the second step.
- Generate an outcome with the result of the second step.
The algorithm:
Pessimistic lock enabled
This task is an integration with a third party that allows to query if an attempt on the action has been performed before based on some form of metadata that includes the Task identifier
. If it has, it has to be possible to get the outcome of that previous attempt. The definition consists on:
- A function that receives the event and the
Task identifier
, attempts the task, and uses its outcome to make a decision. - A function that receives the event and the
Task identifier
, and determines whether the action has been tried before. If it has, reads the outcome of the task and makes a decision. - A property that delimits how long can the task take in an attempt, cancelling it after that time.
The Task will:
- Write a lock for the
Task identifier
. - Verify that the lock just made is the oldest one not expired for this identifier.
- If it isn't, aborts without an outcome.
- Verify that there isn't a completed action in the service for this identifier.
- If there is, it is guaranteed that the original process has expired, so emit an outcome with the result available.
- Attempt the action.
- Generate an outcome with the result.
The algorithm:
Black boxed
This task in an integration with a third party that offers no way to determine whether an action or an attempt has been performed before. While it is possible to determine whether an attempt has started, it is not possible to determine if the third party actually received a request from the system unless the third party replied and the system was successful in storing the outcome.
Since the task is attempted via a pessimistic lock, it is possible this means that if an expired lock exists and no decision has been recorded, the state of possible failure
is reached.
The definition of this task consists on:
- A function that receives the event and the
Task identifier
, attempts the task, and uses its outcome to make a decision. - A function that receives the event, the
Task identifier
, and the time the previous attempt started and makes a decision.
The Task will:
- Write a lock for the
Task identifier
. - Verify that the lock just made is the oldest one.
- If it isn't, verify that every previous lock is past expiration, and with a registered fault.
- If they all meet both conditions, continue.
- If there is at least one expired without a fault, emit an uncertainty outcome.
- If there is at least one that is neither expired nor faulted, register a fault for this attempt and abort without an outcome.
- If it isn't, verify that every previous lock is past expiration, and with a registered fault.
- Attempt the action.
- If there's a fault, register the fault.
- Generate an outcome with the result.
The algorithm:
The Workflow
The core tenets of the framework
Rejection is not failure
The concept of a Task
is, when it comes to its outcome not concerned about which outcome means a success business wise, but about whether the action could be executed and a decision was made out of the outcome. This means that any possible result just represents the next possible state.
Undo "A" is just do "B" when "A" has been done
While the Task
is not concerned about business failure, the Workflow
itself is, there are actions that might need to be undone if subsequent Tasks
fail, this is not considered a special action in this framework, but just an action to perform after a certain precondition has been met.
All parties are third parties
Even when the services integrated are developed, maintained and owned by the same company, even by the same team, treating any external service as a third party that can fail to respond leads to predictable software. While it might seem extreme, bear in mind that this treatment is done only once, as part of the framework, so the pressure of having to consider every outcome is released once designing the state machine.
Yes, even people are third party services
In a To Do Item
, an external action is awaited, this might come from an employee of the company owning the system, a confirmation from a user, or even a user from the third party needing to click on a link. The expiration scenario, that is, when the action never happens, must be a clearly visible part of the design, as it is not a matter of “if”, but of “when” will it happen.
Insist, desist or react
As complex as the decisions for faults in a Task
might seem, there are only three possible outcomes:
- Retry the action, this is part of the retriability definition of the
Task
. - Desist in the attempt, that is, deciding not to retry, and in the fault mapping function, continue the workflow towards a different
State
. - React to the fault, that is, deciding not to retry, and in the fault mapping function, continue the workflow towards a
To Do Item
.
Optimism through deterministic pessimism
Unless there's a performance problem actually happening, treat both optimistic and pessimistic integrations as a pessimistic one.
⚠️ Since an optimistic integration needs to happen in two steps, simply define every step as a different integration.
⚠️ Remember that a performance concern is not a performance problem actually happening, the robustness of pessimistic locked actions allow for an aggressive distribution of the tasks amongst workers, reducing the time required to execute thousands of tasks to the longest execution time of a single one.
The framework, by example
The following example is the creation of an order. So simplify the model, any fault and expiration result that maps to a default exit, Order cancelled
in this case, can be left as an implicit mapping.
Legend
Glossary
Code safety:
Property of code that handles every possible state it can go through. A safe function has constraints in its input, so every possible value that is allowed by the compiler is valid, never throws exceptions for any valid value of its input, and always returns a valid value, specified by its return type.
Completed task:
A task that has executed without fault. This definition is not concerned whether the action was accepted or rejected.
e.g. A fraud check that returns a result, regardless of the user being deemed trustworthy or not, has completed.
Concurrency handling:
Capacity of a system to behave predictably when interacting with a resource that can be accessed by another process, be it a third party, another instance of the current system or another process that is part of the same system.
Failed task:
A task that could not be completed after all the fault tolerance mechanisms have exhausted any option to recover, or the task can’t be performed for other reasons.
e.g. An email provider throttled a request, a circuit breaker acts in to give the service some time to recover, after the request successfully goes through, it returns a Bad Request response because the content had some invalid characters.
e.g. A SMS provider rejects an attempt to send a message because the associated account does not have enough credits.
Fault tolerance:
Capacity of a system to continue functioning predictably after an external action fails to execute as expected, it is assumed that the fault is transient, that is, that the external system will be able to eventually respond.
Optimistic lock enabled action:
An integration that can be done in two steps, provides a deterministic mechanism to identify the process as unique, and allows for the process to be cancelled.
Pessimistic lock enabled action:
An integration that provides a deterministic mechanism to identify the process as unique, but lacks the other two characteristics of the optimistic lock enabled
Featured ones: