Introduction
A collection of notes on programming using actors.
Source code for this book (and program files) are available at https://github.com/jayson-lennon/actor-notes.
Quick DDD Primer
Here is a list of DDD tactical terms and how they map to Actor Systems:
-
Entity -> Unit of State with Unique Identity: A distinct "thing" in your system that you track individually over time.
- Actor Mapping: Often represented by the state within an Actor, with the Actor's address or ID representing the Entity's unique identity.
-
Value Object -> Immutable Data Structure: Data that describes something but has no own identity; defined only by its values.
- Actor Mapping: Data carried as the content of Actor Messages or stored as immutable fields within an Actor's state.
-
Aggregate -> Consistent State Cluster (managed by a Coordinator): A group of related data that must be kept valid together. One point controls all changes to this group.
- Actor Mapping: Naturally maps to a stateful Actor. The Actor is the coordinator, encapsulating and managing the state cluster, processing messages sequentially to ensure internal consistency.
-
Domain Service -> Domain Process/Operation: Logic for a significant action or calculation that doesn't belong to just one "thing."
- Actor Mapping: Implemented by dedicated Actors responsible for specific domain processes or orchestrations, possibly interacting with stateful Actors.
-
Application Service -> Use Case Coordinator: Code that directs steps to perform a specific action requested by the user or another system; uses the domain logic but isn't the logic itself.
- Actor Mapping: Implemented by Actors that serve as the entry point for commands or requests, orchestrating interactions by sending messages to other Actors (Domain Services, Aggregates).
-
Repository -> State Loader/Saver: Gets and saves the "Consistent State Clusters" (Aggregates) from/to storage.
- Actor Mapping: Used by Actors to manage their persistent state. In some frameworks, persistence is integrated into the Actor model itself.
-
Domain Event -> Something Significant Happened: A notification or record that a key occurrence took place in the system.
- Actor Mapping: Often implemented directly as Actor Messages that are emitted by Actors and can be subscribed to by other Actors or external systems.
Actor Systems
Let's think of an "actor system" not just as a technology, but as a model for structuring software. It’s a paradigm shift away from traditional imperative or sequential programming, offering a fundamentally different way to design and build applications – one that excels in handling concurrency, complexity, and resilience.
Instead of viewing your application as a series of steps executed linearly, an actor system envisions it as a collection of independent, self-contained entities interacting through messages. Think of it like many small independent programs communicating with each other. Its primary benefits include:
- Improved Extensibility: Each actor can be developed, tested, and scaled independently, making it easier to integrate new features without affecting existing ones.
- Enhanced Maintainability: The clear separation of concerns and the straightforward message-passing interfaces within an actor system make the codebase easier to understand and maintain.
- Enhanced simplicity: Each actor operates as an independent unit, allowing you to focus solely on the messages it sends and receives, thereby simplifying the coding process.
- And of course easier concurrent programming: By breaking down applications into manageable, isolated components, actor systems simplify the implementation and management of concurrent operations.
By embracing this model, you can build more robust and adaptable software systems that are better equipped to handle complex operations and future growth.
What is an Actor System?
At its core, an actor system consists of:
- Actors: These are the fundamental building blocks. Think of them as lightweight, isolated computational units. Each actor has:
- State: Private data that only it can directly access and modify. This eliminates a huge source of concurrency problems (more on this later).
- Behavior: A set of rules defining how it reacts to incoming messages. It's essentially the logic for processing those messages, potentially updating its state and sending new messages to other actors.
- Mailbox: A queue where incoming messages are stored. Actors process these messages one at a time (serially), ensuring order within their own context.
- Messages: The only way actors communicate with each other. They're simple data structures – no shared memory, no direct function calls. This is crucial.
- The System: The environment that manages the actors, schedules them for execution, and handles message routing.
Why Choose an Actor Model?
Traditional imperative programming often struggles with concurrency. You end up wrestling with locks, mutexes, semaphores – tools designed to prevent problems arising from shared mutable state. These tools are complex, error-prone (deadlocks!), and can significantly hinder performance due to contention. The actor model sidesteps this entire problem class.
Here's how the actor model shines compared to sequential or even traditional concurrent programming.
Elimination of Shared Mutable State
This is the key benefit. Because actors have private state and communicate only through messages, you completely avoid race conditions and deadlocks that plague shared-memory concurrency. This dramatically simplifies reasoning about your code – you can understand the behavior of a single actor in isolation.
Natural Concurrency
Many tasks we think of as sequential are actually inherently concurrent. Consider:
-
CLI tooling
Command-line interfaces (CLIs) typically operate on files. You can enhance even basic tools by modeling each file as an actor, thus incorporating concurrency seamlessly.
-
Image Processing Pipeline
Different stages (loading, filtering, compression) can be represented as actors, each working independently on a portion of the image data.
-
Game Logic
AI agents, physics simulations, rendering – all could be handled by separate actors, allowing for parallel execution and responsiveness.
-
Web Server Handling Requests
Each request could be handled by an independent actor, allowing for parallel processing without complex threading management.
-
Financial Trading System
Order processing, risk management, market data analysis - each can be an actor reacting to events in real-time.
-
Resilience & Fault Tolerance
Actors are designed to fail gracefully. If one actor crashes, it doesn't bring down the entire system. Supervision strategies (built into many actor systems) allow parent actors to monitor their children and restart them if they fail, ensuring continued operation. This is incredibly difficult to achieve reliably in traditional architectures.
-
Modularity & Scalability
Actors promote modularity – each actor encapsulates a specific responsibility. This makes code easier to understand, test, and maintain. The inherent message-passing nature also lends itself well to scaling; actors can be distributed across multiple machines with relative ease.
-
Reactive Programming
Actor systems embody reactive programming by inherently responding asynchronously to events without the need for polling.
The actor model also enables some interesting techniques that are difficult to pull off in more traditional programming paradigms. Some of them are highlighted in the next section.
Key Advantages of Actors
The actor model enables patterns that are incredibly challenging or impractical in other paradigms, providing key advantages over traditional programming paradigms:
-
Complex State Machines
Managing intricate state transitions becomes much cleaner when each state is represented by an actor, reacting to messages representing events. Imagine a complex workflow engine – each step could be an actor, simplifying the logic and making it easier to debug.
-
Decentralized Coordination
In systems where you don't want a central authority dictating behavior (e.g., distributed sensor networks), actors can coordinate their actions through message passing without relying on a single point of failure or bottleneck.
-
Adaptive Systems
Actors can dynamically adjust their behavior based on incoming messages and the state of other actors, creating systems that learn and adapt over time. Think of an autonomous vehicle – each component (perception, planning, control) could be an actor reacting to sensor data and coordinating with others.
-
Workflows
Actor systems excel at managing long-running workflows with by allowing actors to suspend their work temporarily and resume later, maintaining state and ensuring seamless execution over extended periods (even days).
-
Event Journals
Actors can log actions and message exchanges to event journals, providing a comprehensive audit trail of system activities. This allows you to see exactly what the state of an actor was at any given moment in time.
The actor model shines when it comes to maintainability and architectural flexibility. The core principle – building systems from independent, message-passing actors – directly contributes to a more manageable and adaptable codebase.
System Maintainability
In traditional architectures, adding new features often involves modifying existing codebases, potentially introducing regressions or unintended side effects. With the actor model, this process is significantly cleaner. New functionality frequently translates into adding new actors.
Think of it like building with Lego bricks instead of sculpting a single block of clay. Want to add a new feature? Create an actor that encapsulates that feature's logic and connect it to existing actors via messages. This has several key benefits:
- Isolation: The new functionality is isolated within its own actor, minimizing the risk of impacting other parts of the system. No need to fear widespread code changes or complex refactoring.
- Clear Boundaries: Actors clearly define boundaries of responsibility. Adding a feature means defining what that actor does, and how it interacts with others – leading to better-defined interfaces and reduced coupling.
- Testability: Individual actors are much easier to test in isolation, as you can focus solely on their behavior without needing to mock or simulate complex interactions.
Example: Imagine a shopping cart system. Initially, you have actors for Cart
, ProductCatalog
, and PaymentProcessor
. Now, you want to add a "Recommendation Engine." Instead of modifying the existing actors, you create a new RecommendationEngine
actor that receives messages about items in the cart and sends back recommendations. This keeps the core logic clean and focused. The shopping cart and related functionality still run just as before (including their performance characteristics), but now there is new functionality that can provide recommendations.
Actors in Actor Systems
Actors are the primary components of actor systems, acting as isolated, lightweight computational units designed to handle messaging and state management without direct interference.
Key Components of an Actor
- State: Each actor maintains its own private data that only it can access and modify. This isolation prevents concurrent modifications and eliminates common concurrency issues.
- Behavior: Actors define behavior as a set of rules for processing incoming messages. This behavior includes updating the actor's state and sending messages to other actors.
- Mailbox: Actors have a mailbox, which is a queue where all incoming messages are stored. The actor processes these messages one at a time, ensuring that message processing adheres to a specific order.
Benefits
- Isolation: Actors do not share state directly, preventing concurrency conflicts.
- Concurrency Simplification: Serial processing of messages within each actor avoids synchronization issues.
- Modularity: Actors are independent units that can be tested and scaled separately.
Actor Processing
Below is a high-level overview of the steps taken by an actor:
-
Receive message via mailbox
Message receipt is asynchronous. The actor system will typically get the next message from the queue and hand it off to the actor for processing. The actor never enters a loop to get messages and instead has a handler that is called when a message is sent to it for processing.
-
Update its internal state
Upon receiving a message, an actor may modify its internal state based on the content and purpose of the message. This state update enables actors to maintain and alter their internal configurations or variables as required by their behavior logic.
-
Send messages to other actors (including itself)
An actor has the capability to send messages to other actors within the same system. This includes having the ability to send messages to itself, which can be useful for tasks that require iterative processing or callbacks.
-
Spawn new child actors
In addition to sending and receiving messages, an actor can also create new child actors. This feature supports hierarchical structures in actor systems, enabling parent-child relationships where actors manage their own lifecycle and interactions with other members of the system.
-
End message processing
After any state updates and message sends, the actor returns from it's handler function thereby handing off control back to the actor system.
All functionality of a program can be modeled by putting small state changes in actors and using messages to communicate. Note how in the above steps there are no accesses to other actor states and no waiting for messages from other actors. All an actor does is run a handler function upon message receipt, perform it's processing, and then stop.
Message Passing
While message passing is fundamentally different from direct function calls, it's helpful to understand the conceptual similarity. In imperative programming, you have a function that takes arguments, performs some operation, and returns a value. In an actor system:
- The Actor: Represents the "function" – encapsulating logic and state.
- The Message: Acts as both the argument and the trigger for execution. It tells the actor what to do.
- The Response (Optional): The actor's actions, potentially including sending new messages to other actors, can be considered a form of "response" – communicating results or initiating further processing.
However, the crucial difference is that message passing is asynchronous and never blocks. The sender doesn’t wait for a direct return value; it continues its own work while the actor processes the message. This decoupling is key to concurrency and resilience. Actors control the flow of information by deciding which messages to process, when to send new messages, and how to update their internal state based on those messages. They act as gatekeepers, ensuring that data flows through the system in a controlled and predictable manner.
The actor's message handler should handle incoming messages promptly without getting stuck in blocking operations. If extensive computations or delays are inevitable, these tasks should be executed asynchronously using background threads or tasks managed by the actor.
Keeping the message handling function fast in order to process messages immediately (or near-immediate) allows for control messages to be processed by the actor while it's working. This enables interesting capabilities, such as:
-
Self-monitoring
Actors can schedule periodic messages to themselves to perform self-monitoring checks like offloading work to other actors if their queue becomes too large.
-
Temporary Suspension
An outside source can order an actor to suspend operation, but maintain their message queue. For example, maybe a quota was hit and we need to stop processing, but we still want to resume work later after the quota is reset or increased.
-
Message Rejection
Perhaps some messages are no longer relevant, so the actor can be told to drop their pending jobs.
-
Graceful Shutdown
Shutdown messages can be sent to the actor which order it to stop. The actor can then perform various cleanup operations before quitting. This may include:
- dropping remaining messages
- finishing remaining messages
- creating a snapshot so work can be resumed later
This design allows actors to be highly responsive and flexible in handling various scenarios.
Mailboxes
An actor mailbox, or message queue, serves as a central storage unit specifically designed to hold messages directed at a particular actor instance in concurrent programming systems. This mechanism ensures that actors can receive and process their messages in a controlled and orderly manner without immediate interference from other operations or actors.
Purpose of a Mailbox
-
Message Buffering: Actors typically operate asynchronously, and it's possible for multiple messages to be sent to an actor at once before it has the opportunity to process them. The mailbox acts as a buffer, storing these messages until they can be processed one by one.
-
Isolation from Concurrency Issues: By handling message delivery and processing in a serialized manner through the mailbox, actors are isolated from direct concurrency issues such as data races or deadlocks.
-
Order of Message Processing: Mailboxes typically use First-In-First-Out (FIFO) ordering which processes messages in the order they were received.
Implementation Characteristics
-
Thread Safety: Mailboxes are inherently thread-safe to allow multiple threads to deliver messages concurrently without corrupting the message queue.
-
Blocking Operations: Upon attempting to read from an empty mailbox, actor systems often implement blocking operations where the actor thread is paused until a new message arrives. Conversely, when attempting to write to a full mailbox, write operations may block or drop messages based on system policies.
-
Capacity Management: Mailboxes can have limited capacity, which influences how they handle overflow situations. This capacity management prevents excessive memory usage and helps manage system resources effectively.
Communication channels
The method in which actors communicate with other actors is not dictated by the underlying mathematical model of actors. Therefore, how they communicate can vary significantly across different implementations.
Types of Actor Communication Channels
Each method of communication has different overhead and informs the way in which actor-based programs are written. Here is a short list of the primary methods used by actor frameworks or libraries:
Physical Actors (Actors with Affinity)
Some actor systems use "physical" or affinity-based communication channels where actors have direct bindings or shared memory spaces:
- Shared Memory: Actors residing on the same physical machine can communicate directly using shared memory, which provides low-latency interaction.
- Network Sockets: Network sockets are used for inter-machine communication. This introduces a minimal overhead compared to higher-level messaging protocols.
With physical actors you will usually have access to an ActorRef
or similar data structure. This represents a reference to a specific instance of an actor in the system (local or remote). If the actor dies, then the ActorRef
becomes invalid and the communication channel is broken. Attempting to send a message on a broken channel will result in a failure at the sender. This gives the sender the opportunity to perform some kind of different action on failure, but then also requires managing the lifetime of the communication channel manually.
Many actor libraries use physical actors because they are the simple to implement while also providing the bare-minimum to fulfill the actor model without additional overhead.
Virtual Actors (Grains)
This approach is used by Microsoft Orleans, a framework that leverages a distributed computing model. In this paradigm, actors, or "grains," are essentially virtualized and opaque objects to the caller. If actor A
wishes to communicate with actor B
, it simply sends a message to B
. The Orleans runtime handles the intricacies of managing these grains. Specifically:
- Automatic Activation: If actor
B
is not already operational when actorA
attempts to send a message, the runtime automatically activates or "spins up" actorB
. - Transparent Location: Actor
B
can be located anywhere within the distributed system. The exact location becomes irrelevant since Orleans manages the communication channels between actors. - Message Queueing and Routing: Messages are placed in a queue specific to the grain (actor) until it is ready to process them. This mechanism ensures that message delivery is reliable and follows a FIFO order where applicable.
With virtual actor systems, you'll have access to a VirtualActorRef
or similar data structure. This represents the "identity" of an actor. For example, if you query for actor Foo
with identity 123
, then the returned VirtualActorRef
will always route to Foo123
even if that actor dies. The communication channel to Foo123
always exists because the actor runtime performs work behind the scenes to create and maintain the communication channel regardless of the current system state.
Event-Driven Channels
Event-driven channels involve actors listening to specific events or signals:
- Subscription Model: Actors can subscribe to certain types of events or messages. When these events occur, the event source sends out notifications that are delivered to interested actors.
- Asynchronous Notification: This model is inherently asynchronous, meaning that actors do not have to actively query for events but instead reactively process them.
Event-driven channels can use a global message broker. Actors subscribe to events on the broker and also send messages to the broker. The message broker then becomes responsible for routing messages to the correct subscribers.
Using a global message broker decouples actors from direct ActorRef
communication, eliminating concerns about broken links. However, this setup introduces indirection, potentially increasing overhead in message transmission. Additionally, there's a risk of message loss unless acknowledged by the recipient. We assume continuous availability of the broker, ensuring that messages can always be sent; however, delivery is not guaranteed without waiting for an acknowledgment.
System Design
It takes some up-front effort to design an actor system, but it has several benefits:
-
Near 1:1 translation of design to implementation
In actor systems, each component (actor) is independent and communicates with other components via message passing. This parallels a common architectural approach where design documents outline discrete responsibilities and interactions between system components.
-
Clarity and Traceability
Because actors are isolated units, the flow of communication (messages) between them can be easily mapped back to the design documents (or vice-versa). This clear separation makes it simpler to understand how different parts of the system work together, improving traceability and accountability.
-
Ease of Debugging
With well-defined communication channels, debugging becomes more straightforward. You can trace message flows through actors, making it easier to pinpoint where issues occur.
-
Flexibility for Change
Design documents that closely mirror code allow for more flexible changes. If requirements evolve, modifying the design document first provides a clear path for updating the implementation without introducing bugs.
For these reasons, it's recommended that design documents be used for all actor-based projects regardless of size.
Supervision
Actor supervision is a key concept when using the actor model.
What is Actor Supervision?
In an actor system, implementing hierarchical supervision is a best practice. When an actor acts as a parent by spawning another actor (the child), the parent should take on the role of supervisor for the spawned child. The rationale behind this is that the parent has the knowledge of how to spawn the child. Therefore, in cases where the child terminates, the parent is equipped to respawn it.
The supervision hierarchy forms a tree structure where each supervisor is responsible for spawning its immediate children. This setup allows the entire system to be "bootstrapped" by starting with the root-level actor, known as the system supervisor. The system supervisor spawns its children, which can also be supervisors, and these in turn spawn their own children, continuing until the entire system is online. Since supervisors manage only their immediate children, this design facilitates restarting arbitrary parts of the system as needed.
Supervision strategies include:
-
Resume
The failing child actor is resumed, and its internal state is preserved. This is suitable for transient errors that do not corrupt the actor's state and where continuing processing is safe.
-
Restart
The failing child actor is stopped, a new instance of the actor is created with a fresh internal state, and the new instance resumes processing. This is typically used when the actor's state might be corrupted by the failure and a clean slate is needed. The message that caused the failure is usually skipped.
-
Stop
The failing child actor is permanently terminated. This directive is used for non-recoverable errors or when the failed actor is no longer needed.
-
Escalate
The failure is sent up to the supervisor's own supervisor (its parent). This is used when the current supervisor cannot handle the specific failure and delegates the decision to a higher level in the hierarchy.
A supervision strategy is activated by a supervision event. When a problem occurs with a child actor, the actor system itself will generate a supervision event and forward it to the child's supervisor. The supervisor can then apply it's desired strategy based on it's own state and the state of the child actor.
Benefits of Actor Supervision
-
Failure Isolation: By isolating failures at the actor level, supervision ensures that an error in one part of the application does not affect the rest.
-
Fault Tolerance: Applications can be designed to handle failures gracefully, providing a high availability of services even when parts of the system fail.
-
Ease of Debugging and Maintenance: Clearly defined recovery strategies make debugging easier and maintenance simpler.
Direct Actor Communication
Direct actor communication relies on one actor holding a reference (ActorRef
) to a specific instance of another actor. Messages are sent directly from the sender actor's ActorRef
to the receiver actor's ActorRef
. This approach is straightforward and provides a clear path for point-to-point interactions.
However, a significant challenge arises when the target actor instance terminates. Because the ActorRef
points to a specific instance, if that instance dies (due to an exception, planned shutdown, or other reason), the ActorRef
becomes a "dead letter" reference. Any subsequent messages sent to this dead ActorRef
will typically be routed to a system-defined dead letter mailbox, never reaching the intended recipient. The communication channel between the two specific instances is effectively broken.
Handling Channel Failure
Strategies for dealing with this broken channel include:
-
Death Watch
The sending actor can "watch" the receiving actor. When the receiving actor terminates, the sender receives a
Terminated
message. This signals the sender that theActorRef
is now stale and should no longer be used. The "watch" capability can be achieved by linking actors together in a supervision tree in order to be notified when the actor dies. -
Message Delivery Failure
While not guaranteed in all actor systems, some systems might provide feedback (e.g., via acknowledgements or specific error messages) if a message cannot be delivered to a live instance. This would result in an error when attempting to send a message to the dead actor.
-
Protocol-Level Acknowledgements
Design the communication protocol such that the receiver explicitly acknowledges processing messages. If an acknowledgement isn't received within a timeout, the sender can infer failure, potentially due to a dead receiver.
Design Considerations
Given that a direct ActorRef
connection to a specific instance can fail, architectural designs using this pattern should account for potential disruptions. Here are some strategies that can be used to manage recovery.
Finding a Replacement
If the target actor was part of a supervised hierarchy and is restarted, the sender needs a new ActorRef
for the replacement instance. This often requires a mechanism outside the direct reference, such as:
- Querying a parent supervisor or registry for the new
ActorRef
. - Using a naming service where actors register themselves by a logical ID.
In both of these scenarios, there will be a time delay between the time the dead actor is restarted and the propagation of it's new ActorRef
to a registry or naming service. This facilitates the need for the sending actor to use exponential backoff on retries because it may receive the same (dead) ActorRef
from a registry while the dead actor restarts.
Idempotency
While not directly addressing ActorRef
, it's still important to design messages and receiver logic to be idempotent where possible. This allows the sender to safely retry sending a message (potentially to a newly acquired ActorRef
for a replacement actor) without causing unintended side effects if the original message was partially processed before the crash.
Fault Tolerance Boundaries
Clearly define which parts of the system rely on direct actor communication and implement robust error handling and recovery strategies at those boundaries.
Put flaky connections and unstable crash-prone actors behind a "router" actor. Communication can then be reliably made to the router actor, which then relays the message to the unstable ones. The router actor can include the retry and queuing mechanisms which will allow senders to send messages without worrying about error handling.
Pub/Sub Communication
Publish/Subscribe (Pub/Sub) is a messaging pattern where senders (publishers) do not send messages directly to specific receivers (subscribers). Instead, publishers categorize messages into topics or channels without knowing which subscribers, if any, will receive them. Subscribers express interest in one or more topics and receive all messages published to the topics they are subscribed to.
In an actor system context, actors can act as publishers, subscribers, or both. An actor publishes a message to a specific topic managed by a dedicated Pub/Sub service or mediator actor within or external to the system. Other actors subscribe to this topic via the same service.
Pub/Sub and Event-Driven Design
This pattern aligns closely with event-driven design. Publishers emit "events" (messages) describing something that has happened (e.g., UserCreated
, OrderPlaced
, ItemUpdated
). Subscribers, acting upon these events, react to changes in the system state without needing to know the source of the event. This creates a loosely coupled system where components interact primarily by reacting to a stream of events.
Trade-offs: Overhead vs. Management
A key characteristic of Pub/Sub is that all messages for a given topic must pass through the Pub/Sub service or mediator. This introduces a layer of indirection and potential overhead. The service needs to receive the message, determine which subscribers are interested, and relay the message to each of them. This adds latency compared to a direct actor-to-actor message send, and consumes resources in the Pub/Sub infrastructure itself.
However, this overhead comes with significant benefits, primarily in alleviating the management of direct communication channels. With Pub/Sub:
- Publishers do not need
ActorRef
s for specific subscribers. They only need to know the Pub/Sub service and the topic. - The system becomes more resilient to the failure of individual subscriber instances. If a subscriber actor dies, the publisher is unaffected, and other subscribers continue to receive messages. The responsibility of ensuring a subscriber receives messages (potentially after a restart) often shifts to the Pub/Sub infrastructure or the subscriber's supervisor/registration mechanism.
- Subscribers can join or leave topics dynamically without affecting publishers or other subscribers.
- Adding new functionality to the system becomes trivial: Create a new actor, subscribe to a specific topic, and it's now integrated
Compared to direct ActorRef
communication where managing dead letters and finding replacement actor instances is a concern for the sender, Pub/Sub shifts the complexity. The challenge moves from managing many point-to-point connections to ensuring the reliability and scalability of the central Pub/Sub service and how actors register/deregister their subscriptions.
Hybrid Communication
A common and effective strategy in larger actor systems, especially those adhering to Domain-Driven Design principles, is to employ a hybrid communication model. This approach uses Publish/Subscribe for communication between distinct Bounded Contexts, and direct ActorRef
messaging for communication within a Bounded Context.
A Bounded Context can be thought of as a logical boundary encompassing a set of related actors, data, and business logic that operates around a specific part of the domain (e.g., Order Management, Inventory, Shipping).
Inter-Context Communication via Pub/Sub
When functionality in one bounded context needs to interact with or be notified of events in another context, Pub/Sub is utilized. For example:
- An
Order Management
context publishes anOrderPlaced
event to a topic. - An
Inventory
context subscribes to theOrderPlaced
topic to decrement stock levels. - A
Shipping
context also subscribes to theOrderPlaced
topic to initiate the shipping process.
Rationale: Pub/Sub provides loose coupling between contexts. The Order Management
context doesn't need to know who is interested in an OrderPlaced
event, only that it happened. This allows contexts to evolve independently. New contexts can subscribe to existing events without modifying the publisher. It abstracts away the location and specific instances of actors in other contexts. Failures in one subscriber context do not directly impact the publisher or other subscribers.
Intra-Context Communication via Direct ActorRef
Within a single bounded context, actors often have close relationships and need to interact directly and efficiently. For example:
- Within the
Order Management
context, anOrderAggregate
actor might communicate directly withOrderItem
actors or aPaymentProcessor
actor within the same context.
Rationale: Direct ActorRef
communication within a context is typically faster and more performant than routing through a Pub/Sub layer. Actors within the same context are often managed together (e.g., by the same supervisor), and their interactions are tightly coupled by the context's specific business logic. The overhead and indirection of Pub/Sub are unnecessary and counterproductive for these direct, focused interactions. While direct ActorRef
s require managing references and handling potential instance failures, this complexity is contained within the boundary of the context. The context's internal supervision and management strategies can handle failures of its internal actors.
Benefits of the Hybrid Approach
This hybrid model leverages the strengths of both patterns:
- Loose Coupling Between Contexts: Facilitates independent development, deployment, and scaling of different parts of the system. Changes within one context are less likely to break others, provided the published event contracts remain stable.
- Efficient Interaction Within Contexts: Enables high-performance, direct communication for tightly related actors and operations within a specific domain area.
- Containment of Complexity: The complexity of managing direct
ActorRef
lifecycles and failures is contained within the context boundary, while the complexity of managing subscriptions and message relay is centralized in the Pub/Sub layer. - Clear Boundaries: Reinforces the separation defined by the bounded contexts, making the system architecture easier to understand and maintain.
Implementation Concerns
Note that actors within the context do not communicate with any actors outside their context. All messages outside the context must be routed through a gateway actor which acts as the only entry and exit point for messages into and out of the context. This gateway actor also plays a role as an anti-corruption layer, translating messages at the context boundary. This allows the context to evolve over time without having an impact on the rest of the system.
Request-Response
Direct-to-target
Here are the steps for implementing a direct-to-target request/response pattern using a Pub/Sub message broker. This involves creating unique custom subscriptions per actor so that the responder knows where to publish their response.
-
Responding Actor Subscribes
The actor designed to respond (the "responder") subscribes to the Pub/Sub broker for a specific query topic, e.g.,
topic/query/my_thing
. This tells the broker to forward all messages published to this topic to the responder actor. -
Requesting Actor Prepares and Subscribes
The actor initiating the request (the "requester") generates a unique reply topic (e.g., based on its own ID and the request ID) and then subscribes to this unique topic with the broker. This ensures the requester will receive messages specifically addressed back to it.
-
Requesting Actor Publishes Request
The requester actor publishes the request message to the responder's query topic (
topic/query/my_thing
). The message payload includes the necessary query data and the unique reply topic generated in step 2 (often in a field likereply_to
). -
Broker Routes Request
The Pub/Sub broker receives the message published to
topic/query/my_thing
. Based on the subscription from step 1, the broker routes this message to the responding actor. -
Responding Actor Processes and Publishes Response
The responder actor receives the request message, processes it, generates a response, and extracts the
reply_to
topic from the incoming message. It then publishes the response message to this extractedreply_to
topic. The responder does not need to know the identity orActorRef
of the requester. -
Broker Routes Response
The Pub/Sub broker receives the message published to the unique
reply_to
topic. Based on the subscription from step 2, the broker routes this response message to the requesting actor. -
Requesting Actor Receives Response
The requester actor receives the response message on its unique reply topic.
This pattern effectively decouples the requesting and responding actors, as they only need to know about the Pub/Sub broker and agreed-upon topics, not each other's specific addresses.
sequenceDiagram participant Requester as Requesting Actor participant Broker as Pub/Sub Broker participant Responder as Responding Actor Note over Requester,Responder: Setup Phase Responder->>Broker: Subscribe("topic/query/my_thing") (when actor spawns) Requester->>Broker: Subscribe("unique/reply/topic") Note over Requester,Responder: Request/Response Phase activate Requester Requester->>Broker: Publish("topic/query/my_thing", reply_to="unique/reply/topic") activate Broker Broker-->>Responder: Message("topic/query/my_thing", reply_to="unique/reply/topic") deactivate Broker activate Responder Responder->>Broker: Publish("unique/reply/topic", response_data) deactivate Responder activate Broker Broker-->>Requester: Message("unique/reply/topic", response_data) deactivate Broker deactivate Requester
Gateway Correlation ID Mapping
The direct request/response pattern using unique, per-request reply_to
topics requires each requesting actor to manage its own temporary subscription. This can become cumbersome, especially if the requesting actors are short-lived or if there are a very large number of concurrent requests. An alternative approach leverages a central Gateway Actor to manage the response routing using Correlation IDs.
In this pattern:
-
Gateway Subscription
A dedicated Gateway Actor maintains a single, long-lived subscription to a fixed
reply_to
topic (e.g.,gateway/replies
). This is the only topic responses are sent to via the broker. -
Responder Subscription
Responding actors subscribe to their specific query topics with the broker, same as before (e.g.,
topic/query/my_thing
). -
Request from Internal Actor
A requesting actor (behind the Gateway) sends a request message directly to the Gateway Actor, including the intended query topic and payload.
-
Gateway Action - Publish with Correlation ID
The Gateway Actor receives the internal request. It generates a unique Correlation ID (or reads the one the actor generated) for this request and stores a mapping internally, linking this Correlation ID to the specific requesting actor's
ActorRef
. The Gateway then publishes the request message to the broker using the intended query topic (topic/query/my_thing
). The crucial difference is that thereply_to
field in the published message is set to the Gateway's own fixed reply topic (gateway/replies
), and the message payload includes the generated Correlation ID.- Gateway Map:
CorrelationID -> RequestingActorRef
- Published Message: Publish to
topic/query/my_thing
with payload including query data andcorrelation_id
,reply_to: gateway/replies
.
- Gateway Map:
-
Broker Routes Request
The Pub/Sub broker routes the request message to the subscribing responding actor(s) based on the query topic.
-
Responding Actor Action - Publish Response
The responding actor receives the request. It processes it and prepares a response. It takes the
correlation_id
from the incoming message and includes it in the response payload. It then publishes the response message to the topic specified in thereply_to
field of the incoming message, which is the Gateway's fixed reply topic (gateway/replies
).
- Published Message: Publish to
gateway/replies
with payload including response data and the samecorrelation_id
.
-
Broker Routes Response
The Pub/Sub broker routes the response message to the Gateway Actor because the Gateway is subscribed to
gateway/replies
. -
Gateway Action - Route to Requester
The Gateway Actor receives the response message. It extracts the
correlation_id
from the payload. Using its internal mapping, it looks up theActorRef
of the original requesting actor associated with that Correlation ID. It then sends the response message directly to the requesting actor'sActorRef
. After routing, the Gateway may remove the mapping entry for this Correlation ID (depending on whether duplicate responses are possible).- Action: Receive message on
gateway/replies
. Extractcorrelation_id
. Look upRequestingActorRef
in internal map. Send response message directly toRequestingActorRef
.
- Action: Receive message on
sequenceDiagram participant ReqInternal as Requesting Actor (Internal) participant Gateway as Gateway Actor participant Broker as Pub/Sub Broker participant Responder as Responding Actor (External) Note over ReqInternal,Responder: Setup Phase Gateway->>Broker: Subscribe("gateway/replies") Responder->>Broker: Subscribe("topic/query/my_thing") Note over ReqInternal,Responder: Request/Response Phase activate ReqInternal ReqInternal->>Gateway: Request(query_data) deactivate ReqInternal activate Gateway Gateway->>Gateway: Generate/Store Mapping (CorrID -> ReqInternal) Gateway->>Broker: Publish("topic/query/my_thing", correlation_id, reply_to="gateway/replies") deactivate Gateway activate Broker Broker-->>Responder: Message("topic/query/my_thing", correlation_id, reply_to="gateway/replies") deactivate Broker activate Responder Responder->>Broker: Publish("gateway/replies", correlation_id, response_data) deactivate Responder activate Broker Broker-->>Gateway: Message("gateway/replies", correlation_id, response_data) deactivate Broker activate Gateway Gateway->>Gateway: Lookup Mapping (CorrID -> ReqInternal) Gateway->>ReqInternal: Response(response_data) deactivate Gateway
Benefits
- Simplifies Requesting Actors: Requesting actors no longer need to manage individual subscriptions or unique reply topics. They simply send their request to the well-known Gateway.
- Better for Short-Lived Actors: This pattern is ideal when requesting actors are transient, as the long-lived Gateway handles the durable subscription needed to receive the response.
- Centralized Management: The complexity of correlating requests and responses is centralized in the Gateway.
Considerations
- State Management in Gateway: The Gateway must reliably store and manage the Correlation ID to
ActorRef
mapping. This requires careful consideration of potential Gateway restarts, memory usage for long-running transactions, and handling potential timeouts or orphaned mappings if a requesting actor dies before receiving a response. - Single Point of Failure: If the Gateway actor fails, all in-flight requests awaiting responses will be lost unless the Gateway's state is persisted and recoverable.
This pattern is particularly useful when you have a boundary (like a Bounded Context boundary managed by the Gateway/Router actor) and many internal actors that need to initiate requests to external services or actors via a Pub/Sub bus without exposing their individual presence or managing complex per-request subscriptions.
Messaging Concepts
Beyond the fundamental patterns of sending and receiving messages, building robust and observable actor systems requires understanding certain key concepts related to message lifecycle, traceability, and delivery guarantees.
This section covers essential messaging concepts that provide support for debugging, monitoring, and ensuring the reliability of communication within and between actors.
Correlation IDs
In distributed systems and asynchronous messaging architectures, operations often span multiple messages, actors, or services. Tracking the flow of a single logical transaction or request through this series of interactions can be challenging. This is where Correlation IDs become invaluable.
A Correlation ID is a unique identifier assigned to the initial message or event of a specific transaction or workflow. This same ID is then included in all subsequent messages or events that are part of that same logical flow.
Correlation IDs are not typically used for general system-level messages (like supervision signals, acknowledgements handled by the framework, or broadcast system events) or for tracking the overall health of the messaging infrastructure itself. They are specifically for tracing the path and state of a specific business process or workflow that spans multiple actors or services.
Think of it as monitoring an individual product as it moves through different stations on an assembly line (the workflow), rather than monitoring the health or overall throughput of the assembly line machinery itself. Some examples of where you would use Correlation IDs in a workflow include:
- Tracking an individual order as it moves from
Order Placed
->Payment Processed
->Inventory Reserved
->Shipped
. - Following a user's request through several microservices or actor interactions to fulfill that request.
- Monitoring the steps involved in processing a single incoming data record through a processing pipeline.
By limiting Correlation IDs to business workflows, they provide highly relevant context for debugging and understanding the journey of individual requests, without cluttering system-level communication.
Purpose and Usage
When Actor A sends a message that triggers a series of actions involving Actors B, C, and potentially others, Actor A generates a unique Correlation ID and includes it in the message sent to Actor B. When Actor B processes this message and sends a new message to Actor C, it copies the Correlation ID from the message it received and includes it in the message sent to C. This continues throughout the entire chain.
Benefits
- Tracing and Debugging: By searching logs or message queues for a specific Correlation ID, developers and operators can easily follow the exact path a request took, identify which actors processed it, and pinpoint where failures or delays occurred. This is particularly crucial in complex systems with many interconnected components.
- Request/Response Matching: In the Pub/Sub request/response pattern, the Correlation ID allows the requester to match the incoming response to the specific request it sent, especially if multiple requests might be outstanding simultaneously. The requester includes a Correlation ID in its initial request, and the responder includes the same ID in the response.
- Auditing and Monitoring: Correlation IDs provide a mechanism to audit the execution path of specific transactions for compliance or performance analysis.
- Context Propagation: Beyond just tracing, Correlation IDs can help propagate context across service boundaries without requiring components to understand the full state of the transaction.
Implementation
Implementing Correlation IDs is fairly simple and typically involves:
- Generating a unique ID (e.g., a
UUID
) when a new transaction starts. - Adding a dedicated field (e.g.,
correlation_id
) to message structures. - Ensuring that any actor or component processing a message and initiating subsequent messages copies the Correlation ID from the incoming message to the outgoing message.
Tell
In actor systems, the most fundamental message sending pattern is often referred to as tell
, send
, or cast
(terminology varies between frameworks). This pattern is characterized by sending a message to a target actor's address or ActorRef
without expecting an immediate response.
Key characteristics of the tell
pattern:
- Asynchronous: The sender actor does not pause or block its execution after sending the message. It immediately returns to processing its next instruction or message.
- Unidirectional: The message flows in one direction only, from sender to receiver. There is no built-in mechanism in the
tell
operation itself to receive a reply. - Fire-and-Forget: From the sender's perspective, it dispatches the message and moves on. It doesn't wait for confirmation of receipt or processing completion (though underlying transport might offer some level of delivery guarantee).
Importance for Event-Driven Architectures
The tell
pattern is the bedrock of highly concurrent, scalable, and event-driven architectures built on actors. It aligns perfectly with the core principles of reacting to events:
-
Decoupling
Publishers of events (
tell
messages) do not need to know who or how many actors are listening (subscribing) or what they will do with the event. They simply announce that "this happened" by sending the message. This creates immense decoupling between event sources and event consumers. -
Non-Blocking Operations
Because the sender doesn't wait for a reply, it can continue processing other tasks or handling incoming messages. This is crucial for maintaining responsiveness and throughput, especially when dealing with high volumes of events.
-
Scalability
Since senders aren't blocked, they can handle a high rate of outgoing messages. Receivers (subscribers) process messages asynchronously from the sender, allowing the system to scale by adding more processing power to handle incoming message loads.
-
Resilience
The failure of a recipient actor (e.g., one subscriber to an event) does not directly impact the sender or other recipients. The sender has already "fired and forgotten." While the specific message to the failed actor might become a "dead letter" (depending on the framework), the overall event flow from the publisher is not stalled.
-
Natural Fit for Events
Events inherently represent something that has happened. The nature of an event is that it is broadcast for anyone interested to react.
Tell
is the natural way to broadcast or publish such information without requiring explicit coordination or response from each potential listener.
Contrast this with synchronous patterns where a sender calls a function and waits for a return value, or an asynchronous request/response pattern where the sender actively awaits a specific reply message. While these patterns are necessary for certain interactions, relying on them for core event propagation would quickly lead to bottlenecks, tight coupling, and reduced scalability in an event-driven system.
Ask (a.k.a. Request-Response)
The ask
pattern is a way to achieve a request-response interaction between actors where the sending actor explicitly expects a reply. Unlike the basic tell
where the sender fires and forgets, ask
provides a mechanism for the sender to wait for a result corresponding to its specific request. This can be modeled as a Future
and used with async/await
.
Using ask
is much closer to sequential programming. Code within the actor executes line-by-line and waits for responses each time.
Problems with Ask
While ask
seems convenient, its use introduces significant complications that often violate core actor principles and lead to hard-to-debug issues. Almost all stem from the fact that the actor's processing flow is now waiting on an external event (the response arriving to complete the Future) rather than solely processing its incoming message queue.
-
Occupies the Actor's Execution Context
Until the Future completes (or times out) and the response is received, the actor cannot process other messages that arrive in its mailbox.
-
Cannot Gracefully Shutdown
Because the actor is busy waiting on the
ask
Future, it may not be able to process a shutdown message received in its mailbox in a timely manner. It might be stuck waiting for a reply that never comes or waiting for a timeout. -
Cannot Process Control Messages
Similar to shutdown, critical control messages (like state queries, configuration updates, or explicit stop/cancel requests relevant to the actor's current task) might be delayed or missed because the actor is occupied with waiting for the
ask
response. -
Memory Leaks and Resource Exhaustion
If a response message is genuinely lost or the target actor fails without sending a reply, the
Future
returned byask
will never complete. Without explicitly setting a timeout on everyask
call, the actor's logic waiting on that Future will effectively hang indefinitely, consuming memory and preventing the actor from processing further messages. Managing timeouts reliably for every request adds complexity. -
Message Loss
Actor mailboxes uses bounded queues in order to prevent resource exhaustion. If an actor is stuck waiting on a Future, then the mailbox will fill up and messages will start to get dropped.
For these reasons it's recommended that actor systems be designed around tell
(fire and forget messaging) instead of ask
.
Kinds of Messages
In designing actor systems and distributed architectures, especially those leaning towards event-driven principles or CQRS (Command Query Responsibility Segregation), it's helpful to categorize the different types of messages flowing through the system based on their purpose. The primary distinctions are typically made between Commands, Events, and Queries.
Commands (Expressing Intent)
Commands represent a user's or system's intent to perform an action or change the state of the system. They are imperative requests. A Command is a request for the system to do something.
-
Nature: Imperative, a request.
-
Direction: Usually directed at a specific actor or a well-defined endpoint responsible for handling that specific type of action (e.g., an aggregate actor in DDD).
-
Expectation: The sender expects the command to be attempted and potentially result in a state change and/or the emission of one or more events. The sender might expect an acknowledgment of receipt or a more detailed response indicating success/failure or the outcome of the action.
-
Naming Convention: Often named in the imperative mood (verb-noun), like
CreateUser
,PlaceOrder
,ProcessPayment
. -
Examples:
Command Description CreateUser Request to create a new user DeleteAccount Request to delete a user account UpdateProfile Modify user profile information ProcessPayment Initiate a payment transaction PlaceOrder Submit an order for processing AssignTask Assign a task to an actor SendEmail Trigger an email to be sent ScheduleReminder Schedule a reminder for future action ApproveRequest Approve a pending request RetryJob Retry a failed background job
Events (Representing Facts)
Events represent something that has already happened in the system. They are declarative statements of fact about a state change. Events are the result of successfully processing a Command (or sometimes external occurrences).
-
Nature: Declarative, a statement of fact.
-
Direction: Usually published to a Pub/Sub topic or message bus, to be consumed by any interested party. Events are often broadcast.
-
Expectation: Publishers of events typically use
tell
(fire-and-forget) and do not expect a direct response from consumers. Consumers react to the event asynchronously. -
Naming Convention: Often named in the past tense (noun-verb), like
UserCreated
,OrderPlaced
,PaymentProcessed
. -
Examples:
Event Description UserCreated A new user has been successfully created AccountDeleted A user account has been removed ProfileUpdated A user profile was updated PaymentProcessed A payment was completed successfully OrderPlaced An order was submitted TaskAssigned A task was assigned to someone EmailSent An email was successfully delivered ReminderTriggered A reminder has fired RequestApproved A request has been approved JobRetried A job retry was initiated
Events are crucial for decoupling. An actor emitting an event doesn't need to know who is interested or what they will do. Other actors (subscribers) can react to events to update read models, trigger side effects, or initiate subsequent commands/workflows.
Queries (Requesting Information)
Queries are requests for information about the current state of the system. They are not intended to change the system's state.
-
Nature: Interrogative, a request for data.
-
Direction: Directed at an actor or service responsible for providing the requested information (often read models optimized for querying).
-
Expectation: The sender explicitly expects a response containing the requested data.
-
Naming Convention: Often named to reflect the data being requested (e.g.,
GetUserProfile
,ListOrders
,GetOrderStatus
). -
Examples:
Query Description GetUserProfile Retrieve details for a specific user ListOrders Get a list of orders for a user GetOrderStatus Check the current status of an order FindProductsByCategory Find products within a given category CountActiveUsers Get the total number of active users
Persistance
Actor persistence allows actors to save their internal state and message processing history. This is crucial for ensuring reliability and fault tolerance, enabling an actor to recover its state after a system restart or crash. By persisting state and potentially incoming messages, the actor can pick up exactly where it left off, maintaining consistency and durability.
Journal
The Journal is a fundamental concept in actor systems designed for durability, fault tolerance, and recovery, especially those implementing event sourcing. It acts as a persistent, ordered log of the significant state changes or events that an actor has processed.
What is a Journal?
At its core, a Journal is an immutable, append-only sequence of events or commands associated with a specific actor instance. Instead of saving the actor's current state directly, the actor logs the facts (events) or decisions (commands leading to events) that alter its state.
Purpose and Use Cases
Journals are primarily used for:
- Durability: Ensuring that an actor's history and the consequences of its actions are not lost if the actor or the entire system crashes.
- Recovery: Allowing an actor to reliably rebuild its state by replaying the sequence of events from its journal after a failure or planned restart.
- Consistency: Providing a single, authoritative source of truth for an actor's history, enabling deterministic state reconstruction.
- Auditing: The journal provides a complete history of an actor's life and state changes, valuable for debugging, compliance, and analysis.
- Event Sourcing: Journals are the cornerstone of event-sourced systems, where the application state is determined solely by applying a sequence of events.
The Immutable Event Log
The central idea is that events are facts: they represent something that has already happened. Once an event is recorded in the journal, it is typically considered immutable and should never be changed or deleted (though some systems might allow compaction or snapshotting to manage size). The actor's current state is merely a projection or fold over this sequence of past events.
Actor-Specific Journals
Journal implementations are inherently actor-specific, or more accurately, tied to the type or behavior of an actor rather than being a generic, one-size-fits-all log.
- The events stored in a
UserAccount
actor's journal (AccountCreated
,AddressChanged
,PasswordReset
) are specific to the concept of a user account. - The events stored in an
Order
actor's journal (OrderPlaced
,ItemAdded
,PaymentProcessed
) are specific to an order.
This actor-specific nature reinforces encapsulation. The interpretation and application of events in a UserAccount
journal are the sole responsibility of the UserAccount
actor logic. No other actor should directly read or try to interpret the raw events from another actor's journal; they should interact via messages (Commands or Queries) or react to published events from the actor they are interested in.
Journals vs. Snapshots
While the journal logs incremental changes (events), rebuilding state by replaying every event from the beginning can become time-consuming for long-lived actors with large journals. To mitigate this, snapshots can be used.
- A Snapshot captures the actor's full internal state at a specific point in time (i.e., after applying a certain number of events).
- Think of the Journal as the transaction history (the list of all deposits and withdrawals) and a Snapshot as the account balance at a specific date.
The Recovery Process
When an actor needs to recover its state (due to a crash, migration, or intentional restart):
- Load Latest Snapshot (Optional): The recovery process first checks for the most recent snapshot for this actor instance. If found, it loads this snapshot, restoring the state to that point.
- Replay Journaled Events: The process then reads the events from the journal that occurred after the snapshot was taken (or from the beginning if no snapshot was loaded).
- Apply Events: Each of these events is applied sequentially to the loaded state. This brings the actor's state up-to-date.
- Resume Processing: Once all relevant journal entries have been replayed, the actor is considered recovered and can resume processing new incoming messages from its mailbox.
This deterministic replay ensures the actor's state is consistent and accurate based on its recorded history.
Journal Implementations
The underlying storage for journals varies:
- Databases (SQL, NoSQL)
- File systems
- Specialized distributed log systems (like Apache Kafka, EventStoreDB)
The choice depends on requirements for throughput, latency, consistency guarantees, scalability, and operational complexity. Actor frameworks often provide pluggable journal backends.
Journal Sequence Numbers
A critical aspect of the Journal's functionality, especially for ensuring reliable recovery and consistency, is the use of Sequence Numbers.
For each persistent actor instance, the events or commands written to its Journal are assigned a strictly increasing, contiguous sequence number. This number uniquely identifies the position of an entry within that specific actor's journal log. Typically, the sequence starts from 1 for the first entry written by a given actor instance.
How Sequence Numbers are Used
-
Ordering and Determinism
Sequence numbers enforce the order in which events occurred. During recovery, the actor system reads entries from the journal in strict order of increasing sequence numbers. This guarantees that applying the same sequence of events in the same order will always result in the same final state, making recovery deterministic.
-
Tracking Progress
An actor knows how far along it is in processing its history by the sequence number of the last event it successfully processed and persisted. The highest sequence number for an actor's journal represents its most up-to-date persisted state change.
-
Coordination with Snapshots
Sequence numbers are the bridge between Journals and Snapshots. When a Snapshot of an actor's state is saved, it is always associated with the sequence number of the last journal entry that was applied to reach that state. This sequence number is stored alongside the snapshot.
-
Defining Recovery Range
During recovery with a snapshot, the system loads the snapshot and notes its associated sequence number (let's call it
S
). It then tells the Journal store to retrieve all events for this actor starting from sequence numberS + 1
. This ensures that only the events that occurred after the snapshot was taken are replayed on top of the snapshot state, avoiding redundant processing and potential inconsistencies. -
Detecting Gaps
While ideally the journal store ensures contiguous sequence numbers, the numbers provide a way for the system to potentially detect if entries are missing or out of order (though the storage backend should handle this reliability).
It's important to remember that these sequence numbers are local to a specific actor instance's journal. They do not represent a global ordering of events across the entire system or across different actors. Each persistent actor manages its own independent sequence of journal numbers.
Design Principles
When designing actors that use journals:
- Identify Core Events: Focus on the business-meaningful events that represent durable facts.
- Events are Immutable: Design event structures to be facts in the past tense.
- State is Computed: Ensure the actor's current state is a projection of its journaled events.
- Actor-Specific: Keep journal event formats and interpretation private to the actor type.
- Consider Snapshots: Plan for snapshotting if replay time becomes a concern.
Snapshots
While the Journal provides a complete, immutable history of an actor's state changes through its events, reconstructing the actor's current state requires replaying potentially thousands or millions of events from the beginning of time or from the last saved snapshot. For long-lived actors with extensive journals, this replay process can become prohibitively slow, delaying recovery and impacting availability.
Snapshots provide a shortcut. A Snapshot is a serialized representation of an actor's state at a specific point in time. It's a capture of the state after applying a sequence of events up to a certain journal sequence number.
Snapshots and Journals Together
Snapshots are not a replacement for the Journal; they are a complementary mechanism.
- The Journal remains the single source of truth, holding the complete, ordered sequence of events. It tells the story of how the actor reached a state.
- A Snapshot is a saved version of the result of applying events up to a certain point. It's a pre-computed state from the story.
When an actor recovers, instead of replaying all events from the beginning of its journal, it can load the most recent Snapshot and then only replay the events that occurred after that snapshot was taken.
The Recovery Process with Snapshots
The recovery workflow when using Snapshots is:
- Load Latest Snapshot: The actor system attempts to load the most recent Snapshot saved for this specific actor instance.
- Load Journal from Snapshot Point: If a Snapshot is successfully loaded (including its associated journal sequence number), the system then reads events from the Journal starting immediately after that sequence number.
- Apply Remaining Events: The actor applies only these subsequent journaled events to the state loaded from the Snapshot.
- Resume Processing: Once the remaining events are applied, the actor is fully recovered and can handle new messages.
- Without Snapshot: If no Snapshot is found (e.g., it's the first time recovering, or snapshots were lost), the actor recovers by replaying all events from the very beginning of its Journal.
This process significantly reduces the number of events that need to be replayed, speeding up recovery time.
When and What to Snapshot
- Frequency: Snapshots are typically saved periodically – for example, after every N events, after a certain period of time, or when the actor's state reaches a certain size or complexity. Too frequent snapshotting adds overhead (serialization, storage), while too infrequent snapshotting reduces the recovery speed benefit.
- What State: The snapshot should capture the actor's current state. It should include all the mutable fields that define the actor's current condition. This state must be Serializable so it can be written to persistent storage.
- Consistency: The snapshot must accurately reflect the state after processing events up to a specific, recorded journal sequence number. This sequence number is crucial for knowing where to start replaying events from the Journal.
Considerations
- Storage: Snapshots require persistent storage (databases, file systems). This storage must be reliable and accessible during recovery.
- Serialization Overhead: Creating a snapshot involves serializing the actor's entire state, which can be CPU-intensive and time-consuming, especially for large states.
- Coupling: The snapshot format is tightly coupled to the actor's internal state structure. Changes to the actor's state model might require versioning or migration strategies for existing snapshots.
- Snapshot vs. Journal Integrity: The Journal remains the ultimate source of truth. If a snapshot is corrupt or missing, the system can (albeit slower) recover solely from the Journal. The system should prioritize Journal integrity.
Actor Startup and Ready State
Note that larger actor frameworks will provide some form of a persistence
module. This should provide all of the functionality needed to restore actors to their original state. However, if you do not have access to a persistent-capable framework, then this page details how recovery works.
When an actor is created or restarted, it goes through an initialization phase before it is fully capable of processing regular messages from its mailbox. This phase is crucial for establishing the actor's initial state. For actors that use persistence, this startup phase involves state recovery.
An actor can be thought of as transitioning through distinct internal states:
-
Initializing / Recovering
The actor has been created but is actively loading its past state from a journal or other persistent store. It is not yet ready to handle its primary operational messages.
-
Ready
The actor has successfully loaded its historical state and is now capable of processing all types of incoming messages according to its main business logic.
The Startup Process: Entering the Initializing State
When a persistent actor instance is spawned by the system, its on_start
logic begins. At this point, the actor enters the Initializing state. The actor's first task is typically to initiate the recovery process.
This involves:
- Starting out in an
Initializing
state. - Sending a specific message to a known "Journal Reader" actor or service, requesting its historical data (events) based on its unique identity.
- Potentially sending a similar request to a "Snapshot Store" service if snapshots are used, asking for the latest snapshot.
The actor now waits for these recovery messages to arrive in its mailbox.
Handling Messages During Recovery: Stashing
Any message sent to the actor's ActorRef
will be placed in its mailbox, but this needs to be handled carefully.
Since the actor cannot correctly process regular operational messages until its state is recovered, its handler
logic must check its current state. If the actor is Initializing
and receives a message that is not part of the recovery process (i.e., not a response from the Journal Reader or Snapshot Store), it must stash the message.
Stashing involves:
- Checking if the actor's current state is
Initializing
. - If yes, storing the incoming message in a temporary internal collection, like a
VecDeque
(a double-ended queue), often called the "stash". - Returning from the
handler
function without processing the message's content against the actor's business logic.
Messages related to recovery (like SnapshotLoaded(state, seq_num)
or JournalEntry(event, seq_num)
) are not stashed; they are processed immediately by the receive
function's Initializing
state logic to rebuild the actor's state.
Replaying History and Transitioning to Ready
As the actor receives recovery messages from the Journal Reader/Snapshot Store:
- It applies the loaded snapshot state (if any).
- It applies each journaled event in sequence number order, mutating its internal state just as it would for a new incoming command/event.
Once the actor has received and applied all historical data (e.g., indicated by a final message from the Journal Reader like RecoveryComplete(last_seq_num)
), it knows its state is fully reconstructed. At this moment, the actor performs a critical state transition:
- It changes its state from
Initializing
toReady
.
Processing Stashed Messages and New Messages
Upon transitioning to the Ready
state, the actor's priority changes:
- It first processes all messages currently held in its internal "stash". It iterates through the stash and passes each stashed message back to its
receive
function, this time processing it using theReady
state logic. This ensures messages received during recovery are handled in the order they arrived (relative to each other). - After the stash is empty, the actor proceeds to process any new messages that subsequently arrive in its mailbox, using its
Ready
state logic.
Design Patterns
This section explores various actor design patterns.
Resilience
Patterns that help actors stay healthy under failure, load, or instability.
Circuit Breaker
The Circuit Breaker pattern is a design approach used to prevent a system from repeatedly trying to execute an operation that is likely to fail. It wraps a protected function call (like a call to a remote service or another component) in a circuit breaker object, which monitors failures.
In the context of actor systems, this pattern can be applied to an actor that needs to send messages to another actor or external service which might be unreliable. Instead of the calling actor directly sending messages and potentially blocking or consuming resources on repeated failures, a "circuit breaker" logic is incorporated, often within the sending actor itself or a dedicated intermediary actor.
The circuit breaker typically operates in three states:
-
Closed
The default state. Messages are sent to the target actor/service. The breaker monitors for failures (e.g., timeouts, error responses). If the failure rate or count exceeds a threshold within a certain time window, the breaker trips and transitions to the Open state.
-
Open
The breaker immediately fails messages without attempting to send them to the target. It might return an error or a fallback result. This prevents overwhelming the failing target and saves resources. After a configured timeout period, the breaker transitions to the Half-Open state.
-
Half-Open
A limited number of trial messages are allowed through to the target. If these trial messages succeed, the breaker assumes the target has recovered and transitions back to the Closed state. If any of these trial messages fail, the breaker assumes the target is still unhealthy and immediately transitions back to the Open state, typically resetting or increasing the timeout period.
Implementing a circuit breaker involves adding state management within an actor's message processing logic. An actor might hold variables for the current state (Closed, Open, Half-Open), a failure counter, a success counter (for Half-Open state), and a timestamp for state transitions. When processing an outgoing message, the actor checks its circuit breaker state before attempting delivery. Incoming error or success messages from the target update the state. Timers can be managed through scheduled messages within the actor system itself.
Isolation and Resource Protection
Patterns that ensure faults and heavy usage don’t spread across components.
Message Routing and Delivery
Patterns that govern how messages are sent, rerouted, or managed between actors.
State and Lifecycle Management
Patterns for managing actor state, evolution, and lifecycle events.
Coordination and Workflow
Patterns for orchestrating multiple actors to perform long-lived or structured tasks.
Distribution and Scalability
Patterns for spreading actors across nodes or machines while maintaining consistency or performance.
Monitoring and Observability
Patterns to instrument, measure, and gain visibility into actor behavior.