Introduction

A collection of notes on programming using actors.

Source code for this book (and program files) are available at https://github.com/jayson-lennon/actor-notes.

Quick DDD Primer

Here is a list of DDD tactical terms and how they map to Actor Systems:

  • Entity -> Unit of State with Unique Identity: A distinct "thing" in your system that you track individually over time.

    • Actor Mapping: Often represented by the state within an Actor, with the Actor's address or ID representing the Entity's unique identity.
  • Value Object -> Immutable Data Structure: Data that describes something but has no own identity; defined only by its values.

    • Actor Mapping: Data carried as the content of Actor Messages or stored as immutable fields within an Actor's state.
  • Aggregate -> Consistent State Cluster (managed by a Coordinator): A group of related data that must be kept valid together. One point controls all changes to this group.

    • Actor Mapping: Naturally maps to a stateful Actor. The Actor is the coordinator, encapsulating and managing the state cluster, processing messages sequentially to ensure internal consistency.
  • Domain Service -> Domain Process/Operation: Logic for a significant action or calculation that doesn't belong to just one "thing."

    • Actor Mapping: Implemented by dedicated Actors responsible for specific domain processes or orchestrations, possibly interacting with stateful Actors.
  • Application Service -> Use Case Coordinator: Code that directs steps to perform a specific action requested by the user or another system; uses the domain logic but isn't the logic itself.

    • Actor Mapping: Implemented by Actors that serve as the entry point for commands or requests, orchestrating interactions by sending messages to other Actors (Domain Services, Aggregates).
  • Repository -> State Loader/Saver: Gets and saves the "Consistent State Clusters" (Aggregates) from/to storage.

    • Actor Mapping: Used by Actors to manage their persistent state. In some frameworks, persistence is integrated into the Actor model itself.
  • Domain Event -> Something Significant Happened: A notification or record that a key occurrence took place in the system.

    • Actor Mapping: Often implemented directly as Actor Messages that are emitted by Actors and can be subscribed to by other Actors or external systems.

Actor Systems

Let's think of an "actor system" not just as a technology, but as a model for structuring software. It’s a paradigm shift away from traditional imperative or sequential programming, offering a fundamentally different way to design and build applications – one that excels in handling concurrency, complexity, and resilience.

Instead of viewing your application as a series of steps executed linearly, an actor system envisions it as a collection of independent, self-contained entities interacting through messages. Think of it like many small independent programs communicating with each other. Its primary benefits include:

  • Improved Extensibility: Each actor can be developed, tested, and scaled independently, making it easier to integrate new features without affecting existing ones.
  • Enhanced Maintainability: The clear separation of concerns and the straightforward message-passing interfaces within an actor system make the codebase easier to understand and maintain.
  • Enhanced simplicity: Each actor operates as an independent unit, allowing you to focus solely on the messages it sends and receives, thereby simplifying the coding process.
  • And of course easier concurrent programming: By breaking down applications into manageable, isolated components, actor systems simplify the implementation and management of concurrent operations.

By embracing this model, you can build more robust and adaptable software systems that are better equipped to handle complex operations and future growth.

What is an Actor System?

At its core, an actor system consists of:

  • Actors: These are the fundamental building blocks. Think of them as lightweight, isolated computational units. Each actor has:
    • State: Private data that only it can directly access and modify. This eliminates a huge source of concurrency problems (more on this later).
    • Behavior: A set of rules defining how it reacts to incoming messages. It's essentially the logic for processing those messages, potentially updating its state and sending new messages to other actors.
    • Mailbox: A queue where incoming messages are stored. Actors process these messages one at a time (serially), ensuring order within their own context.
  • Messages: The only way actors communicate with each other. They're simple data structures – no shared memory, no direct function calls. This is crucial.
  • The System: The environment that manages the actors, schedules them for execution, and handles message routing.

Why Choose an Actor Model?

Traditional imperative programming often struggles with concurrency. You end up wrestling with locks, mutexes, semaphores – tools designed to prevent problems arising from shared mutable state. These tools are complex, error-prone (deadlocks!), and can significantly hinder performance due to contention. The actor model sidesteps this entire problem class.

Here's how the actor model shines compared to sequential or even traditional concurrent programming.

Elimination of Shared Mutable State

This is the key benefit. Because actors have private state and communicate only through messages, you completely avoid race conditions and deadlocks that plague shared-memory concurrency. This dramatically simplifies reasoning about your code – you can understand the behavior of a single actor in isolation.

Natural Concurrency

Many tasks we think of as sequential are actually inherently concurrent. Consider:

  • CLI tooling

    Command-line interfaces (CLIs) typically operate on files. You can enhance even basic tools by modeling each file as an actor, thus incorporating concurrency seamlessly.

  • Image Processing Pipeline

    Different stages (loading, filtering, compression) can be represented as actors, each working independently on a portion of the image data.

  • Game Logic

    AI agents, physics simulations, rendering – all could be handled by separate actors, allowing for parallel execution and responsiveness.

  • Web Server Handling Requests

    Each request could be handled by an independent actor, allowing for parallel processing without complex threading management.

  • Financial Trading System

    Order processing, risk management, market data analysis - each can be an actor reacting to events in real-time.

  • Resilience & Fault Tolerance

    Actors are designed to fail gracefully. If one actor crashes, it doesn't bring down the entire system. Supervision strategies (built into many actor systems) allow parent actors to monitor their children and restart them if they fail, ensuring continued operation. This is incredibly difficult to achieve reliably in traditional architectures.

  • Modularity & Scalability

    Actors promote modularity – each actor encapsulates a specific responsibility. This makes code easier to understand, test, and maintain. The inherent message-passing nature also lends itself well to scaling; actors can be distributed across multiple machines with relative ease.

  • Reactive Programming

    Actor systems embody reactive programming by inherently responding asynchronously to events without the need for polling.

The actor model also enables some interesting techniques that are difficult to pull off in more traditional programming paradigms. Some of them are highlighted in the next section.

Key Advantages of Actors

The actor model enables patterns that are incredibly challenging or impractical in other paradigms, providing key advantages over traditional programming paradigms:

  • Complex State Machines

    Managing intricate state transitions becomes much cleaner when each state is represented by an actor, reacting to messages representing events. Imagine a complex workflow engine – each step could be an actor, simplifying the logic and making it easier to debug.

  • Decentralized Coordination

    In systems where you don't want a central authority dictating behavior (e.g., distributed sensor networks), actors can coordinate their actions through message passing without relying on a single point of failure or bottleneck.

  • Adaptive Systems

    Actors can dynamically adjust their behavior based on incoming messages and the state of other actors, creating systems that learn and adapt over time. Think of an autonomous vehicle – each component (perception, planning, control) could be an actor reacting to sensor data and coordinating with others.

  • Workflows

    Actor systems excel at managing long-running workflows with by allowing actors to suspend their work temporarily and resume later, maintaining state and ensuring seamless execution over extended periods (even days).

  • Event Journals

    Actors can log actions and message exchanges to event journals, providing a comprehensive audit trail of system activities. This allows you to see exactly what the state of an actor was at any given moment in time.

The actor model shines when it comes to maintainability and architectural flexibility. The core principle – building systems from independent, message-passing actors – directly contributes to a more manageable and adaptable codebase.

System Maintainability

In traditional architectures, adding new features often involves modifying existing codebases, potentially introducing regressions or unintended side effects. With the actor model, this process is significantly cleaner. New functionality frequently translates into adding new actors.

Think of it like building with Lego bricks instead of sculpting a single block of clay. Want to add a new feature? Create an actor that encapsulates that feature's logic and connect it to existing actors via messages. This has several key benefits:

  • Isolation: The new functionality is isolated within its own actor, minimizing the risk of impacting other parts of the system. No need to fear widespread code changes or complex refactoring.
  • Clear Boundaries: Actors clearly define boundaries of responsibility. Adding a feature means defining what that actor does, and how it interacts with others – leading to better-defined interfaces and reduced coupling.
  • Testability: Individual actors are much easier to test in isolation, as you can focus solely on their behavior without needing to mock or simulate complex interactions.

Example: Imagine a shopping cart system. Initially, you have actors for Cart, ProductCatalog, and PaymentProcessor. Now, you want to add a "Recommendation Engine." Instead of modifying the existing actors, you create a new RecommendationEngine actor that receives messages about items in the cart and sends back recommendations. This keeps the core logic clean and focused. The shopping cart and related functionality still run just as before (including their performance characteristics), but now there is new functionality that can provide recommendations.

Actors in Actor Systems

Actors are the primary components of actor systems, acting as isolated, lightweight computational units designed to handle messaging and state management without direct interference.

Key Components of an Actor

  • State: Each actor maintains its own private data that only it can access and modify. This isolation prevents concurrent modifications and eliminates common concurrency issues.
  • Behavior: Actors define behavior as a set of rules for processing incoming messages. This behavior includes updating the actor's state and sending messages to other actors.
  • Mailbox: Actors have a mailbox, which is a queue where all incoming messages are stored. The actor processes these messages one at a time, ensuring that message processing adheres to a specific order.

Benefits

  • Isolation: Actors do not share state directly, preventing concurrency conflicts.
  • Concurrency Simplification: Serial processing of messages within each actor avoids synchronization issues.
  • Modularity: Actors are independent units that can be tested and scaled separately.

Actor Processing

Below is a high-level overview of the steps taken by an actor:

  1. Receive message via mailbox

    Message receipt is asynchronous. The actor system will typically get the next message from the queue and hand it off to the actor for processing. The actor never enters a loop to get messages and instead has a handler that is called when a message is sent to it for processing.

  2. Update its internal state

    Upon receiving a message, an actor may modify its internal state based on the content and purpose of the message. This state update enables actors to maintain and alter their internal configurations or variables as required by their behavior logic.

  3. Send messages to other actors (including itself)

    An actor has the capability to send messages to other actors within the same system. This includes having the ability to send messages to itself, which can be useful for tasks that require iterative processing or callbacks.

  4. Spawn new child actors

    In addition to sending and receiving messages, an actor can also create new child actors. This feature supports hierarchical structures in actor systems, enabling parent-child relationships where actors manage their own lifecycle and interactions with other members of the system.

  5. End message processing

    After any state updates and message sends, the actor returns from it's handler function thereby handing off control back to the actor system.

All functionality of a program can be modeled by putting small state changes in actors and using messages to communicate. Note how in the above steps there are no accesses to other actor states and no waiting for messages from other actors. All an actor does is run a handler function upon message receipt, perform it's processing, and then stop.

Message Passing

While message passing is fundamentally different from direct function calls, it's helpful to understand the conceptual similarity. In imperative programming, you have a function that takes arguments, performs some operation, and returns a value. In an actor system:

  • The Actor: Represents the "function" – encapsulating logic and state.
  • The Message: Acts as both the argument and the trigger for execution. It tells the actor what to do.
  • The Response (Optional): The actor's actions, potentially including sending new messages to other actors, can be considered a form of "response" – communicating results or initiating further processing.

However, the crucial difference is that message passing is asynchronous and never blocks. The sender doesn’t wait for a direct return value; it continues its own work while the actor processes the message. This decoupling is key to concurrency and resilience. Actors control the flow of information by deciding which messages to process, when to send new messages, and how to update their internal state based on those messages. They act as gatekeepers, ensuring that data flows through the system in a controlled and predictable manner.

The actor's message handler should handle incoming messages promptly without getting stuck in blocking operations. If extensive computations or delays are inevitable, these tasks should be executed asynchronously using background threads or tasks managed by the actor.

Keeping the message handling function fast in order to process messages immediately (or near-immediate) allows for control messages to be processed by the actor while it's working. This enables interesting capabilities, such as:

  • Self-monitoring

    Actors can schedule periodic messages to themselves to perform self-monitoring checks like offloading work to other actors if their queue becomes too large.

  • Temporary Suspension

    An outside source can order an actor to suspend operation, but maintain their message queue. For example, maybe a quota was hit and we need to stop processing, but we still want to resume work later after the quota is reset or increased.

  • Message Rejection

    Perhaps some messages are no longer relevant, so the actor can be told to drop their pending jobs.

  • Graceful Shutdown

    Shutdown messages can be sent to the actor which order it to stop. The actor can then perform various cleanup operations before quitting. This may include:

    • dropping remaining messages
    • finishing remaining messages
    • creating a snapshot so work can be resumed later

This design allows actors to be highly responsive and flexible in handling various scenarios.

Mailboxes

An actor mailbox, or message queue, serves as a central storage unit specifically designed to hold messages directed at a particular actor instance in concurrent programming systems. This mechanism ensures that actors can receive and process their messages in a controlled and orderly manner without immediate interference from other operations or actors.

Purpose of a Mailbox

  1. Message Buffering: Actors typically operate asynchronously, and it's possible for multiple messages to be sent to an actor at once before it has the opportunity to process them. The mailbox acts as a buffer, storing these messages until they can be processed one by one.

  2. Isolation from Concurrency Issues: By handling message delivery and processing in a serialized manner through the mailbox, actors are isolated from direct concurrency issues such as data races or deadlocks.

  3. Order of Message Processing: Mailboxes typically use First-In-First-Out (FIFO) ordering which processes messages in the order they were received.

Implementation Characteristics

  • Thread Safety: Mailboxes are inherently thread-safe to allow multiple threads to deliver messages concurrently without corrupting the message queue.

  • Blocking Operations: Upon attempting to read from an empty mailbox, actor systems often implement blocking operations where the actor thread is paused until a new message arrives. Conversely, when attempting to write to a full mailbox, write operations may block or drop messages based on system policies.

  • Capacity Management: Mailboxes can have limited capacity, which influences how they handle overflow situations. This capacity management prevents excessive memory usage and helps manage system resources effectively.

Communication channels

The method in which actors communicate with other actors is not dictated by the underlying mathematical model of actors. Therefore, how they communicate can vary significantly across different implementations.

Types of Actor Communication Channels

Each method of communication has different overhead and informs the way in which actor-based programs are written. Here is a short list of the primary methods used by actor frameworks or libraries:

Physical Actors (Actors with Affinity)

Some actor systems use "physical" or affinity-based communication channels where actors have direct bindings or shared memory spaces:

  • Shared Memory: Actors residing on the same physical machine can communicate directly using shared memory, which provides low-latency interaction.
  • Network Sockets: Network sockets are used for inter-machine communication. This introduces a minimal overhead compared to higher-level messaging protocols.

With physical actors you will usually have access to an ActorRef or similar data structure. This represents a reference to a specific instance of an actor in the system (local or remote). If the actor dies, then the ActorRef becomes invalid and the communication channel is broken. Attempting to send a message on a broken channel will result in a failure at the sender. This gives the sender the opportunity to perform some kind of different action on failure, but then also requires managing the lifetime of the communication channel manually.

Many actor libraries use physical actors because they are the simple to implement while also providing the bare-minimum to fulfill the actor model without additional overhead.

Virtual Actors (Grains)

This approach is used by Microsoft Orleans, a framework that leverages a distributed computing model. In this paradigm, actors, or "grains," are essentially virtualized and opaque objects to the caller. If actor A wishes to communicate with actor B, it simply sends a message to B. The Orleans runtime handles the intricacies of managing these grains. Specifically:

  • Automatic Activation: If actor B is not already operational when actor A attempts to send a message, the runtime automatically activates or "spins up" actor B.
  • Transparent Location: Actor B can be located anywhere within the distributed system. The exact location becomes irrelevant since Orleans manages the communication channels between actors.
  • Message Queueing and Routing: Messages are placed in a queue specific to the grain (actor) until it is ready to process them. This mechanism ensures that message delivery is reliable and follows a FIFO order where applicable.

With virtual actor systems, you'll have access to a VirtualActorRef or similar data structure. This represents the "identity" of an actor. For example, if you query for actor Foo with identity 123, then the returned VirtualActorRef will always route to Foo123 even if that actor dies. The communication channel to Foo123 always exists because the actor runtime performs work behind the scenes to create and maintain the communication channel regardless of the current system state.

Event-Driven Channels

Event-driven channels involve actors listening to specific events or signals:

  • Subscription Model: Actors can subscribe to certain types of events or messages. When these events occur, the event source sends out notifications that are delivered to interested actors.
  • Asynchronous Notification: This model is inherently asynchronous, meaning that actors do not have to actively query for events but instead reactively process them.

Event-driven channels can use a global message broker. Actors subscribe to events on the broker and also send messages to the broker. The message broker then becomes responsible for routing messages to the correct subscribers.

Using a global message broker decouples actors from direct ActorRef communication, eliminating concerns about broken links. However, this setup introduces indirection, potentially increasing overhead in message transmission. Additionally, there's a risk of message loss unless acknowledged by the recipient. We assume continuous availability of the broker, ensuring that messages can always be sent; however, delivery is not guaranteed without waiting for an acknowledgment.

System Design

It takes some up-front effort to design an actor system, but it has several benefits:

  • Near 1:1 translation of design to implementation

    In actor systems, each component (actor) is independent and communicates with other components via message passing. This parallels a common architectural approach where design documents outline discrete responsibilities and interactions between system components.

  • Clarity and Traceability

    Because actors are isolated units, the flow of communication (messages) between them can be easily mapped back to the design documents (or vice-versa). This clear separation makes it simpler to understand how different parts of the system work together, improving traceability and accountability.

  • Ease of Debugging

    With well-defined communication channels, debugging becomes more straightforward. You can trace message flows through actors, making it easier to pinpoint where issues occur.

  • Flexibility for Change

    Design documents that closely mirror code allow for more flexible changes. If requirements evolve, modifying the design document first provides a clear path for updating the implementation without introducing bugs.

For these reasons, it's recommended that design documents be used for all actor-based projects regardless of size.

Supervision

Actor supervision is a key concept when using the actor model.

What is Actor Supervision?

In an actor system, implementing hierarchical supervision is a best practice. When an actor acts as a parent by spawning another actor (the child), the parent should take on the role of supervisor for the spawned child. The rationale behind this is that the parent has the knowledge of how to spawn the child. Therefore, in cases where the child terminates, the parent is equipped to respawn it.

The supervision hierarchy forms a tree structure where each supervisor is responsible for spawning its immediate children. This setup allows the entire system to be "bootstrapped" by starting with the root-level actor, known as the system supervisor. The system supervisor spawns its children, which can also be supervisors, and these in turn spawn their own children, continuing until the entire system is online. Since supervisors manage only their immediate children, this design facilitates restarting arbitrary parts of the system as needed.

Supervision strategies include:

  • Resume

    The failing child actor is resumed, and its internal state is preserved. This is suitable for transient errors that do not corrupt the actor's state and where continuing processing is safe.

  • Restart

    The failing child actor is stopped, a new instance of the actor is created with a fresh internal state, and the new instance resumes processing. This is typically used when the actor's state might be corrupted by the failure and a clean slate is needed. The message that caused the failure is usually skipped.

  • Stop

    The failing child actor is permanently terminated. This directive is used for non-recoverable errors or when the failed actor is no longer needed.

  • Escalate

    The failure is sent up to the supervisor's own supervisor (its parent). This is used when the current supervisor cannot handle the specific failure and delegates the decision to a higher level in the hierarchy.

A supervision strategy is activated by a supervision event. When a problem occurs with a child actor, the actor system itself will generate a supervision event and forward it to the child's supervisor. The supervisor can then apply it's desired strategy based on it's own state and the state of the child actor.

Benefits of Actor Supervision

  1. Failure Isolation: By isolating failures at the actor level, supervision ensures that an error in one part of the application does not affect the rest.

  2. Fault Tolerance: Applications can be designed to handle failures gracefully, providing a high availability of services even when parts of the system fail.

  3. Ease of Debugging and Maintenance: Clearly defined recovery strategies make debugging easier and maintenance simpler.

Direct Actor Communication

Direct actor communication relies on one actor holding a reference (ActorRef) to a specific instance of another actor. Messages are sent directly from the sender actor's ActorRef to the receiver actor's ActorRef. This approach is straightforward and provides a clear path for point-to-point interactions.

However, a significant challenge arises when the target actor instance terminates. Because the ActorRef points to a specific instance, if that instance dies (due to an exception, planned shutdown, or other reason), the ActorRef becomes a "dead letter" reference. Any subsequent messages sent to this dead ActorRef will typically be routed to a system-defined dead letter mailbox, never reaching the intended recipient. The communication channel between the two specific instances is effectively broken.

Handling Channel Failure

Strategies for dealing with this broken channel include:

  1. Death Watch

    The sending actor can "watch" the receiving actor. When the receiving actor terminates, the sender receives a Terminated message. This signals the sender that the ActorRef is now stale and should no longer be used. The "watch" capability can be achieved by linking actors together in a supervision tree in order to be notified when the actor dies.

  2. Message Delivery Failure

    While not guaranteed in all actor systems, some systems might provide feedback (e.g., via acknowledgements or specific error messages) if a message cannot be delivered to a live instance. This would result in an error when attempting to send a message to the dead actor.

  3. Protocol-Level Acknowledgements

    Design the communication protocol such that the receiver explicitly acknowledges processing messages. If an acknowledgement isn't received within a timeout, the sender can infer failure, potentially due to a dead receiver.

Design Considerations

Given that a direct ActorRef connection to a specific instance can fail, architectural designs using this pattern should account for potential disruptions. Here are some strategies that can be used to manage recovery.

Finding a Replacement

If the target actor was part of a supervised hierarchy and is restarted, the sender needs a new ActorRef for the replacement instance. This often requires a mechanism outside the direct reference, such as:

  • Querying a parent supervisor or registry for the new ActorRef.
  • Using a naming service where actors register themselves by a logical ID.

In both of these scenarios, there will be a time delay between the time the dead actor is restarted and the propagation of it's new ActorRef to a registry or naming service. This facilitates the need for the sending actor to use exponential backoff on retries because it may receive the same (dead) ActorRef from a registry while the dead actor restarts.

Idempotency

While not directly addressing ActorRef, it's still important to design messages and receiver logic to be idempotent where possible. This allows the sender to safely retry sending a message (potentially to a newly acquired ActorRef for a replacement actor) without causing unintended side effects if the original message was partially processed before the crash.

Fault Tolerance Boundaries

Clearly define which parts of the system rely on direct actor communication and implement robust error handling and recovery strategies at those boundaries.

Put flaky connections and unstable crash-prone actors behind a "router" actor. Communication can then be reliably made to the router actor, which then relays the message to the unstable ones. The router actor can include the retry and queuing mechanisms which will allow senders to send messages without worrying about error handling.

Pub/Sub Communication

Publish/Subscribe (Pub/Sub) is a messaging pattern where senders (publishers) do not send messages directly to specific receivers (subscribers). Instead, publishers categorize messages into topics or channels without knowing which subscribers, if any, will receive them. Subscribers express interest in one or more topics and receive all messages published to the topics they are subscribed to.

In an actor system context, actors can act as publishers, subscribers, or both. An actor publishes a message to a specific topic managed by a dedicated Pub/Sub service or mediator actor within or external to the system. Other actors subscribe to this topic via the same service.

Pub/Sub and Event-Driven Design

This pattern aligns closely with event-driven design. Publishers emit "events" (messages) describing something that has happened (e.g., UserCreated, OrderPlaced, ItemUpdated). Subscribers, acting upon these events, react to changes in the system state without needing to know the source of the event. This creates a loosely coupled system where components interact primarily by reacting to a stream of events.

Trade-offs: Overhead vs. Management

A key characteristic of Pub/Sub is that all messages for a given topic must pass through the Pub/Sub service or mediator. This introduces a layer of indirection and potential overhead. The service needs to receive the message, determine which subscribers are interested, and relay the message to each of them. This adds latency compared to a direct actor-to-actor message send, and consumes resources in the Pub/Sub infrastructure itself.

However, this overhead comes with significant benefits, primarily in alleviating the management of direct communication channels. With Pub/Sub:

  • Publishers do not need ActorRefs for specific subscribers. They only need to know the Pub/Sub service and the topic.
  • The system becomes more resilient to the failure of individual subscriber instances. If a subscriber actor dies, the publisher is unaffected, and other subscribers continue to receive messages. The responsibility of ensuring a subscriber receives messages (potentially after a restart) often shifts to the Pub/Sub infrastructure or the subscriber's supervisor/registration mechanism.
  • Subscribers can join or leave topics dynamically without affecting publishers or other subscribers.
  • Adding new functionality to the system becomes trivial: Create a new actor, subscribe to a specific topic, and it's now integrated

Compared to direct ActorRef communication where managing dead letters and finding replacement actor instances is a concern for the sender, Pub/Sub shifts the complexity. The challenge moves from managing many point-to-point connections to ensuring the reliability and scalability of the central Pub/Sub service and how actors register/deregister their subscriptions.

Hybrid Communication

A common and effective strategy in larger actor systems, especially those adhering to Domain-Driven Design principles, is to employ a hybrid communication model. This approach uses Publish/Subscribe for communication between distinct Bounded Contexts, and direct ActorRef messaging for communication within a Bounded Context.

A Bounded Context can be thought of as a logical boundary encompassing a set of related actors, data, and business logic that operates around a specific part of the domain (e.g., Order Management, Inventory, Shipping).

Inter-Context Communication via Pub/Sub

When functionality in one bounded context needs to interact with or be notified of events in another context, Pub/Sub is utilized. For example:

  • An Order Management context publishes an OrderPlaced event to a topic.
  • An Inventory context subscribes to the OrderPlaced topic to decrement stock levels.
  • A Shipping context also subscribes to the OrderPlaced topic to initiate the shipping process.

Rationale: Pub/Sub provides loose coupling between contexts. The Order Management context doesn't need to know who is interested in an OrderPlaced event, only that it happened. This allows contexts to evolve independently. New contexts can subscribe to existing events without modifying the publisher. It abstracts away the location and specific instances of actors in other contexts. Failures in one subscriber context do not directly impact the publisher or other subscribers.

Intra-Context Communication via Direct ActorRef

Within a single bounded context, actors often have close relationships and need to interact directly and efficiently. For example:

  • Within the Order Management context, an OrderAggregate actor might communicate directly with OrderItem actors or a PaymentProcessor actor within the same context.

Rationale: Direct ActorRef communication within a context is typically faster and more performant than routing through a Pub/Sub layer. Actors within the same context are often managed together (e.g., by the same supervisor), and their interactions are tightly coupled by the context's specific business logic. The overhead and indirection of Pub/Sub are unnecessary and counterproductive for these direct, focused interactions. While direct ActorRefs require managing references and handling potential instance failures, this complexity is contained within the boundary of the context. The context's internal supervision and management strategies can handle failures of its internal actors.

Benefits of the Hybrid Approach

This hybrid model leverages the strengths of both patterns:

  • Loose Coupling Between Contexts: Facilitates independent development, deployment, and scaling of different parts of the system. Changes within one context are less likely to break others, provided the published event contracts remain stable.
  • Efficient Interaction Within Contexts: Enables high-performance, direct communication for tightly related actors and operations within a specific domain area.
  • Containment of Complexity: The complexity of managing direct ActorRef lifecycles and failures is contained within the context boundary, while the complexity of managing subscriptions and message relay is centralized in the Pub/Sub layer.
  • Clear Boundaries: Reinforces the separation defined by the bounded contexts, making the system architecture easier to understand and maintain.

Implementation Concerns

Note that actors within the context do not communicate with any actors outside their context. All messages outside the context must be routed through a gateway actor which acts as the only entry and exit point for messages into and out of the context. This gateway actor also plays a role as an anti-corruption layer, translating messages at the context boundary. This allows the context to evolve over time without having an impact on the rest of the system.

Request-Response

Direct-to-target

Here are the steps for implementing a direct-to-target request/response pattern using a Pub/Sub message broker. This involves creating unique custom subscriptions per actor so that the responder knows where to publish their response.

  1. Responding Actor Subscribes

    The actor designed to respond (the "responder") subscribes to the Pub/Sub broker for a specific query topic, e.g., topic/query/my_thing. This tells the broker to forward all messages published to this topic to the responder actor.

  2. Requesting Actor Prepares and Subscribes

    The actor initiating the request (the "requester") generates a unique reply topic (e.g., based on its own ID and the request ID) and then subscribes to this unique topic with the broker. This ensures the requester will receive messages specifically addressed back to it.

  3. Requesting Actor Publishes Request

    The requester actor publishes the request message to the responder's query topic (topic/query/my_thing). The message payload includes the necessary query data and the unique reply topic generated in step 2 (often in a field like reply_to).

  4. Broker Routes Request

    The Pub/Sub broker receives the message published to topic/query/my_thing. Based on the subscription from step 1, the broker routes this message to the responding actor.

  5. Responding Actor Processes and Publishes Response

    The responder actor receives the request message, processes it, generates a response, and extracts the reply_to topic from the incoming message. It then publishes the response message to this extracted reply_to topic. The responder does not need to know the identity or ActorRef of the requester.

  6. Broker Routes Response

    The Pub/Sub broker receives the message published to the unique reply_to topic. Based on the subscription from step 2, the broker routes this response message to the requesting actor.

  7. Requesting Actor Receives Response

    The requester actor receives the response message on its unique reply topic.

This pattern effectively decouples the requesting and responding actors, as they only need to know about the Pub/Sub broker and agreed-upon topics, not each other's specific addresses.

sequenceDiagram
    participant Requester as Requesting Actor
    participant Broker as Pub/Sub Broker
    participant Responder as Responding Actor

    Note over Requester,Responder: Setup Phase
    Responder->>Broker: Subscribe("topic/query/my_thing") (when actor spawns)
    Requester->>Broker: Subscribe("unique/reply/topic")

    Note over Requester,Responder: Request/Response Phase
    activate Requester
    Requester->>Broker: Publish("topic/query/my_thing", reply_to="unique/reply/topic")
    activate Broker
    Broker-->>Responder: Message("topic/query/my_thing", reply_to="unique/reply/topic")
    deactivate Broker

    activate Responder
    Responder->>Broker: Publish("unique/reply/topic", response_data)
    deactivate Responder
    activate Broker
    Broker-->>Requester: Message("unique/reply/topic", response_data)
    deactivate Broker
    deactivate Requester

Gateway Correlation ID Mapping

The direct request/response pattern using unique, per-request reply_to topics requires each requesting actor to manage its own temporary subscription. This can become cumbersome, especially if the requesting actors are short-lived or if there are a very large number of concurrent requests. An alternative approach leverages a central Gateway Actor to manage the response routing using Correlation IDs.

In this pattern:

  1. Gateway Subscription

    A dedicated Gateway Actor maintains a single, long-lived subscription to a fixed reply_to topic (e.g., gateway/replies). This is the only topic responses are sent to via the broker.

  2. Responder Subscription

    Responding actors subscribe to their specific query topics with the broker, same as before (e.g., topic/query/my_thing).

  3. Request from Internal Actor

    A requesting actor (behind the Gateway) sends a request message directly to the Gateway Actor, including the intended query topic and payload.

  4. Gateway Action - Publish with Correlation ID

    The Gateway Actor receives the internal request. It generates a unique Correlation ID (or reads the one the actor generated) for this request and stores a mapping internally, linking this Correlation ID to the specific requesting actor's ActorRef. The Gateway then publishes the request message to the broker using the intended query topic (topic/query/my_thing). The crucial difference is that the reply_to field in the published message is set to the Gateway's own fixed reply topic (gateway/replies), and the message payload includes the generated Correlation ID.

    • Gateway Map: CorrelationID -> RequestingActorRef
    • Published Message: Publish to topic/query/my_thing with payload including query data and correlation_id, reply_to: gateway/replies.
  5. Broker Routes Request

    The Pub/Sub broker routes the request message to the subscribing responding actor(s) based on the query topic.

  6. Responding Actor Action - Publish Response

    The responding actor receives the request. It processes it and prepares a response. It takes the correlation_id from the incoming message and includes it in the response payload. It then publishes the response message to the topic specified in the reply_to field of the incoming message, which is the Gateway's fixed reply topic (gateway/replies).

  • Published Message: Publish to gateway/replies with payload including response data and the same correlation_id.
  1. Broker Routes Response

    The Pub/Sub broker routes the response message to the Gateway Actor because the Gateway is subscribed to gateway/replies.

  2. Gateway Action - Route to Requester

    The Gateway Actor receives the response message. It extracts the correlation_id from the payload. Using its internal mapping, it looks up the ActorRef of the original requesting actor associated with that Correlation ID. It then sends the response message directly to the requesting actor's ActorRef. After routing, the Gateway may remove the mapping entry for this Correlation ID (depending on whether duplicate responses are possible).

    • Action: Receive message on gateway/replies. Extract correlation_id. Look up RequestingActorRef in internal map. Send response message directly to RequestingActorRef.
sequenceDiagram
    participant ReqInternal as Requesting Actor (Internal)
    participant Gateway as Gateway Actor
    participant Broker as Pub/Sub Broker
    participant Responder as Responding Actor (External)

    Note over ReqInternal,Responder: Setup Phase
    Gateway->>Broker: Subscribe("gateway/replies")
    Responder->>Broker: Subscribe("topic/query/my_thing")

    Note over ReqInternal,Responder: Request/Response Phase
    activate ReqInternal
    ReqInternal->>Gateway: Request(query_data)
    deactivate ReqInternal

    activate Gateway
    Gateway->>Gateway: Generate/Store Mapping (CorrID -> ReqInternal)
    Gateway->>Broker: Publish("topic/query/my_thing", correlation_id, reply_to="gateway/replies")
    deactivate Gateway

    activate Broker
    Broker-->>Responder: Message("topic/query/my_thing", correlation_id, reply_to="gateway/replies")
    deactivate Broker

    activate Responder
    Responder->>Broker: Publish("gateway/replies", correlation_id, response_data)
    deactivate Responder

    activate Broker
    Broker-->>Gateway: Message("gateway/replies", correlation_id, response_data)
    deactivate Broker

    activate Gateway
    Gateway->>Gateway: Lookup Mapping (CorrID -> ReqInternal)
    Gateway->>ReqInternal: Response(response_data)
    deactivate Gateway

Benefits

  • Simplifies Requesting Actors: Requesting actors no longer need to manage individual subscriptions or unique reply topics. They simply send their request to the well-known Gateway.
  • Better for Short-Lived Actors: This pattern is ideal when requesting actors are transient, as the long-lived Gateway handles the durable subscription needed to receive the response.
  • Centralized Management: The complexity of correlating requests and responses is centralized in the Gateway.

Considerations

  • State Management in Gateway: The Gateway must reliably store and manage the Correlation ID to ActorRef mapping. This requires careful consideration of potential Gateway restarts, memory usage for long-running transactions, and handling potential timeouts or orphaned mappings if a requesting actor dies before receiving a response.
  • Single Point of Failure: If the Gateway actor fails, all in-flight requests awaiting responses will be lost unless the Gateway's state is persisted and recoverable.

This pattern is particularly useful when you have a boundary (like a Bounded Context boundary managed by the Gateway/Router actor) and many internal actors that need to initiate requests to external services or actors via a Pub/Sub bus without exposing their individual presence or managing complex per-request subscriptions.

Messaging Concepts

Beyond the fundamental patterns of sending and receiving messages, building robust and observable actor systems requires understanding certain key concepts related to message lifecycle, traceability, and delivery guarantees.

This section covers essential messaging concepts that provide support for debugging, monitoring, and ensuring the reliability of communication within and between actors.

Correlation IDs

In distributed systems and asynchronous messaging architectures, operations often span multiple messages, actors, or services. Tracking the flow of a single logical transaction or request through this series of interactions can be challenging. This is where Correlation IDs become invaluable.

A Correlation ID is a unique identifier assigned to the initial message or event of a specific transaction or workflow. This same ID is then included in all subsequent messages or events that are part of that same logical flow.

Correlation IDs are not typically used for general system-level messages (like supervision signals, acknowledgements handled by the framework, or broadcast system events) or for tracking the overall health of the messaging infrastructure itself. They are specifically for tracing the path and state of a specific business process or workflow that spans multiple actors or services.

Think of it as monitoring an individual product as it moves through different stations on an assembly line (the workflow), rather than monitoring the health or overall throughput of the assembly line machinery itself. Some examples of where you would use Correlation IDs in a workflow include:

  • Tracking an individual order as it moves from Order Placed -> Payment Processed -> Inventory Reserved -> Shipped.
  • Following a user's request through several microservices or actor interactions to fulfill that request.
  • Monitoring the steps involved in processing a single incoming data record through a processing pipeline.

By limiting Correlation IDs to business workflows, they provide highly relevant context for debugging and understanding the journey of individual requests, without cluttering system-level communication.

Purpose and Usage

When Actor A sends a message that triggers a series of actions involving Actors B, C, and potentially others, Actor A generates a unique Correlation ID and includes it in the message sent to Actor B. When Actor B processes this message and sends a new message to Actor C, it copies the Correlation ID from the message it received and includes it in the message sent to C. This continues throughout the entire chain.

Benefits

  1. Tracing and Debugging: By searching logs or message queues for a specific Correlation ID, developers and operators can easily follow the exact path a request took, identify which actors processed it, and pinpoint where failures or delays occurred. This is particularly crucial in complex systems with many interconnected components.
  2. Request/Response Matching: In the Pub/Sub request/response pattern, the Correlation ID allows the requester to match the incoming response to the specific request it sent, especially if multiple requests might be outstanding simultaneously. The requester includes a Correlation ID in its initial request, and the responder includes the same ID in the response.
  3. Auditing and Monitoring: Correlation IDs provide a mechanism to audit the execution path of specific transactions for compliance or performance analysis.
  4. Context Propagation: Beyond just tracing, Correlation IDs can help propagate context across service boundaries without requiring components to understand the full state of the transaction.

Implementation

Implementing Correlation IDs is fairly simple and typically involves:

  • Generating a unique ID (e.g., a UUID) when a new transaction starts.
  • Adding a dedicated field (e.g., correlation_id) to message structures.
  • Ensuring that any actor or component processing a message and initiating subsequent messages copies the Correlation ID from the incoming message to the outgoing message.

Tell

In actor systems, the most fundamental message sending pattern is often referred to as tell, send, or cast (terminology varies between frameworks). This pattern is characterized by sending a message to a target actor's address or ActorRef without expecting an immediate response.

Key characteristics of the tell pattern:

  1. Asynchronous: The sender actor does not pause or block its execution after sending the message. It immediately returns to processing its next instruction or message.
  2. Unidirectional: The message flows in one direction only, from sender to receiver. There is no built-in mechanism in the tell operation itself to receive a reply.
  3. Fire-and-Forget: From the sender's perspective, it dispatches the message and moves on. It doesn't wait for confirmation of receipt or processing completion (though underlying transport might offer some level of delivery guarantee).

Importance for Event-Driven Architectures

The tell pattern is the bedrock of highly concurrent, scalable, and event-driven architectures built on actors. It aligns perfectly with the core principles of reacting to events:

  • Decoupling

    Publishers of events (tell messages) do not need to know who or how many actors are listening (subscribing) or what they will do with the event. They simply announce that "this happened" by sending the message. This creates immense decoupling between event sources and event consumers.

  • Non-Blocking Operations

    Because the sender doesn't wait for a reply, it can continue processing other tasks or handling incoming messages. This is crucial for maintaining responsiveness and throughput, especially when dealing with high volumes of events.

  • Scalability

    Since senders aren't blocked, they can handle a high rate of outgoing messages. Receivers (subscribers) process messages asynchronously from the sender, allowing the system to scale by adding more processing power to handle incoming message loads.

  • Resilience

    The failure of a recipient actor (e.g., one subscriber to an event) does not directly impact the sender or other recipients. The sender has already "fired and forgotten." While the specific message to the failed actor might become a "dead letter" (depending on the framework), the overall event flow from the publisher is not stalled.

  • Natural Fit for Events

    Events inherently represent something that has happened. The nature of an event is that it is broadcast for anyone interested to react. Tell is the natural way to broadcast or publish such information without requiring explicit coordination or response from each potential listener.

Contrast this with synchronous patterns where a sender calls a function and waits for a return value, or an asynchronous request/response pattern where the sender actively awaits a specific reply message. While these patterns are necessary for certain interactions, relying on them for core event propagation would quickly lead to bottlenecks, tight coupling, and reduced scalability in an event-driven system.

Ask (a.k.a. Request-Response)

The ask pattern is a way to achieve a request-response interaction between actors where the sending actor explicitly expects a reply. Unlike the basic tell where the sender fires and forgets, ask provides a mechanism for the sender to wait for a result corresponding to its specific request. This can be modeled as a Future and used with async/await.

Using ask is much closer to sequential programming. Code within the actor executes line-by-line and waits for responses each time.

Problems with Ask

While ask seems convenient, its use introduces significant complications that often violate core actor principles and lead to hard-to-debug issues. Almost all stem from the fact that the actor's processing flow is now waiting on an external event (the response arriving to complete the Future) rather than solely processing its incoming message queue.

  1. Occupies the Actor's Execution Context

    Until the Future completes (or times out) and the response is received, the actor cannot process other messages that arrive in its mailbox.

  2. Cannot Gracefully Shutdown

    Because the actor is busy waiting on the ask Future, it may not be able to process a shutdown message received in its mailbox in a timely manner. It might be stuck waiting for a reply that never comes or waiting for a timeout.

  3. Cannot Process Control Messages

    Similar to shutdown, critical control messages (like state queries, configuration updates, or explicit stop/cancel requests relevant to the actor's current task) might be delayed or missed because the actor is occupied with waiting for the ask response.

  4. Memory Leaks and Resource Exhaustion

    If a response message is genuinely lost or the target actor fails without sending a reply, the Future returned by ask will never complete. Without explicitly setting a timeout on every ask call, the actor's logic waiting on that Future will effectively hang indefinitely, consuming memory and preventing the actor from processing further messages. Managing timeouts reliably for every request adds complexity.

  5. Message Loss

    Actor mailboxes uses bounded queues in order to prevent resource exhaustion. If an actor is stuck waiting on a Future, then the mailbox will fill up and messages will start to get dropped.

For these reasons it's recommended that actor systems be designed around tell (fire and forget messaging) instead of ask.

Kinds of Messages

In designing actor systems and distributed architectures, especially those leaning towards event-driven principles or CQRS (Command Query Responsibility Segregation), it's helpful to categorize the different types of messages flowing through the system based on their purpose. The primary distinctions are typically made between Commands, Events, and Queries.

Commands (Expressing Intent)

Commands represent a user's or system's intent to perform an action or change the state of the system. They are imperative requests. A Command is a request for the system to do something.

  • Nature: Imperative, a request.

  • Direction: Usually directed at a specific actor or a well-defined endpoint responsible for handling that specific type of action (e.g., an aggregate actor in DDD).

  • Expectation: The sender expects the command to be attempted and potentially result in a state change and/or the emission of one or more events. The sender might expect an acknowledgment of receipt or a more detailed response indicating success/failure or the outcome of the action.

  • Naming Convention: Often named in the imperative mood (verb-noun), like CreateUser, PlaceOrder, ProcessPayment.

  • Examples:

    CommandDescription
    CreateUserRequest to create a new user
    DeleteAccountRequest to delete a user account
    UpdateProfileModify user profile information
    ProcessPaymentInitiate a payment transaction
    PlaceOrderSubmit an order for processing
    AssignTaskAssign a task to an actor
    SendEmailTrigger an email to be sent
    ScheduleReminderSchedule a reminder for future action
    ApproveRequestApprove a pending request
    RetryJobRetry a failed background job

Events (Representing Facts)

Events represent something that has already happened in the system. They are declarative statements of fact about a state change. Events are the result of successfully processing a Command (or sometimes external occurrences).

  • Nature: Declarative, a statement of fact.

  • Direction: Usually published to a Pub/Sub topic or message bus, to be consumed by any interested party. Events are often broadcast.

  • Expectation: Publishers of events typically use tell (fire-and-forget) and do not expect a direct response from consumers. Consumers react to the event asynchronously.

  • Naming Convention: Often named in the past tense (noun-verb), like UserCreated, OrderPlaced, PaymentProcessed.

  • Examples:

    EventDescription
    UserCreatedA new user has been successfully created
    AccountDeletedA user account has been removed
    ProfileUpdatedA user profile was updated
    PaymentProcessedA payment was completed successfully
    OrderPlacedAn order was submitted
    TaskAssignedA task was assigned to someone
    EmailSentAn email was successfully delivered
    ReminderTriggeredA reminder has fired
    RequestApprovedA request has been approved
    JobRetriedA job retry was initiated

Events are crucial for decoupling. An actor emitting an event doesn't need to know who is interested or what they will do. Other actors (subscribers) can react to events to update read models, trigger side effects, or initiate subsequent commands/workflows.

Queries (Requesting Information)

Queries are requests for information about the current state of the system. They are not intended to change the system's state.

  • Nature: Interrogative, a request for data.

  • Direction: Directed at an actor or service responsible for providing the requested information (often read models optimized for querying).

  • Expectation: The sender explicitly expects a response containing the requested data.

  • Naming Convention: Often named to reflect the data being requested (e.g., GetUserProfile, ListOrders, GetOrderStatus).

  • Examples:

    QueryDescription
    GetUserProfileRetrieve details for a specific user
    ListOrdersGet a list of orders for a user
    GetOrderStatusCheck the current status of an order
    FindProductsByCategoryFind products within a given category
    CountActiveUsersGet the total number of active users

Persistance

Actor persistence allows actors to save their internal state and message processing history. This is crucial for ensuring reliability and fault tolerance, enabling an actor to recover its state after a system restart or crash. By persisting state and potentially incoming messages, the actor can pick up exactly where it left off, maintaining consistency and durability.

Journal

The Journal is a fundamental concept in actor systems designed for durability, fault tolerance, and recovery, especially those implementing event sourcing. It acts as a persistent, ordered log of the significant state changes or events that an actor has processed.

What is a Journal?

At its core, a Journal is an immutable, append-only sequence of events or commands associated with a specific actor instance. Instead of saving the actor's current state directly, the actor logs the facts (events) or decisions (commands leading to events) that alter its state.

Purpose and Use Cases

Journals are primarily used for:

  • Durability: Ensuring that an actor's history and the consequences of its actions are not lost if the actor or the entire system crashes.
  • Recovery: Allowing an actor to reliably rebuild its state by replaying the sequence of events from its journal after a failure or planned restart.
  • Consistency: Providing a single, authoritative source of truth for an actor's history, enabling deterministic state reconstruction.
  • Auditing: The journal provides a complete history of an actor's life and state changes, valuable for debugging, compliance, and analysis.
  • Event Sourcing: Journals are the cornerstone of event-sourced systems, where the application state is determined solely by applying a sequence of events.

The Immutable Event Log

The central idea is that events are facts: they represent something that has already happened. Once an event is recorded in the journal, it is typically considered immutable and should never be changed or deleted (though some systems might allow compaction or snapshotting to manage size). The actor's current state is merely a projection or fold over this sequence of past events.

Actor-Specific Journals

Journal implementations are inherently actor-specific, or more accurately, tied to the type or behavior of an actor rather than being a generic, one-size-fits-all log.

  • The events stored in a UserAccount actor's journal (AccountCreated, AddressChanged, PasswordReset) are specific to the concept of a user account.
  • The events stored in an Order actor's journal (OrderPlaced, ItemAdded, PaymentProcessed) are specific to an order.

This actor-specific nature reinforces encapsulation. The interpretation and application of events in a UserAccount journal are the sole responsibility of the UserAccount actor logic. No other actor should directly read or try to interpret the raw events from another actor's journal; they should interact via messages (Commands or Queries) or react to published events from the actor they are interested in.

Journals vs. Snapshots

While the journal logs incremental changes (events), rebuilding state by replaying every event from the beginning can become time-consuming for long-lived actors with large journals. To mitigate this, snapshots can be used.

  • A Snapshot captures the actor's full internal state at a specific point in time (i.e., after applying a certain number of events).
  • Think of the Journal as the transaction history (the list of all deposits and withdrawals) and a Snapshot as the account balance at a specific date.

The Recovery Process

When an actor needs to recover its state (due to a crash, migration, or intentional restart):

  1. Load Latest Snapshot (Optional): The recovery process first checks for the most recent snapshot for this actor instance. If found, it loads this snapshot, restoring the state to that point.
  2. Replay Journaled Events: The process then reads the events from the journal that occurred after the snapshot was taken (or from the beginning if no snapshot was loaded).
  3. Apply Events: Each of these events is applied sequentially to the loaded state. This brings the actor's state up-to-date.
  4. Resume Processing: Once all relevant journal entries have been replayed, the actor is considered recovered and can resume processing new incoming messages from its mailbox.

This deterministic replay ensures the actor's state is consistent and accurate based on its recorded history.

Journal Implementations

The underlying storage for journals varies:

  • Databases (SQL, NoSQL)
  • File systems
  • Specialized distributed log systems (like Apache Kafka, EventStoreDB)

The choice depends on requirements for throughput, latency, consistency guarantees, scalability, and operational complexity. Actor frameworks often provide pluggable journal backends.

Journal Sequence Numbers

A critical aspect of the Journal's functionality, especially for ensuring reliable recovery and consistency, is the use of Sequence Numbers.

For each persistent actor instance, the events or commands written to its Journal are assigned a strictly increasing, contiguous sequence number. This number uniquely identifies the position of an entry within that specific actor's journal log. Typically, the sequence starts from 1 for the first entry written by a given actor instance.

How Sequence Numbers are Used

  1. Ordering and Determinism

    Sequence numbers enforce the order in which events occurred. During recovery, the actor system reads entries from the journal in strict order of increasing sequence numbers. This guarantees that applying the same sequence of events in the same order will always result in the same final state, making recovery deterministic.

  2. Tracking Progress

    An actor knows how far along it is in processing its history by the sequence number of the last event it successfully processed and persisted. The highest sequence number for an actor's journal represents its most up-to-date persisted state change.

  3. Coordination with Snapshots

    Sequence numbers are the bridge between Journals and Snapshots. When a Snapshot of an actor's state is saved, it is always associated with the sequence number of the last journal entry that was applied to reach that state. This sequence number is stored alongside the snapshot.

  4. Defining Recovery Range

    During recovery with a snapshot, the system loads the snapshot and notes its associated sequence number (let's call it S). It then tells the Journal store to retrieve all events for this actor starting from sequence number S + 1. This ensures that only the events that occurred after the snapshot was taken are replayed on top of the snapshot state, avoiding redundant processing and potential inconsistencies.

  5. Detecting Gaps

    While ideally the journal store ensures contiguous sequence numbers, the numbers provide a way for the system to potentially detect if entries are missing or out of order (though the storage backend should handle this reliability).

It's important to remember that these sequence numbers are local to a specific actor instance's journal. They do not represent a global ordering of events across the entire system or across different actors. Each persistent actor manages its own independent sequence of journal numbers.

Design Principles

When designing actors that use journals:

  • Identify Core Events: Focus on the business-meaningful events that represent durable facts.
  • Events are Immutable: Design event structures to be facts in the past tense.
  • State is Computed: Ensure the actor's current state is a projection of its journaled events.
  • Actor-Specific: Keep journal event formats and interpretation private to the actor type.
  • Consider Snapshots: Plan for snapshotting if replay time becomes a concern.

Snapshots

While the Journal provides a complete, immutable history of an actor's state changes through its events, reconstructing the actor's current state requires replaying potentially thousands or millions of events from the beginning of time or from the last saved snapshot. For long-lived actors with extensive journals, this replay process can become prohibitively slow, delaying recovery and impacting availability.

Snapshots provide a shortcut. A Snapshot is a serialized representation of an actor's state at a specific point in time. It's a capture of the state after applying a sequence of events up to a certain journal sequence number.

Snapshots and Journals Together

Snapshots are not a replacement for the Journal; they are a complementary mechanism.

  • The Journal remains the single source of truth, holding the complete, ordered sequence of events. It tells the story of how the actor reached a state.
  • A Snapshot is a saved version of the result of applying events up to a certain point. It's a pre-computed state from the story.

When an actor recovers, instead of replaying all events from the beginning of its journal, it can load the most recent Snapshot and then only replay the events that occurred after that snapshot was taken.

The Recovery Process with Snapshots

The recovery workflow when using Snapshots is:

  1. Load Latest Snapshot: The actor system attempts to load the most recent Snapshot saved for this specific actor instance.
  2. Load Journal from Snapshot Point: If a Snapshot is successfully loaded (including its associated journal sequence number), the system then reads events from the Journal starting immediately after that sequence number.
  3. Apply Remaining Events: The actor applies only these subsequent journaled events to the state loaded from the Snapshot.
  4. Resume Processing: Once the remaining events are applied, the actor is fully recovered and can handle new messages.
  5. Without Snapshot: If no Snapshot is found (e.g., it's the first time recovering, or snapshots were lost), the actor recovers by replaying all events from the very beginning of its Journal.

This process significantly reduces the number of events that need to be replayed, speeding up recovery time.

When and What to Snapshot

  • Frequency: Snapshots are typically saved periodically – for example, after every N events, after a certain period of time, or when the actor's state reaches a certain size or complexity. Too frequent snapshotting adds overhead (serialization, storage), while too infrequent snapshotting reduces the recovery speed benefit.
  • What State: The snapshot should capture the actor's current state. It should include all the mutable fields that define the actor's current condition. This state must be Serializable so it can be written to persistent storage.
  • Consistency: The snapshot must accurately reflect the state after processing events up to a specific, recorded journal sequence number. This sequence number is crucial for knowing where to start replaying events from the Journal.

Considerations

  • Storage: Snapshots require persistent storage (databases, file systems). This storage must be reliable and accessible during recovery.
  • Serialization Overhead: Creating a snapshot involves serializing the actor's entire state, which can be CPU-intensive and time-consuming, especially for large states.
  • Coupling: The snapshot format is tightly coupled to the actor's internal state structure. Changes to the actor's state model might require versioning or migration strategies for existing snapshots.
  • Snapshot vs. Journal Integrity: The Journal remains the ultimate source of truth. If a snapshot is corrupt or missing, the system can (albeit slower) recover solely from the Journal. The system should prioritize Journal integrity.

Actor Startup and Ready State

Note that larger actor frameworks will provide some form of a persistence module. This should provide all of the functionality needed to restore actors to their original state. However, if you do not have access to a persistent-capable framework, then this page details how recovery works.


When an actor is created or restarted, it goes through an initialization phase before it is fully capable of processing regular messages from its mailbox. This phase is crucial for establishing the actor's initial state. For actors that use persistence, this startup phase involves state recovery.

An actor can be thought of as transitioning through distinct internal states:

  • Initializing / Recovering

    The actor has been created but is actively loading its past state from a journal or other persistent store. It is not yet ready to handle its primary operational messages.

  • Ready

    The actor has successfully loaded its historical state and is now capable of processing all types of incoming messages according to its main business logic.

The Startup Process: Entering the Initializing State

When a persistent actor instance is spawned by the system, its on_start logic begins. At this point, the actor enters the Initializing state. The actor's first task is typically to initiate the recovery process.

This involves:

  1. Starting out in an Initializing state.
  2. Sending a specific message to a known "Journal Reader" actor or service, requesting its historical data (events) based on its unique identity.
  3. Potentially sending a similar request to a "Snapshot Store" service if snapshots are used, asking for the latest snapshot.

The actor now waits for these recovery messages to arrive in its mailbox.

Handling Messages During Recovery: Stashing

Any message sent to the actor's ActorRef will be placed in its mailbox, but this needs to be handled carefully.

Since the actor cannot correctly process regular operational messages until its state is recovered, its handler logic must check its current state. If the actor is Initializing and receives a message that is not part of the recovery process (i.e., not a response from the Journal Reader or Snapshot Store), it must stash the message.

Stashing involves:

  • Checking if the actor's current state is Initializing.
  • If yes, storing the incoming message in a temporary internal collection, like a VecDeque (a double-ended queue), often called the "stash".
  • Returning from the handler function without processing the message's content against the actor's business logic.

Messages related to recovery (like SnapshotLoaded(state, seq_num) or JournalEntry(event, seq_num)) are not stashed; they are processed immediately by the receive function's Initializing state logic to rebuild the actor's state.

Replaying History and Transitioning to Ready

As the actor receives recovery messages from the Journal Reader/Snapshot Store:

  1. It applies the loaded snapshot state (if any).
  2. It applies each journaled event in sequence number order, mutating its internal state just as it would for a new incoming command/event.

Once the actor has received and applied all historical data (e.g., indicated by a final message from the Journal Reader like RecoveryComplete(last_seq_num)), it knows its state is fully reconstructed. At this moment, the actor performs a critical state transition:

  • It changes its state from Initializing to Ready.

Processing Stashed Messages and New Messages

Upon transitioning to the Ready state, the actor's priority changes:

  1. It first processes all messages currently held in its internal "stash". It iterates through the stash and passes each stashed message back to its receive function, this time processing it using the Ready state logic. This ensures messages received during recovery are handled in the order they arrived (relative to each other).
  2. After the stash is empty, the actor proceeds to process any new messages that subsequently arrive in its mailbox, using its Ready state logic.

Design Patterns

This section explores various actor design patterns.

Resilience

Patterns that help actors stay healthy under failure, load, or instability.

Circuit Breaker

The Circuit Breaker pattern is a design approach used to prevent a system from repeatedly trying to execute an operation that is likely to fail. It wraps a protected function call (like a call to a remote service or another component) in a circuit breaker object, which monitors failures.

In the context of actor systems, this pattern can be applied to an actor that needs to send messages to another actor or external service which might be unreliable. Instead of the calling actor directly sending messages and potentially blocking or consuming resources on repeated failures, a "circuit breaker" logic is incorporated, often within the sending actor itself or a dedicated intermediary actor.

The circuit breaker typically operates in three states:

  1. Closed

    The default state. Messages are sent to the target actor/service. The breaker monitors for failures (e.g., timeouts, error responses). If the failure rate or count exceeds a threshold within a certain time window, the breaker trips and transitions to the Open state.

  2. Open

    The breaker immediately fails messages without attempting to send them to the target. It might return an error or a fallback result. This prevents overwhelming the failing target and saves resources. After a configured timeout period, the breaker transitions to the Half-Open state.

  3. Half-Open

    A limited number of trial messages are allowed through to the target. If these trial messages succeed, the breaker assumes the target has recovered and transitions back to the Closed state. If any of these trial messages fail, the breaker assumes the target is still unhealthy and immediately transitions back to the Open state, typically resetting or increasing the timeout period.

Implementing a circuit breaker involves adding state management within an actor's message processing logic. An actor might hold variables for the current state (Closed, Open, Half-Open), a failure counter, a success counter (for Half-Open state), and a timestamp for state transitions. When processing an outgoing message, the actor checks its circuit breaker state before attempting delivery. Incoming error or success messages from the target update the state. Timers can be managed through scheduled messages within the actor system itself.

Isolation and Resource Protection

Patterns that ensure faults and heavy usage don’t spread across components.

Message Routing and Delivery

Patterns that govern how messages are sent, rerouted, or managed between actors.

State and Lifecycle Management

Patterns for managing actor state, evolution, and lifecycle events.

Coordination and Workflow

Patterns for orchestrating multiple actors to perform long-lived or structured tasks.

Distribution and Scalability

Patterns for spreading actors across nodes or machines while maintaining consistency or performance.

Monitoring and Observability

Patterns to instrument, measure, and gain visibility into actor behavior.