Every System is a Log: Avoiding Coordination in Distributed Applications

Building resilient distributed applications is a challenging task. Ideally, developers should focus on business logic and domain complexity. However, they often find themselves worrying about service crashes, API availability, concurrent invocations, and zombie processes corrupting state, leading to concerns over failover strategies, retries, race conditions, and locking mechanisms.

Many applications fail to handle these issues correctly, especially under load or failures. The question arises: how can we simplify this complexity? This article explores a fundamental concept for addressing these challenges by avoiding distributed coordination, drawing from the development of Apache Flink.

Every System is a Log

In distributed applications, every system can be seen as a log. Message queues such as Apache Kafka and Pulsar, databases, and distributed locking services like ZooKeeper inherently function as logs. When developing applications that interact with these systems, developers orchestrate multiple logs within their business logic.

Applications Need to Orchestrate Many Logs

Consider a processPayment handler triggered by a queue, involving fraud detection, account balance updates, and notifications. Issues arise with concurrent invocations, retries, and non-deterministic fraud models. Ensuring the ownership of locks and consistent state updates is a complex task due to distributed locking and failure detection limitations.

This complexity is due to the need to orchestrate various systems, each maintaining its own state. Ensuring consistent operations across these systems becomes the essence of many distributed application challenges.

What if It Were All the Same Log?

Imagine all these systems operate from a single log, specifically, the upstream log for the payment handler. State changes write records to this log, allowing for conditional appends and linked log entries. This approach simplifies retries and ensures strong workflow execution guarantees.

Lock and state updates are logged, transforming the log into a central point for coordinating events. This eliminates previous corner cases related to locks and state updates.

If Everything’s in One Log, There’s Nothing to Coordinate

By unifying states within a single log, the need for coordination diminishes. The complexity of distributed systems is reduced, increasing efficiency and simplicity. This approach contrasts with traditional methods that, like using ZooKeeper, shift rather than reduce complexity.

Adopting This Idea in Practice

This concept provides a blueprint for log-based architectures, akin to “Turning the Database Inside Out.” Currently, Restate implements this idea, managing the upstream log and reliably invoking event handlers, attaching execution journals, and ensuring conditionally appended log entries.

The implementation supports exactly-once semantics, stateful handlers, and facilitates robust and consistent application states without requiring coordination across distributed components.

Blast Radius and Separation of Concerns

While it’s impractical to use a single log for all operations in a multi-service architecture, targeting all state strictly scoped to a handler through a log achieves a balance, akin to event-driven architectures. Databases remain essential for shared data, while the log efficiently handles state machines and event-driven updates.

This approach provides robust building blocks for maintaining core state and enables overlays and metadata tracking in databases. The result is a highly reliable and efficient system.

What’s Next?

To explore this pattern, Restate is open source and available for download. The next updates will include distributed support and scalability features. Future articles will delve into the broker design and the unique log implementation backing this system, exploring its advantages over existing solutions like Kafka or Postgres.