The subtle beauty of Durable Objects, and what could be better about them

Cloudflare’s Durable Objects (and the official SDK, PartyKit) are genuinely beautiful infrastructure.

You write a stateful service. You give the service a name. Cloudflare guarantees there are 0 or 1 copies of a service with that name running anywhere on planet Earth. The service spins up in a few microseconds whenever someone sends a request to it; it spins back down when no one is actively using it. It generally spins up geographically close to the user who made the request.

I unhesitatingly recommend Durable Objects to everyone. It is close to the perfect solution for small, stateful services like collaborative text editing. We use (or have used) Durable Objects for everything, most notably our collaborative whiteboards, but also variously presence, collaborative text editing, and so on. Altogether we have written and been on call for thousands and thousands of lines of Durable Objects code.

So: Durable Objects (and PartyKit) are great and we love them. But writing collaborative apps is still astonishingly difficult, and today I want to talk about what I wish had been there when we started our journey.

Idempotency and side-effect validation should be built in

If you’re building a live, collaborative app, stop everything you are working on and do the following.

Make every request idempotent.

Change your app to send every single request twice. At least in dev and staging builds.

(Or use a platform like Reboot or Convex which handles resilience for you.)

Seriously. You will find an incredible number of bugs. If you are not already doing this, you are years behind the curve, yes, years behind. React has been automatically re-running all side-effects to flush out bugs since React v16.3, which was released March 28, 2018. As of publication date, that’s 2,783 days or 7.7 of God’s own years in the past. For nearly a decade, they have been schooling backend at something it is supposed to be good, and if you don’t want to believe that’s true, listen to Jamie Turner, CEO of Convex (who also ran Dropbox storage), or look at the Reboot docs to see why they built it directly into their platform (their CEO is the original author of the Mesos paper).

If you work at Cloudflare: it is now time to build this into PartyKit and/or Durable Objects. You can eliminate an entire class of excruciatingly hard-to-track-down bugs by providing this out of the box.

Look for yourself: search usePartySocket on Sourcegraph and you’ll see a bunch of code like the following, with no error handling whatsoever. Perhaps not every callsite like this is a bug, but undoubtedly many of them are. And more importantly, it will be hard to distinguish between the two.

I understand this cuts directly against the “plain-old message-passing actors” aesthetic, but if there had been an API like await socket.idempotentlySend or await PartySocket.idempotentlyFetch which you allowed you to send the message idempotently, wait for a response, and double-send it in dev, it would have saved our team hundreds of person-hours as we added collaborative text editing into our product.

Yes, I mean that. Hundreds:

A user typing probably dispatches a couple dozen requests per minute.

Any request in the chain can get dropped or accepted multiple times, and you need to take care to handle all those cases correctly, or it will corrupt the document state, often totally silently.

It is excruciatingly hard to track down this kind of corruption. Every bug of this type cost us tens of hours: poring over Datadog to find offending requests, reconstructing the problematic state, developing alerts when there are symptoms of the problem, and so on.

This sounds like elementary distributed systems, and to some extent it is, but our team is very experienced in this kind of real-time system, and our approach was mainly to budget time in QA to find this kind of bug.

When we finally made all requests idempotent and started double-submitting them, nearly all our state-corruption problems (in fact, all except exactly one) instantly became obvious in trivial testing, and we were able to fix most of them immediately.

I know the retort. Plain-old PartySocket.fetch gives completion semantics, and socket.send paired with broadcast success can do something similar today. I know there are performance considerations. And I know you can build this yourself—eventually we did!

But I’m here today to tell you that perf hit for something like await socket.idempotentlySend or await PartySocket.idempotentlyFetch was fine for live, collaborative text editing, where the response budget is 30-50ms. I’d make the trade again every single day of the week, and I think nearly all applications building on Durable Objects would too, if they tried it and benchmarked it.

The point of a platform is to increase the probability that code you write on top of it is correct. I really think this should be the default. If you want to do something riskier you should do it only after you know this option is too slow.

Sidebar: doesn’t Yjs solve this? We argue that it does not. We have given some talks about why we think it is not appropriate for the use case of live-collaborative editing that supports > 10 concurrent editors. We’ve also written part 1 of a series of blog posts further discussing these and other objections.

“Transactional” requests via Cap’n Web

All collaborative apps have some shared piece of collaborative state that is constantly being updated and read by many users. In basically any case that isn’t completely trivial, updating the shared state is not atomic, and care needs to be taken to ensure that requests do not trample over each other or otherwise corrupt the state.

There are many answers to this, but the simplest one is to define classes of requests that are mutually exclusive. As I’ve built on Durable Objects, I’ve become convinced that the default behavior should be that:

n readers can access the shared state

…or, 1 writer can access the shared state

Sometimes you’ll need more fine-grained interleaving of writer code, but this did work for our collaborative text editing case, and so I argue they are orders of magnitude rarer, and that you should only elect that course when you have clear reasons to believe your performance is not good enough.

As a straw proposal, I imagine these semantics built into PartyKit with an only-slightly-souped-up version of Cloudflare’s new RPC system, Cap’n Web.

Suppose we have two decorators @reader and @writer. The Durable Objects platform would allow any number of @reader calls to be executed concurrently, OR, a single @writer. This means that data in a writer never gets “stale”:

So the guarantees here do a lot of the heavy lifting: you know there is at most 1 @writer and so you don’t have to worry about interleaved execution.

As a reference point, the code block that follows is (roughly) the same as the manually-locked version of our collaborative text editing implementation. Look at it and try to think through where you can move the lock. Is it resilient to many writers? To many readers? Where are the places it’s safe to put another await and why?

It’s definitely doable to figure out, but it’s a lot harder than it initially seems. Our team collectively has decades of experience working on these problems, and also have caused subtle data corruption issues multiple times by moving various awaits around in the real version of this code.

We eventually did land on a version that uses the “n readers or 1 writer” semantics, and it was both much easier to reason about, but also easily accomplished our 30-50ms perf budget for tens of readers and hundreds of writers.

Of course, you can do all of this with a simple R/W lock. But as I’ve argued before: the point of the platform is to increase the probability of writing safe code. If these semantics are vastly safer and also represent a vast majority of performance goals, it should be the platform default. People can break out of it when they really need the extra perf considerations.

“Transactional” updates of many Durable Object instances

Sooner or later when you build a Durable Objects app, you will have to perform some administrative task. You’ll need to migrate all the persisted data in thousands of distinct Durable Object instances; you’ll need to spot-correct some bit of corrupted data for a customer; you’ll need to inspect the data for signs of intruders.

To do this you will find yourself building the following:

A second, parallel “shadow” API for administration.

…with a completely separate authorization path.

…and tools for “doing things” to the instance: pausing traffic, tracking which instances have been changed, rolling back changes that aren’t successful, etc.

The first two are painful but manageable. The third is nearly unbearable, something that is very hard to build without some notion of distributed multi-instance “transactions” built directly into the Durable Objects platform. This is probably the biggest gap in the platform overall.

Using Cap’n Web-like syntax again, imagine a @transaction decorator. This would allow us to use .ref to reference other Durable Object instances, call @writers on them. But, critically, either all the writes succeed, or none of them do.

Here, for example, is how a transactional Bank transfer implementation might work:

Yes, we all know distributed transactions are fraught. Yes, we all know there are a lot of caveats around performance. But it is also true that eventually you are going to need to touch many Durable Object instances. This is a toy example, but stuff like migrations are a simple fact of life.

And when you do need to touch many Durable Object instances, the only option cannot be “implement your own two-phase commit, failure detection, etc.” And anyway, we know this is all possible and reasonable at least to some extent because platforms like Reboot already support this exact thing.

Better tools for stateful bringup

The blessing and the curse of Durable Objects is that they are cheap to restart. The platform is not at all shy about killing your instance and bringing it up somewhere else, so your app is going to have to deal gracefully with little blips in availability.

In particular, that means that if you have to load data from disk (or wherever) on startup, there is going to be an instant where you are not yet ready to process requests. You can return an error, but then (say) users collaboratively editing a document will occasionally get a flurry of 500s for any open requests they have while they are waiting for bringup to complete.

There are two big things that you can do to gracefully deal with this problem:

Far and away the best thing to do is build idempotency into the app, but we already covered that.

The platform should make it easier to “hold” requests while the server initializes.

In the case of our collaborative text editor, when the server gets reset while requests are in flight, we hold a read lock which blocks until the editor state is loaded from R2, open patches are applied from the log stored in Durable Objects k/v, and the server is ready for requests.

This, again, is doable, but since this is part of the normal lifecycle of basically any nontrivial Durable Object app, it should just be built in.

Admin tools for object state

Ok this is cheating. I used this as an example for the section about distributed transactions.

This is really important though, and I would like to see the platform fully embrace the ops story. It does not feel like they have, yet.

Conclusion

Ok so a lot of what we wrote here looks like it is complaining about Durable Objects. I think it’s worth going back to the intro and remembering that the framing of this conversation was that Durable Objects are nearly perfect infrastructure, and also some stuff could be better. That is all still true.

So, with that, I want to leave you with a couple things.

First, a counterfactual. We eventually did boil our collab problems down to a few hundred lines of code that do seem to work, with tests, and so on. If you want a good idea of what it looked like before we moved to the new model, you can look at the @tldraw/sync-core code. tldraw are obviously immensely talented, and this is certainly not to take away from that. And while I do not know to a certainty that the code could be made simpler with better platform-level tools, I will say their code does look remarkably like ours did before these changes.

Second, inspiration. I really strongly recommend looking at Reboot. It is, essentially Durable Objects with richer, better semantics. Much of what I have described here already exists there. Particularly I recommend looking at their Transaction and Workflow APIs, and the way they handle idempotency, which is (IMO) industry-best. I think their notion of hot state is not going to be a perfect fit for the direction Durable Objects should go, but it’s worth looking at anyway, just to see how other people are doing it.