Code Monkey home page Code Monkey logo

automerge-repo's Introduction

Automerge

Automerge logo

homepage main docs latest docs ci docs

Automerge is a library which provides fast implementations of several different CRDTs, a compact compression format for these CRDTs, and a sync protocol for efficiently transmitting those changes over the network. The objective of the project is to support local-first applications in the same way that relational databases support server applications - by providing mechanisms for persistence which allow application developers to avoid thinking about hard distributed computing problems. Automerge aims to be PostgreSQL for your local-first app.

If you're looking for documentation on the JavaScript implementation take a look at https://automerge.org/docs/hello/. There are other implementations in both Rust and C, but they are earlier and don't have documentation yet. You can find them in rust/automerge and rust/automerge-c if you are comfortable reading the code and tests to figure out how to use them.

If you're familiar with CRDTs and interested in the design of Automerge in particular take a look at https://automerge.org/automerge-binary-format-spec.

Finally, if you want to talk to us about this project please join our Discord server!

Status

This project is formed of a core Rust implementation which is exposed via FFI in javascript+WASM, C, and soon other languages. Alex (@alexjg) is working full time on maintaining automerge, other members of Ink and Switch are also contributing time and there are several other maintainers. The focus is currently on shipping the new JS package. We expect to be iterating the API and adding new features over the next six months so there will likely be several major version bumps in all packages in that time.

In general we try and respect semver.

JavaScript

A stable release of the javascript package is currently available as @automerge/[email protected] where. pre-release verisions of the 2.0.1 are available as 2.0.1-alpha.n. 2.0.1* packages are also available for Deno at https://deno.land/x/automerge

Rust

The rust codebase is currently oriented around producing a performant backend for the Javascript wrapper and as such the API for Rust code is low level and not well documented. We will be returning to this over the next few months but for now you will need to be comfortable reading the tests and asking questions to figure out how to use it. If you are looking to build rust applications which use automerge you may want to look into autosurgeon

Repository Organisation

  • ./rust - the rust rust implementation and also the Rust components of platform specific wrappers (e.g. automerge-wasm for the WASM API or automerge-c for the C FFI bindings)
  • ./javascript - The javascript library which uses automerge-wasm internally but presents a more idiomatic javascript interface
  • ./scripts - scripts which are useful to maintenance of the repository. This includes the scripts which are run in CI.
  • ./img - static assets for use in .md files

Building

To build this codebase you will need:

  • rust
  • node
  • yarn
  • cmake
  • cmocka

You will also need to install the following with cargo install

  • wasm-bindgen-cli
  • wasm-opt
  • cargo-deny

And ensure you have added the wasm32-unknown-unknown target for rust cross-compilation.

The various subprojects (the rust code, the wrapper projects) have their own build instructions, but to run the tests that will be run in CI you can run ./scripts/ci/run.

For macOS

These instructions worked to build locally on macOS 13.1 (arm64) as of Nov 29th 2022.

# clone the repo
git clone https://github.com/automerge/automerge
cd automerge

# install rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# install homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# install cmake, node, cmocka
brew install cmake node cmocka

# install yarn
npm install --global yarn

# install javascript dependencies
yarn --cwd ./javascript

# install rust dependencies
cargo install wasm-bindgen-cli wasm-opt cargo-deny

# get nightly rust to produce optimized automerge-c builds
rustup toolchain install nightly
rustup component add rust-src --toolchain nightly

# add wasm target in addition to current architecture
rustup target add wasm32-unknown-unknown

# Run ci script
./scripts/ci/run

If your build fails to find cmocka.h you may need to teach it about homebrew's installation location:

export CPATH=/opt/homebrew/include
export LIBRARY_PATH=/opt/homebrew/lib
./scripts/ci/run

Contributing

Please try and split your changes up into relatively independent commits which change one subsystem at a time and add good commit messages which describe what the change is and why you're making it (err on the side of longer commit messages). git blame should give future maintainers a good idea of why something is the way it is.

Releasing

There are four artefacts in this repository which need releasing:

  • The @automerge/automerge NPM package
  • The @automerge/automerge-wasm NPM package
  • The automerge deno crate
  • The automerge rust crate

JS Packages

The NPM and Deno packages are all released automatically by CI tooling whenever the version number in the respective package.json changes. This means that the process for releasing a new JS version is:

  1. Bump the version in the rust/automerge-wasm/package.json (skip this if there are no new changes to the WASM)
  2. Bump the version of @automerge/automerge-wasm we depend on in javascript/package.json
  3. Bump the version in @automerge/automerge also in javascript/package.json

Put all of these bumps in a PR and wait for a clean CI run. Then merge the PR. The CI tooling will pick up a push to main with a new version and publish it to NPM. This does depend on an access token available as NPM_TOKEN in the actions environment, this token is generated with a 30 day expiry date so needs (manually) refreshing every so often.

Rust Package

This is much easier, but less automatic. The steps to release are:

  1. Bump the version in automerge/Cargo.toml
  2. Push a PR and merge once clean
  3. Tag the release as rust/automerge@<version>
  4. Push the tag to the repository
  5. Publish the release with cargo publish

automerge-repo's People

Contributors

acurrieclark avatar akellbl4 avatar alexjg avatar bijela-gora avatar blaine avatar cateland avatar daiyi avatar ecstatic-morse avatar ept avatar geoffreylitt avatar heckj avatar herbcaudill avatar joshuahhh avatar kid-icarus avatar msakrejda avatar mykola-vrmchk avatar neftaly avatar nornagon avatar openscript avatar orionz avatar paulsonnentag avatar pvh avatar yarolegovich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automerge-repo's Issues

Websocket Version bugs

@alexjg I have been running some tests with a sync server, and there were 2 problems which are still there wrt versioning:

  1. If the versions don't match, the server closes the connection. As it stands, the client is set up so that it automatically retries the connection when it closes, causing it to reattempt connection infinitely.
  2. If the server returns an error, it gets passed back to the NetworkSubsystem which treats it as a message.

I have added some logic to fix the second issue in #142 but it might be better to just keep the error in the adapter for now?

On Document URLs, and Document IDs

It looks like I won't be able to finish this patch before I go out on holiday, so I'm going to write up my plan and push the branch so that it can either be critiqued or finished (or both).

I'm working on a new format to replace the UUID document IDs used by automerge-repo.

The goal of this change is twofold. First, to introduce a recognizable and consistently parsable URL for Automerge documents that can be stored in an Automerge document. In the future, this URL should support specifiying heads, or perhaps branch IDs, or other kinds of tomfoolery, but for now it's just designed to be a recognizable URL. Second, to allow Automerge-Repo to immediately discard URLs that are either the wrong data type or the result of a transcription error.

The URL format is straightforward:
automerge:<checksummed-bs58-encoded-UUID>
It looks like this:
automerge:3f1w4KRPqEgwCrGGdyM6ATLYfMWo.

Let's discuss each part of the URL.

scheme / protocol

First, the scheme, automerge. I've chosen to use a custom scheme because Automerge is not run over HTTP (though it can tunnel over it via websockets), and because a traditional HTTP url would require us to provide elements like a hostname which doesn't really exist in this context.

Unfortunately, the automerge scheme can't be fetch()'d, at least not as of this writing, and I presume by extension it can't be intercepted by a service worker either. I ran some experiments looking for a scheme that would be accepted by fetch() and so on, and concluded that browser URL parsing inconsistencies and limitations on fetch meant there was no way to produce an authentic automerge URL that would work reliably without doing things like adding a made-up hostname or calling it HTTPS.

uuid

I've kept the UUID as the "deep" representation of the data to maintain some amount of consistency with past versions, and also because using ~128b of entropy to identify some unique resource is pretty much industry standard. I could have thought of a shorter data type (maybe 64 bit is enough) but there's no reason to be cute here.

bs58 / bs58check

The encoding serves two purposes. First, a 16 byte UUID encoded into hex is 38 bytes of text. A bs58 encoding is a third shorter, at 24. We then "spend" four of those bytes back to add a checksum which allows us to detect if the URL was copy-pasted with a character missing or (worse) if someone is just passing wildly unrelated values into the system.

library internals

I've concluded the URL format should only exist "at the edge" of the library. Internally, on disk, and over the network we should use the most efficient 16 byte binary representation.

This URL format should be stored in the document as a string (for now), but at some point in the future we may add a custom type for it to optimize storage and retrieval of document connections.

Internally to the library, we should use the textual BS58check representation of the UUID for logging and anything user facing. This will allow the user to make visual comparisons to the URL they passed in.

Conclusions

URLs are hard! But this seems like a reasonable plan. The most likely critiques I anticipate are with the bs58check system, and my theory there is that it's better to use something off the shelf than to invent something new.

I welcome your questions / comments.

Vanilla JS demo

We should have a demo of how to use automerge repo without libraries

Device IDs

Peter mentioned the idea of having some sort of RepoId which would be able to identify a device, and may contain multiple peers.

If this is achievable, a few thoughts on where it might help. The first 3 which spring to mind are:

  1. Sync confidence - knowing that your data has successfully synced with another device
  2. Local saves - the issue of multiple local peers (eg. open tabs) writing to the same data source could be trivially solved if local changes are only written by the peer which makes them. This doesn't help with who should write changes from the network, but it's a start.
  3. Undo/Redo - Not yet part of Repo, and obviously a more complex topic, but at the very least you need a way to determine which changes have come from your own device and which have come from the rest of the network. What to do with that information is a conversation for another day.

Whatever the weather, I think being able to group and identify a cluster of Repos would be most useful.

Mismatched heads error when syncing multiple changes from different peers

TLDR: When syncing a large number of batched changes to another peer, a simultaneous update from another peer causes the document to error with a error inflating document chunk ops: mismatching heads error when saving then loading the document.


First a little context. I have been exploring how automerge-repo might work when used with an http based network adapter, rather than a tcp based one (eg. websocket). On the server end I am then storing the document in a postgres db using Automerge.save, then loading it back on each update using Automerge.load.

Having got it working, I thought it would be sensible to stress test it by creating a large number of changes to sync. This worked without issue, syncing successfully to an additional connected peer. However, due to the latency in the network and the load/save cycle, this took a short while to complete. My next test was to do the same thing, but simultaneously push a change from the additional peer. This is where I first saw the error.

To check that it wasn't an error with my http adapter implementation or improper transaction isolation with postgres, I mocked a version using (slightly modified) existing automerge-repo adapters. The result was the same. A reproduction is available here.

As this runs on one host, the only modification to the adapter was to introduce some latency so that the batched changes would take a short while to complete.

I suspect that there is a simpler way to recreate the issue, perhaps by simply trying to sync a change from a document which has not yet completed a round of sync messages to update it from another peer.


@alexjg - My hunch is that this is an upstream syncMessage issue in automerge itself, but hopefully there is enough information here to help get to the bottom of it.

move sync-server to another repository

The sync server is a particular application which uses automerge-repo. I feel that it doesn't belong in this repository. The examples do use it however so we'll need to make sure it's published to NPM in such a way that the examples can depend on it.

Network backwards compatibility post v1.0

One concern I have about repo is how to manage peers which are running out of sync versions of repo.

For example, an electron app is launched running v1.0. It syncs to a server running the same version.

How can we be sure that both adapters and repo itself can communicate correctly if only one of these is updated?

Talking with @pvh about this, it seems that at a minimum network messages should include some kind of version tag, likely for both repo itself and the adapter employed.

Document corruption when updating while the dev server is stopped

While running the counter demo on the latest main, I found that starting (yarn workspace automerge-repo-demo-counter dev) and stopping (ctrl-c) the dev server a few times while incrementing the counter in the browser causes the saved document (the one in localforage) to become corrupted1:

caught (in promise) error inflating document chunk ops: invalid changes: changes out of order
Repo.js:36

This persists across reloads (obviously), and only stops once you clear the IndexedDb (where localforage keeps data). I've attached a log containing the localforage::keyvaluepairs database (with Uint8Arrays encoded to hex).

localhost-1682658343400.log

Ultimately, this leaves the retrieved document handle in the loading state and causes the following error when you click the button and try to update the state:

Handle.js:170 Uncaught (in promise) Error: Cannot change while loading
    at DocHandle.change (DocHandle.js:170:19)
    at changeDoc (useDocument.js:21:16)
    at onClick (App.tsx:14:9)
    at HTMLUnknownElement.callCallback2 (react-dom.development.js:4164:14)
    at Object.invokeGuardedCallbackDev (react-dom.development.js:4213:16)
    at invokeGuardedCallback (react-dom.development.js:4277:31)
    at invokeGuardedCallbackAndCatchFirstError (react-dom.development.js:4291:25)
    at executeDispatch (react-dom.development.js:9041:3)
    at processDispatchQueueItemsInOrder (react-dom.development.js:9073:7)
    at processDispatchQueue (react-dom.development.js:9086:5)

Besides fixing the underlying data corruption, it would be nice to surface this error earlier, preferably in Repo.find, but probably the best you can do is add an error event to DocHandle. Otherwise there's no way of knowing that loading the document failed besides listening to the 'error' event on window.

Footnotes

  1. It happens pretty consistently, but it's hard to give a specific set of actions. Maybe I'll record a screencast to make it easier for you to reproduce.

update demos

The demos should not be in /packages but instead in a top level examples folder. We should check they are up to date with the current API and ideally build them in CI.

Improve incremental data storage by using hashes as keys instead of integers

The current implementation of the Storage Subsystem writes an extra file for each individual change. These files are identified as $docId.incremental.$number. This works okay if you have only a small number of incremental changes and only a single writer.

We could do better, I think, in two ways. For a storage engine that supports locking and append operations, we could simply append the change onto the end of the document. Automerge is capable of reading additional changes this way (and indeed, this is how we assemble a document at load time.)

For key value engines that don't allow edits, we could store the incremental edits at $docId/incremental/$hashOfChange. This would allow for a few nice properties. First, we don't have to worry about race conditions in writing (as much). If two agents both write an incremental change, no big deal! (When we go to clean up these incremental keys we should be careful not to delete all keys: more may have arrived from another process while we were working: just delete keys you can guarantee you've got in your current snapshot. You will also need to take a lock while making a snapshot so you don't have two nodes each trying to write a snapshot with disjoint data.)

The cost of this approach is that it will require the storage engines to be able to enumerate all the keys in a sub-namespace. For a filesystem, this is pretty trivial: put the incremental keys in a subdirectory. Other databases will have similar properties. The one place I'm not sure we know how to support with this approach (but that I think we should support) is an S3 backend.

If anyone wants to tinker with this, please feel free to grab me and ask any clarifying questions. Otherwise I'll get around to it at some point, I'm sure.

[sync-server] this.repo.sharePolicy(...).then is not a function

Came here from hypermerge and really glad to see this project. Pluggability ftw!

I'm getting:

dist/synchronizer/CollectionSynchronizer.js:78
            void this.repo.sharePolicy(peerId, documentId).then(okToShare => {
                                                           ^

TypeError: this.repo.sharePolicy(...).then is not a function

Which is fixed in packages/automerge-repo-sync-server/src/index.js in the same way as here:

-   sharePolicy: (peerId) => false,
+    sharePolicy: async (peerId) => false,

Implementation of `DocSynchronizer.receiveSyncMessage` may cause messages to be received out of order

While writing #69 , I noticed that receiveSyncMessage waits for the doc handle to be ready before calling update

async receiveSyncMessage(
peerId: PeerId,
channelId: ChannelId,
message: Uint8Array
) {
if ((channelId as string) !== (this.documentId as string))
throw new Error(`channelId doesn't match documentId`)
// We need to block receiving the syncMessages until we've checked local storage
// TODO: this is kind of an opaque way of doing this...
await this.handle.loadAttemptedValue()
this.handle.update(doc => {
const [newDoc, newSyncState] = A.receiveSyncMessage(
doc,
this.#getSyncState(peerId),
message
)
this.#setSyncState(peerId, newSyncState)
// respond to just this peer (as required)
this.#sendSyncMessage(peerId, doc)
return newDoc
})
}

receiveSyncMessage is invoked (indirectly through CollectionSynchronizer) in an on("message", ...) listener in Repo.ts:

networkSubsystem.on("message", msg => {
const { senderId, channelId, message } = msg
// TODO: this demands a more principled way of associating channels with recipients
// Ephemeral channel ids start with "m/"
if (channelId.startsWith("m/")) {
// Ephemeral message
this.#log(`receiving ephemeral message from ${senderId}`)
ephemeralData.receive(senderId, channelId, message)
} else {
// Sync message
this.#log(`receiving sync message from ${senderId}`)
synchronizer.receiveSyncMessage(senderId, channelId, message)
}

However, if we get two messages before doc handle is ready, I don't think there's any guarantee that the two listeners will resume from the await in the same order they reached it, which could cause sync messages to be processed out of order. I haven't reproduced this yet, so maybe I'm wrong about the spec and someone can correct me.

Improve gossip protocol for ephemeral messages

In NetworkSubsystem, if we see a broadcast message (currently only ephemeral messages), we rebroadcast it to all our other peers.

networkAdapter.on("message", msg => {
const { senderId, channelId, broadcast, message } = msg
this.#log(`message from ${senderId}`)
// If we receive a broadcast message from a network adapter we need to re-broadcast it to all
// our other peers. This is the world's worst gossip protocol.
// TODO: This relies on the network forming a tree! If there are cycles, this approach will
// loop messages around forever.
if (broadcast) {
Object.entries(this.#adaptersByPeer)
.filter(([id]) => id !== senderId)
.forEach(([id, peer]) => {
peer.sendMessage(id as PeerId, channelId, message, broadcast)
})
}
this.emit("message", msg)
})

As noted in the comments, this is not a great gossip protocol. It relies on the network forming a tree. If there are cycles, this approach will loop messages around forever.

Removing repo documents

I wanted to start a discussion on how we might go about removing a document/handle from a repo.

It seems to me that there are 2 stages to removing a document:

  1. A document can be locally "forgotten". This is relatively straightforward, as it would simply involve removing the document from storage and deleting it from the DocCollection cache.
  2. We can request for the document to be removed from one or more peers. This is obviously a more complex scenario, as both users and developers will want differing levels of control. There is also the matter of how delete requests might be propagated amongst peers.

I would like to propose that we implement stage 1 of the above. The mechanism of locally forgetting a document is useful as a standalone option and would also ease the implementation of stage 2, when we settled on a process.

Happy to put together a forget PR. If there are any thoughts on naming the API, please let me know.

Types are incomplete for `Repo`

First of all, amazing project! Congrats on the launch of AutoMerge 2.0!

When playing around with this I noticed that TypeScript doesn't pick up on any Repo properties provided by DocCollection and EventEmitter. E.g., Repo.create turns up as any.

CleanShot 2023-02-02 at 09 41 37@2x

This can be explained when we look at Repo.d.ts where extends Mixin(...) has been replaced with extends Repo_base, which is of type any.

CleanShot 2023-02-02 at 09 46 22@2x

When I build from the main branch things seems to be working properly, Repo_base is not of type any but has proper typings:

CleanShot 2023-02-02 at 09 56 33@2x

Yarn install fails - missing deps

I tried to fresh install of automerge-repo in a project with yarn v3, it seems like some peer dependencies are not set in packages/ packages etc.

These monorepo packages are missing the following deps/devdeps/peerdeps from their package.json's:

  • automerge-repo-network-websocket missing:

    • @automerge/automerge
    • ws, requested by isomorphic-ws (peer dep)
  • automerge-repo missing:

    • @types/node, requested by ts-node
    • typescript), requested by ts-node

The following are all misssing a "@automerge/automerge" dep/devdep/peerdep:

  • automerge-repo-network-broadcastchannel
  • automerge-repo-react-hooks
  • automerge-repo-storage-localforage

Share Policy use

I have been looking into how we might be able to remove local documents from a repo (#48) and started to look at how a deleted document might end up returning to a repo inadvertently.

NB. I know that work is being done on authentication and authorisation which will ultimately supplant the current share policy API, but I think this is worth discussing.

The current default share policy is that when 2 peers connect, all of the documents in each repo are synced automatically. This means that when one peer attempts to find a document which might have previously existed only on the other repo, it is already in its cache, so it finds it immediately.

A better option might be to have a default whereby documents are only shared when requested. However, I am not sure this is possible in the current implementation.

The sharePolicy is currently consulted (as far as I can tell) whenever a document is added to a repo, whenever a new peer connects and whenever a sync message is received. There are 4 situations which result from this:

  1. Bob has documents. Alice connects. Which documents should be shared?
  2. Bob creates a new document. Should it be shared with Alice?
  3. Bob tries to find a document which he doesn't already have. Should the share policy let him request it from Alice?
  4. Bob receives a sync message for one of his documents from Alice. Should he share his changes with her?

The issue is that we may only want to do some of these activities, and there isn't currently a way to discern which one we are attempting.

I would suggest that we add an additional parameter to the sharePolicy which allows the user to determine which stage of the process we are at. For example, we could add a 3rd type argument which is passed in to the sharePolicy.

export type ShareType = 'CONNECT' | 'CREATE' | 'FIND' | 'REQUEST';

I am sure there are other APIs which could be explored. What do people think?

Obviously this could easily be transferred to @HerbCaudill's auth work as and when

Better Timeout Handling

At the moment, documents which are not found locally are requested from peers. If it's still not found after a period of time, the handle throws a timeout error.

There are a couple of improvements which could be made:

  1. Setting a custom timeout
  2. Letting the network inform repo that a document is unavailable so that the user can be informed as soon as it's known. This will also have the benefit of providing a more useful error message.

Delete protocol

When we delete a document from a local repo, all it does set the local handle to a deleted state, emit an event and remove from local storage.

While there is a test to ensure that a deleted document can be refetched, I don't think this works as expected. For example, while we remove the local handle from the cache, we don't delete the local synchronizer, nor do we inform the network that we have have deleted the document.

I would suggest that we should remove the local synchronizer and probably inform the network of the deletion. Relevant peers should then remove the peer from their list of peers to synchronize with. This is if particular relevance to a sync server, which would otherwise look to continue syncing changes it receives from elsewhere.

At some point we should add a hook which a peer can use to determine if there is anything else which should be done with a delete message. For example, if you deleted in one tab, you might want to delete in a service worker and all other tabs but just stop syncing from your sync server.

Sync access to a Handle's underlying document

I think there is a case for a public api to access a DocHandle's document without the need for the handle to be "ready".

Whether or not the Handle is ready, it always has a document. It might not be in the expected state, but it exists nonetheless. At any point following handle.value() resolving, the document is in a valid state.

Leave means leave

There is an uncommon and unlikely-to-effect-the-real-world bug which occurs with the BroadcastChannel network adapter.

When something like Hot Module Replacement causes a page to be initialised a large number of times within a short window of time, any Repos on the page are instantiated with a new peerId. All of them join the BroadcastChannel network and they all start syncing with each other. It doesn't take long for this loop to get very much out of hand.

I think this could likely be mitigate by implementing leave on the adapter, and then exposing a close or leave method on Repo itself to allow the user to manually close network connections.

I have tested the issue the latest pre-1.0 release and it also exists then, so this hasn't been introduced by any of our recent efforts.

Documentation

Currently the sole form of documentation is the typescript types. This is not nothing but we need more. To start I think we need:

  1. Typedoc comments on all the public methods (unless the name of the method is descriptive enough).
  2. A more "how to use this" style guide

Once the typedoc comments are complete we can publish them on the automerge.org from https://github.com/automerge/automerge.github.io, we already do this for automerge so it will hopefully be easy to add.

The source of the "how to" guide for automerge is entirely within https://github.com/automerge/automerge.github.io so maybe we should also write the automerge-repo how to guide there?

  • TypeDocs for Repo/DocCollection/Handle (others?)
  • Updated examples on automerge.org
  • READMEs of automerge-repo root
  • README automerge-repo-sync-server
  • READMEs of all example code
  • Check comments in examples
  • Check comments elsewhere in code
  • get Automerge-Repo into Docusaurus on Automerge.org

Support authentication & authorization

What follows is a sketch of a proposed approach for supporting authentication & authorization within automerge-repo.

Why

Automerge Repo currently offers no way to authenticate a peer, and very little in the way of access control.

Our current security model is the "Rumplestiltskin rule": If you know a document's ID, you can read that document, and everyone else who knows that ID will accept your changes.

That model is good enough for a surprising number of situations — the ID serves as an unguessable secret "password" for the document — but it has limitations. Without a way to establish a peer's identity, we can't revoke access for an individual peer — say if someone leaves a team, or if a device is lost. And we can't distinguish between read and write permissions, or limit access to specific documents.

An application might implement authentication and authorization in any number of ways, so this should be pluggable — like the existing network and storage adapters.

So initializing a repo might look something like this:

import { SuperCoolAuthProvider } from 'supercool-auth-library'

const authOptions = {
  // ...options specific to this type of authentication
}
const auth = new SuperCoolAuthProvider(options)

const repo = new Repo({ network, storage, auth })

API

An auth provider inherits from this abstract class:

export abstract class AuthProvider extends EventEmitter<AuthProviderEvents> {
  /**
   * Can this peer prove their identity? The provider implementation will
   * use the web socket to communicate with the peer.
   */
  abstract async authenticate(peerId: PeerId, socket?: WebSocket): Promise<true | Error> {}

  // The following methods may be overriden by the provider and would replace
  // the existing `sharePolicy` that we pass to a repo.

  /** Should we tell this peer about the existence of this document? */
  async okToAdvertise(peerId: PeerId, documentId: DocumentId): Promise<boolean> {
    return false
  }

  /** Should we provide this document (and changes to it) to this peer when asked for it by ID? */
  async okToSend(peerId: PeerId, documentId: DocumentId): Promise<boolean> {
    return false
  }

  /** Should we accept changes to this document from this peer? */
  async okToReceive(peerId: PeerId, documentId: DocumentId): Promise<boolean> {
    return false
  }
}

Authentication

The auth provider's authenticate method is invoked when the network adapter emits a
peer-candidate event.

// NetworkSubsystem.ts
networkAdapter.on('peer-candidate', async ({ peerId, channelId, socket }) => {
  const { authenticationResult } = await authProvider.authenticate({ peerId, socket })
  if (!authenticationResult.isValid) {
    const { error } = authenticationResult
    this.emit('peer-authentication-failed', { peerId, channelId, error })
  } else {
    // ...

    this.emit('peer', { peerId, channelId })
  }
})

Authorization

Advertising

TODO

Sending changes

TODO

Receiving changes

TODO

A disconnected peer cannot rejoin

At present, if a network adapter fires a peer-disconnected event, the DocSynchronizer removes the peer from its list, but maintains it's state.

When a new peer-candidate event is fired, the DocSychronizer checks to see if it has a doc state for that peer before adding the peer back on its list. As the state is still present, the peer never gets added.

Happy to issue a pull request for a fix, but I have a question about implementation. Would it be preferred to:

  1. Remove the peer state when a peer is disconnected OR:
  2. Keep the state and always add to the peer list if not already present on reconnection

It seems to me that it would be better to preserve the known state so that we would not be losing information already known about the peer's doc. I see that beginSync method already performs an encode/decode cycle.

NB. I spotted this as I was trying to find a way to reset the state from a network adapter in the case that the connection dropped or a syncMessage failed to get through.

Thoughts on architecture

Peter's aspiration is for this library to provide composable units that can be used independently, or swapped out for alternative implementations.

I'm on board with that aspiration in principle. Clearly the NetworkAdapter and StorageAdapter abstractions make a ton of sense, as does the proposed AuthProvider.

But I still can't picture what it would mean to use DocCollection, NetworkSubsystem, EphemeralData, or StorageSubsystem independently of Repo, or what the motivation would be.

In the meantime, we're paying a steep cognitive price for what feels to me like premature abstraction. The work that automerge-repo is doing is not all that complex, but it's hard to see what's actually load-bearing through the maze of event-driven orchestration between all these classes. All this indirection means that takes work to track down who depends on what, what an event is intended to accomplish, and what's part of the public API vs internal only.

I'd like to do one of two things:

  1. document and test ways in which the current decomposition would be useful to someone; OR
  2. fold some or all of those classes into Repo

My preference would be to go ahead and do (2). We can always refactor later to reintroduce some abstraction if and when we have an actual use case for it, and in the meantime we'll just be able to see the shape of the thing more clearly.

Ephemeral message scope

Ephemeral messages are currently emitted with a channelId which is not used to scope messages in any way.

My understanding of the current setup is that this means that 2 independent peers connected via (for example) a central server, would receive any ephemeral messages that the other sent, even if they were accessing a completely different set of documents.

If this is the case, would it be useful to be able to scope messages to a specific document, such that it is only sent to peers with whom the document is already being synced?

Are there any other scenarios?

PatchCallback Signature

In automerge the PatchCallback has been recently updated so that the PatchInfo parameter includes a source property.

At the moment, the patch information is returned to the user as below:

this.emit("patch", { handle: this, patches, before, after }),

Can I propose that we adopt a similar signature to that of the PatchCallback?

const doc = A.init<T>({
  patchCallback: (patches, pathInfo) =>
    this.emit("patch", { handle: this, patches, patchInfo }),
})

Unable to use automerge-repo in Deno: Property 'on' missing from 'Repo'

It seems like automerge-repo currently cannot be used with Deno natively, nor via Deno's npm compatibility layer. When trying to run it natively, Deno complains about all of the imports of .js files. When trying to import it via npm, the typechecker seems to get hung up on the type of Repo (specifically its extension of EventEmitter)

Property 'on' does not exist on type 'Repo'.deno-ts(2339)

Note that the property is definitely there at run-time, but IDEs fail to find it.

Steps to reproduce:

  1. Make a file called main.ts with the following contents:
import {Repo} from "npm:@automerge/[email protected]";
import {MessageChannelNetworkAdapter} from "npm:@automerge/[email protected]";

const messageChannel = new MessageChannel();
const port = messageChannel.port1;

const repoConfig = {
    network: [new MessageChannelNetworkAdapter(port)]
};

const repo = new Repo(repoConfig);
repo.on("document", () => true);
  1. Open the project in an editor with Deno support (e.g., Visual Studio Code with the Deno extension, or Webstorm with its Deno plugin) and make sure it is enabled for the project. The editor should raise a typerror.
image

Some additional info which is tangentially related: I ran into this issue when trying to wrap the main classes (DocHandle, Repo, etc) with my own classes that have the same interface as the ones being wrapped. This included my classes extending EventEmitter. A problem arose because Deno complained (at run-time) I could not invoke super() on a superclass twice, the superclass being EventEmitter. Is it an intended use-case of the eventemitter3 library to extend EventEmitter?

`DocHandle.value` and `DocHandle.doc` are confusing

Currently the DocHandle.doc attribute throws an exception when trying to access it before the document is ready, where ready means (as far as I can tell) one of:

  • The handle was created wih Repo.create()
  • The handle was created with Repo.find() and we found something in storage
  • The handle was created with Repo.find(), nothing was found in storage but a peer sent us some changes for the document

The intention here is that the user either call DocHandle.isReady before accessing DocHandle.doc. There is also an async method DocHandle.value which waits for the document to become ready and returns the value. This seems confusing to me because it's not clear from the naming or the types what the difference is between doc and value() and the relationship to isReady is oblique.

I think a nicer API would be for DocHandle.doc to return null until the document is ready. For typescript users this will signal the difference between value() and doc in a way which hopefully will prompt reading the documentation for doc, which should include a mention of isReady and value().

Can't run sync server tests on Windows

I haven't been able to get the sync server tests to run on windows.

First, there's a line-endings issue with tests.sh. I can fix this manually by switching my local file from CRLF to LF. There's probably something we can do with the git config so that it doesn't do its normal magic with this one file's line endings on Windows.

Once that is fixed, it complains about imports in the js file.

Error message
↯ yarn workspace automerge-repo-sync-server test
yarn workspace v1.22.10
yarn run v1.22.10
$ bash ./scripts/tests.sh
yarn run v1.22.10
yarn run v1.22.10
$ mocha --no-warnings --experimental-specifier-resolution=node --exit
$ node ./src/index.js
/mnt/c/git/pvh/automerge-repo/packages/automerge-repo-sync-server/src/index.js:2
import fs from "fs"
    ^^

SyntaxError: Unexpected identifier
    at Module._compile (internal/modules/cjs/loader.js:723:23)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
    at Module.load (internal/modules/cjs/loader.js:653:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
    at Function.Module._load (internal/modules/cjs/loader.js:585:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:831:12)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:623:3)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Error: Not supported
    at Object.exports.doImport (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/nodejs/esm-utils.js:35:41)
    at formattedImport (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/nodejs/esm-utils.js:9:28)
    at Object.exports.requireOrImport (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/nodejs/esm-utils.js:42:34)   
    at Object.exports.loadFilesAsync (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/nodejs/esm-utils.js:100:34)   
    at Mocha.loadFilesAsync (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/mocha.js:447:19)
    at singleRun (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/cli/run-helpers.js:125:15)
    at exports.runMocha (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/cli/run-helpers.js:190:10)
    at Object.exports.handler (/mnt/c/git/pvh/automerge-repo/node_modules/mocha/lib/cli/run.js:370:11)
    at innerArgv.then.argv (/mnt/c/git/pvh/automerge-repo/node_modules/yargs/build/index.cjs:443:71)
    at process._tickCallback (internal/process/next_tick.js:68:7)
    at Function.Module.runMain (internal/modules/cjs/loader.js:834:11)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:623:3)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
./scripts/tests.sh: line 7: kill: (15) - No such process
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed.
Exit code: 1
Command: C:\Program Files\nodejs\node.exe
Arguments: C:\Users\herb\AppData\Roaming\npm\node_modules\yarn\lib\cli.js test
Directory: C:\git\pvh\automerge-repo\packages\automerge-repo-sync-server
Output:

info Visit https://yarnpkg.com/en/docs/cli/workspace for documentation about this command.

In localfirst/relay, the server is written as a typescript class; and there's a separate repo (relay-deployable) that just contains a small js file to instantiate the class and run the server. That setup makes it easy to test the server's functionality without having to spin up separate processes:

import { Server } from './Server'

 // ...

describe('Server', () => {
  let url: string
  let server: Server

  beforeAll(async () => {
    const port = await getAvailablePort({ port: 3100 })
    url = `ws://localhost:${port}`
    server = new Server({ port })
    await server.listen({ silent: true })
  })

  afterAll(() => {
    server.close()
  })

  // ...

  it('should make a connection', done => {
    const { aliceId } = setup()

    server.on('introductionConnection', userName => {
      expect(userName).toEqual(aliceId)
      expect(server.peers).toHaveProperty(aliceId)
      expect(server.documentIds).toEqual({})
      done()
    })

    // make a connection
    const alice = new WebSocket(`${url}/introduction/${aliceId}`)
  })

// ...

})

If that seems like a reasonable approach I'll just refactor the server and tests along those lines, rather than trying to get that script to work cross-platform.

Add tests for react-hooks useAwareness

  • Add useRemoteAwareness tests
  • Add useLocalAwareness tests
  • Move example to examples dir

Optional:

  • Add description to automerge-repo-react-hooks README
    • this should explain what ephemeral messages are and why this is useful

Create `DocHandle` with an initial document

Either

  • pass an initial value and do something like this to ensure that you get a valid initial value
const myInitialValue = {
  tasks: [],
  filter: "all",

const guaranteeInitialValue = (doc: any) => {
if (!doc.tasks) doc.tasks = []
if (!doc.filter) doc.filter = "all"

  return { ...myInitialValue, ...doc }
}

or

  • pass a "reify" function that takes a <any> and returns <T>

`NetworkSubsystem` Review

I have spent some time considering the current NetworkSubsystem and NetworkAdapter implementations and I think it's worth starting a discussion on reviewing and simplifying. I admit this is pretty opinionated, but I am not tied to any of these thoughts with any enormous force.

Most of this came from thinking about 2 steps:

  1. Allowing multiple message types to be passed between peers
  2. Simplifying the Adapter Interface to make implementing new adapters more straightforward

ChannelId
TLDR. I think we should lose/replace it.

The ChannelId (to the best of my understanding) is used in 3 ways. The first is as some kind of ID when we ask the network system and subsequently each adapter to join a "channel". This is done when the system is set up for the first time. We join a SYNC_CHANNEL channel, then there is never another channel joined, and the SYNC_CHANNEL is never referred to again. The ID in this case is redundant.

The second is to identify which document is being synced when sync messages are being passed between peers. I think we would be better to just use DocumentIds in this case, and update the NetworkAdapter API to match.

The third is to identify if a message is meant for broadcast, if it is ephemeral. In this case the ChannelId is prefixed with m/ as a way to differentiate it from other message types. I think a clearer way forward would be to specify the message type explicitly as ephemeral.

Join
At the moment, this is called when the network subsystem is set up, for any channels which are joined. However, when the subsytem is initialised, there are no channels, so nothing is joined. The only time we join is when the repo is initialised for the first time, at the top level of the constructor. I think we could likely do away with join on both the subsystem and the adapter level; when an adapter is initialised, it should run whatever is needed to join automatically. The only reason to hang on to it would be in the case that the leave method is actually useful.

Leave
As far as I can tell, the leave method in the NetworkSubsystem is never called. Do we need to actually stop a network adapter? At the moment the suggestion is that we can either leave all adapters (channels, right now) but not individual ones. I can see a world where you might want to be able to stop syncing to a server, for example, but keep syncing between local tabs in your browser/app.

Adapters
At its simplest, a network adapter needs to be able to tell repo when:

  • It opens
  • It has a peer candidate
  • It loses a peer
  • It has a message from one of those peers
  • It closes

It also needs to be able to send messages to those peers.

All of these events currently exist on NetworkAdapter, but at no point is open fired by any of the existing adapters. If it did, I think we could bring all of those event types in house. For example:

  1. An adapter sets itself up, and emits an open event.
  2. The subsystem sends out a peer-candidate message type (discussed further below), announcing itself to the network.
  3. Any peers receiving the message log the PeerId, announce the connection to the DocSynchronizer and reply with their own peer-candidate message.

I might be underestimating some of the specificity of the initial handshakes required here, but it strikes me that essentially each adapter needs to set up and configure the network, then pass messages between instances of Repo on one peer, with another.

Messages
So, to messages. I think we should essentially add message types. Off the top of my head, they would be:

peer-candidate: Sent by a peer when an adapter opens a network connection, or when a new peer connects.
peer-disconnected: Sent by a peer when it wants to let other peers know that it no longer wants to be a peer, but the network is still active
sync-message: exchanging sync messages between peers
ephemeral: sending broadcast messages to all peers
auth: authentication messages (to fit in with @HerbCaudill's work)
document-not-found: explicit response to a sync request on a document which a peer does not have, and cannot be found

There may be more here, but the point is that I think that all of these messages could/should be handled by the subsystem, so that the Adapters can be left to set up the network, and then just communicate.

Would love to hear thoughts and corrections on the above. If there is anything I have misunderstood, please don't hesitate to point out

useHandles

There's a convenient hook for loading an array of DocHandles. It's not perfect -- you still need to do some bookkeeping of your own to work wtih it but it's a step in the right direction when working with arrays of documents in react which insists of always having the same number of hooks on every execution.

Code here. Would anyone else want this?

export interface DocIdMap<T> {
  [id: DocumentId]: T
}

export function useHandles<T>(docIds?: DocumentId[]): DocIdMap<T> {
  const handlersRef = useRef<DocIdMap<DocHandle<T>>>({})
  const repo = useRepo()
  const [documents, setDocuments] = useState<DocIdMap<T>>({})

  if (!docIds) {
    return documents
  }

  const handlers = handlersRef.current
  const prevHandlerIds = Object.keys(handlers) as DocumentId[]

  docIds.forEach((id) => {
    if (handlers[id]) {
      return
    }

    const handler = (handlers[id] = repo.find<T>(id))
    handler.value().then((doc) => {
      setDocuments((docs) => ({
        ...docs,
        [id]: doc,
      }))
    })

    // TODO: evt.handle.doc isn't awesome
    handler.on("change", (evt) => {
      setDocuments((docs) => ({
        ...docs,
        [id]: evt.handle.doc,
      }))
    })
  })

  prevHandlerIds.forEach((id) => {
    if (handlers[id]) {
      return
    }

    const handler = handlers[id]
    handler.off("change")
    delete handlers[id]

    setDocuments((textDocs) => {
      const copy = { ...textDocs }
      delete copy[id]
      return copy
    })
  })

  return documents
}

Automerge-Repo in a SharedWorker under Vite loses "connect".

See also: Menci/vite-plugin-wasm#37

In the meantime, Orion has a workaround which just recreates the worker after a couple hundred MS but... it's ugly, and it makes using a SharedWorker very inconvenient. There are a few ways we could reduce the grossness of this without waiting for an upstream fix in Chrome but I think it might boil down to including an async import.

We don't 100% have to fix this before 1.0 but it's just so awful I'd like to fit it in if we have time.

Browser / vanilla javascript example?

Is there a vanilla JS browser example available?
Searching for a way to build a multi user collaborative javascript app. Local first and syncing with other clients...
Listening for other clients changes like todo list, chat or similar apps.

Control sharing on a per document level

I had a conversation with @pvh last week about how it would be best to go about managing which documents in a repo are permitted to be shared with which peers. At the moment, there is a sharePolicy option which takes a peerId parameter, but this only determines whether or not a peer should be shared with at all.

It seems to me that whether or not a document should be shared with peers depends on factors which may or may not be in Automerge's control or purview. As such, it makes sense to me to take control of that outside of automerge-repo and allow a user to permit or deny based on their own criteria.

I would love to put together a PR for this, but wanted to discuss a possible API for this before continuing. My preferred route right now would be:

  • Pass a callback as an additional option which takes both a peerId and a documentId. This would then be called any time a peer or a document is added to determine whether or not it should be shared.
  • While the current sharePolicy is set when the peer is added (the callback is evaluated once only), this new callback would be executed each time a new peer joined or a document was added.

I am sure there are other methods I have not considered, and would be grateful for peoples' thoughts on the subject.

`handle.value` resolves even if the document does not exist

At present, the following test fails:

it("cannot find a document which does not exist", async done => {
  const { repo } = setup()

  const handle = repo.find<TestDoc>("does-not-exist" as DocumentId)
  assert.equal(handle.isReady(), false)

  handle
    .value()
    .then(() => {
      done(
        new Error(
          "should not be able to find a document which does not exist"
        )
      )
    })
    .catch(() => {
      done()
    })
})

There likely needs to be some response when a document is not found, but this currently resolves to an empty Automerge doc.

New document sync and save

I have been looking in to how we might determine if a document is not available on the network. As a side effect, I was investigating how sync messaging is triggered.

At the moment, if we create a document on the repo, we set up a new empty document but don't persist anything to storage. We do, however, immediately start to sync it with other peers (if the sharePolicy allows).

When we find a document which doesn't exist, the same process happens, only the document stays in a requesting state.

Any connected peer will have no way of determining if the first (empty?) sync message is regarding an empty doc which has been created elsewhere and will shortly begin syncing changes or if it is a non existing document that the other peer is looking for.

I would suggest that the most logical solution would be to have a new document only sync over the network if it contains changes?

It might not matter, and we could be happy that a connected peer could respond to the initial sync in either case to say that it the document isn't available from it. If so, we just ignore the message when it arrives at the creating peer.

Consecutive changes do not emit distinct documents

When a handle is changed, the change event emits as expected. We can then resolve the handle value to get underlying document.

However, if two changes are made in quick succession, only the final value is resolved:

const TEST_ID = "test-document-id" as DocumentId
const handle = new DocHandle<{foo: string}>(TEST_ID, { isNew: true });

handle.on("change", async ({ handle }) => {
  const d = await handle.value();
  console.log(d);
});

handle.change((doc) => {
  doc.foo = "bar";
});

handle.change((doc) => {
  doc.foo = "baz";
});

will log

{ foo: 'baz' } <-- should be 'bar'
{ foo: 'baz' }

A simple solution to this would be to include the document in the event parameter object? Happy to put together a PR if this is acceptable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.