Channel checkpointing in ForceMove

tom · September 16, 2019, 3:05pm

Through Andrew’s work on a TLA+ specification for force-move, we discovered an issue with force-move. It is still possible to run the protocol safely, but doing so requires storing > nParticipants states in some circumstances, while we were aiming for the property that a client should only ever have to store the last nParticipants states.

The problem manifests as follows: suppose that we have a force-move channel with participants = [Alice, Eve, Bob, Charlie] (so that Alice signs states numbered 0, 4, 8, ..., etc). Suppose that the largest turn number Alice has seen so far is 11, and she holds states s8, s9, s10, s11, signed by Alice/Eve/Bob/Charlie respectively. Now suppose that Eve launches a force-move with states s6, s7, s8, s9', where s9' != s9. How can Alice ensure that the channel concludes in state s11 or later? She has to cancel the force-move in some way, for which there are three options: (1) respond, (2) refute, (3) respondFromAlternative. She can’t respond, in general, because s9' -> s10 might not be a validTransition. She can’t refute, as she doesn’t hold a state with turnNum > 9 that is signed by Eve. In order to respondFromAlternative, she would need to supply states s7, s8, s9, s10, which is fine and works, but requires her to store s7, at least temporarily.

This isn’t a massive deal - after all, Eve has just provided s7 in order to launch the force-move. In thinking through this, we realised that there might be a neater solution though.

The proposal is to replace the respondFromAlternative method with a checkpoint method. This method would take a supportedState* with turnNum = n and:

if there’s an ongoing challenge
- for a state with turnNum < n
  - cancel the challenge
  - increase the turnNumRecord to n
- for a state with turnNum >= n
  - do nothing
otherwise
- increase the turnNumRecord to n, unless it is already larger

In the example above, this would mean Alice could call checkpoint(s8, s9, s10, 11), which would cancel Eve’s challenge.

Checkpointing has the additional feature that it can be used pre-emptively by a participant before going offline, as by increasing the turnNumRecord it prevents a force-move being called with an older state.

* In the simplest case, a supportedState is a sequence of signed states, of length nParticipants, with each consecutive pair a validTransition. It also allows for an optimization, where some participants sign the same state. See the force-move implementation for more details.

AndrewStewart · September 16, 2019, 7:46pm

To add some clarification over what it means for a state to be “supported”, I’ve wrote out a definition for it. To keep the discussion here on course, I’ve made a new post over here.

AndrewStewart · September 16, 2019, 8:21pm

Channel checkpointing like this is quite elegant, as it can be used to accomplish the same thing as respond, respondWithAlternative, and refute. IE. you could have an even smaller ForceMove protocol than the current one, with just forceMove, checkpoint and conclude.

On top of that, consider the following channel with participants [Alice, Eve]. Suppose they each hold states 99 and 100

Eve maliciously calls forceMove(s10) with an old state.
Alice calls refute(s99) in transaction tx1
Eve sees tx2 and front-runs a transaction tx2 containing respond(s11)
tx1 reverts, since the channel is now open
Eve then repeats this with turn 11,12,…

In the current ForceMove protocol, this lets Eve block the channel for a duration roughly equal to the latest turn number times the challenge expiration duration. With checkpoint, Alice can stop this attack by calling checkpoint([s99, s100]).

It might also be worth modifying the requirements for refute to increment Eve’s turnNumberRecord, and then clear the challenge if it exists. This would mean that tx1 would not revert in 4, and Alice would not have to subsequently call checkpoint in a second transaction to block the above attack.

There is still benefit to having respond and refute:

when calling respond, you only have to check one valid transition and signature, so it’s more gas-efficient than calling checkpoint
when calling refute, the adjudicator has the option to penalize the challenger

The checkpoint method could be written to detect these two scenarios and react accordingly, but that would complicate the implementation and would not be as gas-efficient as the bespoke functions.

AndrewStewart · September 17, 2019, 5:22pm

It seems like preventing the front-running attack here is not possible with the fully-optimized force-move protocol.

To prevent the front-running attack, checkpoint needs to work like this:

function checkpoint(input) {
  require(input is valid) // does not check if input matches channelStorage
  alwaysIncrementTurnNumbers() // turnNumber = max(turnNumber, input.turnNumber)
  clearChallengeIfExists()  // if (input matches channelStorage) { clear the challenge }
}

However, when the turnNumRecord is stored in the (hashed) channel storage, the only way to increment the turnNumRecord is by replacing the channel storage, and that is only allowed when the input exactly matches the current channel storage.

One way around this is to pull turnNumRecord off of channel storage, and store it un-hashed. This would introduce a 20k gas cost on the first forceMove, and a 5k gas cost on every subsequent call that increments the turnNumRecord.

We could still store a per-participant turnNumberRecord in channel storage to manage turn numbers when refute is called. The channel storage turnNumberRecord could be updated something like this in checkpoint and respond,

// in `checkpoint`, `respond`
channelStorage.turnNumRecord[participant] = maximum(
  globalTurnNumRecord,
  channelStorage.turnNumRecord[participant],
  input.turnNumRecord
)

// in `refute`
channelStorage.turnNumRecord[challenger] = maximum(
  globalTurnNumRecord,
  channelStorage.turnNumRecord[challenger],
)

EDIT:
If we use type uint32 to store turn numbers, we could probably get around the increased gas cost by only storing some of the bytes of the hashed storage:

struct ChannelStorage {
  uint32 turnNumRecord;
  uint128 storageHash;
}

Here, storageHash is obtained by bit-shifting keccak256(channelStorage) to make it 16 bytes instead of 32 bytes.

In this case, the turnNumRecord and storageHash can be packed into a single slot, not increasing the gas price of writing to storage. This is at the cost of increasing hash collisions, but 1/2^128 is probably still acceptable.

We might also get away with storing per-participant turn number records for free when the channel has 3 or fewer participants. This might be worth it when mainly hub-and-spoke virtual channels are created, which involve only two or three participants.

We could also store a bunch of other things, like finalizedAt (as a uint48), by tuning the hash collision rate via the number of bits of the storage hash actually stored on chain.

zakalwe · September 18, 2019, 1:33pm

Ah - hadn’t really thought about the front running situation you’d described - interesting! Although potentially very expensive for Eve if she gets it wrong or is front-front run…

Tom - back to your original point. Do you still have the principle that it’s “ok” for Eve to sign multiple states s9?

I’m pretty sure this problem goes away if you enforce that it’s only ok to sign one state transition? All Alice needs to do is to provide a different s9 signed by Eve, and you’re done…

(as you know, this is what we do…)

I think the checkpointing thing is fine, but it strikes me as a nicety. If it’s actually needed for the core protocol then we should think more about this?

zakalwe · September 18, 2019, 1:36pm

alternatively, can you explain this statement: She can’t refute, as she doesn’t hold a state with turnNum > 9 that is signed by Eve - why cant she use s9 here?

tom · September 18, 2019, 1:52pm

Yes, we still have this. The reason is that it’s necessary to enable replace-by-incentive workflows. For example, it allows Alice to send multiple micropayments to Bob, without Bob having to wait for a response between each one.

She can’t use s9, because the turnNum for s9 isn’t greater than 9. The general principle here is that a refute is used to cancel a provably old challenge. The challenge here isn’t provably old - Eve genuinely might not have received s10 yet, so could be calling force-move honestly.

zakalwe · September 18, 2019, 2:12pm

Eve genuinely might not have received s10 yet, so could be calling force-move honestly.

Well, she might, but the fact is that she has signed s9?

I really have big issues with allowing multiple variants of a state to be signed. It really does break my code completely, because of the way that the RNG works. I take a signed state to be final. If you send me signed s9 then I am able to release my next RNG seed because you can’t go back and change your mind.

Ignoring the RNG, though. Imagine chess. I make a move, s9, and sign it. then you make a move s10 which is checkmate. “Bugger” think, and force move with a different s9.

Ok - so I’ve have reduced this to the problem you described in your original post, I think.

But this means that we can theoretically end up with a situation where we have
s7->s8->s9
s7'->s8'->s9'
s7''->s8''->s9''
all being signed and all being valid, anyone of which could be checkpointed. Can I counter-checkpoint with a different set of valid states?

tom · September 18, 2019, 2:21pm

The rule with multiple states is that, if Alice has sent multiple signed states to Bob, then Bob gets to choose which of them to accept.

So in this case, I think it’s ok. I can try signing a different s9, but you can just choose the original one, so it’s safe for you to release the next RNG.

I think it’s ok here too. If you force-move with s9’, then I can respondFromAlternative (or checkpoint, in the new proposal) with states (s9, s10), which means that my checkmate still stands.

No, but this doesn’t matter. All that checkpointing does is assert that everyone has seen some valid chain of states up to a given point. So checkpointing (s7, s8, s9), just means that no-one can call a force-move on a turnNum < 9; it doesn’t mean that s9 necessarily has to be a state on the path to the final state of the channel.

tom · September 18, 2019, 3:18pm

I put together a few notes on some of our options here:

Possible protocol changes

[A1] Introduce per-participant turnNumRecords
[A2] Pull the global turnNumRecord out of the storage hash. Allow checkpoint to set the turnNumRecord, even if the other storage doesn’t match.
[A3] In addition to 2, pull finalizedAt out of the storage hash. Allow checkpoint to cancel a forceMove if the turnNumRecord increases and it isn’t final, even if the other storage doesn’t match.
[A4] In addition to 2, allow a forceMove to set the turnNumRecord, even if the other storage doesn’t match (and therefore the challenge isn’t registered).
[A5] In addition to 3 and 4, allow one forceMove to replace another if the turnNumRecord increases and it isn’t final.

Options and Comments

[O1] Do A1 alone:
1. Can still force the channel to reach an outcome in a finite amount of time.
2. But that finite amount of time scales with the current turnNum.
3. In the worst case scenario, Eve triggers and cancels a force-move at every turn number up to the the current turn number. (In a n-participant channel, where she controls every participant bar Alice, she could actually trigger n-1 force-moves at some (but not all) turn numbers - but even that is finite.)
4. This attack is expensive for Eve, as she has to pay for a force-move and respond or refute at every turnNum. But it potentially could prevent Alice from calling a force-move at a crucial moment, assuming a time-dependent channel.
[O2] Do A2 alone:
1. Can force a close in a constant amount of time, by calling checkpoint with your latest state.
2. Is it still worth having refute in this scenario?
  1. It can always be front-run
  2. It doesn’t prevent the same force-move from being called again
  3. It might make sense if we were to implement penalties
  4. It might still make sense as a cheap way to cancel a stale force-move
3. Have the property that (a) a checkpoint transaction will always succeed, and (b) afterwards the turnNumRecord will be at least what we expected.
4. Potential weirdness: you wanted to cancel a force-move with your checkpoint, but got the storage wrong. The transaction succeeds, but the force-move isn’t cancelled.
5. Also need to be careful about using the turnNumRecord to double up as storing the turnNum of the force-move, as the turnNumRecord can change during a challenge.
[O3] Do A1 and A2:
1. Does this make a significant change to when you would call refute?
  1. It now does prevent the same force-move from being called again
  2. But it can still always be front-run
  3. Maybe you’d try once, and if it’s front-run, fall back to checkpoint
[O4] Do A3, A2 (and maybe A1):
1. Removes the weirdness described in 2.d. I guess it could revert if the turnNumRecord didn’t increase.
2. Get the property that, after a successful checkpoint, the channel is always open and the turnNumRecord is increased.
[O5] Do A2, A4 (and maybe A1):
1. Additional weirdness that a force-move can be partially successful - I increase the turnNum but don’t manage to launch a challenge.
[O6] Do A5, A3, A4, A2:
1. Note that A5 is incompatible with A1: you can’t decide whether to replace a force-move based on finalizesAt and turnNumRecord alone, if you also need to check the per-participant turn number record.
2. Nice that a force-move can’t be partially successful. Either the turnNum increases and your challenge is launched, or neither of those things happen.

How to decide

I propose we choose between O1, O4, or O6. Motivation: as it is easier not to have to think about transactions being successful, but only part of the effect is carried out.

In making this decision, we also want to keep on-chain storage to a minimum, preferably restricting it to a single bytes32 per channel.

I propose that we choose between these options as follows:

If we can fit, turnNumRecord, finalizesAt and stateHash safely into the same bytes32, we choose O4 or O6 (depending on whether we want to allow one force-move to overwrite another).
Else, we choose O1.

Keen to hear feedback on the following:

What is an acceptable size for the maximum turnNum in a channel? Would uint64 be enough?
What are acceptable values for finalizesAt (depending on whether that’s seconds or block number)?
How many bits do we need for the channelStorageHash? Is 128 enough?

AndrewStewart · September 18, 2019, 10:04pm

Ah - hadn’t really thought about the front running situation you’d described - interesting! Although potentially very expensive for Eve if she gets it wrong or is front-front run…

While the attack described is expensive for Eve, that doesn’t preclude other attacks of a similar nature that are less expensive for Eve. The following scenario is also a justification for checkpoint perhaps being more than just a nicety.

Suppose Alice wants to go offline, and she wants to protect against Eve calling forceMove with a stale state that’s favourable to Eve. In the current protocol, the only option that Alice has is to call forceMove (1) with one of her more recent states* and then call respondWithMove, refute, or respondWithAlternativeMove.

In this situation, if Eve front-runs Alice just once, and Alice is offline, she cannot defend against the front-running attack. Or, she is forced to stay online as long as Eve is willing and able to front-run her, which is perhaps incongruent with Alice’s holiday plans.

This illustrates at least one scenario where a relatively short front-running attack (perhaps just one transaction) can be lucrative for Eve. It also demonstrates a key application of checkpoint as described by Tom: at (1), Alice could call checkpoint instead of forceMove, and accomplish the goal of protecting her assets while offline in just one transaction. Finally, it demonstrates the benefit of a checkpoint function which increments the turnNumber, and then clears a challenge if exists. (This version of checkpoint would require something like the modification described here)

*If she signed the most recent state herself, and she calls forceMove with the penultimate state, someone else can refute her, which in some settings might cost Alice a penalty.

AndrewStewart · September 18, 2019, 10:35pm

My opinion is that the ForceMove protocol needs to make the following guarantee:

G1: Suppose Alice holds a supported state at turn n.
Suppose a blockchain event E is emitted at time t.
Assume that Alice can reliably notice E, decide what to do, and submit a transaction tx that is mined before time t + challengeDuration.
Then Alice can ensure [optional: in constant time], that the channel cannot be finalized at a turn m < n.

With this guarantee, it is much easier to prove the security of Alice’s assets, both

in theory, when Alice has an unlimited amount of patience; and
in practice, when Alice might want to go on holiday, and protect her assets temporarily

The simplest way to do this is to choose the optional “in constant time” version of the guarantee, and O4 (without A1) provides this guarantee.

A stronger guarantee which might be nice is:

G2: The turnNumRecord stored on-chain is the turn number of the most recent supported state seen by the adjudicator.

This would require O6. (Note that G2 trivially implies G1.)

I think the most sensible thing to do is:

first, do O4 without A1
second, if desired do O6

Note that per-participant turnNumRecords might still fit into 256 bits if stored in a struct, assuming a small enough number of participants. For example, using 32 bit turn numbers and a 40 bit finalizedAt leaves 120 bits for struct bookkeeping plus hash-collision-security. It might be possible to fit this in 256 bits and still offer safety. This would allow O6 + A1, at the cost of losing the minimal storage usage in larger channels.

zakalwe · September 19, 2019, 7:43am

Suppose Alice wants to go offline

Ok - that does sound like a valid use case for this. It does start to drag in wider issues, though, like the liveness requirements of participants, the duration of a timeout, and what penalties are applied.

I keep referring to my implementation because I’m trying to see if we can bring the two together, or at least closer, but:
a) I enforce liveness - if you’re going away past the timeout period you should close the channel
b) you can’t resolve your own dispute. Actually, its’s a bit more subtle than that - only a channel participant can open a dispute. (force move). and that participant can’t be the person who’s turn it is next to act. I could, I think, remove this restriction, but it was there to prevent griefing - I’ll have a think about it
c) at least some of this should be the responsibility of watchtowers/guardians.

Suppose Alice wants to go offline, and she wants to protect against Eve calling forceMove with a stale state that’s favourable to Eve.

so - going back to your original example with 100 states, where Eve calls force move with s10. Alice has signed s99 and s100. Why can’t she do anything with these?

tom · September 19, 2019, 8:16am

With checkpointing, she can: she’d call checkpoint(s99, s100), and that would cancel the force-move. That’s one of the advantages of checkpoint.

There’s also the possibility that Eve front-runs her checkpoint(s99, s100) transaction, changing the on-chain state before it is mined. If we were to implement O4 or O6, the transaction could still succeed, and we can guarantee that, after the transaction is mined, the turnNumRecord is at least 100 (increased either by this transaction, or one before it), which therefore means that force-moves on turns below 100 are no-longer possible.

tom · September 19, 2019, 8:19am

Also, maybe it’s worth starting a new topic to discuss the differences between your implementation and force-move?

zakalwe · September 20, 2019, 9:24am

yeah, sorry, I’ll leave that out of this. I’m getting closer to understanding where you are now. Can I propose A6 for completeness sake:

[A6] explicitly pass in the data required to generate the existing storage hash in addition to the data needed for the checkpoint.

Advantages - you don’t need to change any other code
Disadvantage - small extra gas cost in function parameters

Which gives you
[O7] - do A6 (and maybe A5). None of the others are needed if I read this right?

tom · September 20, 2019, 9:59am

No problem. I’d be really interested to discuss that separately though :).

[A6] is pretty similar to what we do currently. The issue with A6 is its susceptibility to front-running: if Alice has to include the data to recreate existing storage hash in her transaction, and Eve manages to get a transaction that changes the storage state mined first, then Alice’s transaction will fail. If Eve can do this consistently, she can prevent any of Alice’s transactions from being mined.

That said, we probably should include A6 for completeness sake. I guess O7 is really “make no change”, which should definitely be an option!

For anyone interested, we’ve now updated the force-move contracts to use checkpoint over respondWithAlternative and @AndrewStewart currently has a PR open to implement a proof-of-concept for storing the turnNumRecord and finalizesAt in the storageHash. There’s also a separate discussion on whether this approach is sufficiently secure.

zakalwe · September 20, 2019, 11:22am

aha - I think I finally understand the front-running now.

hm… So, if it wasn’t for the respond() method, you wouldn’t need to store anything on chain apart from the turn number and the finalisesAt variable (which could easily double as a isInDispute flag), right, as checkpoint() doesn’t care what the currently declared state actually is?

Not sure that helps, but just saying…

tom · September 20, 2019, 11:33am

Yeah - I think that’s right. To summarise:

force-move, checkpoint, conclude just need access to turnNumRecord and finalizesAt
respond also needs access to the challengeState
refute also needs access to the challengerAddress (but not the challengeState)

By pulling the turnNumRecord and finalizesAt out of the hashed storage, we reduce the effect of front-running those transactions. (Interestingly, it seems that those actions commuting with all others, in terms of the effect on the channel storage, is what matters here…).

AndrewStewart · September 21, 2019, 7:52am

Here’s a draft implementation of [O4]. In this implementation, only respond and refute care about the precise challenge currently in storage. conclude, forceMove, and checkpoint only care about

the current turnNumRecord
the channel mode (ie. whether finalizesAt is in {0}, (0,now), or [now,\infty)

The “atomic” requirements are implemented here.

Topic		Replies	Views
Supported states force-move	3	485	September 18, 2019
Do we really need the post-fund-setup round? force-move	6	582	November 9, 2019
Composite Moves + Programmatic NextPlayer protocol design	2	517	September 1, 2020
Could/should a wallet move away from turn-taking for consensus updates? Wallet Behaviour	8	670	November 24, 2019
How do we sign states without requiring a pop up for the user every time? Wallet Behaviour	7	711	August 31, 2021

Channel checkpointing in ForceMove

Possible protocol changes

Options and Comments

How to decide

Related Topics