[tor-dev] [tor-assistants] Protocol grammars as probabilistic channels

Sat Jul 28 18:59:35 UTC 2012

On 7/28/12, Zack Weinberg <zack.weinberg at sv.cmu.edu> wrote:
> [I'm moving this from a giant cc: list to tor-dev, I hope that's okay.
>  I *think* everyone involved is already on that list.]
>
> [I apologize for not picking up this ball for so long, I have been ill
> and mostly without the brain.]
>
> On Sun, Jul 22, 2012 at 2:26 AM, Robert Ransom <rransom.8774 at gmail.com>
> wrote:
>>
>> Specifically, to obfuscate a ‘payload protocol’ data stream:
>>
>> * send it through an arithmetic-code encoder, using a model which
>> contains a (low-probability) ‘pad’ symbol;
>> * encrypt it with a stream cipher;
>> * send it through an arithmetic-code decoder, using a model which
>> matches the ‘cover protocol’ (the protocol you are trying to mimic).
>
> I like this way of modeling the problem, but I see some practical
> problems with it.  The payload protocol in this case is Tor's link
> protocol, which is already encrypted, so your first step would appear
> not to do anything (well, except to TLS's cleartext record headers,
> but I don't think that will gain you much).

The first step (the arithmetic-code encoder) allows the obfuscator to
introduce padding symbols into a bytestream without either
understanding or breaking the protocol it is obfuscating.  This is the
only reason for the obfuscator to use an arithmetic coder on the
payload protocol.

If the payload protocol is datagram-oriented and can tolerate junk
appended to the end of a datagram, or if the software implementing a
stream-oriented payload protocol can generate padding at the end of a
message when the obfuscator requests it, the payload-side arithmetic
coder should be omitted.  (Note that Tor itself doesn't satisfy either
of these requirements; but Tor-over-TCP-over-IP is a datagram-oriented
protocol with a length field in each datagram.)

>  The stream cipher
> provides no integrity protection for the message, which maybe we can
> get away with since the higher level has it, and more importantly, no
> sequencing.

The stream cipher in the middle prevents the payload-side
arithmetic-code encoder (or the underlying protocol) from making the
stream fed to the cover-side decoder (computationally) distinguishable
from a random bitstream.

A datagram-oriented payload protocol will need a different scrambling
layer as well.  The easiest solution I can think of is to encrypt the
message header with a large-block block cipher, and assume that
clients usually (but not always) keep the same IP address during a
protocol session in order to reduce the number of keys that the server
needs to try for each message.

>  Several cover protocols of interest (most importantly
> HTTP) break a session up into many short-lived TCP connections.  That
> means you have to allow for the possibility that cover-protocol
> messages will arrive at the decoder out of order.  You also need the
> ability to send control messages at a level below the payload
> protocol, for key agreement, reassociation of new connections with
> ongoing sessions, and possibly other things we haven't thought of yet.

TCP transports a stream protocol over an unreliable unordered datagram
protocol.  Surely it can't be too hard to add a security (i.e.
encryption and authentication) layer to an unreliable unordered
datagram protocol.

> The "chopper" component of StegoTorus tackles these problems.  Have
> you seen the draft paper?

Probably not.

>  It has its own problems -- the one I'm
> wrestling with now is, the _amount of payload data_ sent in any given
> message gets frozen the first time it is transmitted; if you need to
> _re_transmit, you'd better be able to send at least that much data
> right now! (Padding can of course be varied.)

TCP can increase or decrease the amount of data sent in a
retransmitted datagram.  (I suspect that real TCP implementations
actually do retransmit datagrams with less data, during path MTU
discovery.)

> I would be curious to know if you think your arithmetic-decoding
> approach to cover generation could reasonably be made to work within
> StegoTorus' framework.

The arithmetic-code decoder on the cover-protocol side may (should)
have been proposed before.  The other two pieces were my idea, but
they were for stream-oriented obfuscation, and you pointed out above
that they suck for a datagram-oriented obfuscator and need to be
replaced.

>>> George turned me on to this paper,
>>>
>>> "Provably Secure Steganography"
>>> http://www-users.cs.umn.edu/~hopper/tc-stego.pdf
>
> I have to say that I am very skeptical about the practical value of
> this paper, because it requires one to characterize the probability
> distribution of all possible messages in the cover protocol, which is
> not practically possible for any nontrivial protocol.  I could see an
> approximation being good enough for our purposes, though.

(Note that Zack is quoting someone other than me there.)

>> Don't hide information directly in the syntax -- it's easy enough for
>> Them to rearrange protocol messages in ways that will not interfere
>> with standard-conformant clients and servers, but will disrupt crappy
>> protocol obfuscators that don't implement the cover protocol
>> correctly.  Use a semantic model of the cover protocol instead.
>
> vmon and I have been wrestling with this one a bit in the context of
> HTTP and we've pretty much come to the conclusion that we need an
> actual implementation of HTTP, into which we plug data.  Fortunately
> there is no shortage of HTTP implementations.
>
>> ‘Blocking resistance’ is *very* different from ‘detection resistance’
>> (i.e. ‘indistinguishability from normal traffic’ for some value of
>> ‘normal traffic’).
>>
>> * ‘Blocking resistance’ does not require that a cover protocol be
>> ‘indistinguishable from X’ by the adversary, only that the adversary
>> be unable to block it.  For example, an adversary which can only
>> ‘block’ communications by disrupting a connection using extra (forged)
>> packets will not be able to block a UDP-based protocol.
>
> The adversary we're concerned with has control of all the border
> routers between the client and the server, so they can do things like
> blackhole all packets with a particular source or destination.  I
> would argue that detection resistance is in fact required for blocking
> resistance against this adversary.

But only ‘detection resistance’ against any box that They can use to
control a blocking system, and that can detect the protocol quickly
enough to usefully reduce the amount of data that each user can
exchange with each bridge.

>> * A protocol which is ‘indistinguishable from’ some protocol which the
>> adversary wants to not block may not provide ‘blocking resistance’.
>> For example, a crappy obfuscator which uses a syntactic model of HTTP
>> instead of a semantic model can be blocked by just about any
>> off-the-shelf HTTP proxy.
>
> That effect also provides a distinguisher from real HTTP, doesn't it?

Only if They can tell that They blocked an obfuscator's traffic.  They
may not be able to detect that.

Robert Ransom