[tor-dev] QUIC TOR Debugging Question (no attach)

Mon Apr 25 03:33:25 UTC 2016

> On 25 Apr 2016, at 06:44, Xiaofan Li <xli2 at andrew.cmu.edu> wrote:
> 
> Hi Tim and everyone on tor-dev,
> 
> Our QUIC + TOR project has almost been fully implemented. We are debugging the last few bits of bugs. Update:
> 	• We've now able to build many complete circuits with QUIC as its underlying protocol.
> 	• We have not debugged the actual communication part yet. We are aware of certain failure cases for QUIC (e.g. line 15642 of the log is being debugged right now). So we cannot send actually client data yet.
> 	• The current state uses QUIC for OR connections only. Thus a dual-path is implemented as suggested in my last email thread.
> 	• TLS is completely bypassed and important state (that is set up in tls_handshake functions) is preserved and refactored out. e.g. conn->/chan->state purpose, etc.
> 	• Some tinkering and re-designing of QUIC itself is also underway. The fact that QUIC is a transport protocol on application layer makes it painful to interact with the event and timer systems of TOR. We are trying to improve this aspect now.
> The debugging log I was attaching was too big for the tor-dev list. So if you are interested to take a look at the file, let me know.

Large debug logs contain too much information to be helpful to you or to us.

Try warning, notice, or info level logs, in that order.
Using high-level logs makes it easier to work out where your attempts to send data have broken down.
Once you've identified where communication has broken down, try to fix it.

If you can't fix it, you're welcome to ask for advice.
Please quote a small number of relevant log messages, tell us what you think they mean, and what you've tried to do to fix it.
Also feel free to provide a link to logs at that level for people to look through.

This makes it more likely that people will recognise your issue and respond by helping you to fix it.

> Some particularly concerning things in the log:
> 	• circuit_get_by_circid_channel_impl(): found nothing for circ_id 14801, channel ID 2 (0x7f758bb6b740)
> Then it just attaches this circ onto this channel.. Is this normal?
> 	• Line 4901 circuit_receive_relay_cell(): Passing on unrecognized cell.
> It happens a lot. Is this normal?
> 	• This sequence happened a lot around 7500.
> relay_send_command_from_edge_(): delivering 10 cell forward.
> circuit_package_relay_cell(): crypting a layer of the relay cell.
> circuit_package_relay_cell(): crypting a layer of the relay cell.
> circuit_package_relay_cell(): crypting a layer of the relay cell.
> It seems like its decrypting and forwarding cells along. Is it normal for TOR (with TCP) to do this in a burst? Because I'm seeing about ~1s of repeated calls.

I honestly don't think these are concerning at all. But I don't really know.
And I can't find out, because I don't know which version of tor you've based your changes on.

Here's how you can find out whether these log messages are typical or not:

Run the original version of Tor that you've based your QUIC changes on, with the same network configuration.
(Does it work? If not, your QUIC network will likely never work either.)
Then compare the warning, notice, and info logs to tor with QUIC.
Stop at the first log that differs in non-trivial ways.
This is a log level that's useful for you.
(High-level logs will also cause you less concern about spurious messages.)

This way, you can answer your own questions about which logs and behaviours are normal, and which ones you've introduced.
Feel free to report back with any log messages from the unmodified version of Tor that might indicate bugs.

> Some more general questions:
> 	• Internal Circuits: any docs? What is it used for? Measuring bandwidth?

Relay bandwidth testing, relay reachability testing (default chutney configs skip this using AssumeReachable), client directory fetches, hidden service directory document uploads, onion services (hidden services), …

Read the ~12 instances of CIRCLAUNCH_IS_INTERNAL in the tor source code for more details.

> How many internal circuits are required by the system?

As many are as necessary to support the operation of the Tor client / relay / onion service at the current time.
Initially, 2 or 3 (read circuit_predict_and_launch_new for more details).

> 	• circuit wide ID format. We had a bug regarding this last week. The check in process_create_cell always fails because line 281-295 in command.c always failed (the check for CIRC_ID_TYPE and id_is_high). Currently we commented out this check. What does it affect? And could we do this?

I can't see how this could be your client communication issue. It's only an issue if the circuit IDs collide, which should be unlikely in small networks.

When two relays create circuits on a connection, one uses the lower half of the circuit id space, and one uses the upper half. This prevents circuit IDs colliding. Read the definitions of circ_id_type, circ_id_type_t, and channel_set_circid_type for details.

The version of the link protocol determines how this decision is made.
I assume that your tor has chan->conn->link_proto >= MIN_LINK_PROTO_FOR_WIDE_CIRC_IDS.
(You can check this by printing out the value of chan->conn->link_proto everywhere channel_set_circid_type is called.)

So you've removed TLS client identity and TLS server identity keys.
What do get_tlsclient_identity_key and get_server_identity_key return?
Null bytes?

Is there a publicly known key in QUIC that's known by both sides and stable for the life of a connection?
If so, use that.

If not, always pass 0 for consider_identity to channel_set_circid_type, so that the initiator uses the upper half of the circuit IDs, regardless of keys.

Breaking other parts of the circuit management code could also cause this issue.

> 	• From a high level, when a client sends data using a circuit, what is its code path? Which special (as in, specific to client-initiated communication) functions are called?

I'm not sure how to answer this question. The unhelpful but accurate answer is "not many codepaths are client-specific, if there are any at all".

Regardless of its role in the network, every tor instance performs common operations like retrieving consensus documents and building circuits. And, if configured to do so, tor instances can perform multiple roles.

Here are some high-level differences between client and server communication in the tor network that could be causing your issues:

Typically, clients, onion services, and bridges retrieve directory documents using "begindir", a TLS connection to the ORPort. Relays and authorities do this unencrypted over the DirPort. If you haven't replaced TLS with QUIC correctly, clients may fail to bootstrap or retrieve directory documents. There should be log messages about this.

Clients have a SOCKSPort open, and in response to application requests they make an AP (application) connection that's linked to a stream on a circuit that's been extended to the destination exit relay. They then send requests received on the SOCKSPort to the destination relay, and receive responses that they forward to the application. (The onion service setup is slightly more complex, but transmits data in a similar way.)

Have you read torguts?
https://gitweb.torproject.org/user/nickm/torguts.git/

Any part of this process could break and cause client communication to fail.
Parts of the relay code could also break in ways that cause client communication to fail.

I can't see how to describe specific code paths without more specific (and precise) detail about what's failing, and whether it's failing on clients or relays. You can find this in the logs, if you log sensibly. Let us know what you find, and what you tried to do to fix it.

What high-level success or failure message (warning, log, info) is logged on the client right after you try to make an application connection?
Does the connection reach a relay? The exit? DNS? The remote site?
What warning, notice, or info-level message is logged on the last tor node where the connection stops working?
(Or what DNS or HTTP request is sent to the remote server / site?)

> Any other comment on the log is greatly appreciated, since everyone here is probably more familiar than me with what a normal bootstrapping process would look like.

Don't worry too much about the log messages. They're designed to be used for debugging once there is a known issue.
The vast majority are harmless, and many need context to interpret. You can find this context by searching the tor code for unique words or phrases in the log message. (But keep in mind that log strings are often composed of shorter strings.)

Some general requests for future questions:

It would be much easier and faster for me (and perhaps others) to help you if you asked questions after trying to identify and fix issues yourself. I encourage you to try some of the things I've suggested, and ask more precise questions next time.

Personally, I would find it easier to respond to targeted questions that come one at a time, every few days or every week, rather than a large email every few weeks.

It might also be helpful to be able to see the source code you're working on, rather than trying to guess, what changes you've made, from what I remember, about what you said, about your design, in previous emails.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP 968F094B
ricochet:ekmygaiu4rzgsk6n

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20160425/b7a25458/attachment.sig>