[tor-dev] [GSoC] [Stegotorus] Recent activities

Thu Aug 23 18:34:41 UTC 2012

On Sun, Aug 19, 2012 at 8:25 PM, vmon <vmonmoonshine at gmail.com> wrote:
> Hey Zack,
>
> I know you are super busy packing for your big move, but I thought as tomorrow is
> the hard deadline for GSoC, is not a bad ideas to give you some final update.

I appreciate it.  I am still in the middle of moving and don't have
the time to go over your changes in detail, but I read over the diff
between 'master' and 'f--payload-servers' and I like it for the most
part.  I don't think we want to _merge_ it just yet, but it's
definitely something we can use.  I like the URL dictionary concept
and the use of query parameters, and it looks like you've got some
good cleanups in here too.

I have a few specific change requests for you.  First: boost should
not be used.  I know it's convenient when you're writing the code, but
it causes so many deployment headaches that it's best avoided.  It
*appears* that you're only using one thing, boost::filesystem::exists,
to check whether a file exists before attempting to open it.  That's
an antipattern.  The correct approach is, don't check first, just
attempt to open the file.  If it succeeds, the file existed.  If it
fails, then check whether errno == ENOENT to find out whether the
problem is that the file doesn't exist.  (If you needed boost for
something else, please let me know and we'll figure out a
replacement.)  Second: do not include the header <iostream>, ever.  It
is only *necessary* to include that header to get 'cin', 'cout', or
'cerr', and none of those should ever be used in this program.  (For
what you were doing you probably want to include <istream> instead.)
(This will also mean that you don't need to add std::__ioinit to the
global variables whitelist.)  Third: I appreciate your having fixed my
sloppy fwrite()s, but please, whenever you print an error message as a
direct consequence of an I/O operation failing, include
strerror(errno).  That is, instead of log_debug("Error writing data");
it should be log_warn("Error writing data: %s", strerror(errno));

... I really gotta find time to write a style guide for this thing :-/

Also, could you please explain in detail the problem you were having
with the DECLARE macros?  If they are getting in your way, then we
should change them, but I do not understand what the issue is.

> So most of last week I spent on the conflict that occurred between
> libcurl and libevent having their eyes on the same socket. I tried
> different ugly and uglier solutions but finally settled on this one
> which is not that bad:
>
> http://stackoverflow.com/questions/12021217/how-to-ask-libcurl-not-to-listen-on-a-socket-just-send-a-url-request-over-it?answertab=votes#tab-top
>
> After that, I'm passing all time line tests now.

Great.

> However, when I look at
> the client log, I see that some of the sequence no are missing in the
> recv: like this
>
> 203.7169 [debug] recv: <3.81> receiving block 42 <d=0 p=9 f=DAT>
> T:203.7170: ckt 3 <ntp 1 outq 0>: recv 42 <d=0 p=9 f=DAT>
>
> and the next recv is like this:
>
> 203.7926 [debug] recv: <3.99> receiving block 60 <d=0 p=1 f=DAT>
> T:203.7927: ckt 3 <ntp 1 outq 0>: recv 60 <d=0 p=1 f=DAT>
>
> Is this for sure an error and loss of data? or there is a normal
> situation that such a thing can happen? For example when the whole
> block can't be sent over one connection? (otherwise how do the tl tests all pass?).

Blocks are cut to fit, so it never happens that "the whole block can't
be sent over one connection".  What _can_ happen is that blocks arrive
in a different order than they are received.  Sometimes they get
reordered to an extent that seems incredible, e.g. block 60 arriving
immediately after block 42 as in your example (I suspect if you look
at the complete log you'll see that what actually happened is block 42
was *delayed* until after blocks 43 through 59 had already arrived).

That's perfectly normal by itself, but there is a bug. If block K+128
arrives while the receiver is still waiting for block K, the receiver
will declare a protocol error and kill the entire circuit.  That's by
design; the bug is that the sender has no way of knowing that the
receiver is still waiting for block K.  This is what I'm supposed to
be fixing right now (by introducing explicit acknowledgments).

> Further more, I merged the payload scraper and stegotorus. I this way,
> http_apache checks for the payload database and if it's not there it
> calls the scraper (it needs apache to be installed on the same
> machine). These all are commited on the github.

Sounds good.  When we get a config file we'll want to be able to say
that we expect to use one or the other, but for now this is a good
approach.

Are we now exclusively using libcurl to generate requests or is there
still legacy hand-generation code in there?

> I started writing a unit test that fetch a webpage directly (using curl)
> and through stegotorus and check if they are the same. I thought curl is
> best approximation of a browser that is available to us. But as I don't
> think I can finish that till tomorrow, I'm going to spend the remaining
> time on documenting the 'uri transport protocol' and deal with small
> issues.
>
> After we are done with the evaluation, I'll go back to unit testing.

Sounds good.

zw