[anti-censorship-team] Need to increase number of tor instances on snowflake-01 bridge, increased usage since yesterday

Tue Sep 27 14:54:53 UTC 2022

On Mon, Sep 26, 2022 at 10:39:42AM +0200, Linus Nordberg via anti-censorship-team wrote:
> It seems likely that we're hitting a limit of some sort and next thing
> is to figure out if it's a soft limit that we can influence through
> system configuration or if it's a hardware resource limit.

tor has a default bandwidth limit, but we should be nowhere close to it,
especially disitributed across 12 instances:

    BandwidthRate N bytes|KBytes|MBytes|GBytes|TBytes|KBits|MBits|GBits|TBits
	A token bucket limits the average incoming bandwidth usage on this node
	to the specified number of bytes per second, and the average outgoing
	bandwidth usage to that same value. If you want to run a relay in the
	public network, this needs to be at the very least 75 KBytes for a
	relay (that is, 600 kbits) or 50 KBytes for a bridge (400 kbits) — but
	of course, more is better; we recommend at least 250 KBytes (2 mbits)
	if possible. (Default: 1 GByte)

I do not see any rate limit enabled in /etc/haproxy/haproxy.cfg.

I checked the number of sockets connected to the haproxy frontend port,
thinking that we may be running out of localhost 4-tuples. It's still in
bounds (but we may have to figure something out for that eventually).

    # ss -n | grep -c '127.0.0.1:10000\s*$'
    27314
    # sysctl net.ipv4.ip_local_port_range
    net.ipv4.ip_local_port_range = 15000    64000

According to https://stackoverflow.com/a/3923785, some other parameters
that may be important are

    # sysctl net.ipv4.tcp_fin_timeout
    net.ipv4.tcp_fin_timeout = 60
    # cat /proc/sys/net/netfilter/nf_conntrack_max
    262144
    # sysctl net.core.netdev_max_backlog
    net.core.netdev_max_backlog = 1000
    Ethernet txqueuelen (1000)

net.core.netdev_max_backlog is the "maximum number of packets, queued on
the INPUT side, when the interface receives packets faster than kernel
can process them."
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html#netdev-max-backlog
But if we were having trouble with backlog buffer sizes, I would expect
to see lots of dropped packets, and I don't:

    # ethtool -S eno1 | grep dropped
         rx_dropped: 0
         tx_dropped: 0

It may be something inside snowflake-server, for example some central
scheduling algorithm that cannot run any faster. (Though if that were
the case, I'd expect to see one CPU core at 100%, which I do not.) I
suggest doing another round of profiling now that we have taken care of
the more obvious hotspots in
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/100