[tor-dev] Wide block cipher experiment.
Yawning Angel
yawning at schwanenlied.me
Thu Mar 19 10:12:45 UTC 2015
Hello,
Nickm mentioned to me that he was curious as to how LIONESS performs
these days (See #5460) with modern cryptographic primitives. I've
conveyed the results to several people, but I'm also sending them here
for posterity.
Code used: https://github.com/yawning/lioness (May be incorrect, don't
use for anything other than benchmarking. Numbers taken with a
previous version of the code without the initial memcpy, that was added
later so that the code in git could be used by the extremely brave for
other things.)
All measurements taked on an i5-4250U, so the usual caveats about
turboboost and hyperthreading apply.
Baseline (from tests/bench, AES-NI enabled):
===== cell_ops =====
Inbound cells: 231.33 ns per cell. (0.45 ns per byte of payload)
Outbound cells: 224.39 ns per cell. (0.44 ns per byte of payload)
(Note: Outbound with AES-NI disabled is ~3.0 ns per byte)
LIONESS (BLAKE2b/ChaCha, 509 byte block size):
* ChaCha20:
* Ted Krovetz's AVX2-ed ChaCha20/Ref AVX BLAKE2b: ~6.6 ns/byte
(~143 MiB/s)
* AVX2ed-ed ChaCha20, Andrew Moon's AVX2-ed Blake2b: ~5.0 ns/byte
(~190 MiB/s)
* ChaCha12:
* Ted Krovetz's AVX2-ed ChaCha12/Ref AVX BLAKE2b: ~6.1 ns/byte
(~156 MiB/s)
* AVX2ed-ed ChaCha12, Andrew Moon's AVX2-ed Blake2b: ~4.4 ns/byte
(~213 MiB/s)
* ChaCha8: (Yolo swag 420 blaze it)
* Ted Krovetz's AVX2-ed ChaCha8/Ref AVX BLAKE2b: ~5.8 ns/byte
(~164 MiB/s)
* AVX2ed-ed ChaCha12, Andrew Moon's AVX2-ed Blake2b: ~4.1 ns/byte
(~232 MiB/s)
NB: Using Andrew Moon's Blake2b isn't in git, because the way I tested
it was kind of kludgey.
Profiler output:
64.04% lioness_test_av lioness_test_avx2 [.] blake2b_compress
22.43% lioness_test_av lioness_test_avx2 [.] chacha_stream_xor
6.60% lioness_test_av lioness_test_avx2 [.] blake2b_init_key
2.72% lioness_test_av lioness_test_avx2 [.] blake2b
2.41% lioness_test_av libc-2.21.so [.] __memcpy_avx_unaligned
1.07% lioness_test_av lioness_test_avx2 [.] lioness_encrypt_block
Ted Krovetz's ChaCha implementation isn't quite the fastest out there,
but it doesn't lag massively behind Andrew Moon's. Benchmarks on the
same hardware from Andrew Moon's chacha-opt/blake2b-opt are:
BLAKE2b:
576 byte(s):
avx2, 1468.00 cycles per call, 2.5486 cycles/byte
avx, 1674.00 cycles per call, 2.9062 cycles/byte
x86, 2020.00 cycles per call, 3.5069 cycles/byte
generic/64, 2638.00 cycles per call, 4.5799 cycles/byte
ChaCha20:
576 byte(s):
avx2, 694.00 cycles per call, 1.2049 cycles/byte
avx, 1104.00 cycles per call, 1.9167 cycles/byte
ssse3, 1112.00 cycles per call, 1.9306 cycles/byte
sse2, 1376.00 cycles per call, 2.3889 cycles/byte
x86, 2528.00 cycles per call, 4.3889 cycles/byte
generic, 3200.00 cycles per call, 5.5556 cycles/byte
I don't think using CTR-AES (with AES-NI) in a LIONESS construct is
going to be that big of a win, at least on my hardware, and the sort of
performance I'm seeing feels too much of a performance hit to me.
Regards,
--
Yawning Angel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20150319/746d2e91/attachment.sig>
More information about the tor-dev
mailing list