More than you necessarily wanted to know about tor_strndup performance improvements from late-2004 [was Re: tor callgrinds]
Watson Ladd
watsonbladd at gmail.com
Sun Feb 18 15:19:15 UTC 2007
Christopher Layne wrote:
> On Sat, Feb 17, 2007 at 04:04:04PM -0500, Nick Mathewson wrote:
>> On Fri, Feb 16, 2007 at 10:12:20PM -0500, Watson Ladd wrote:
>> [...]
>>> 2: Won't strlcpy run in time proportional to n? strncpy can't be much
>>> faster as both functions need to loop from 0..n, load the character and
>>> check if its null, and then copy it. The big difference is strncpy keeps
>>> on going.
>> You're confused about how strlcpy works when n is much smaller than
>> the length of the source string. Look at the return value of strlcpy:
>> it's the length of the original string. To find the length of the
>> original string, strlcpy basically has to walk over the whole thing.
>> So while the running time of strncpy(dst,src,siz) is O(siz),
>> strlcpy(dst,src,siz) is O(strlen(src)). That's a big problem for
>> tor_strndup: its most performance-relevant use case is copying a small
>> portion of a much longer string, as when we're parsing router
>> descriptors. In this use case, strncpy is obviously better.
>
> Technically though shouldn't strlcpy() not have to do this? It already
> knows thelength, as it is provided to it. The only anecdote to this is
> that it will supercede an 'actual' length if it detects a '\0' along the
> copy. This isn't typically bad as it's already walking the string
> anyways.
It provides the length of the source string, not the destination string.
This lets you compute the space you need, but is slower. I made the same
mistake.
>
> e.g. OpenBSD copy (same as Tor's):
> /*
> * Copy src to string dst of size siz. At most siz-1 characters
> * will be copied. Always NUL terminates (unless siz == 0).
> * Returns strlen(src); if retval >= siz, truncation occurred.
> */
> size_t strlcpy(char *dst, const char *src, size_t siz)
> {
> register char *d = dst;
> register const char *s = src;
> register size_t n = siz;
>
> if (n == 0)
> return(strlen(s));
> while (*s != '\0') {
> if (n != 1) {
> *d++ = *s;
> n--;
> }
> s++;
> }
> *d = '\0';
>
> return(s - src); /* count does not include NUL */
> }
>
> Provided the string 'onion\0routing\0', with n being 13, only 6 loop
> cycles will actually happen.
No, OpenBSD strlcpy is different. It returns strlen(src) which Tor copy
doesn't.
>
>> This isn't one of those hypothetical armchair optimizations, by the
>> way: we profiled Tor before and after, and found out that the time
>> spent copying strings went way down after we made the change. (I'm
>> afraid I don't still have the numbers; this change came in the
>> 0.0.9pre6 develoment, was back in November of 2004.)
>
> Unfortunately this may be due to the fact that strlcpy() vs strncpy()s
> implementation may have been one of C vs asm. :-/
>
> I've always been on the fence about the whole strlcpy vs strncpy vs
> snprintf vs sprintf("%.*s", len, str) issue. For one, I tend to not
> use many str* functions in general. Secondarily, the "bonus" of strncpy
> zero-filling the remaining end of the dst buffer as some kind of security
> benefit that others have brought up in the past, I find to be completely
> baseless as it's depending on defensive programming and the possibility of
> a buffer overflow happening in the first place.
>
> Also, when it comes to strlcpy, you're actually better off using:
>
> /* provided we've allocated dst as s_len + 1 bytes */
>
> l = MIN(d_len, s_len);
> memcpy(dst, src, l); dst[l] = '\0';
I don't think we need s_len+1 bytes. I think we need l+1 bytes as we
could be copying a small section of a giant string, so allocating
s_len+1 bytes will be too much most of the time. Isn't the basis for
using stncpy that we are only copying a small section of huge strings?
>
> The rationale here being that you've handled the length issues, you already
> know the length (typically) of the source ahead of time (how else would you
> pre-allocate for dst? if it's an auto or static buffer, then you know the
> length as well).
>
> In most asm implementations of str* functions the maximum length they will
> operate on, chunk-wise, is a 4-byte integer. The reason being that they
> will do the old 'is there a null in 4 bytes' trick, check it's alignment,
> and if everything looks good, copy off the equivalent space of an int in
> one call. But the bonus of using memcpy() is that memcpy() knows the entire
> length ahead of time - and if it can make use of, provided the asm backend
> code is there, it will use higher order optimizations such as mmx and/or
> sse ops to copy much more than 4 bytes at once. It also doesn't have to
> deal with even checking for nul's at all. It's entire goal is to copy data
> as fast as possible, especially since it's use is widespread inside the
> kernel and networking code.
>
> This is from Solaris. It does not do any mmx or amd-specific tricks to
> make things even faster, but it is relatively optimized and nothing out
> of the ordinary.
>
> memcpy.s:
> 40 ENTRY(memcpy)
> 41 movl %edi,%edx / save register variables
> 42 pushl %esi
> 43 movl 8(%esp),%edi / %edi = dest address
> 44 movl 12(%esp),%esi / %esi = source address
> 45 movl 16(%esp),%ecx / %ecx = length of string
> 46 movl %edi,%eax / return value from the call
> 47
> 48 shrl $2,%ecx / %ecx = number of words to move
> 49 rep ; smovl / move the words
> 50
> 51 movl 16(%esp),%ecx / %ecx = number of bytes to move
> 52 andl $0x3,%ecx / %ecx = number of bytes left to move
> 53 rep ; smovb / move the bytes
> 54
> 55 popl %esi / restore register variables
> 56 movl %edx,%edi
> 57 ret
> 58 SET_SIZE(memcpy)
>
>
> strncpy.s:
> 65 ENTRY(strncpy)
> 66 pushl %edi / save register variables
> 67 pushl %esi
> 68
> 69 movl 16(%esp), %eax / %eax = source string address
> 70 movl 12(%esp), %edi / %edi = destination string address
> 71 movl 20(%esp), %esi / %esi = number of bytes
> 72
> 73 testl $3, %eax / if %eax not word aligned
> 74 jnz .L1 / goto .L1
> 75 .L8:
> 76 cmpl $4, %esi / if number of bytes < 4
> 77 jb .L4 / goto .L4
> 78 .align 4
> 79 .L2:
> 80 movl (%eax), %edx / move 1 word from (%eax) to %edx
> 81 movl $0x7f7f7f7f, %ecx
> 82 andl %edx, %ecx / %ecx = %edx & 0x7f7f7f7f
> 83 addl $4, %eax / next word
> 84 addl $0x7f7f7f7f, %ecx / %ecx += 0x7f7f7f7f
> 85 orl %edx, %ecx / %ecx |= %edx
> 86 andl $0x80808080, %ecx / %ecx &= 0x80808080
> 87 cmpl $0x80808080, %ecx / if null byte in this word
> 88 jne .L3 / goto .L3
> 89 movl %edx, (%edi) / copy this word to (%edi)
> 90 subl $4, %esi / decrement number of bytes by 4
> 91 addl $4, %edi / next word
> 92 cmpl $4, %esi / if number of bytes >= 4
> 93 jae .L2 / goto .L2
> 94 jmp .L4 / goto .L4
> 95 .L3:
> 96 subl $4, %eax / post-incremented
> 97 .align 4
> 98 .L4:
> 99 / (number of bytes < 4) or (a null byte found in the word)
> 100 cmpl $0, %esi / if number of bytes == 0
> 101 je .L7 / goto .L7 (finished)
> 102 movb (%eax), %dl / %dl = a byte in (%eax)
> 103 decl %esi / decrement number of bytes by 1
> 104 movb %dl, (%edi) / copy %dl to (%edi)
> 105 incl %eax / next byte
> 106 incl %edi / next byte
> 107 cmpb $0, %dl / compare %dl with a null byte
> 108 je .L5 / if %dl is a null, goto .L5
> 109 jmp .L4 / goto .L4
> 110 .align 4
> 111 .L1:
> 112 / %eax not aligned
> 113 cmpl $0, %esi / if number of bytes == 0
> 114 je .L7 / goto .L7 (finished)
> 115 movb (%eax), %dl / %dl = a byte in (%eax)
> 116 decl %esi / decrement number of bytes by 1
> 117 movb %dl, (%edi) / copy %dl to (%edi)
> 118 incl %edi / next byte
> 119 incl %eax / next byte
> 120 cmpb $0, %dl / compare %dl with a null byte
> 121 je .L5 / if %dl is a null, goto .L5
> 122 testl $3, %eax / if %eax word aligned
> 123 jz .L8 / goto .L8
> 124 jmp .L1 / goto .L1 (not word aligned)
> 125 .align 4
> 126 .L5:
> 127 movl %esi, %ecx / %ecx = length to copy null bytes
> 128 xorl %eax, %eax / clear %eax
> 129 shrl $2, %ecx / %ecx = words to copy null bytes
> 130 rep ; sstol / rep;sstol is optimal
> 131 andl $3, %esi / %esi = leftover bytes
> 132 .L6:
> 133 cmpl $0, %esi / if number of bytes == 0
> 134 jz .L7 / goto .L7 (finished)
> 135 movb $0, (%edi) / move a null byte to (%edi)
> 136 decl %esi / decrement number of bytes by 1
> 137 incl %edi / next byte
> 138 jmp .L6 / goto .L6
> 139 .align 4
> 140 .L7:
> 141 movl 12(%esp), %eax / return the destination address
> 142 popl %esi / restore register variables
> 143 popl %edi
> 144 ret
> 145 SET_SIZE(strncpy)
>
>
> Both functions are in optimized assembly - but you don't even have be to be
> very knowledgable about asm (which I am not that much) to know that memcpy()
> will be faster than strncpy(), period.
>
>
>
> So consider this simple little test program on Linux 2.6.20 w/ gcc-4.1.1:
>
> $ cat strn.c
> #include <string.h>
> #include <stdlib.h>
>
> static char buf[] = "an example string to use for copying bytes";
> /* word offsets: 0 1 2 3 4 5 6 7 8 9 0 */
>
> static size_t strlcpy(char *dst, const char *src, size_t siz)
> {
> register char *d = dst;
> register const char *s = src;
> register size_t n = siz;
>
> if (n == 0)
> return(strlen(s));
> while (*s != '\0') {
> if (n != 1) {
> *d++ = *s;
> n--;
> }
> s++;
> }
> *d = '\0';
>
> return(s - src); /* count does not include NUL */
> }
>
> int main(int argc, char **argv)
> {
> char *q;
> int i, l = sizeof buf - 1;
>
> if (argc <= 1)
> return EXIT_FAILURE;
> if ((q = malloc(l * sizeof *q + 1)) == NULL)
> return EXIT_FAILURE;
>
> switch (*argv[1]) {
> case 'm':
> for (i = 16777216; i--; ) {
> /* one doesn't typically copy nul */
> memcpy(q, buf, l); q[l] = '\0';
> }
> break;
> case 'l':
> for (i = 16777216; i--; )
> strlcpy(q, buf, l);
> break;
> case 'n':
> for (i = 16777216; i--; )
> strncpy(q, buf, l);
> break;
> case 's':
> for (i = 16777216; i--; )
> strcpy(q, buf);
> break;
> default: break;
> }
>
> free(q);
>
> return 0;
> }
>
> $ gcc -static -O3 -W -Wall -pedantic -o strn strn.c
> $ for i in l m s n; do TIMEFORMAT="$i r: %3lR u: %3lU s: %3lS"; for n in 0 1 2 3 4 5 6 7; do time ./strn $i; done; echo; done
> l r: 0m2.593s u: 0m2.380s s: 0m0.004s
> l r: 0m2.491s u: 0m2.436s s: 0m0.004s
> l r: 0m2.494s u: 0m2.408s s: 0m0.000s
> l r: 0m2.425s u: 0m2.364s s: 0m0.012s
> l r: 0m2.494s u: 0m2.408s s: 0m0.008s
> l r: 0m2.432s u: 0m2.380s s: 0m0.000s
> l r: 0m2.530s u: 0m2.420s s: 0m0.000s
> l r: 0m2.440s u: 0m2.388s s: 0m0.000s
>
> m r: 0m0.546s u: 0m0.508s s: 0m0.008s
> m r: 0m0.526s u: 0m0.508s s: 0m0.000s
> m r: 0m0.545s u: 0m0.512s s: 0m0.000s
> m r: 0m0.545s u: 0m0.512s s: 0m0.004s
> m r: 0m0.545s u: 0m0.508s s: 0m0.000s
> m r: 0m0.534s u: 0m0.508s s: 0m0.000s
> m r: 0m0.547s u: 0m0.508s s: 0m0.004s
> m r: 0m0.526s u: 0m0.508s s: 0m0.004s
>
> s r: 0m1.738s u: 0m1.688s s: 0m0.000s
> s r: 0m1.760s u: 0m1.692s s: 0m0.004s
> s r: 0m1.813s u: 0m1.712s s: 0m0.004s
> s r: 0m1.736s u: 0m1.700s s: 0m0.004s
> s r: 0m1.764s u: 0m1.700s s: 0m0.004s
> s r: 0m1.757s u: 0m1.696s s: 0m0.004s
> s r: 0m1.745s u: 0m1.688s s: 0m0.008s
> s r: 0m1.749s u: 0m1.692s s: 0m0.000s
>
> n r: 0m1.913s u: 0m1.860s s: 0m0.004s
> n r: 0m1.945s u: 0m1.872s s: 0m0.000s
> n r: 0m1.920s u: 0m1.868s s: 0m0.000s
> n r: 0m1.951s u: 0m1.876s s: 0m0.004s
> n r: 0m1.973s u: 0m1.868s s: 0m0.036s
> n r: 0m1.918s u: 0m1.864s s: 0m0.000s
> n r: 0m1.922s u: 0m1.864s s: 0m0.004s
> n r: 0m1.935s u: 0m1.864s s: 0m0.004s
>
> Regardless of the differences between str, strn, and strl, memcpy just blows
> them out of the water. I even included the terminating nul for realistic use.
>
> -cl
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20070218/e0a920d9/attachment.pgp>
More information about the tor-dev
mailing list