More than you necessarily wanted to know about tor_strndup performance improvements from late-2004 [was Re: tor callgrinds]
Christopher Layne
clayne at anodized.com
Sun Feb 18 13:54:00 UTC 2007
On Sat, Feb 17, 2007 at 04:04:04PM -0500, Nick Mathewson wrote:
> On Fri, Feb 16, 2007 at 10:12:20PM -0500, Watson Ladd wrote:
> [...]
> > 2: Won't strlcpy run in time proportional to n? strncpy can't be much
> > faster as both functions need to loop from 0..n, load the character and
> > check if its null, and then copy it. The big difference is strncpy keeps
> > on going.
>
> You're confused about how strlcpy works when n is much smaller than
> the length of the source string. Look at the return value of strlcpy:
> it's the length of the original string. To find the length of the
> original string, strlcpy basically has to walk over the whole thing.
> So while the running time of strncpy(dst,src,siz) is O(siz),
> strlcpy(dst,src,siz) is O(strlen(src)). That's a big problem for
> tor_strndup: its most performance-relevant use case is copying a small
> portion of a much longer string, as when we're parsing router
> descriptors. In this use case, strncpy is obviously better.
Technically though shouldn't strlcpy() not have to do this? It already
knows thelength, as it is provided to it. The only anecdote to this is
that it will supercede an 'actual' length if it detects a '\0' along the
copy. This isn't typically bad as it's already walking the string
anyways.
e.g. OpenBSD copy (same as Tor's):
/*
* Copy src to string dst of size siz. At most siz-1 characters
* will be copied. Always NUL terminates (unless siz == 0).
* Returns strlen(src); if retval >= siz, truncation occurred.
*/
size_t strlcpy(char *dst, const char *src, size_t siz)
{
register char *d = dst;
register const char *s = src;
register size_t n = siz;
if (n == 0)
return(strlen(s));
while (*s != '\0') {
if (n != 1) {
*d++ = *s;
n--;
}
s++;
}
*d = '\0';
return(s - src); /* count does not include NUL */
}
Provided the string 'onion\0routing\0', with n being 13, only 6 loop
cycles will actually happen.
> This isn't one of those hypothetical armchair optimizations, by the
> way: we profiled Tor before and after, and found out that the time
> spent copying strings went way down after we made the change. (I'm
> afraid I don't still have the numbers; this change came in the
> 0.0.9pre6 develoment, was back in November of 2004.)
Unfortunately this may be due to the fact that strlcpy() vs strncpy()s
implementation may have been one of C vs asm. :-/
I've always been on the fence about the whole strlcpy vs strncpy vs
snprintf vs sprintf("%.*s", len, str) issue. For one, I tend to not
use many str* functions in general. Secondarily, the "bonus" of strncpy
zero-filling the remaining end of the dst buffer as some kind of security
benefit that others have brought up in the past, I find to be completely
baseless as it's depending on defensive programming and the possibility of
a buffer overflow happening in the first place.
Also, when it comes to strlcpy, you're actually better off using:
/* provided we've allocated dst as s_len + 1 bytes */
l = MIN(d_len, s_len);
memcpy(dst, src, l); dst[l] = '\0';
The rationale here being that you've handled the length issues, you already
know the length (typically) of the source ahead of time (how else would you
pre-allocate for dst? if it's an auto or static buffer, then you know the
length as well).
In most asm implementations of str* functions the maximum length they will
operate on, chunk-wise, is a 4-byte integer. The reason being that they
will do the old 'is there a null in 4 bytes' trick, check it's alignment,
and if everything looks good, copy off the equivalent space of an int in
one call. But the bonus of using memcpy() is that memcpy() knows the entire
length ahead of time - and if it can make use of, provided the asm backend
code is there, it will use higher order optimizations such as mmx and/or
sse ops to copy much more than 4 bytes at once. It also doesn't have to
deal with even checking for nul's at all. It's entire goal is to copy data
as fast as possible, especially since it's use is widespread inside the
kernel and networking code.
This is from Solaris. It does not do any mmx or amd-specific tricks to
make things even faster, but it is relatively optimized and nothing out
of the ordinary.
memcpy.s:
40 ENTRY(memcpy)
41 movl %edi,%edx / save register variables
42 pushl %esi
43 movl 8(%esp),%edi / %edi = dest address
44 movl 12(%esp),%esi / %esi = source address
45 movl 16(%esp),%ecx / %ecx = length of string
46 movl %edi,%eax / return value from the call
47
48 shrl $2,%ecx / %ecx = number of words to move
49 rep ; smovl / move the words
50
51 movl 16(%esp),%ecx / %ecx = number of bytes to move
52 andl $0x3,%ecx / %ecx = number of bytes left to move
53 rep ; smovb / move the bytes
54
55 popl %esi / restore register variables
56 movl %edx,%edi
57 ret
58 SET_SIZE(memcpy)
strncpy.s:
65 ENTRY(strncpy)
66 pushl %edi / save register variables
67 pushl %esi
68
69 movl 16(%esp), %eax / %eax = source string address
70 movl 12(%esp), %edi / %edi = destination string address
71 movl 20(%esp), %esi / %esi = number of bytes
72
73 testl $3, %eax / if %eax not word aligned
74 jnz .L1 / goto .L1
75 .L8:
76 cmpl $4, %esi / if number of bytes < 4
77 jb .L4 / goto .L4
78 .align 4
79 .L2:
80 movl (%eax), %edx / move 1 word from (%eax) to %edx
81 movl $0x7f7f7f7f, %ecx
82 andl %edx, %ecx / %ecx = %edx & 0x7f7f7f7f
83 addl $4, %eax / next word
84 addl $0x7f7f7f7f, %ecx / %ecx += 0x7f7f7f7f
85 orl %edx, %ecx / %ecx |= %edx
86 andl $0x80808080, %ecx / %ecx &= 0x80808080
87 cmpl $0x80808080, %ecx / if null byte in this word
88 jne .L3 / goto .L3
89 movl %edx, (%edi) / copy this word to (%edi)
90 subl $4, %esi / decrement number of bytes by 4
91 addl $4, %edi / next word
92 cmpl $4, %esi / if number of bytes >= 4
93 jae .L2 / goto .L2
94 jmp .L4 / goto .L4
95 .L3:
96 subl $4, %eax / post-incremented
97 .align 4
98 .L4:
99 / (number of bytes < 4) or (a null byte found in the word)
100 cmpl $0, %esi / if number of bytes == 0
101 je .L7 / goto .L7 (finished)
102 movb (%eax), %dl / %dl = a byte in (%eax)
103 decl %esi / decrement number of bytes by 1
104 movb %dl, (%edi) / copy %dl to (%edi)
105 incl %eax / next byte
106 incl %edi / next byte
107 cmpb $0, %dl / compare %dl with a null byte
108 je .L5 / if %dl is a null, goto .L5
109 jmp .L4 / goto .L4
110 .align 4
111 .L1:
112 / %eax not aligned
113 cmpl $0, %esi / if number of bytes == 0
114 je .L7 / goto .L7 (finished)
115 movb (%eax), %dl / %dl = a byte in (%eax)
116 decl %esi / decrement number of bytes by 1
117 movb %dl, (%edi) / copy %dl to (%edi)
118 incl %edi / next byte
119 incl %eax / next byte
120 cmpb $0, %dl / compare %dl with a null byte
121 je .L5 / if %dl is a null, goto .L5
122 testl $3, %eax / if %eax word aligned
123 jz .L8 / goto .L8
124 jmp .L1 / goto .L1 (not word aligned)
125 .align 4
126 .L5:
127 movl %esi, %ecx / %ecx = length to copy null bytes
128 xorl %eax, %eax / clear %eax
129 shrl $2, %ecx / %ecx = words to copy null bytes
130 rep ; sstol / rep;sstol is optimal
131 andl $3, %esi / %esi = leftover bytes
132 .L6:
133 cmpl $0, %esi / if number of bytes == 0
134 jz .L7 / goto .L7 (finished)
135 movb $0, (%edi) / move a null byte to (%edi)
136 decl %esi / decrement number of bytes by 1
137 incl %edi / next byte
138 jmp .L6 / goto .L6
139 .align 4
140 .L7:
141 movl 12(%esp), %eax / return the destination address
142 popl %esi / restore register variables
143 popl %edi
144 ret
145 SET_SIZE(strncpy)
Both functions are in optimized assembly - but you don't even have be to be
very knowledgable about asm (which I am not that much) to know that memcpy()
will be faster than strncpy(), period.
So consider this simple little test program on Linux 2.6.20 w/ gcc-4.1.1:
$ cat strn.c
#include <string.h>
#include <stdlib.h>
static char buf[] = "an example string to use for copying bytes";
/* word offsets: 0 1 2 3 4 5 6 7 8 9 0 */
static size_t strlcpy(char *dst, const char *src, size_t siz)
{
register char *d = dst;
register const char *s = src;
register size_t n = siz;
if (n == 0)
return(strlen(s));
while (*s != '\0') {
if (n != 1) {
*d++ = *s;
n--;
}
s++;
}
*d = '\0';
return(s - src); /* count does not include NUL */
}
int main(int argc, char **argv)
{
char *q;
int i, l = sizeof buf - 1;
if (argc <= 1)
return EXIT_FAILURE;
if ((q = malloc(l * sizeof *q + 1)) == NULL)
return EXIT_FAILURE;
switch (*argv[1]) {
case 'm':
for (i = 16777216; i--; ) {
/* one doesn't typically copy nul */
memcpy(q, buf, l); q[l] = '\0';
}
break;
case 'l':
for (i = 16777216; i--; )
strlcpy(q, buf, l);
break;
case 'n':
for (i = 16777216; i--; )
strncpy(q, buf, l);
break;
case 's':
for (i = 16777216; i--; )
strcpy(q, buf);
break;
default: break;
}
free(q);
return 0;
}
$ gcc -static -O3 -W -Wall -pedantic -o strn strn.c
$ for i in l m s n; do TIMEFORMAT="$i r: %3lR u: %3lU s: %3lS"; for n in 0 1 2 3 4 5 6 7; do time ./strn $i; done; echo; done
l r: 0m2.593s u: 0m2.380s s: 0m0.004s
l r: 0m2.491s u: 0m2.436s s: 0m0.004s
l r: 0m2.494s u: 0m2.408s s: 0m0.000s
l r: 0m2.425s u: 0m2.364s s: 0m0.012s
l r: 0m2.494s u: 0m2.408s s: 0m0.008s
l r: 0m2.432s u: 0m2.380s s: 0m0.000s
l r: 0m2.530s u: 0m2.420s s: 0m0.000s
l r: 0m2.440s u: 0m2.388s s: 0m0.000s
m r: 0m0.546s u: 0m0.508s s: 0m0.008s
m r: 0m0.526s u: 0m0.508s s: 0m0.000s
m r: 0m0.545s u: 0m0.512s s: 0m0.000s
m r: 0m0.545s u: 0m0.512s s: 0m0.004s
m r: 0m0.545s u: 0m0.508s s: 0m0.000s
m r: 0m0.534s u: 0m0.508s s: 0m0.000s
m r: 0m0.547s u: 0m0.508s s: 0m0.004s
m r: 0m0.526s u: 0m0.508s s: 0m0.004s
s r: 0m1.738s u: 0m1.688s s: 0m0.000s
s r: 0m1.760s u: 0m1.692s s: 0m0.004s
s r: 0m1.813s u: 0m1.712s s: 0m0.004s
s r: 0m1.736s u: 0m1.700s s: 0m0.004s
s r: 0m1.764s u: 0m1.700s s: 0m0.004s
s r: 0m1.757s u: 0m1.696s s: 0m0.004s
s r: 0m1.745s u: 0m1.688s s: 0m0.008s
s r: 0m1.749s u: 0m1.692s s: 0m0.000s
n r: 0m1.913s u: 0m1.860s s: 0m0.004s
n r: 0m1.945s u: 0m1.872s s: 0m0.000s
n r: 0m1.920s u: 0m1.868s s: 0m0.000s
n r: 0m1.951s u: 0m1.876s s: 0m0.004s
n r: 0m1.973s u: 0m1.868s s: 0m0.036s
n r: 0m1.918s u: 0m1.864s s: 0m0.000s
n r: 0m1.922s u: 0m1.864s s: 0m0.004s
n r: 0m1.935s u: 0m1.864s s: 0m0.004s
Regardless of the differences between str, strn, and strl, memcpy just blows
them out of the water. I even included the terminating nul for realistic use.
-cl
More information about the tor-dev
mailing list