int atoi(char *);
The function embodies the first rule in the joke about Unix:
In the early PDP-11 days, Unix programs had the following design parameters:
Rule 1. It didn’t have to be good, or even correct, but:
Rule 2. It had to be small.
Thus the toolkit approach, and so forth. Of course, over time, computer hardware has become progressively more powerful: processors speed up, address spaces move from 16 to 32 bits, memory gets cheaper, and so forth. So Rule 2 has been relaxed.
Classic implementation of atoi is small and fast, and I bet that this is how it was implemented in original C runtime:
int atoi(char *c) {
int res = 0;
while (*c >= '0' && *c <= '9'))
res = res * 10 + *c++ - '0';
return res;
}
The problem is, it only works on correct input, and there's no way to know if the input is incorrect: call it with "boo" and it happily returns 0, which is the same as if it were passed with 0 input. Worse, give it a long-enough string of digits, and the return value would be incorrect because of the integer overflows.
So if you're absolutely guaranteed that the number is well-formatted, you can use atoi. But how often does this happen in real life? In most cases when parsing flags or user input the results of this function are non-deterministic.
These days atoi internal implementation is using strtol, which is a better function:
long strtol(const char *nptr,
char **endptr,
int base);
With strtol it is possible to figure out which part of the input (or none) was actually converted. It also sets errno to indicate an overflow.
However, being a patch over atoi, it exhibits the same terrible interface - you CAN figure out if there are errors, but you have to bend over the backwards to do it. Here's how it should be called to actually process errors:
errno = 0;
char *p;
long num = strtol(str, &p, 10);
if (errno != 0 || *p != '\0') {
...error
}
For this reason, every system that I've worked on so far had its own internal string to number conversion functions. I even wrote a couple myself :-).
What would be a production-worthy interface? I would argue that the best thing to do is to return an error explicitly, like this:
// Returns 0 if the input contains parseable
// number, and error number otherwise
// Parameters:
// str Input string
// res Pointer to the resulting parsed integer
int atoi(char *str, int *res);
Here's the C++-ish code that I use at home. It's not super-fast, (because to detect the overflow it uses division, which is a big performance no-no), but it does thorough input validation.
The advantage of it is that it's readable and easy to understand. I consider it generally acceptable when validating flags or user input - something that either is inherently slow, or happens once per program.
I would probably advise against using it in parsers though.
template<class IntType>
bool AtoI(WCHAR *sz, IntType *out) {
int i = 0;
bool sign = false;
if (sz[i] == '-') {
++i;
sign = true;
}
int base = 10;
if(sz[i] == '0' && tolower(sz[i+1]) == 'x') {
i += 2;
base = 16;
} else if (sz[i] == '0' && sz[i+1] != '\0') {
++i;
base = 8;
}
IntType res = 0;
int chars = 0;
while (sz[i] != '\0') {
WCHAR c = tolower(sz[i++]);
if (c >= '0' && c <= '9')
c = c - '0';
else if (c >= 'a' && c <= 'f')
c = c - 'a' + 10;
else
return false;
if (c >= base)
return false;
IntType res_sav = res;
res *= (IntType)base;
if (res / (IntType)base != res_sav) {
// Overflow!
return false;
}
res_sav = res;
res += (IntType)c;
if (res < res_sav) {
// Overflow!
return false;
}
++chars;
}
if (chars == 0)
return false;
*out = sign ? -res : res;
return true;
}
If you need performance and you are coding on Windows, you can use intsafe.h. It's a library that does overflow-proof data coversion and arithmetic. It is also unbelievably ugly (think names like UnsignedMultiply128).
But it does the job, correctly, and probably much faster than any home-brew approach (with the same level of correctness, that is).
12 comments:
If you are using c++, why not use ostringstream?
in c++, why not use boot::lexical_cast?
boost::lexcial_cast
Don't both of them touch heap?
I don't like hidden memory allocations. See here:
http://1-800-magic.blogspot.com/2007/11/guerilla-guide-to-native-memory.html
and here:
http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html
You forgot to increment c in the first code snippet.
That implementation has some serious weirdness going on. 01f will successfully parse as 23.
Fixed the atoi example - thanks, there was also a deref typo - it should have been *c++.
> That implementation has some serious weirdness going on. 01f will successfully parse as 23.
I just tried it:
int wmain(int argc, WCHAR **argv) {
int res = 0;
bool ret = AtoI&tl;int>(L"01f", &res);
printf ("%s %d\n", ret ? "true" : "false", res);
return 0;
}
It prints false 0, which is the excpected result... Did something break in copy and paste?
You could also consider using OpenBSD's strtonum(3) - http://www.openbsd.org/cgi-bin/man.cgi?query=strtonum&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html
Being BSD, you could also freely integrate it into your projects. Source code is available here: http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/strtonum.c
You could also not use a horrible programming language.
WRT horrible programming languages, see here:
http://1-800-magic.blogspot.com/2007/11/domain-languages.html
:-)
If you change the last line in the first function to
return *c == '\0' ? res : -1;
it will report bad input.
Post a Comment