Sunday, February 17, 2008

Down with atoi!

One of the interview questions that I ask is "What are the reasons one should dislike atoi?" atoi of course is an ancient C function that converts strings (a stands for ASCII) to integers:

int atoi(char *);

The function embodies the first rule in the joke about Unix:

In the early PDP-11 days, Unix programs had the following design parameters:
Rule 1. It didn’t have to be good, or even correct, but:
Rule 2. It had to be small.
Thus the toolkit approach, and so forth. Of course, over time, computer hardware has become progressively more powerful: processors speed up, address spaces move from 16 to 32 bits, memory gets cheaper, and so forth. So Rule 2 has been relaxed.

Classic implementation of atoi is small and fast, and I bet that this is how it was implemented in original C runtime:

int atoi(char *c) {
int res = 0;
while (*c >= '0' && *c <= '9'))
res = res * 10 + *c++ - '0';
return res;
}

The problem is, it only works on correct input, and there's no way to know if the input is incorrect: call it with "boo" and it happily returns 0, which is the same as if it were passed with 0 input. Worse, give it a long-enough string of digits, and the return value would be incorrect because of the integer overflows.

So if you're absolutely guaranteed that the number is well-formatted, you can use atoi. But how often does this happen in real life? In most cases when parsing flags or user input the results of this function are non-deterministic.

These days atoi internal implementation is using strtol, which is a better function:

long strtol(const char *nptr,
char **endptr,
int base);

With strtol it is possible to figure out which part of the input (or none) was actually converted. It also sets errno to indicate an overflow.

However, being a patch over atoi, it exhibits the same terrible interface - you CAN figure out if there are errors, but you have to bend over the backwards to do it. Here's how it should be called to actually process errors:

errno = 0;
char *p;
long num = strtol(str, &p, 10);
if (errno != 0 || *p != '\0') {
...error
}

For this reason, every system that I've worked on so far had its own internal string to number conversion functions. I even wrote a couple myself :-).

What would be a production-worthy interface? I would argue that the best thing to do is to return an error explicitly, like this:

// Returns 0 if the input contains parseable
// number, and error number otherwise
// Parameters:
// str Input string
// res Pointer to the resulting parsed integer
int atoi(char *str, int *res);


Here's the C++-ish code that I use at home. It's not super-fast, (because to detect the overflow it uses division, which is a big performance no-no), but it does thorough input validation.

The advantage of it is that it's readable and easy to understand. I consider it generally acceptable when validating flags or user input - something that either is inherently slow, or happens once per program.

I would probably advise against using it in parsers though.


template<class IntType>
bool AtoI(WCHAR *sz, IntType *out) {
int i = 0;
bool sign = false;
if (sz[i] == '-') {
++i;
sign = true;
}

int base = 10;
if(sz[i] == '0' && tolower(sz[i+1]) == 'x') {
i += 2;
base = 16;
} else if (sz[i] == '0' && sz[i+1] != '\0') {
++i;
base = 8;
}

IntType res = 0;
int chars = 0;
while (sz[i] != '\0') {
WCHAR c = tolower(sz[i++]);
if (c >= '0' && c <= '9')
c = c - '0';
else if (c >= 'a' && c <= 'f')
c = c - 'a' + 10;
else
return false;
if (c >= base)
return false;
IntType res_sav = res;
res *= (IntType)base;
if (res / (IntType)base != res_sav) {
// Overflow!
return false;
}
res_sav = res;
res += (IntType)c;
if (res < res_sav) {
// Overflow!
return false;
}
++chars;
}
if (chars == 0)
return false;
*out = sign ? -res : res;
return true;
}


If you need performance and you are coding on Windows, you can use intsafe.h. It's a library that does overflow-proof data coversion and arithmetic. It is also unbelievably ugly (think names like UnsignedMultiply128).

But it does the job, correctly, and probably much faster than any home-brew approach (with the same level of correctness, that is).

12 comments:

Anonymous said...

If you are using c++, why not use ostringstream?

Anonymous said...

in c++, why not use boot::lexical_cast?

Anonymous said...

boost::lexcial_cast

Sergey Solyanik said...

Don't both of them touch heap?

I don't like hidden memory allocations. See here:
http://1-800-magic.blogspot.com/2007/11/guerilla-guide-to-native-memory.html
and here:
http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html

Anonymous said...

You forgot to increment c in the first code snippet.

Anonymous said...

That implementation has some serious weirdness going on. 01f will successfully parse as 23.

Sergey Solyanik said...

Fixed the atoi example - thanks, there was also a deref typo - it should have been *c++.

> That implementation has some serious weirdness going on. 01f will successfully parse as 23.

I just tried it:
int wmain(int argc, WCHAR **argv) {
int res = 0;
bool ret = AtoI&tl;int>(L"01f", &res);
printf ("%s %d\n", ret ? "true" : "false", res);
return 0;
}

It prints false 0, which is the excpected result... Did something break in copy and paste?

Nelson Castillo said...
This comment has been removed by the author.
Joe Goh said...

You could also consider using OpenBSD's strtonum(3) - http://www.openbsd.org/cgi-bin/man.cgi?query=strtonum&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html

Being BSD, you could also freely integrate it into your projects. Source code is available here: http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/strtonum.c

Anonymous said...

You could also not use a horrible programming language.

Sergey Solyanik said...

WRT horrible programming languages, see here:
http://1-800-magic.blogspot.com/2007/11/domain-languages.html

:-)

Ehren Metcalfe said...

If you change the last line in the first function to

return *c == '\0' ? res : -1;

it will report bad input.