1-800-MAGIC: Down with atoi!

Sunday, February 17, 2008

Down with atoi!

One of the interview questions that I ask is "What are the reasons one should dislike atoi?" atoi of course is an ancient C function that converts strings (a stands for ASCII) to integers:


    int atoi(char *);

The function embodies the first rule in the joke about Unix:

In the early PDP-11 days, Unix programs had the following design parameters:
Rule 1. It didn’t have to be good, or even correct, but:
Rule 2. It had to be small.
Thus the toolkit approach, and so forth. Of course, over time, computer hardware has become progressively more powerful: processors speed up, address spaces move from 16 to 32 bits, memory gets cheaper, and so forth. So Rule 2 has been relaxed.

Classic implementation of atoi is small and fast, and I bet that this is how it was implemented in original C runtime:


    int atoi(char *c) {
      int res = 0;
      while (*c >= '0' && *c <= '9'))
        res = res * 10 + *c++ - '0';
      return res;
    }

The problem is, it only works on correct input, and there's no way to know if the input is incorrect: call it with "boo" and it happily returns 0, which is the same as if it were passed with 0 input. Worse, give it a long-enough string of digits, and the return value would be incorrect because of the integer overflows.

So if you're absolutely guaranteed that the number is well-formatted, you can use atoi. But how often does this happen in real life? In most cases when parsing flags or user input the results of this function are non-deterministic.

These days atoi internal implementation is using strtol, which is a better function:


    long strtol(const char *nptr,
                char **endptr,
                int base);

With strtol it is possible to figure out which part of the input (or none) was actually converted. It also sets errno to indicate an overflow.

However, being a patch over atoi, it exhibits the same terrible interface - you CAN figure out if there are errors, but you have to bend over the backwards to do it. Here's how it should be called to actually process errors:


    errno = 0;
    char *p;
    long num = strtol(str, &p, 10);
    if (errno != 0 || *p != '\0') {
       ...error
    }

For this reason, every system that I've worked on so far had its own internal string to number conversion functions. I even wrote a couple myself :-).

What would be a production-worthy interface? I would argue that the best thing to do is to return an error explicitly, like this:


   // Returns 0 if the input contains parseable
   // number, and error number otherwise
   // Parameters:
   //   str Input string
   //   res Pointer to the resulting parsed integer
   int atoi(char *str, int *res);

Here's the C++-ish code that I use at home. It's not super-fast, (because to detect the overflow it uses division, which is a big performance no-no), but it does thorough input validation.

The advantage of it is that it's readable and easy to understand. I consider it generally acceptable when validating flags or user input - something that either is inherently slow, or happens once per program.

I would probably advise against using it in parsers though.


template<class IntType>
    bool AtoI(WCHAR *sz, IntType *out) {
  int i = 0;
  bool sign = false;
  if (sz[i] == '-') {
    ++i;
    sign = true;
  }

  int base = 10;
  if(sz[i] == '0' && tolower(sz[i+1]) == 'x') {
    i += 2;
    base = 16;
  } else if (sz[i] == '0' && sz[i+1] != '\0') {
    ++i;
    base = 8;
  }

  IntType res = 0;
  int chars = 0;
  while (sz[i] != '\0') {
    WCHAR c = tolower(sz[i++]);
    if (c >= '0' && c <= '9')
      c = c - '0';
    else if (c >= 'a' && c <= 'f')
      c = c - 'a' + 10;
    else
      return false;
    if (c >= base)
      return false;
    IntType res_sav = res;
    res *= (IntType)base;
    if (res / (IntType)base != res_sav) {
      // Overflow!
      return false;
    }
    res_sav = res;
    res += (IntType)c;
    if (res < res_sav) {
      // Overflow!
      return false;
    }
    ++chars;
  }
  if (chars == 0)
    return false;
  *out = sign ? -res : res;
  return true;
}

If you need performance and you are coding on Windows, you can use intsafe.h. It's a library that does overflow-proof data coversion and arithmetic. It is also unbelievably ugly (think names like UnsignedMultiply128).

But it does the job, correctly, and probably much faster than any home-brew approach (with the same level of correctness, that is).

12 comments:

Anonymous said...: If you are using c++, why not use ostringstream?; February 18, 2008 at 7:56 AM
Anonymous said...: in c++, why not use boot::lexical_cast?; February 18, 2008 at 9:12 AM
Anonymous said...: boost::lexcial_cast; February 18, 2008 at 9:13 AM
Sergey Solyanik said...: Don't both of them touch heap?

I don't like hidden memory allocations. See here:
http://1-800-magic.blogspot.com/2007/11/guerilla-guide-to-native-memory.html
and here:
http://1-800-magic.blogspot.com/2007/12/memory-is-not-free-more-on-vista.html; February 18, 2008 at 9:41 AM
Anonymous said...: You forgot to increment c in the first code snippet.; February 18, 2008 at 10:55 AM
Anonymous said...: That implementation has some serious weirdness going on. 01f will successfully parse as 23.; February 18, 2008 at 11:59 AM
Sergey Solyanik said...: Fixed the atoi example - thanks, there was also a deref typo - it should have been *c++.

> That implementation has some serious weirdness going on. 01f will successfully parse as 23.

I just tried it:
int wmain(int argc, WCHAR **argv) {
int res = 0;
bool ret = AtoI&tl;int>(L"01f", &res);
printf ("%s %d\n", ret ? "true" : "false", res);
return 0;
}

It prints false 0, which is the excpected result... Did something break in copy and paste?; February 18, 2008 at 12:49 PM
Nelson Castillo said...: This comment has been removed by the author.; February 18, 2008 at 6:33 PM
Joe Goh said...: You could also consider using OpenBSD's strtonum(3) - http://www.openbsd.org/cgi-bin/man.cgi?query=strtonum&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html

Being BSD, you could also freely integrate it into your projects. Source code is available here: http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/strtonum.c; February 18, 2008 at 9:51 PM
Anonymous said...: You could also not use a horrible programming language.; February 20, 2008 at 6:20 PM
Sergey Solyanik said...: WRT horrible programming languages, see here:
http://1-800-magic.blogspot.com/2007/11/domain-languages.html

:-); February 20, 2008 at 10:08 PM
Ehren Metcalfe said...: If you change the last line in the first function to

return *c == '\0' ? res : -1;

it will report bad input.; April 8, 2009 at 6:25 PM

1-800-MAGIC

Sunday, February 17, 2008

Down with atoi!

12 comments:

NIST Clock

About Me

Blog Archive

Links

Visitors