Aussie AI
C++ Lexical Bugs
-
Bonus Material for "Generative AI in C++"
-
by David Spuler, Ph.D.
Lexical Bugs
The lexical phase of C++ compilation is the first analysis of the code. This phase looks to group all of the characters you've typed into "tokens" such as comments, numbers, variable names, or constants. There are various ways that the compiler can misunderstand what you were trying to do.
Unclosed multi-line comments
The C-style /*...*/ comments do not nest in C++ (nor did they in C). The presence of "/*" inside another /*...*/ style comment does not start a nested comment. Instead, the second "/*" is ignored and the compiler keeps looking for */ to close the comment. This is the main reason you cannot use /*..*/ to comment-out multiple lines of code (use "#if 0" instead).
Leaving a comment unclosed accidentally can comment out part of your code. The /* of the unclosed comment matches the */ of a second comment, leaving out any code between the two. In the example shown below, only the first printf statement will be executed because the second is accidentally commented out:
printf("j = %d\n", j); /* this is a comment - unclosed! printf("i = %d\n", i); /* this statement is commented out */
Most C++ compilers will issue a warning about "nested comments", but this isn't technically a compilation error, so it will still run.
Note that multi-line "/*" comments starting inside a C++ "//" comment are also ignored.
printf("j = %d\n", j); // This is a comment /* This is too... printf("i = %d\n", i); // This statement is NOT commented out
The above code works, but it's very suspicious. It looks like the programmer was writing a /*...*/ comment, but forgot to finish it. Hopefully, a compilation warning is emitted to remind the programmer to clean this up.
Comment Out of Nowhere
Another very strange instance of nested comment problems appears when the / and * operators are placed together, such as:
divisor = x /*ptr;
Instead of dividing x by what ptr points to, the /* sequence starts a comment. Like most instances of this error, the above code will usually cause a compilation error because the "divisor=x" statement will have no semicolon. However, there are pathological examples where no error is generated, and the / operation simply disappears!
Other Nested Comment Errors
Another common pitfall is attempting to comment out a statement that already contains comments. However, this is less of a problem than the pitfalls above in that the */ sequence of the enclosing comment is no longer actually in a comment, and is parsed as if it were part of a statement. Thus the error typically provokesacompilation diagnostic. The solution is to use #if 0 and #endif rather than comments.
Nesting comments is not usually any problem with C++-style // comments. The appearance of another // sequence before the newline is simply ignored. Using a /* comment to surround multiple lines containing // comments is also no problem.
A minor danger may occur if the programmer accidentally uses // to comment out the beginning of a multiline C-style /* comment, such as below:
// ... /* this is a multiline comment */
Fortunately, this situation will almost always cause a compilation error either from the contents of the comment or from the closing */ sequence. It is possible to generate pathological examples where the code will compile cleanly and fail at run-time, but such instances are so rare that I can't think of a likely one. Can you?
A very rare problem involving the C++-style // comment is that of accidentally commenting out a statement. One potential danger appears in the common practice of converting C-style /*..*/ comments into // comments by changing /* to // and deleting */. Consider what could happen to the code below during conversion:
for (i = 1; i <= 10; i++) /* do nothing */ ; assert(i == 11);The code becomes:
for (i = 1; i <= 10; i++) // do nothing ; assert(i == 11);
The null statement (the semicolon) is commented out, and the assert macro call becomes the body of the loop.
Accidental string literal concatenation
String concatenation is a relatively obscure feature of C++ that allows consecutive string literals to be merged into a single string literal. Concatenation of string literals takes place after the usual preprocessing tasks (i.e., after macro expansion), but before parsing.
An example of its usage is that the following code:
char *prompt = "Hello " "world";
This looks like a typo to beginner C++ programmers, but is totally valid C++ that will be equivalent to:
char *prompt = "Hello world";
Once you get used to it, this is a very helpful C++ feature that is most useful for writing long string literals on multiple lines. In particular, it avoids the pitfalls that line splicing (i.e. backslashes at the end of a line) has involving whitespace inside string literals.
Unfortunately, the fact that the compiler (or preprocessor) performs this concatenation automatically without any warning can also lead to strange errors. Consider the following definition of an array of strings:
char *arr[] = { "a", "b" "c" }; // Bug (missing comma)
The absence of the second comma causes "b" and "c" to be concatenated to produce "bc" and arr is defined to hold 2 strings instead of 3. Even if the array size were explicitly declared as 3 (i.e., char*arr[3]) many compilers would still not produce a warning, since having too few initializers is not an error.
Line splicing string constants
The placement of a backslash as the last character on a line is called line splicing, as it causes the lines to be joined as if there had been no backslash or newline. This feature is convenient for writing long string constants on multiple lines, such as:
char *prompt = "Hello \ world";
There is one danger involving whitespace characters. The programmer must be careful about the number of spaces or tabs before the backslash, and also on the beginning of the next line, because these spaces will become part of the string constant. The following code fragment illustrates both mistakes:
char *prompt = "Hello \ world";
The resulting string literal has many spaces between the two words. One solution to the problem is to use the facility of adjacent string literal concatenation wherever it is necessary to extend string literals over more than one source code line.
Note that this danger involving whitespace does not exist for the other common use of line splicing in creating multiple-line macros, because these spaces will not be "inside" a token.
Octal integer constants
Any integer constant beginning with 0 is treated as an octal constant. This creates no problem with 0 itself since its value is the same in both octal and decimal, but there are dangers in using prefix zeros on integer constants. For example, the following use of prefix zeros to line up columns of integer initializers is erroneous:int powers_of_10[] = { 0001, // Octal 1 == decimal 1 0010, // Octal 10 == decimal 8 0100, // Octal 100 == decimal 64 1000, //Decimal 1000 };
The correct solution is simply to use spaces instead of prefix zeros. Nevertheless, the temptation to use initial zeros can arise occasionally. For example, consider representing 4-digit phone extension numbers as integers:
struct { char *name; int ext_number; } arr[] = { { "Mary", 7234 }, { "John", 3467 }, { "Elaine", 0135 } // Bug! };
The phone number 0135 will be interpreted as an octal constant, and won't equal decimal 135. It's value in octal is 1*64+3*8+5=93.
Lowercase l suffix on integer constants
A problem can arise with programmers who use lowercase l as the suffix to indicate long constants. In some printed fonts, and to some extent on the screen, an l letter looks almost identical to 1 (one). Therefore, the use of lowercase l as a suffix is highly error-prone because it can be easily mistaken for a 1.
For example, consider the constant 10l; is it 10 of type long or 101 of type int? Unfortunately, the chance of the compiler noticing that the constant is of the wrong type is very slim, and this error can be very hard to detect. The simple solution is: use the uppercase suffix L.
Character escape errors
Novice C++ programmers occasionally confuse / with \ when used in printf format strings. The error is usually reasonably harmless, as it will appear as erroneous output. A typical example is:
printf("Hello world/n"); // Bug
Corrected code is:
printf("Hello world\n"); // Correct
Hexadecimal escape extra characters
A rare error can occur when using hexadecimal or octal escapes in string literals. Hexadecimal escapes with \x use at most 2 hexadecimal digits and octal escapes starting with the digit 0 use at most 3 octal digits. If the programmer uses too many digits for the escape, the succeeding digits will be included as characters in the string. For example, the hexadecimal string literal "\xffff" contains 3 characters: '\xff', 'f' and 'f'. In octal, the string "\000002" is 4 characters: '\000' (null byte), '0', '0' and '2'.
These rules also create a portability problem for very old compilers, where there were different rules. In very old compilers the escape "\000002" would be a single character in a string literal; in modern C++ it is 4 characters.
Tabs in output statements
Using tab characters to align columns of output is an error-prone practice. The alignment of columns to tab stops will be different in the source code from how it appears in the output, because the source code is already indented a number of characters by whitespace and by the characters that make up the program statement.
Quotes in preprocessed-out code
A common mistake is to assume that code ignored by the preprocessor, such as by using #if 0, can contain any text. Unfortunately, such text must contain valid preprocessor tokens, so the following code illustrates a problem, since the apostrophe in the text should start a character constant token:
#if 0 this code shouldn't compile cleanly #endif
The solution is to place this text inside comment delimiters, in which case the apostrophe is harmless:
#if 0 /* the comment below is incorrect! */ /* this code shouldn't compile cleanly */ #endif