Aussie AI
Appendix: Bug Symptom Diagnosis
-
Bonus Material for "Generative AI in C++"
-
by David Spuler, Ph.D.
It is very beneficial to the debugging process to be able to identify the cause of an error from its symptoms. Unfortunately, this is a very difficult process — otherwise debugging would be easy! Nevertheless, there are some common run-time errors with well-known causes, and this section attempts to provide a brief catalog of common error causes, mapping observable failure symptoms into the common errors.
Linux core dumps
There are a number of run-time error messages that occur mainly on large UNIX machines, but not usually on personal computers. Some of the common run-time error messages are:
- segmentation fault
- bus error
- illegal instruction
- trace/BPT trap
The message "core dumped" will often accompany the error message if it causes program termination, and this indicates that a file named "core" has been saved in the current directory. The "core" file can be used for postmortem debugging to locate the failure with a symbolic debugger.
Note that the dump of the core file can be prevented by providing an empty file named "core" that is set to protection mode 000 using chmod. This may be useful if disk space is limited and the core dumps are huge.
A segmentation fault occurs when the hardware detects a memory access by the program that attempts to reference memory it is not allowed to use. For example, the address NULL cannot be referenced, and in fact, the single most common cause of a segmentation fault (at least for the experienced programmer) is a NULL dereference, but there are many other causes.
A bus error occurs when an attempt is made to load an incorrect address into an address bus. Although this leads us to suspect bad pointers, this error can also arise via stack corruption (because this can cause bad pointer addresses), and so there are a variety of potential causes.
Segmentation faults and bus errors may be reported as the program receiving signal SIGSEGV or SIGBUS in some situations. The most common causes of a segmentation fault or bus error are listed below. Different architectures will have different results for these errors, but will usually produce either a segmentation fault or bus error.
- NULL pointer dereference
- wayward pointer dereference (memory allocation problem)
- noninitialized pointer dereference
- array index out of bounds
- wrong number or type of arguments to nonprototyped function
- bad arguments to scanf or printf
- forgetting the & on arguments to scanf
- deallocating nonallocated location using free or delete
- deallocating same address twice using free or delete
- executable file removed/modified while being executed (dynamic loading)
- stack overflow
Another common abnormal termination condition for UNIX machines is the message "illegal instruction," which usually causes a core dump. The most common causes of this method of termination are:
- assert macro has failed (causes abort call)
- abort library function called
- data has been executed somehow (uninitialized pointer-to-function?)
- stack corruption (e.g., write past end of local array)
- stack overflow
- C++ exception problem causing abort call
— unhandled exception was thrown
— unexpected exception from function with interface specification
— exception thrown in destructor during exception-related stack unwinding
Another run-time error message for UNIX machines is the message "fixed up nonaligned data access," although this does not necessarily lead to program termination. This indicates that hardware has detected an attempt to access a value through an address with incorrect alignment requirements. Typically it refers to attempting to read or write an integer or pointer at an odd-valued address (i.e., an address that is not word-aligned). Note that on machines without this automatic "fix-up" the same code will probably cause a bus error.
Program hangs infinitely
When one is faced with debugging a program that seems to get stuck, it is important to determine what type of "hang" has occurred. If the program is simply stuck in an infinite loop, you will still have control of the program and can interrupt it. One method of finding out where the program is stuck is to run the program from a debugger, or (under UNIX) to use the keyboard interrupt <ctrl-\> to cause a core dump, which can then be examined by a debugger. Some causes of this form of infinite looping are:
- NP-complete algorithm (i.e. basically anything in AI)
- infinite loop
- accidental semicolon on end of while/for loop header
- exit called within a destructor of global object (C++ only)
- handled/ignored signal is recurring (e.g., SIGSEGV, SIGBUS)
- waiting for input: getc/getch assigned to char
- linked data structure corrupted (contains pointer cycles)
If the program hangs for a period of time and then crashes, a likely candidate is a runaway recursive function. This will loop (almost) infinitely, consuming stack space all the time, until it runs out of stack space and (a) terminates abnormally (e.g., under UNIX or DOS with stack checking enabled), or (b) the stack overwrites some important memory and the second, more severe form of "hang" occurs (e.g., under DOS without stack checking).
The most severe form of a "hung" program is one that will not respond. This rarely occurs under UNIX or other large systems because of memory protection, but it is common for personal computers. You know it'sabad bug when the reset button is the only thing that works. When this occurs, I recommend the use of any compiler run-time checks, especially stack overflow checking and array bounds checking (if available). An additional method is to recompile using a memory allocation debugging library. Some possible causes of a nonresponsive program crash are:
- infinite recursion
- stack overflow
- array index out of bounds
- modification via wayward pointer
- modification via noninitialized pointer
- modification via NULL pointer
- freeing a nonallocated block
- freeing a string constant
- nonterminated string was copied
- inconsistent compiler/linker options (e.g., object files with different memory models)
Note how many of these errors will cause a hung program on "smaller" computers but will receive segmentation fault or bus errors on UNIX systems.
Failure after long execution
A very annoying error is that of a program that runs perfectly for a long period of time and then suddenly fails for no apparent reason. This usually indicates a "memory leak" causing the system to use up all available memory and malloc to return NULL. However, there are other causes and a more complete list is:
- untested rare sequence of events is causing the error (try to repeat it)
- heap memory leak causing allocation failure (allocated memory not deallocated)
- running out of FILE* handles (files opened but not closed)
- some form of memory corruption (symptom of bug doesn't appear immediately)
- integer overflow (e.g., of some 16-bit counter)
- disk filling up
- peripheral error (e.g., printer out of paper)
Optimizer-only bugs
A program that runs correctly with normal compilation but fails when the optimizer is invoked isawell-known problem. The immediate reaction is to blame a bug in the optimizer. Howev er, although such bugs are not so rare as one would wish, there are a number of other potential causes. It is usually an indication that some erroneous or nonportable code has been working correctly more by luck than good programming, and the more aggressive optimizations have shown up the error. Some possible causes are:
- order of evaluation errors (optimizer rearranges expressions)
- special location not declared volatile
- use of an uninitialized variable
- wrong number/type of arguments to nonprototyped function
- wrong arguments to prototyped function not declared before use
- memory access problems (optimizer has rearranged some memory)
In this situation it may be useful to examine what compiler options are available to choose which optimizations are chosen. For example, there may be an option to choose between traditional stack-based argument passing and pass by register. If so, recompilation with and without that option can help to test for argument passing errors. Argument passing errors can also be found more quickly by lint under UNIX.
Failure disappears when debugger used
A really annoying situation is a program that crashes when run normally, but does not fail when run via a symbolic debugger or interpreter. One fairly well known cause is the use of an uninitialized automatic variable. The error may disappear when run via the debugger, because some debuggers set these local variables to zero or NULL initially. Thus, some possible causes are:
- using uninitialized variable (especially a pointer)
- memory access problems (debugger has rearranged memory somehow)
— array index out of bounds
— modification via wayward pointer
— modification via noninitialized pointer
— NULL pointer dereference
— modification via NULL pointer
— freeing a nonallocated block
— freeing a string constant
The list of errors possibly causing a memory-related problem is comparable with the list of errors causing a nonrecoverable hung program.
Program crashes at startup
When a C++ program crashes on program startup, without even executing the first statement in main, we must suspect constructors of global objects. Use a run-time debugger to determine if main has been entered; but note that some debuggers allow debugging of constructors before main and others do not. Alternatively, place an output statement as the very first statement in main (even before the first declaration!) to ensure that the problem really is arising before main, rather than from instructions in main. Once a constructor problem has been identified, finding the root cause of the problem is a debugging matter. There are no forms of error particular to constructors, so the problem is something being done by a constructor that is probably some type of other error (e.g. a memory stomp error).
Program crashes on exit
The program can fail in a few obscure ways at the end of execution. Careful consideration of what actions are taking place at the end of execution is important (e.g., destructors are invoked in C++; any functions registered with atexit will be called). In my experience this failure is most common during the learning phase of C++ programming, when destructor errors are common.
- delete operation in object destructor is trashing memory
- destructor in global object calls exit
- main accidentally declared returning non-int e.g., missing semicolon on class or struct declaration above main
- setbuf buffer is a non-static local variable of main
- no call to exit, and no return statement in main (a few platforms only) File closed twice (e.g. double fclose error)
Function apparently not invoked
Consider the situation where you are debugging a program, and discover that a particular function seems to be having no effect. You put an output statement at its first statement and no output appears. Why isn't the function being invoked? Some possible causes are:
- No call to the function (!), e.g. you didn't rebuild properly (your fault), or a source code repo issue (someone else's fault) or you're looking in the wrong C++ source file (I've done it many times).
- Control flow or conditional test controlling the call is wrong.
- Missing brackets on function call (null-effect)
- Function is a macro at call location
- Function is a reserved library function name (wrong function is getting called)
- Nested comments deleting call to function
Garbage output
You accidentally started up your main competitor's AI engine? Oh, wait, you're not supposed to be running it.
When a program runs and produces strange output there are a number of possibilities (mostly related to misusing string variables). Note that it is important to distinguish whether the output of a statement is entirely garbage or whether it has a correct prefix (which may indicate a nonterminated string). Some causes are:
- Uninitialized variable
- Constructor not initializing all data members
- Missing argument to printf %s format
- Wrong type argument to printf %s format
- Returning address of automatic local string array
- Stack corruption (local array buffer overrun)
- strncpy leaves string nonterminated
- Pointer variable not initialized
- Address has already been deallocated
Failure on new platform
When a program appears to be running successfully on one machine, it is by no means guaranteed that porting the source code and recompiling on a new machine will not lead to new errors. When a new error is discovered, the first thing that must be tested is whether the same error exists for the same test data on the original machine. The bug might not be a portability problem — it might be an untested case.
However, if the bug appears on one machine but not on another there are a few common causes. The most frequent portability problem is a memory corruption error since these will often lurk undetected on one machine, and appear in the new memory layout of a different environment.
Another common class of portability problems that typically arise when porting software from UNIX to DOS is that many DOS compilers have 16-bit int, whereas UNIX compilers use 32-bit int. For these errors it is worthwhile to examine all compiler warnings, since the compiler will often identify errors such as an integer constant that is too large to be stored in an int, or an attempt to compare an int with a value outside the range stored in 16-bits.
Other possible causes are different compilation results that may arise when a new compiler uses more aggressive optimization. Hence, code that relies on an undocumented compiler feature (e.g., left to right function argument evaluation) may suddenly fail. Note that this implies that portability errors can arise after a compiler upgrade on the same machine, as well as when moving code to a new machine. Some common causes of portability errors are:
- Memory corruption errors
— array index out of bounds
— modification via wayward pointer
— modification via noninitialized pointer
— NULL pointer dereference
— modification via NULL pointer
— freeing a nonallocated block
— freeing a string constant - 16-bit short int problem
— arithmetic overflow
— bit-shifting: 1<<16 should probably be 1L<<16
— assuming rand returns 32 bits
— scanf/printf using %d (%ld) on a long (int) - Function has no return statement
- Order of evaluation error
— operators: a[i]=i++;
— function arguments: fn(i,i++);
— global object construction in separate files - Special location not declared volatile
- Use of an uninitialized variable
— Constructor not initializing all data members
— new doesn't initialize non-class types
— malloc doesn't initialize any types - Bit-field is plain int
- Plain char is signed/unsigned
— getc/getchar return value assigned to char
Most of these causes are fairly self-explanatory. However, the appearance of "function has no return statement" in the list may appear surprising — surely this will cause a bug on all implementations? However, it has been observed surprisingly frequently that a function that terminates without a return statement might accidentally return the correct value. Typically, this surprising outcome occurs if, by coincidence, a local scalar variable that is intended to be returned happens to be in the hardware register that is used to hold the function return value. Since that register is not loaded when no return statement is found, the correct result is accidentally returned, and there is no failure until a different compiler or environment is used.
Some compilers have compilation options to change various compiler-dependent features. For example, there may be options to change the default type of plain char and/or plain int bit-fields to signed or unsigned. If it is suspected that this may be the cause of the error, the code can be recompiled with different option settings to confirm this. Any run-time error checking options such as memory allocation debugging and stack overflow checking should also be enabled.