Copyright (c) Hyperion Entertainment and contributors.
Troubleshooting Your Software
Contents
Troubleshooting Your Software
Many Amiga programming errors have classic symptoms. This guide will help you to eliminate or avoid these problems in your software.
Typical Problems
The bit data for audio samples must be in Chip RAM. Check your compiler manual for directives or flags which will place your audio sample data in Chip RAM. Or dynamically allocate Chip RAM and copy or load the audio sample there.
RAWKEY users must be aware that RAWKEY codes can be different letters or symbols on national keyboards. If you need to use RAWKEY, run the codes through RawKeyConvert() (see the “Intuition Mouse and Keyboard” chapter) to get proper translation to correct ASCII codes. Improper display or processing of high-ASCII international characters can be caused by incorrect tolower()/toupper(), or by sign extension of character values when switched on or assigned into larger size variables. Use unsigned variables such as UBYTE (not char) for strings and characters whenever possible. Internationally correct string functions are provided in the 2.0 utility.library.
Improper error messages are caused by calling exit(n) with an invalid or missing return value n. Assembler programmers using startup code should jump to the startup code’s _exit with a valid return value on the stack. Programs without startup code should return with a valid value in D0. Valid return values such as RETURN_OK, RETURN_WARN, RETURN_FAIL are defined in <dos/dos.h> and <dos/dos.i>. Values outside of these ranges (-1 for instance) can cause invalid CLI error messages such as “not an object module”. Useful hint – if your program is called from a script, your valid return value can be conditionally branched on in the script (i.e., call program, then perform actions based on IF WARN or IF NOT WARN). RETURN_FAIL will cause the script to stop if a normal FAILAT value is being used in script.
A CLI can’t close if a program has a Lock() on the CLI input or output stream ("*"). If your program is “RUN >NIL:” from a CLI, that CLI should be able to close unless your code or your compiler’s startup code explicitly opens "*".
Memory corruption, address errors, and illegal instruction errors are generally caused by use of an uninitialized, incorrectly initialized, or already freed/closed pointer or memory. You may be using the pointer directly, or it may be one that you placed (or forgot to place) in a structure passed to system calls. Or you may be overwriting one of your arrays, or accidentally modifying or incrementing a pointer later used in a free/close. Be sure to test the return of all open/allocation type functions before using the result, and only close/free things that you successfully opened/allocated. Use watchdog/torture utilities such as Enforcer and MungWall in combination to catch use of uninitialized pointers or freed memory, and other memory misuse problems. Use the debugging tool TNT to get additional debugging information instead of a Software Error requester. You may also be overflowing your stack – your compiler’s stack checking option may be able to catch this. Cut stack usage by dynamically allocating large structures, buffers, and arrays which are currently defined inside your functions.
Corruption or crashes can also be caused by passing wrong or missing arguments to a system call (for example SetAPen(3) or SetAPen(win,3), instead of SetAPen(rp,3)). C programmers should use function prototypes to catch such errors. If using short integers be sure to explicitly type long constants as long (e.g., 42L). (For example, with short ints, 1 << 17 may become zero). If corruption is occurring during exit, use printf() (or kprintf(), etc.) with Delay(n) to slow down your cleanup and broadcast each step. A bad pointer that causes a system crash will often be reported as an standard 680x0 processor exception $0000 0003 or $0000 0004, or less often a number in the range of $0000 0006-$0000 000B. Or an Amiga-specific alert number may result. See <exec/alerts.h> for Amiga-specific alert numbers. Also see “Crashes – After Exit” below.
If this only happens when you start your program from Workbench, then you are probably UnLocking() one of the WBStartup message wa_Locks, or UnLocking() the Lock() returned from an initial CurrentDir() call. If you CurrentDir(), save the lock returned initially, and CurrentDir() back to it before you exit. Only UnLock() locks that you created.
If you are crashing from both Workbench and CLI, and you are only crashing after exit, then you are probably either freeing/closing something twice, or freeing/closing something your did not actually allocate/open, or you may be leaving an outstanding device I/O request or other wakeup request. You must abort and WaitIO() any outstanding I/O requests before you free things and exit (see the Autodocs for your device, and for Exec AbortIO() and WaitIO()). Similar problems can be caused by deleting a subtask that might be in a WaitTOF(). Only delete subtasks when you are sure they are in a safe state such as Wait(0L).
This can be caused by illegal instructions (80000000.00000004) such as new 68020/30/40 instructions or inline 68881/882 code. But this is usually caused by a word or longword access at an odd address. This is legal on the 68020 and above, but will generate an Address Error (80000000.00000003) on a 68000 or 68010. This can be caused by using uninitialized pointers, using freed memory, or using system structures improperly (for example, referencing into IntuiMessage->IAddress as a struct Gadget * on a non-Gadget message).
Because of the instruction pipelining of the 68040, it is very difficult to recover from a bus error. If your program has an “Enforcer hit” (i.e., an illegal reference to memory), the resulting 68040 processor bus error will probably crash the machine. Use Enforcer (on an ’030) to track down your problems, then correct them.
If part of your code runs on a different stack or the system stack, you must turn off compiler stack-checking options. If part of your code is called directly by the system or by other tasks, you must use long code/long data or use special compiler flags or options to assure that the correct base registers are set up for your subtask or interrupt code.
Be careful not to CloseWindow() a window during a while(msg=GetMsg(...)) loop on that window’s port (next GetMsg() would be on freed pointer). Also, use ModifyIDCMP(NULL) with care, especially if using one port with multiple windows. Be sure to ClearMenuStrip() any menus before closing a window, and do not free items such as dynamically allocated gadgets and menus while they are attached to a window. Do not reference an IntuiMessage’s IAddress field as a structure pointer of any kind before determining it is a structure pointer (this depends on the Class of the IntuiMessage). If a crash or problem only occurs when opening a window after extended use of your program, check to make sure that your program is properly freeing up signals allocated indirectly by CreatePort(), OpenWindow() or ModifyIDCMP().
If you are crashing near the first DOS call, either your stack is too small or your startup code did not GetMsg() the WBStartup message from the process message port. If your program crashes during execution or during your exit procedure only when started from Workbench, and your startup opens no stdio window or “NIL:” file handles for WB programs, then make sure you are not writing anything to stdout (printf(), etc.) when started from WB (argc==0). See also “Crashes – After Exit”.
Device-related problems may caused by: improperly initialized port or I/O request structures (use CreatePort() and CreateExtIO()); use of a too-small I/O request (see the device’s <.h> files and Autodocs for information on the required type of I/O request); re-use of an I/O request before it has returned from the device (use the debugging tool IO_Torture to catch this); failure to abort and wait for an outstanding device request before exiting; waiting on a signal/port/message allocated by a different task.
This occurs when a program leaves a Lock() on one or more of a disk’s files or directories. A memory loss of exactly 24 bytes is usually Lock() which has not been UnLocked().
In general, any dos.library function which fills in a structure for you (for example, Examine()), requires that the structure be longword aligned. In most cases, the only way to insure longword alignment in C is to dynamically allocate the structure. Unless documented otherwise, dos.library functions may only be called from a process, not from a task. Also note that a process’s pr_MsgPort is intended for the exclusive use of dos.library. (The port may be used to receive a WbStartup message as long as the message is GetMsg()’d from the port before DOS is used.
The following programming practices can cause this problem: using the upper bytes of addresses as flags; doing signed math on addresses; self-modifying code; using the MOVE SR assembler instruction (use Exec GetCC() instead); software delay loops; assumptions about the order in which asynchronous tasks will finish. The following differences in 68020/30 can cause problems: data and/or instruction caches must be flushed if data or code is changed by DMA or other non-processor modification; different exception stack frame; interrupt autovectors may be moved by VBR; 68020/30 CLR instruction does a single write access unlike the 68000 CLR instruction which does a separate read and write access (this might affect a read-triggered register in I/O space – use MOVE instead).
The following programming practices can be the cause of this problem: software delay loops; word or longword access of an odd address (illegal on the 68000). Note that this can occur under 2.0 if you reference IntuiMessage->IAddress as a structure pointer without first determining that the IntuiMessage’s Class is defined as having a structure pointer in its IAddress; use of the assembler CLR instruction on a hardware register which is triggered by any access. The 68000 CLR instruction performs two accesses (read and write) while 68020/30 CLR does a single write access. Use MOVE instead; assumptions about the order in which asynchronous tasks will finish; use of compiler flags which have generated inline 68881/68882 math coprocessor instructions or 68020/30 specific code.
This can be caused by asking for a library version higher than you need (Do not use the #define LIBRARY_VERSION when compiling!). Can also be caused by calling functions or using structures which do not exist in the older version of the operating system. Ask for the lowest version which provides the functions you need (usually 33), and exit gracefully and informatively if an OpenLibrary() fails (returns NULL). Or code conditionally to only use new functions and structures if the available library’s lib->Version supports them.
This should not happen with proper programming. Possible causes include: running too close to your stack limits or the memory limits of a base machine (newer versions of the operating system may use slightly more stack in system calls, and usually use more free memory); using system functions improperly; not testing function return values; improper register or condition code handling in assembler code. Remember that result, if any, is returned in D0, and condition codes and D1/A0/A1 are undefined after a system call; using improperly initialized pointers; trashing memory; assuming something (such as a flag) is B if it is not A; failing to initialize formerly reserved structure fields to zero; violating Amiga programming guidelines (for example: depending on or poking private system structures, jumping into ROM, depending on undocumented or unsupported behaviors); failure to read the function Autodocs.
See Appendix E, “Release 2 Compatibility”, for more information on 2.0 compatibility problem areas.
Caused by specifically asking for or requiring MEMF_FAST memory. If you don’t need Chip RAM, ask for memory type 0L, or MEMF_CLEAR, or MEMF_PUBLIC|MEMF_CLEAR as applicable. If there is Fast memory available, you will be given Fast memory. If not, you will get Chip RAM. May also be caused by trackdisk-level loading of code or data over important system memory or structures which might reside in low Chip memory on a Chip-RAM-Only machine.
Data and buffers which will be accessed directly by the custom chips must be in Chip RAM. This includes bitplanes (use OpenScreen() or AllocRaster()), audio samples, trackdisk buffers, and the graphic image data for sprites, pointers, bobs, images, gadgets, etc. Use compiler or linker flags to force Chip RAM loading of any initialized data needing to be in Chip RAM, or dynamically allocate Chip RAM and copy any initialization data there.
Usually caused by writing or reading addresses past the end of older custom chips, or writing something other than 0 (zero) to bits which are undefined in older chip registers, or failing to mask out undefined bits when interpreting the value read from a chip register. Note that system copper lists are different under 2.0 when ECS chips are present. See “Fails only on Chip-RAM-Only Machines”.
A dazzling pyrotechnic video display is caused by trashing or freeing a copper list which is in use, or trashing the pointers to the copper list. If you aren’t messing with copper lists, see “Crashes and Memory Corruption”.
The bit data for graphic images such as sprites, pointers, bobs, and gadgets must be in Chip RAM. Check your compiler manual for directives or flags which will place your graphic image data in Chip RAM. Or dynamically allocate Chip RAM and copy them there.
Program hangs are generally caused by Wait()ing on the wrong signal bits, on the wrong port, on the wrong message, or on some other event that will never occur. This can occur if the event you are waiting on is not coming, or if one task tries to Wait(), WaitPort(), or WaitIO() on a signal, port, or window that was created by a different task. Both WaitIO() and WaitPort() can call Wait(), and you cannot Wait() on another task’s signals. Hangs can also be caused by verify deadlocks. Be sure to turn off all Intuition verify messages (such as MENUVERIFY) before calling AutoRequest() or doing disk access.
This is generally caused by a Disable() without a corresponding Enable(). It can also be caused by memory corruption, especially corruption of low memory. See “Crashes and Memory Corruption”.
First determine that your program is actually causing a memory loss. It is important to boot with a standard Workbench because a number of third party items such as some background utilities, shells, and network handlers dynamically allocate and free pieces of memory. Open a Shell for memory checking, and a Shell or Workbench drawer for starting your program. Arrange windows so that all are accessible, and so that no window rearrangement will be needed to run your program.
In the Shell, type Avail FLUSH<RET> several times (2.0 option). This will flush all non-open disk-loaded fonts, devices, etc., from memory. Note the amount of free memory. Now without rearranging any windows, start your program and use all of your program features. Exit your program, wait a few seconds, then type Avail FLUSH<RET> several times. Note the amount of free memory. If this matches the first value you noted, your program is fine, and is not causing a memory loss.
If memory was actually lost, and your program can be run from CLI or Workbench, then try the above procedure with both methods of starting your program. Note that under 2.0, there will be a slight permanent (until reboot) memory usage of about 672 bytes when the audio.device or narrator.device is first opened. See “Memory Loss – CLI Only” and “Memory Loss – WB Only” if appropriate. If you lose memory from both WB and CLI, then check all of the open/alloc/get/create/lock type calls in your code, and make sure that there is a matching close/free/delete/unlock type call for each of them (note – there are a few system calls that have or require no corresponding free – check the Autodocs). Generally, the close/free/delete/unlock calls should be in opposite order of the allocations.
If you are losing a fixed small amount of memory, look for a structure of that size in the Structure Offsets listing in the Amiga ROM Kernel Reference Manual: Includes and Autodocs. For example, a loss of exactly 24 bytes is probably a Lock() which has not been UnLocked(). If you are using ScrollRaster(), be aware that ScrollRaster() left or right in a Superbitmap window with no TmpRas will lose memory under 1.3 (workaround – attach a TmpRas). If you lose much more memory when started from Workbench, make sure your program is not using Exit(n). This would bypass startup code cleanups and prevent a Workbench-loaded program from being unloaded. Use exit(n) instead.
Make sure you are testing in a standard environment. Some third-party shells dynamically allocate history buffers, or cause other memory fluctuations. Also, if your program executes different code when started from CLI, check that code and its cleanup. And check your startup.asm if you wrote your own.
You have Amiga-specific resources opened or allocated and you have not disabled your compiler’s automatic Ctrl-C handling (causing all of your program cleanups to be skipped). Disable the compiler’s Ctrl-C handling and handle Ctrl-C (SIGBREAKF_CTRL_C) yourself.
A continuing memory loss during execution can be caused by failure to keep up with voluminous IDCMP messages such as MOUSEMOVE messages. Intuition cannot re-use IDCMP message blocks until you ReplyMsg() them. If your window’s allotted message blocks are all in use, new sets will be allocated and not freed till the window is closed. Continuing memory losses can also be caused by a program loop containing an allocation-type call without a corresponding free.
Commonly, this is caused by a failure of your code to unload after you exit. Make sure that your code is being linked with a standard correct startup module, and do not use the Exit(n) function to exit your program. This function will bypass your startup code’s cleanup, including its ReplyMsg() of the WBStartup message (which would signal Workbench to unload your program from memory). You should exit via either exit(n) where n is a valid DOS error code such as RETURN_OK (<dos/libraries.h>), or via final “}” or return. Assembler programmers using startup code can JMP to _exit with a long return value on stack, or use the RTS instruction.
A flickering menu is caused by leaving a pixel or more space between menu subitems when designing your menu. Crashing after browsing a menu (looking at menu without selecting any items) is caused by not properly handling MENUNULL select messages. Multiple selection not working is caused by not handling NextSelect properly. See the “Intuition Menus” chapter.
Caused by failing to handle all received signals or all possible messages after a Wait() orWaitPort() call. More than one event or message may have caused your program to awakened. Check the signals returned by Wait() and act on every one that is set. At ports which may have more than one message (for instance, a window’s IDCMP port), you must handle the messages in a while(msg=GetMsg(...)) loop.
This is often caused by a one program doing one or more of the following: busy waiting or polling; running at a higher priority; doing lengthy Forbids(), Disables(), or interrupts.
If your program has “Enforcer hits” (i.e., illegal references to memory caused by improperly initialized pointers), this will cause Bus Errors. The A3000 bus error handler contains a built-in delay to let the bus settle. If you have many enforcer hits, this could slow your program down substantially.
Make sure your trackdisk buffers are in Chip RAM under 1.3 and lower versions of the operating system.
Set the NOCAREREFESH flag. Even SMART_REFRESH windows may generate refresh events if there is a sizing gadget. If you don’t have specific code to handle this, you must set the NOCAREREFRESH flag. If you do have refresh code, be sure to use the Begin()/EndRefresh() calls. Failure to do one or the other will leave Intuition in an intermediate state, and slow down operation for all windows on the screen.
Many visual problems in windows can be caused by improper font specification or improper setting of gadget flags. See the Appendix E on “Release 2 Compatibility” for detailed information on common problems.
General Debugging Techniques
Use methodical testing procedures, and debugging messages if necessary, to locate the problem area. Low level code can be debugged using kprintf() serial (or dprintf() parallel) messages. Check the initial values, allocation, use, and freeing of all pointers and structures used in the problem area. Check that all of your system and internal function calls pass correct initialized arguments, and that all possible error returns are checked for and handled.
If errors cannot be found, simplify your code to the smallest possible example that still functions. Often you will find that this smallest example will not have the problem. If so, add back the other features of your code until the problem reappears, then debug that section.
A variety of debugging tools are available to help locate faulty code. Some of these are source level and other debuggers, crash interceptors, vital watchdog and memory invalidation tools like Enforcer and MungWall.
A Final Word About Testing
Test your program with memory watchdog and invalidation tools on a wide variety of systems and configurations. Programs with coding errors may appear to work properly on one or more configurations, but may fail or cause fatal problems on another. Make sure that your code is tested on both a 68000 and a 68020/30, on machines with and without Fast RAM, and on machines with and without enhanced chips. Test all of your program functions on every machine.
Test all error and abort code. A program with missing error checks or unsafe cleanup might work fine when all of the items it opens or allocates are available, but may fail fatally when an error or problem is encountered. Try your code with missing files, filenames with spaces, incorrect filenames, cancelled requesters, Ctrl-C, missing libraries or devices, low memory, missing hardware, etc.
Test all of your text input functions with high-ASCII characters (such as the character produced by pressing Alt-F then “A”). Note that RAWKEY codes can be different keyboard characters on national keyboards (higher levels of keyboard input are automatically translated to the proper characters). If your program will be distributed internationally, support and take advantage of the additional screen lines available on a PAL system. Enhanced Agnus chip machines may be switched to be PAL or NTSC via motherboard jumper J102 in A2000s and jumper J200 in A3000s. Note that a base PAL machine will have less memory free due to the larger display size.
Write good code. Test it. Then make it great.