Copyright (c) Hyperion Entertainment and contributors.

Expat Library

From AmigaOS Documentation Wiki
Jump to navigation Jump to search

Introduction

Expat is a fast and resource-efficient XML parser written by James Clark. On AmigaOS it is implemented in various flavours (static link library / shared object / shared Amiga library). This documentation specifically focuses on, and provides code examples for, the shared Amiga library version.

XML Parsing Basics

As far as XML document parsing is concerned, there are basically two kinds of parsers:

  • Tree-based parsers, which process the entire XML file and build a tree structure representing the elements and other constructs in the document. An example of a tree-based parser is libxml2, which is also available for AmigaOS.
  • Stream-oriented (event-driven) parsers, which process the XML file as a continuous data stream and produce an event each time the parser encounters an XML construct. Expat is an example of an event-driven parser.

Tree-based parsers are really comfortable to work with: the parser reconstructs the entire document structure and contents for you. You are also provided with functions to search in the document, find data, add or modify the contents etc. Event-driven parsers are, on the other hand, much more basic. They require setup and generally more work on the part of the programmer.

However, tree-based parsers are rather taxing resource-wise. Parsing the document takes longer and uses up a considerable amount of memory. Implementations also tend to be bulky: for example, the current AmigaOS static library implementation of libxml2 is bigger than 5MB – which is a preposterous file size overhead added to your program only to provide it with a parser! Event-driven parsers may offer fewer bells and whistles but they are much smaller (about 300KB in all AmigaOS Expat implementations, static or shared) and faster. In order to keep the spirit of Amiga software, you'll quite naturally want to use Expat as a lightweight and efficient parser, preferably in its shared Amiga library incarnation.

Among other things that speak in favour of event-driven parsers is the fact that when working with XML files, reconstructing the complete document tree structure is not always necessary. Quite often you're just interested in particular data that is stored in particular elements (for instance, you may only want to mine text data from the <para> sections of an XHTML document). A parser like Expat can then be used to process (react upon) events only concerning the parts of the document you're interested in. But even if you do need the entire tree structure, for whatever purpose or merely for the comfort, there is no reason to give up on Expat. As tree-based parsers are typically built on top of event-driven parsers, you can use Expat to build your own XML data representation. It means work but you can tailor the procedure to your own needs, providing perhaps less sophisticated but still adequate representation, without the extra overhead libxml2 would incur.

How To Use The Library

Depending on your particular aim and purpose, using the AmigaOS Expat Library entails at least the following minimum steps (they will all be discussed in more detail further on):

  1. Open the library and obtain its interface (see Library Opening Chores below).
  2. Create handlers to deal with the particular types of XML data.
  3. Create a parser instance.
  4. Configure the parser so that it knows of your handlers.
  5. Open the XML file for reading, read data from the XML file into a buffer and call the parsing function (see Parsing XML Files below).

Reading from the file is usually done in a loop. The file data is continuously fed into a fixed-size memory buffer and parsed, until the end of the document is reached. As the document is read in pieces, parsing can start before you have all the document (unlike with tree-based parsers). This also allows parsing really big documents that won't fit into memory.

Whenever the parser encounters an element's start tag, it will call your start element handler function. Whenever it encounters an element's end tag, it invokes your end element handler function. Whenever text is found, the parser will call your character-data handler function. And so on.

As you can see, the parser itself doesn't do much: it just processes the XML file and calls the respective handler functions to react upon the individual events. All the grunt work is done in the handlers, outside of the library. Programmers design the handler functions to suit their particular needs, and process (or ignore) the received data as they find appropriate.

It should also be noted that the Expat Library provides just the parser, so it cannot be used for writing XML files. Neither can it be used for validating XML documents against a DTD.

Installation

If the library file is not installed in your LIBS: drawer, and/or your SDK is missing the necessary includes, download the Expat package from OS4depot.net and copy the files to their relevant directories:

  • the entire contents of the SDK directory goes to SDK:
  • the "Workbench/Libs/expat.library" file goes to LIBS:
  • the "Workbench/SObjs/libexpat.so" file goes to SOBJS: (should you need the shared object version as well).

No further setup is required. To use the shared Amiga library, you must include the following header file at the beginning of your code:

#include <proto/expat.h>

Library Opening Chores

Just like other AmigaOS libraries, the Expat Library must be opened and its interface obtained before you can use it:

struct Library    *ExpatBase = NULL;
struct ExpatIFace *IExpat = NULL;
 
if ( (ExpatBase = IExec->OpenLibrary("expat.library", 53)) )
{
   IExpat = (struct ExpatIFace *) IExec->GetInterface(ExpatBase, "main", 1, NULL);
}
 
if ( !ExpatBase || !IExpat )
{
   /* handle library opening error */
}

Handler Functions

Handler functions are custom callback functions that get invoked automatically whenever the parser encounters a specific XML construct in the stream. Such a situation is called an event. Obvious events include XML elements and character data (handlers for these are needed for any XML parsing) but there are, of course, others: comments, XML document declarations, namespace declarations, CDATA sections, unparsed entities (NDATA) etc. For each type of event you are interested in, a handler function must be written and registered with the parser. How you write the handlers entirely depends on what you intend to do.

Whereas handler function names are arbitrary (you can choose any name), the parameter list must follow precise definition and is different for each type of handler (see the "libraries/expat.h" include file for the definitions).

You can pass user data to and between handlers, be it a simple variable or a complex data structure. This is commonly used for tracking the process (parsing state), storing intermediate values, or passing global data to handlers. Never use global variables for that!

Element Handlers

In XML, an element is enclosed between a start tag and an end tag. They are reported as separate events, so you need to provide two separate functions to handle them:

The Start Element Handler

This handler function is invoked when the parser encounters an element's start tag.

void start_handler(void *userData, const XML_Char *name, const XML_Char **attrs)
{
}

The End Element Handler

This handler function is invoked when the parser encounters an element's end tag.

void end_handler(void *userData, const XML_Char *name)
{
}

The Character-Data Handler

This handler function is invoked when the parser encounters character data (i.e. a text string that is enclosed within an element but is not itself a tag).

Please note that Expat may produce several events (and thus call the character-data handler several times successively) before it processes all text within a single element. Such a situation arises, for example, when the text contains character entities (a character entity will always produce a separate event), or when the parser reaches the end of the buffer and needs to reload it with new data. You can never assume that element text will be processed in one go, and your code must be ready to handle it!

void chardata_handler(void *userData, const XML_Char *string, int length)
{
}

Parsing

Creating The Parser

When you have written all handlers you'll need for your particular application, it's time to create the parser (or a parser instance as we say, because more than one can be used). It takes a single function call:

XML_Parser parser = NULL;
 
parser = IExpat->XML_ParserCreate(NULL);
 
if (!parser)
 {
   /* Error: the parser instance couldn't be created for some reason. */
 }

The function creates a simple XML parser with no namespace support (please refer to this section if you need to handle namespaces). The function's only parameter is the name of the character encoding that is used in the document you want to parse. If the parameter is NULL (as in the example above), the parser assumes that the document uses one of the built-in encodings:

  • US-ASCII
  • UTF-8
  • UTF-16
  • ISO-8859-1

Supplying a non-NULL parameter for the encoding (for example, "UTF-8") will override the XML document encoding declaration. This is rarely used – normally the parameter is NULL.

Please note that if you specify a different encoding (for example, "ISO-8859-2", which is used in many Central/Eastern European countries), it will not make Expat support it! The only supported encodings are the four built-in ones above. Refer to the Unknown Encodings section in the Advanced Use chapter below to learn how to handle documents with a non-standard character encoding.

Parser Configuration

Before you can start parsing, the parser instance must be properly configured. This typically entails registering all handlers (so that the parser knows which particular function to call when it encounters an event) and preparing user data we want to carry around.

Handler Registration

A handler function must first be registered if you want Expat to use it during the parsing process. Separate registration functions exist for each handler type. The functions may be called anytime, thus allowing the programmer to modify the handler configuration along the way. However, in most cases you'll probably want to do all registrations just before you send Expat on the job.

The following piece of code will register the handlers we've written above (the two element handlers and the character-data handler) with the parser we've created:

IExpat->XML_SetElementHandler(parser, start_handler, end_handler);
IExpat->XML_SetCharacterDataHandler(parser, chardata_handler);

User Data Setting

When using an event-driven parser you'll need to monitor the whole process yourself – keep track of depth, remember the name of the parent element, etc. You'll also need to pass values and pointers to and between handlers; for example, the character-data handler usually needs access to a global memory buffer to copy text into. In other words, you'll need to carry data around. As global variables are a no-no, Expat can take arbitrary user data and pass it to the individual handlers invoked along the parsing process. This data can then be accessed from within the handlers via the userData pointer that is provided when the handler function gets called.

User data normally comprises more than a single value, so it is recommended to create a dedicated custom structure such as struct MyUserData in the example code snippet below. The parser must then receive the address of the data structure (or any other kind of user data), which is done using the function XML_SetUserData():

/* Define the user data structure. */
struct MyUserData
{
  uint32  depth;         // to monitor parsing depth
  STRPTR  parElement;    // to keep the name of the parent element
  char   *buffer;        // to provide access to the text buffer
};
 
/* Declare the structure we'll be using for parsing. */
struct MyUserData myUserData;
 
/* 
    Here you would initialize the user data
    to default values.
 */
 
/* Finally, provide the user data address. */
IExpat->XML_SetUserData(parser, &myUserData);

Parsing XML Files

Parsing an XML document pretty much boils down to reading the file data into a memory buffer and calling XML_Parse() upon it. If you know that you'll only be working with small files, you can of course allocate a buffer the size of the XML file and then read the entire document into it. However, remember that your software may run on a system with limited memory resources, so it is always advisable to strive for a low memory footprint. The more common technique is, therefore, to provide a smaller buffer and use a loop to read/parse the document in parts. This way you can process even really big XML files that otherwise wouldn't fit into memory.

Advanced Use

Namespace Processing

Unknown Encodings

Function Reference