Copyright (c) Hyperion Entertainment and contributors.

Pattern Matching

From AmigaOS Documentation Wiki
Revision as of 20:54, 2 October 2015 by Steven Solie (talk | contribs) (→‎Patterns)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

One of the features included in dos.library is a series of functions to do standard pattern matching. Using a set of standard string matching tokens, any application can use these functions to test if a particular string matches a pattern. The Amiga OS uses these functions for processing file name strings for its new directory scanning functions.

These functions can be used in every circumstance where you would like to enable the user to enter a pattern to indicate more than one target string. Using these functions not only makes it unnecessary to implement your own pattern matching routines, but by using the familiar DOS pattern tokens in your application, it is easier for the user to learn how to use your application.

Patterns

An AmigaDOS pattern matching string is a combination of alphanumeric characters and a series of special token characters. These token characters are part of the ASCII character set and they denote such things as string matching wildcards, string repetitions, and string negation. Pattern matching strings can use parentheses to delimit pattern matching substrings.

 ?
The question mark matches any single character. For example, the pattern matching string "A?B" matches any string that is three letters long, that starts with an "A" and ends with a "B".
#
The number sign matches strings containing zero or more repetitions of the expression that immediately follows the # in the pattern matching string. For example, the pattern matching string "#A" matches any string that consists of one or more of the "A" character. The pattern matching string "#?" matches any non-NULL string. The # can apply to entire substrings delimited by parentheses. For example, the pattern string "#(AB)" matches any string consisting of one or more repetitions of the substring "AB" (AB, ABAB, ABABAB...).
 %
Matches the NULL string.
|
This is the OR symbol. This matches strings that contain the expressions on either side of the OR sign. The expressions and the OR symbol need to be enclosed in parentheses. For example, the pattern matching string "(A|B)" matches the string "A" or the string "B". The pattern matching string A(B|%|C) matches the strings "AB", "A", and "AC".
~
The tilde negates the expression that follows it. All strings that do not match the expression that follows the tilde will match the expression with the tilde. For example, the pattern matching string "~(#?.info)" matches any string that does not match the string "#?.info" (does not end with the substring ".info").
*
The star is provided as an synonym to "#?". This is an option which can be turned on. Note that the star can not be used by itself on all non-file system devices, like a logical device name assigned to a directory on a file system. For example:
Assign A: dh0:tmp
cd a:
list *
will produce an error. DOS preferences can be used to enable or disable this feature.
[]
All characters within brackets indicate a character class. Any character in the character class qualifies. Within a character class, a character range can be indicated by specifing the start and stop character, separated with a minus sign. Note that character classes are case sensitive. If character classes are to be used in a case insensitive form, they should be translated to uppercase. Here are some examples:
[ACF]#?   matches strings starting with `A', `C', or `F'
[A-D]#?   matches strings starting with `A', `B', `C', or `D'
[~ACF]#?  matches strings not starting with `A', `C', or `F'
'
The quote character neutralizes the special meaning of a special character. Here are some examples:
'#'? matches only the literal string "#?"
'?(A|B|C|%)'# matches the literal strings "?#", "?A#", "?B#", "?C#"
" matches '
()
Parentheses group special characters. The expression within the parentheses is a subpattern.

Parsing

When you want to use a string as an AmigaDOS wildcard pattern, the system must first parse it. The system builds a token string which the system uses to match strings. There are two functions in dos.library to parse pattern matching strings:

int32 ParsePattern(CONST_STRPTR SourcePattern, STRPTR MyDestination, int32 DestLength);
 
int32 ParsePatternNoCase(CONST_STRPTR SourcePattern, STRPTR MyDestination, int32 DestLength);

The ParsePattern() function creates a case sensitive token string, whereas the ParsePatternNoCase() functions creates a case insensitive token string. Both functions require a pointer to a destination buffer (MyDestination in the above prototype) to place the tokenized string in. Since every character in the pattern can be expanded to two tokens (3 in one case), this buffer should be at twice as as large as the original pattern plus 2 bytes. As a general rule, allocating a buffer three times the size of the pattern will hold all patterns. The third argument, DestLength, indicates the size of the destination buffer provided. These functions will returns one of three values:

-1 if there is an error (if the buffer is too small to hold all the tokens or the source string contains an invalid pattern)
1 if the wildcard pattern was parsed successfully and the pattern contains one of the special token characters.
0 if the wildcard pattern was parsed successfully and pattern contained only alphanumeric characters (no special token characters)

Matching

Once a pattern is parsed, it can be compared to a string using either the case sensitive MatchPattern() or the case insensitive MatchPatternNoCase() functions. These functions have the following synopsis:

BOOL MatchPattern(CONST_STRPTR mytokenpattern, STRPTR mystring);
BOOL MatchPatternNoCase(CONST_STRPTR mytokenpattern, STRPTR mystring);

These functions will compare the wildcard pattern string, mytokenpattern (created by ParsePattern/NoCase()), to mystring. If mystring matches the pattern in mytokenpattern, these routines return TRUE, otherwise they return FALSE.

Because these routines are recursive, they can use a lot of stack space, although they will not use more than 1500 bytes of stack space. Make sure the stack space is at least 1500 bytes before calling these routines.

If either of these functions detect a stack overflow, they will return 0 and IoErr() will return ERROR_TOO_MANY_LEVELS. If IoErr() returns 0, there was simply no match. The example below shows how to use the parse and match functions and allows you to test and experiment with patterns.

Example

// Pattern.c
//
// Pattern matching example.
 
#include <exec/types.h>
#include <exec/memory.h>
#include <dos/dos.h>
#include <dos/dosasl.h>
#include <dos/rdargs.h>
 
#include <proto/exec.h>
#include <proto/dos.h>
#include <proto/utility.h>
 
UWORD StrLen(UBYTE *);
 
int main()
{
    LONG            vargs[4];
    UBYTE         **strings;
    UBYTE          *pattern, *parsebuffer;
    UWORD           case_sensitive, buffersize;
    LONG            iswild, success;
    COUNT           i;
 
    /* See the DOS Autodocs for more information about ReadArgs() */
    int32 rargs[3] = {0};
    struct RDArgs *readargs = IDOS->ReadArgs("PATTERN/A,CASE/S,STRINGS/A/M", rargs, NULL);
 
    if (readargs != NULL)
    {
        /* The pattern. */
        pattern = (UBYTE *) (rargs[0]);
 
        /*
         * Case sensitive or not? (default not. Note filename matching
         * should ALWAYS be case insensitive).
         */
        case_sensitive = (UWORD) (rargs[1]);
 
        /* Pointer to array of strings to match */
        strings = (UBYTE **) (rargs[2]);
 
        /* Get a buffer big enough to hold all the tokens */
        buffersize = StrLen(pattern) * 3;
 
        if (parsebuffer = IExec->AllocVecTags(buffersize, AVT_ClearWithValue, 0, TAG_END))
        {
            /* Parse the pattern, according to case sensitivity flag */
            if (case_sensitive)
                iswild = IDOS->ParsePattern(pattern, parsebuffer, buffersize);
            else
            {
                /* make pattern uppercase in case of character classes */
                i = 0;
                while (pattern[i])
                    pattern[i] = ToUpper(pattern[i++]);
                iswild = IDOS->ParsePatternNoCase(pattern, parsebuffer, buffersize);
            }
 
            /*
             * -1 if ParsePattern() failed, 0 for no wildcards, 1 for
             * wildcards. For this I don't care if the supplied pattern had
             * wildcards or not.
             */
            if (iswild != -1)
            {
                /* The array of strings is terminated with a NULL */
                while (*strings)
                {
                    /*
                     * MatchPattern() returns 1 for a successful match, 0
                     * for no match
                     */
                    if (case_sensitive)
                        success = IDOS->MatchPattern(parsebuffer, *strings);
                    else
                        success = IDOS->MatchPatternNoCase(parsebuffer, *strings);
 
                    if (success)
                    {
                        vargs[0] = (LONG) * strings;
                        IDOS->VFPrintf(Output(), "Match: %s\n", vargs);
                    }
                    else
                    {
                        if (IDOS->IoErr() != 0)
                        {
                            IDOS->VFPrintf(Output(), "Overflow\n", NULL);
                            break;
                        }
                    }
                    strings++;
                }
            }
            else
                IDOS->PrintFault(ERROR_BAD_TEMPLATE, pattern);
 
            IDOS->FreeMem(parsebuffer, buffersize);
        }
        else
            IDOS->PrintFault(ERROR_NO_FREE_STORE, NULL);
 
        IDOS->FreeArgs(readargs);
    }
    else
        IDOS->PrintFault(IDOS->IoErr(), NULL);
 
    return 0;
}
 
UWORD StrLen(UBYTE * string)
{
    UBYTE          *length = string + 1;
 
    while (*string++ != '\0');
    return ((UWORD) (string - length));
}