Copyright (c) Hyperion Entertainment and contributors.

UTF8 IFF UTF-8 Unicode Text

From AmigaOS Documentation Wiki
Jump to navigation Jump to search

Author

Registered by Ilkka Lehtoranta.

Rationale

The CHRS chunk is too limited. There is a CSET chunk allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting Unicode could indeed use that UTF-8 string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions.

Definition

(any).UTF8

Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in such a manner that legacy applications can read data in a compatible format while new Unicode aware applications can take advantage of full Unicode support.

Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. Legacy applications continue to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present.

Legacy applications only know the CHRS chunk and continue to work correctly. New applications, when finding UTF8 chunks within the form, will read UTF8 chunks only and ignore all CHRS chunks. Applications should not expect chunks to be in any particular order nor should they expect chunks to be consecutive.

Example

    FORM TEXT
    {
        CHRS { "Hello world!" }
        UTF8 ( "Hello world!" }
    }