Copyright (c) Hyperion Entertainment and contributors.

Difference between revisions of "UTF8 IFF UTF-8 Unicode Text"

From AmigaOS Documentation Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
== Rationale ==
 
== Rationale ==
   
The [[FTXT_IFF_Formatted_Text#Data_Chunk_CHRS|CHRS chunk]] is too limited. There is a CSET chunk allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting Unicode could indeed use that UTF-8 string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions.
+
The [[FTXT_IFF_Formatted_Text#Data_Chunk_CHRS|CHRS chunk]] is too limited. There is [[CSET_IFF_Text_Character_Set|a CSET chunk]] allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting [http://en.wikipedia.org/wiki/Unicode Unicode] could indeed use that [http://en.wikipedia.org/wiki/UTF-8 UTF-8] string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions.
   
 
== Definition ==
 
== Definition ==
Line 11: Line 11:
 
'''(any).UTF8'''
 
'''(any).UTF8'''
   
Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in a manner where old legacy applications can read data in a compatible format while new Unicode aware applications can take advantage of full Unicode support.
+
Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in such a manner that legacy applications can read data in a compatible format while new Unicode aware applications can take advantage of full Unicode support.
   
Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. Old legacy applications continue
+
Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. Legacy applications continue
 
to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present.
 
to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present.
   

Latest revision as of 17:47, 8 June 2012

Author

Registered by Ilkka Lehtoranta.

Rationale

The CHRS chunk is too limited. There is a CSET chunk allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting Unicode could indeed use that UTF-8 string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions.

Definition

(any).UTF8

Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in such a manner that legacy applications can read data in a compatible format while new Unicode aware applications can take advantage of full Unicode support.

Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. Legacy applications continue to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present.

Legacy applications only know the CHRS chunk and continue to work correctly. New applications, when finding UTF8 chunks within the form, will read UTF8 chunks only and ignore all CHRS chunks. Applications should not expect chunks to be in any particular order nor should they expect chunks to be consecutive.

Example

    FORM TEXT
    {
        CHRS { "Hello world!" }
        UTF8 ( "Hello world!" }
    }