Copyright (c) Hyperion Entertainment and contributors.
Difference between revisions of "UTF8 IFF UTF-8 Unicode Text"
Steven Solie (talk | contribs) |
Steven Solie (talk | contribs) |
||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | == Author == |
||
+ | |||
+ | Registered by Ilkka Lehtoranta. |
||
+ | |||
== Rationale == |
== Rationale == |
||
− | The [[FTXT_IFF_Formatted_Text#Data_Chunk_CHRS|CHRS chunk]] is too limited. There is a CSET chunk allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting Unicode could indeed use that UTF-8 string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions. |
+ | The [[FTXT_IFF_Formatted_Text#Data_Chunk_CHRS|CHRS chunk]] is too limited. There is [[CSET_IFF_Text_Character_Set|a CSET chunk]] allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting [http://en.wikipedia.org/wiki/Unicode Unicode] could indeed use that [http://en.wikipedia.org/wiki/UTF-8 UTF-8] string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions. |
== Definition == |
== Definition == |
||
Line 7: | Line 11: | ||
'''(any).UTF8''' |
'''(any).UTF8''' |
||
− | Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in a manner |
+ | Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in such a manner that legacy applications can read data in a compatible format while new Unicode aware applications can take advantage of full Unicode support. |
− | Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. |
+ | Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. Legacy applications continue |
to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present. |
to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present. |
||
Legacy applications only know the CHRS chunk and continue to work correctly. New applications, when finding UTF8 chunks within the form, will read UTF8 chunks only and ignore all CHRS chunks. Applications should not expect chunks to be in any particular order nor should they expect chunks to be consecutive. |
Legacy applications only know the CHRS chunk and continue to work correctly. New applications, when finding UTF8 chunks within the form, will read UTF8 chunks only and ignore all CHRS chunks. Applications should not expect chunks to be in any particular order nor should they expect chunks to be consecutive. |
||
− | + | == Example == |
|
<pre> |
<pre> |
Latest revision as of 17:47, 8 June 2012
Contents
Author
Registered by Ilkka Lehtoranta.
Rationale
The CHRS chunk is too limited. There is a CSET chunk allowing the specification of a character set but it has proven impractical. For example, if a web browser pastes a Unicode string to the clipboard it must include a CSET chunk. However, older applications don't know anything about CSET and thus display garbage for code points outside the standard ASCII range. Newer applications supporting Unicode could indeed use that UTF-8 string. CSET is also quite cumbersome because all new applications would have to support character set conversions and there is a risk of information loss during those conversions.
Definition
(any).UTF8
Unicode string chunk. This chunk contains a Unicode string in UTF-8 format. It is designed to store Unicode strings in IFF FORMs in such a manner that legacy applications can read data in a compatible format while new Unicode aware applications can take advantage of full Unicode support.
Applications can write both CHRS and UTF8 chunks to the form. When reading, only one string chunk type is used. Legacy applications continue to read the CHRS chunk which only contain legacy 8-bit strings using the system character set encoding. New applications read only UTF8 chunks, if present.
Legacy applications only know the CHRS chunk and continue to work correctly. New applications, when finding UTF8 chunks within the form, will read UTF8 chunks only and ignore all CHRS chunks. Applications should not expect chunks to be in any particular order nor should they expect chunks to be consecutive.
Example
FORM TEXT { CHRS { "Hello world!" } UTF8 ( "Hello world!" } }