UTF-EBCDIC
UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to five one-byte (8-bit) code units (in contrast to a maximum of four for UTF-8).[1] It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+03FF are larger than the UTF-8 encoding.
The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, U+0041 "A" is still encoded as 01000001), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 (Unicode's "A") is 0xC1 (EBCDIC's "A").
This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, IBM Db2, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.
Codepage layout
There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.
UTF-EBCDIC | ||||||||||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
0x | NUL | SOH | STX | ETX | ST | HT | SSA | DEL | EPA | RI | SS2 | VT | FF | CR | SO | SI |
1x | DLE | DC1 | DC2 | DC3 | OSC | LF | BS | ESA | CAN | EM | PU2 | SS3 | FS | GS | RS | US |
2x | PAD | HOP | BPH | NBH | IND | NEL | ETB | ESC | HTS | HTJ | VTS | PLD | PLU | ENQ | ACK | BEL |
3x | DCS | PU1 | SYN | STS | CCH | MW | SPA | EOT | SOS | SGCI | SCI | CSI | DC4 | NAK | PM | SUB |
4x | SP | • | • | • | • | • | • | • | • | • | • | . | < | ( | + | | |
5x | & | • | • | • | • | • | • | • | • | • | ! | $ | * | ) | ; | ^ |
6x | - | / | • | • | • | • | • | • | • | • | • | , | % | _ | > | ? |
7x | • | • | • | • | 2 | 2 | 2 | 2 | 2 | ` | : | # | @ | ' | = | " |
8x | 2 | a | b | c | d | e | f | g | h | i | 2 | 2 | 2 | 2 | 2 | 2 |
9x | 2 | j | k | l | m | n | o | p | q | r | 2 | 2 | 2 | 2 | 2 | 2 |
Ax | 2 | ~ | s | t | u | v | w | x | y | z | 2 | 2 | 2 | [ | 2 | 2 |
Bx | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | ] | 3 | 3 |
Cx | { | A | B | C | D | E | F | G | H | I | 3 | 3 | 3 | 3 | 3 | 3 |
Dx | } | J | K | L | M | N | O | P | Q | R | 3 | 3 | 4 | 4 | 4 | 4 |
Ex | \ | 4 | S | T | U | V | W | X | Y | Z | 4 | 4 | 4 | 5 | 5 | |
Fx | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | APC |
Oracle UTFE
Oracle UTFE is a Unicode 3.0 UTF-8 Oracle database variation, similar to the CESU-8 variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms.[2]
Advantages:
- Only Unicode character set for EBCDIC.
- Length of SQL CHAR types can be specified in number of characters.
- Binary order of the SQL CHAR columns is same as binary order of the SQL NCHAR columns if the data consists of same supplementary characters. Consequently, these columns sort the same for identical strings.[2]
Disadvantages:
- Supplementary characters occupy six bytes instead of four bytes only. Consequently, supplementary characters need to be converted.
- UTFE is not a Unicode standard encoding. Clients requiring UTF-8 encoding must convert data on retrieval and storage.[2]
References
- "UTR #16: UTF-EBCDIC". www.unicode.org. Retrieved 2021-02-23.
You need to search at most five bytes (seven bytes, if the full range of 31 bits of ISO/IEC 10646 is considered) backwards
- Baird, Cathy; Chiba, Dan; Chu, Winson; Fan, Jessica; Ho, Claire; Law, Simon; Lee, Geoff; Linsley, Peter; Matsuda, Keni; Oscroft, Tamzin; Takeda, Shige; Tanaka, Linus; Tozawa, Makoto; Trute, Barry; Tsujimoto, Mayumi; Wu, Ying; Yau, Michael; Yu, Tim; Wang, Chao; Wong, Simon; Zhang, Weiran; Zheng, Lei; Zhu, Yan; Moore, Valarie (2002) [1996]. "Appendix A: Locale Data". Oracle9i Database Globalization Support Guide (PDF) (Release 2 (9.2) ed.). Oracle Corporation. Oracle A96529-01. Archived (PDF) from the original on 2017-02-14. Retrieved 2017-02-14.
External links
- V.S. Umamaheswaran, Unicode Technical Report #16: the definition of UTF-EBCDIC (2002-04-16)