# The New C Standard- P4

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:100

0
38
lượt xem
3

## The New C Standard- P4

Mô tả tài liệu

Tham khảo tài liệu 'the new c standard- p4', công nghệ thông tin, kỹ thuật lập trình phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả

Chủ đề:

Bình luận(0)

Lưu

## Nội dung Text: The New C Standard- P4

1. 5.2.1 Character sets 223 Table 221.2: Relative frequency (most common to least common, with parenthesis used to bracket extremely rare letters) of letter usage in various human languages (the English ranking is based on the British National Corpus). Based on Kelk.[729] Language Letters English etaoinsrhldcumfpgwybvkxjqz French esaitnrulodcmpévqfbghjàxèyêzâçîùôûïkëw Norwegian erntsilakodgmvfupbhøjyåæcwzx(q) Swedish eantrsildomkgväfhupåöbcyjxwzéq Icelandic anriestuðlgmkfhvoáþídjóbyæúöpé` cxwzq y Hungarian ˝ eatlnskomzrigáéydbvhjofupöócuíúüxw(q) ˝ 222 The representation of each member of the source and execution basic character sets shall ﬁt in a byte. basic char- acter set ﬁt in a byte Commentary This is a requirement on the implementation. The deﬁnition of character already speciﬁes that it ﬁts in a byte. 59 character single-byte However, a character constant has type int; which could be thought to imply that the value representation of 883 character constant type characters need not ﬁt in a byte. This wording clariﬁes the situation. The representation of members of the 478 basic char- basic execution character set is also required to be a nonnegative value. acter set positive if stored in char object C++ A byte is at least large enough to contain any member of the basic execution character set and . . . 1.7p1 This requirement reverses the dependency given in the C Standard, but the effect is the same. Common Implementations On hosts where characters have a width 16 or 32 bits, that choice has usually been made because of addressability issues (pointers only being able to point at storage on 16- or 32-bit address boundaries). It is not usually necessary to increase the size of a byte because of representational issues to do with the character set. In the EBCDIC character set, the value of ’a’ is 129 (in Ascii it is 97). If the implementation-deﬁned value of CHAR_BIT is 8, then this character, and some others, will not be representable in the type signed 307 CHAR_BIT macro char (in most implementations the representation actually used is the negative value whose least signiﬁcant eight bits are the same as those of the corresponding bits in the positive value, in the character set). In such implementations the type char will need to have the same representation as the type unsigned char. The ICL 1900 series used a 6-bit byte. Implementing this requirement on such a host would not have been possible. Coding Guidelines 569.1 represen- A general principle of coding guidelines is to recommend against the use of representation information. In tation in- formation using this case the standard is guaranteeing that a character will ﬁt within a given amount of storage. Relying on this requirement might almost be regarded as essential in some cases. Example 1 void f(void) 2 { 3 char C_1 = ’W’; /* Guaranteed to fit in a char. */ 4 char C_2 = ’$’; /* Not guaranteed to fit in a char. */ 5 signed char C_3 = ’W’; /* Not guaranteed to fit in a signed char. */ 6 } June 24, 2009 v 1.2 2. 224 5.2.1 Character sets digit characters In both the source and execution basic character sets, the value of each character after 0 in the above list of 223 contiguous decimal digits shall be one greater than the value of the previous. Commentary This is a requirement on the implementation. The Committee realized that a large number of existing programs depended on this statement being true. It is certainly true for the two major character sets used in the English-speaking world, Ascii, EBCDIC, and all of the human language digit encodings speciﬁed in Unicode, see Table 797.1. The Committee thus saw ﬁt to bless this usage. Not only is it possible to perform relational comparisons on the digit characters (e.g, ’0’ 3. 5.2.1 Character sets 227 Commentary This is a requirement on the implementation. The C library makes a distinction between text and binary ﬁles. However, there is no requirement that source ﬁles exist in either of these forms. The worst-case scenario: In a host environment that did not have a native method of delimiting lines, an implementation would have to provide/deﬁne its own convention and supply tools for editing such ﬁles. Some integrated development environments do deﬁne their own conventions for storing source ﬁles and other associated information. C++ The C++ Standard does not specify this level of detail (although it does refer to end-of-line indicators, 2.1p1n1). Common Implementations Unicode Technical Report #13: “Unicode newline guidelines” discusses the issues associated with repre- senting new-lines in ﬁles. The ISO 6429 standard also deﬁnes NEL (NExt Line, hexadecimal 0x85) as an end-of-line indicator. The Microsoft Windows convention is to indicate this end-of-line with a carriage return/line feed pair, \r\n (a convention that goes back through CP/M to DEC RT-11); the Unix convention is to use a single line feed character \n; the MacIntosh convention is to use the carriage return character, \r. Some mainframes implement a form of text ﬁles that mimic punched cards by having ﬁxed-length lines. Each line contains the same number of characters, often 80. The space after the last user-written character is sometimes padded with spaces, other times it is padded with null characters. 225 this International Standard treats such an end-of-line indicator as if it were a single new-line character. Commentary 116 transla- The standard is not interested in the details of the byte representation of end-of-line on storage media. It tion phase 1 makes use of the concept of end-of-line and uses the conceptual simpliﬁcation of treating it as if it were a single character. C++ . . . (introducing new-line characters for end-of-line indicators) . . . 2.1p1n1 226 In the basic execution character set, there shall be control characters representing alert, backspace, carriage basic execution character set return, and new line. control characters Commentary This is a requirement on the implementation. These characters form part of the set of 96 execution character set members (counting the null character) 221 basic execu- deﬁned by the standard, plus new line which is introduced in translation phase 1. However, these characters tion character set are not in the basic source character set, and are represented in it using escape sequences. 116 transla- tion phase 1 Other Languages 866 escape se- quence Few other languages include the concept of control characters, although many implementations provide syntax semantics for them in source code (they are usually mapped exactly from the source to the execution character set). Java deﬁnes the same control characters as C and gives them their equivalent Ascii values. However, it does not deﬁne any semantics for these characters. Common Implementations ECMA-48 Control Functions for Coded Character Sets, Fifth Edition (available free from their Web site, http://www.ecma-international.ch) was fast-tracked as the third edition of ISO/IEC 6429. This standard deﬁnes signiﬁcantly more control functions than those speciﬁed in the C Standard. June 24, 2009 v 1.2 4. 228 5.2.1 Character sets If any other characters are encountered in a source ﬁle (except in an identiﬁer, a character constant, a string 227 literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undeﬁned. Commentary The standard does not prohibit such characters from occurring in a source ﬁle outright. The Committee was aware of implementations that used such characters to extend the language. For instance, the use of the @ character in an object deﬁnition to specify its address in storage. The list of exceptions is extensive. The only usage remaining, for such characters, is as a punctuator. Any # 1950 other character has to be accepted as a preprocessing token. It may subsequently, for instance, be stringized. operator preprocess- 137 ing token It is the attempt to convert this preprocessing token into a token where the undeﬁned behavior occurs. converted to token C90 Support for additional characters in identiﬁers is new in C99. C++ Any source ﬁle character not in the basic source character set (2.2) is replaced by the universal-character-name 2.1p1 that designates that character. The C++ Standard speciﬁes the behavior and a translator is required to handle source code containing such a character. A C translator is permitted to issue a diagnostic and fail to translate the source code. Other Languages Most languages regard the appearance of an unknown character in the source as some form of error. Like C, most language implementations support additional characters in string literals and comments. Common Implementations Most implementations generate a diagnostic, either when the preprocessing token containing one of these characters is converted to a token, or as a result of the very likely subsequent syntax violation. Some implementations[728] deﬁne the @ character to be a token, its usual use being to provide the syntax for specifying the address at which an object is to be placed in storage. It is generally followed by an integer constant expression. Coding Guidelines An occurrence of a character outside of the basic source character set, in one of these contexts, is most likely to be a typing mistake and is very likely to be diagnosed by the translator. The other possibility is that such characters were intended to be used because use is being made of an extension. This issue is discussed extensions 95.1 elsewhere. cost/beneﬁt Example 1 static int glob @ 0x100; /* Put glob at location 0x100. */ letter A letter is an uppercase letter or a lowercase letter as deﬁned above; 228 Commentary This deﬁnes the term letter. There is a third kind of case that characters can have, titlecase (a term sometimes applied to words where the ﬁrst letter is in uppercase, or titlecase, and the other letters are in lowercase). In most instances titlecase is the same as uppercase, but there are a few characters where this is not true; for instance, the titlecase of the Unicode character U01C9, lj, is U01C8, Lj, and its uppercase is U01C7, LJ. v 1.2 June 24, 2009 5. 5.2.1.1 Trigraph sequences 232 C90 This deﬁnition is new in C99. 229 in this International Standard the term does not include other characters that are letters in other alphabets. Commentary All implementations are required to support the basic source character set to which this terminology applies. Annex D lists those universal character names that can appear in identiﬁers. However, they are not referred to as letters (although they may well be regarded as such in their native language). The term letter assumes that the orthography (writing system) of a language has an alphabet. Some 792 orthography orthographies, for instance Japanese, don’t have an alphabet as such (let alone the concept of upper- and lowercase letters). Even when the orthography of a language does include characters that are considered to be matching upper and lowercase letters by speakers of that language (e.g., æ and Æ, å and Å), the C Standard does not deﬁne these characters to be letters. C++ The deﬁnition used in the C++ Standard, 17.3.2.1.3 (the footnote applies to C90 only), implies this is also true in C++. Coding Guidelines The term letter has a common usage meaning in a number of different languages. Developers do not often use this term in its C Standard sense. Perhaps the safest approach for coding guideline documents to take is to avoid use of this term completely. 230 The universal character name construct provides a way to name other characters. Commentary In theory all characters on planet Earth and beyond. In practice, those deﬁned in ISO 10646. 28 ISO 10646 C90 Support for universal character names is new in C99. Other Languages Other language standards are slowly moving to support ISO 10646. Java supports a similar concept. Common Implementations Support for these characters is relatively new. It will take time before similarities between implementations become apparent. 231 Forward references: universal character names (6.4.3), character constants (6.4.4.4), preprocessing direc- tives (6.10), string literals (6.4.5), comments (6.4.9), string (7.1.1). 5.2.1.1 Trigraph sequences trigraph se- quences 232 All occurrences in a source ﬁle Before any other processing takes place, each occurrence of one of the replaced by following sequences of three characters (called trigraph sequences12) ) are replaced with the corresponding single character. Commentary Trigraphs were an invention of the C committee. They are a method of supporting the input (into source ﬁles, not executing programs) and the printing of some C source characters in countries whose alphabets, and keyboards, do not include them in their national character set. Digraphs, discussed elsewhere, are another 916 digraphs sequence of characters that are replaced by a corresponding single character. 895 string literal The \? escape sequence was introduced to allow sequences of ?s to occur within string literals. syntax The wording was changed by the response to DR #309. June 24, 2009 v 1.2 6. 234 5.2.1.1 Trigraph sequences Other Languages Until recently many computer languages did not attempt to be as worldly as C, requiring what might be called an Ascii keyboard. Pascal speciﬁes what it calls lexical alternatives for some lexical tokens. The character sequences making up these lexical alternatives are only recognized in a context where they can form a single, complete token. Common Implementations On the Apple MacIntosh host, the notation ’????’ is used to denote the unknown ﬁle type. Translators in this environment often disable trigraphs by default to prevent unintended replacements from occurring. trigraph se- quences mappings 233 ??= # ??) ] ??! | ??( [ ??’ ^ ??< } ??/ \ ??< { ??- ~ Commentary The above sequences were chosen to minimize the likelihood of breaking any existing, conforming, C source code. Other Languages Many languages use a small subset, or none, of these problematic source characters, reducing the potential severity of the problem. The Pascal standard speciﬁes (. and .) as alternative lexical representations of [ and ] respectively. Common Implementations Recognizing trigraph sequences entails a check against every character read in by the translator. Performance proﬁling of translators has shown that a large percentage of time is spent in the lexer. A study by Waite[1469] found 41% of total translation time was spent in a handcrafted lexer (with little code optimization performed by the translator). An automatically produced lexer, the lex tool was used, consumed 3 to 5 as much time. One vendor, Borland, who used to take pride, and was known, for the speed at which their translators operated, did not include trigraph processing in the main translator program. A stand-alone utility was provided to perform trigraph processing. Those few programs that used trigraphs needed to be processed by this utility, generating a temporary ﬁle that was processed by the main translator program. While using this pre-preprocessor was a large overhead for programs that used trigraphs, performance was not degraded for source code that did not contain them. Usage There are insufﬁcient trigraphs in the visible form of the .c ﬁles to enable any meaningful analysis of the usage of different trigraphs to be made. trigraph se- No other trigraph sequences exist. 234 quences no other Commentary The set of characters for which trigraphs were created to provide an alternative spelling are known, and unlikely to be extended. Coding Guidelines Although no other trigraph sequences exist, sequences of two adjacent questions marks in string literals may lead to confusion. Developers may be unsure about whether they represent a trigraph or not. Using the escape sequence \? on at least one of these questions marks can help clarify the intent. Example 1 char *unknown_trigraph = "??++"; 2 char *cannot_be_trigraph = "?\?--"; v 1.2 June 24, 2009 7. 5.2.1.2 Multibyte characters 238 Usage The visible form of the .c ﬁles contained 593 (.h 10) instances of two question marks (i.e., ??) in string literals that were not followed by a character that would have created a trigraph sequence. 235 Each ? that does not begin one of the trigraphs listed above is not changed. Commentary Two ?s followed by any other character than those listed above is not a trigraph. Common Implementations No implementation is known to deﬁne any other sequence of ?s to be replaced by other characters. Coding Guidelines No other trigraph sequences are deﬁned by the standard, have been notiﬁed for future addition to the standard, or used in known implementations. Placing restrictions on other uses of other sequences of ?s provides no beneﬁt. 236 EXAMPLE 1 ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??) becomes #define arraycheck(a,b) a[b] || b[a] Commentary This example was added by the response to DR #310 and is intended to show a common trigraph usage. 237 EXAMPLE 2 The following source line printf("Eh???/n"); becomes (after replacement of the trigraph sequence ??/) printf("Eh?\n"); Commentary This illustrates the sometimes surprising consequences of trigraph processing. 5.2.1.2 Multibyte characters 238 The source character set may contain multibyte characters, used to represent members of the extended multibyte character character set. source contain Commentary 60 multibyte The mapping from physical source ﬁle multibyte characters to the source character set occurs in translation character phase 1. Whether multibyte characters are mapped to UCNs, single characters (if possible), or remain as 116 transla- tion phase multibyte characters depends on the model used by the implementation. 115 1 UCN models of C++ The representations used for multibyte characters, in source code, invariably involve at least one character that is not in the basic source character set: Any source ﬁle character not in the basic source character set (2.2) is replaced by the universal-character-name 2.1p1 that designates that character. The C++ Standard does not discuss the issue of a translator having to process multibyte characters during translation. However, implementations may choose to replace such characters with a corresponding universal- character-name. June 24, 2009 v 1.2 8. 241 5.2.1.2 Multibyte characters Other Languages Most programming languages do not contain the concept of multibyte characters. Common Implementations universal 815 Support for multibyte characters in identiﬁers, using a shift state encoding, is sometimes seen as an ex- charac- tension. Support for multibyte characters in this context using UCNs is new in C99. The most common ter name syntax implementations have been created to support the various Japanese character sets. Coding Guidelines The standard does not deﬁne how multibyte characters are to be represented. Any program that contains them is dependent on a particular implementation to do the right thing. Converting programs that existed before support for universal character names became available may not be economically viable. Some coding guideline documents recommend against the use of characters that are not speciﬁed in the C Standard. Simply prohibiting multibyte characters because they rely on implementation-deﬁned behavior ignores the cost/beneﬁt issues applicable to the developers who need to read the source. These are complex issues for which your author has insufﬁcient experience with which to frame any applicable guideline recommendations. The execution character set may also contain multibyte characters, which need not have the same encoding 239 as for the source character set. Commentary Multibyte characters could be read from a ﬁle during program execution, or even created by assigning byte values to contiguous array elements. These multibyte sequences could then be interpreted by various library functions as representing certain (wide) characters. The execution character set need not be ﬁxed at translation time. A program’s locale can be changed at execution time (by a call to the setlocale function). Such a change of locale can alter how multibyte characters are interpreted by a library function. C++ There is no explicit statement about such behavior being permitted in the C++ Standard. The C header (speciﬁed in Amendment 1 to C90) is included by reference and so the support it deﬁnes for multibyte characters needs to be provided by C++ implementations. Other Languages Most languages do not include library functions for handling multibyte characters. Coding Guidelines Use of multibyte characters during program execution is an applications issue that is outside the scope of these coding guidelines. For both character sets, the following shall hold: 240 Commentary This is a set of requirements that applies to an implementation. It is the minimum set of guaranteed requirements that a program can rely on. Coding Guidelines The set of requirements listed in the following C-sentences is fairly general. Dealing with implementations that do not meet the requirements listed in these sentences is outside the scope of these coding guidelines. — The basic character set shall be present and each character shall be encoded as a single byte. 241 v 1.2 June 24, 2009 9. 5.2.1.2 Multibyte characters 243 Commentary This is a requirement on the implementation. It prevents an implementation from being purely multibyte- 222 basic char- based. The members of the basic character set are guaranteed to always be available and ﬁt in a byte. acter set ﬁt in a byte Common Implementations An implementation that includes support for an extended character set might choose to deﬁne CHAR_BIT to 216 extended character set be 16 (most of the commonly used characters in ISO 10646 are representable in 16 bits, each in UTF-16; at 307 CHAR_BIT macro 28 ISO 10646 least those likely to be encountered outside of academic research and the traditional Chinese written on Hong 28 UTF-16 Kong). Alternatively, an implementation may use an encoding where the members of the basic character set are representable in a byte, but some members of the extended character set require more than one byte for their encoding. One such representation is UTF-8. 28 UTF-8 242 — The presence, meaning, and representation of any additional members is locale-speciﬁc. Commentary On program startup the execution locale is the "C" locale. During execution it can be set under program control. The standard is silent on what the translation time locale might be. Common Implementations The full Ascii character set is used by a large number of implementations. Coding Guidelines It often comes as a surprise to developers to learn what characters the C Standard does not require to be provided by an implementation. Source code readability could be affected if any of these additional members appear within comments and cannot be meaningfully displayed. Balancing the beneﬁts of using additional members against the likelihood of not being able to display them is a management issue. The use of any additional members during the execution of a program will be driven by the user require- ments of the application. This issue is outside the scope of these coding guidelines. 243 — A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte multibyte character characters begins in an initial shift state and enters other locale-speciﬁc shift states when speciﬁc multibyte state-dependent characters are encountered in the sequence. encoding shift state Commentary State-dependent encodings are essentially ﬁnite state machines. When a state encoding, or any multibyte encoding, is being used the number of characters in a string literal is not the same as the number of bytes encountered before the null character. There is no requirement that the sequence of shift states and characters 215 extended representing an extended character be unique. characters There are situations where the visual appearance of two or more characters is considered to be a single combining characters character. For instance, (using ISO 10646 as the example encoding), the two characters LATIN SMALL LETTER O (U+006F) followed by COMBINING CIRCUMFLEX ACCENT (U+0302) represent the grapheme cluster (the ISO 10646 term[334] for what might be considered a user character) ô not the two characters o ^. Some languages use grapheme clusters that require more than one combining character, for instance ô. Unicode (not ISO 10646) deﬁnes a canonical accent ordering to handle sequences of these combining ¯ characters. The so-called combining characters are deﬁned to combine with the character that comes immediately before them in the character stream. For backwards compatibility with other character encodings, and ease of conversion, the ISO 10646 Standard provides explicit codes for some accent characters; for instance, LATIN SMALL LETTER O WITH CIRCUMFLEX (U+00F4) also denotes ô. A character that is capable of standing alone, the o above, is known as a base character. A character that modiﬁes a base character, the ô above, is known as a combining character (the visible form of some combining characters are called diacritic characters). Most character encodings do not contain any combining characters, and those that do contain them rarely specify whether they should occur before or after the modiﬁed base June 24, 2009 v 1.2 10. 243 5.2.1.2 Multibyte characters character. Claims that a particular standard require the combining character to occur before the base character it modiﬁes may be based on a misunderstanding. For instance, ISO/IEC 6937 speciﬁes a single-byte encoding for base characters and a double-byte encoding for some visual combinations of (diacritic + base) Latin letter. These double-byte encodings are precomposed in the sense that they represent a single character; there is no single-byte encoding for the diacritic character, and the representation of the second byte happens to be the same as that of the single-byte representation of the corresponding base character (e.g., 0xC14F represents LATIN CAPITAL LETTER O WITH GRAVE and 0xC16F represents LATIN SMALL LETTER O WITH GRAVE). C90 The C90 Standard speciﬁed implementation-deﬁned shift states rather than locale-speciﬁc shift states. C++ The deﬁnition of multibyte character, 1.3.8, says nothing about encoding issues (other than that more than one byte may be used). The deﬁnition of multibyte strings, 17.3.2.1.3.2, requires the multibyte characters to begin and end in the initial shift state. Common Implementations ISO 2022 Most methods for state-dependent encoding are based on ISO/IEC 2022:1994 (identical to the standard ECMA-35 “Character Code Structure and Extension Techniques”, freely available from their Web site, http://www.ecma.ch). This uses a different structure than that speciﬁed in ISO/IEC 10646–1. The encoding method deﬁned by ISO 2022 supports both 7-bit and 8-bit codes. It divides these codes up into control characters (known as C0 and C1) and graphics characters (known as G0, G1, G2, and G3). In the initial shift state the C0 and G0 characters are in effect. Table 243.1: Commonly seen ISO 2022 Control Characters. The alternative values for SS2 and SS3 are only available for 8-bit codes. Name Acronym Code Value Meaning Escape ESC 0x1b Escape Shift-In SI 0x0f Shift to the G0 set Shift-Out SO 0x0e Shift to the G1 set Locking-Shift 2 LS2 ESC 0x6e Shift to the G2 set Locking-Shift 3 LS3 ESC 0x6f Shift to the G3 set Single-Shift 2 SS2 ESC 0x4e, or 0x8e Next character only is in G2 Single-Shift 3 SS3 ESC 0x4f, or 0x8f Next character only is in G3 Some of the control codes and their values are listed in Table 243.1. The codes SI, SO, LS2, and LS3 are known as locking shifts. They cause a change of state that lasts until the next control code is encountered. A stream that uses locking shifts is said to use stateful encoding. ISO 2022 speciﬁes an encoding method: it does not specify what the values within the range used for ISO 8859 24 graphic characters represent. This role is ﬁlled by other standards, such as ISO 8859. A C implementation that supports a state-dependent encoding chooses which character sets are available in each state that it supports (the C Standard only deﬁnes the character set for the initial shift state). Table 243.2: An implementation where G1 is ISO 8859–1, and G2 is ISO 8891–7 (Greek). Encoded values 0x62 0x63 0x64 0x0e 0xe6 0x1b 0x6e 0xe1 0xe2 0xe3 0x0f Control character SO LS2 SI Graphic character a b c æ α β γ Having to rely on implicit knowledge of what character set is intended to be used for G1, G2, and so on, is not always satisfactory. A method of specifying the character sets in the sequence of bytes is needed. The v 1.2 June 24, 2009 11. 5.2.1.2 Multibyte characters 244 ESC control code provides this functionality by using two or more following bytes to specify the character set (ISO maintains a registry of coded character sets). It is possible to change between character sets without any intervening characters. Table 243.3 lists some of the commonly used Japanese character sets. C source code written by Japanese developers probably has the highest usage of shift sequences. There are several JIS (Japanese Industrial Standard) documents specifying representations for such sequences. Shift JIS (developed by Microsoft) belies its name and does not involve shift sequences that use a state-dependent encoding. Table 243.3: ESC codes for some of the character sets used in Japanese. Character Set Byte Encoding Visible Ascii Representation JIS C 6226–1978 1B 24 40$ @ JIS X 0208–1983 1B 24 42 $B JIS X 0208–1990 1B 26 40 1B 24 42 & @$ B JIS X 0212–1990 1B 24 28 44 $( D JIS-Roman 1B 28 4A ( J Ascii 1B 28 42 ( B Half width Katakana 1B 28 49 ( I Table 243.4: A JIS encoding of the character sequence (“kana and kanji”). Encoded values 0x1b 0x24 0x42 0x242b 0x244a 0x3441 0x3b7a 0x1b 0x28 0x4a Control character$ B ( J Graphic character Ascii characters $+$J 4A ;z Coding Guidelines Developers do not need to remember the numerical values for extended characters. The editor, or program development environment, used to create the source code invariably looks after the details (generating any escape sequences and the appropriate byte values for the extended character selected by the developer). How these tools decide to encode multibyte character sequences is outside the scope of these coding guidelines. It is usually possible to express an extended character in a minimal number of bytes using a particular state-dependent encoding. The extent to which developers might create ﬁxed-length data structures on the assumption that multibyte characters will not contain any redundant shift sequences is outside the scope of 2017 footnote 152 313 this book. The value of the MB_LEN_MAX macro places an upper limit on the number of possible redundant MB_LEN_MAX shift sequences. Example 1 #include 2 3 char *p1 = "^[$B$3$l$OF|K\8lI=8=^[(J"; /* ^[$BF|K\8lJ8;zNs^[(J */ 4 char *p2 = "^[$B$3$l$OF|1Q^[(Jmixed^[$BJ8;zNs^[(J"; /* Ascii + ^[$BF|K\8l^[(J */ 5 char *p3 = "^[$B$3$l$OH>3Q^[(J^N6@6E^O^[$B$H^[(JASCII^[$B:.9g^[(J"; 6 7 int main(void) 8 { 9 printf("%s^[$B$H^[(J%s^[$B$H^[(J%s\n", p1, p2, p3); 10 } 244 While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. June 24, 2009 v 1.2
12. 247 5.2.1.2 Multibyte characters Commentary The implementation of a stateful encoding has to pick a special character, which is not in the basic character set, to indicate the start of a shift sequence. When not in the initial shift state, it is very unlikely that single bytes will be interpreted the same way as when in the initial shift state. C++ The C++ Standard does not explicitly specify this requirement. Common Implementations The ESC character, 0x1b, is commonly used to indicate the start of a shift sequence. footnote 12) The trigraph sequences enable the input of characters that are not deﬁned in the Invariant Code Set as 245 12 described in ISO/IEC 646, which is a subset of the seven-bit US ASCII code set. Commentary When trigraphs are used, it is possible to write C source code that contains only those characters that are in the Invariant Code Set of ISO/IEC 646. C90 The C90 Standard explicitly referred to the 1983 version of ISO/IEC 646 standard. The interpretation for subsequent bytes in the sequence is a function of the current shift state. 246 Commentary This wording is really a suggestion for the design of multibyte shift states (it is effectively describing the processing performed by ﬁnite state machines, which is what a shift state encoding is). Being able to interpret a byte independent of the current shift state would indicate that the sequence of bytes that resulted in the current state were redundant. 313 MB_LEN_MAX The speciﬁcation of the macro MB_LEN_MAX requires that the maximum number of bytes needed to handle a supported multibyte character be provided. It may, or may not, be possible to represent some redundant shift sequence within the available bytes. The standard does not explicitly require or prohibit support for redundant shift sequences. C++ A set of virtual functions for handling state-dependent encodings, during program execution, is discussed in Clause 22, Localization library. But, this requirement is not speciﬁed. Common Implementations Implementations usually use a simple ﬁnite state machine, often automatically generated, to handle the mapping of shift states into their execution character value. The extent to which sequences of redundant shift sequences is supported will depend on the implementation. Coding Guidelines The sequence of bytes in a shift sequence are usually generated via some automated process. For this reason a guideline recommending against the use of redundant shift sequences is unlikely to be enforceable, and none is given. byte — A byte with all bits zero shall be interpreted as a null character independent of shift state. 247 all bits zero Commentary This is a requirement on the implementation. This requirement makes it possible to search for the end of a string without needing any knowledge of the encoding that has been used. For instance, string-handling functions can copy multibyte characters without interpreting their contents. v 1.2 June 24, 2009
13. 5.2.1.2 Multibyte characters 250 C++ . . . , plus a null character (respectively, null wide character), whose representation has all zero bits. 2.2p3 While the C++ Standard does not rule out the possibility of all bits zero having another interpretation in other contexts, other requirements (17.3.2.1.3.1p1 and 17.3.2.1.3.2p1) restrict these other contexts, as do existing character set encodings. 248 — A byte with all bits zero shall not occur in the second or subsequent bytes of a Such a byte shall not occur multibyte character as part of any other multibyte character. end in initial shift state Commentary This is a requirement on the implementation. The effect of this requirement is that partial multibyte characters cannot be created (otherwise the behavior is undeﬁned). A null character can only exist outside of the sequence of bytes making up a multibyte character. For source ﬁles this requirement follows from the requirement to end in the initial shift state. During program execution this requirement means that library 250 token shift state functions processing multibyte characters do not need to concern themselves with handling partial multibyte characters at the end of a string. The wording was changed by the response to DR #278 (it is a requirement on the implementation that forbids a two-byte character from having a ﬁrst, or any, byte that is zero). C++ This requirement can be deduced from the deﬁnition of null terminated byte strings, 17.3.2.1.3.1p1, and null terminated multibyte strings, 17.3.2.1.3.2p1. 249 For source ﬁles, the following shall hold: Commentary These C-sentences specify requirements on a program. A program that violates them exhibits undeﬁned behavior. 44 locale- Use of multibyte characters can involve locale-speciﬁc and implementation-deﬁned behaviors. A source speciﬁc behavior ﬁle does not affect the conformance status of any program built using it, provided its use of multibyte 42 implementation- characters either involves locale-speciﬁc behavior or the implementation-deﬁned behavior does not affect deﬁned behavior program output (e.g., they appear in comments). Coding Guidelines The creation of multibyte characters within source ﬁles is usually handled by an editor. The developer involvement in the process being the selection of the appropriate character. In such an environment the developer has no control over the byte sequences used. A guideline recommending against such usage is likely to be impractical to implement and none is given. 250 — An identiﬁer, comment, string literal, character constant, or header name shall begin and end in the initial token shift state shift state. Commentary These are the only tokens that can meaningfully contain a multibyte character. A token containing a multibyte character should not affect the processing of subsequent tokens. Without this requirement a token that did not end in the initial shift state would be likely to affect the processing of subsequent tokens. C90 Support for multibyte characters in identiﬁers is new in C99. June 24, 2009 v 1.2
14. 252 5.2.2 Character display semantics C++ transla- 116 tion phase In C++ all characters are mapped to the source character set in translation phase 1. Any shift state encoding 1 will not exist after translation phase 1, so the C requirement is not applicable to C++ source ﬁles. Coding Guidelines The fact that many multibyte sequences are created automatically, by an editor, can make it very difﬁcult for a developer to meet this requirement. A developer is unlikely to intentionally end a preprocessing token, created using a multibyte sequence, in other than the initial state. A coding guideline is unlikely to be of beneﬁt. — An identiﬁer, comment, string literal, character constant, or header name shall consist of a sequence of 251 valid multibyte characters. Commentary What is a valid multibyte character? This decision can only be made by a translator, should it chose to accept multibyte characters. In C90 it was relatively easy to lexically process a source ﬁle containing multibyte characters. The context in which these characters occurred often meant that a lexer simply had to look for the character that terminated the kind of token being processed (unless that character occurred as part of a multibyte character). Identiﬁer tokens do not have a single termination character. This means that it is not possible to generalise support for multibyte characters in identiﬁers across all translators. It is possible that source containing a multibyte character identiﬁer supported by one translator will cause another translator to issue a diagnostic. C90 Support for multibyte characters in identiﬁers is new in C99. C++ transla- 116 tion phase In C++ all characters are mapped to the source character set in translation phase 1. Any shift state encoding 1 will not exist after translation phase 1, so the C requirement is not applicable to C++ source ﬁles. Coding Guidelines In some cases source ﬁles can contain multibyte characters and be translated by translators that have no knowledge of the structure of these multibyte characters. The developer is relying on the translator ignoring them in comments containing their native language, or simply copying the character sequence in a string literal into the program image. In other cases, for instance identiﬁers, knowledge of the encoding used for the multibyte character set is likely to be needed by a translator. Ensuring that a translator capable of handling any multibyte characters occurring in the source is used, is a conﬁguration-management issue that is outside the scope of these coding guidelines. 5.2.2 Character display semantics Commentary character display There is no guarantee that a character display will exist on any hosted implementation. If such a device is semantics supported by an implementation, this clause speciﬁes its attributes. C++ Clause 18 mentions “display as a wstring” in Notes:. But, there is no other mention of display semantics anywhere in the standard. Common Implementations termcap Most Unix-based environments contain a database of terminal capabilities, the so-called termcap database.[1332] database This database provides information to the host on a large number of terminal capabilities and characteristics. Knowing the display device currently being used (this usually relies on the user setting an environment variable) enables the database to be queried for device attribute information. This information can then be used by an application to handle its output to display devices. There is a similar database of information on printer characteristics. v 1.2 June 24, 2009
15. 5.2.2 Character display semantics 254 252 The active position is that location on a display device where the next character output by the fputc function would appear. Commentary This deﬁnes the term active position; however, the term current cursor position is more commonly used by developers. The wide character output functions act as if fputc is called. C++ C++ has no concept of active position. The fputc function appears in "Table 94" as one of the functions supported by C++. Other Languages Most languages don’t get involved in such low-level I/O details. 253 The intent of writing a printing character (as deﬁned by the isprint function) to a display device is to display a graphic representation of that character at the active position and then advance the active position to the next position on the current line. Commentary The standard speciﬁes an intent, not a requirement. Some devices produce output that cannot be erased later (e.g., printing to paper) while other devices always display the last character output at a given position (e.g., VDUs). The ability of printers to display two or more characters at the same position is sometimes required. For instance, programs wanting to display the ô character on a wide variety of printers might generate the sequence o, backspace, ^ (all of these characters are contained in the invariant subset of ISO 646). The intended behavior describes the movement of the active position, not the width of the character displayed. There is nothing in this deﬁnition to prevent the writing of one character affecting previously written characters (which can occur in Arabic). This speciﬁcation implies that the positions are a ﬁxed width apart. The graphic representation of a character is known as a glyph. 58 glyph C++ The C++ Standard does not discuss character display semantics. Common Implementations In some oriental languages, character glyphs can usually be organized into two groups, one being twice the width as the other. Implementations in these environments often use a ﬁxed width for each glyph, creating empty spaces between some glyph pairs. Some orthographies, which use an alphabetic representation, contain single characters that use what appears to be two characters in their visual representation. For instance, the character denoted by the Unicode value U00C6 is Æ, and the character denoted by the Unicode value U01C9 is lj. Both representations are considered to be a single character (the former is also a single letter, while the latter is two letters). Coding Guidelines The concept of active position is useful for describing the basic set of operations supported by the C Standard. The applications’ requirements for displaying characters may, or may not, be feasible within the functionality provided by the standard; this is a top-level application design issue. How characters appear on a display device is an application user interface issue that is outside the scope of this book. 254 The direction of writing is locale-speciﬁc. writing direction locale-speciﬁc June 24, 2009 v 1.2
16. 256 5.2.2 Character display semantics Commentary Although left-to-right is used by many languages, this direction is not the only one used. Arabic uses right-to-left (also Hebrew, Urdu, and Berber). In Japanese it is possible for the direction to be from top to bottom with the lines going right-to-left (mainland Chinese has the columns going from left-to-right, in Taiwan it goes right-to-left), or left-to-right with the lines going top to bottom (the same directional conventions as English) There is no requirement that the direction of writing always be the same direction, for instance, braille alternates in direction between adjacent lines (known as boustrophedron), as do Egyptian hieroglyphs, Mayan, and Hittite. Some Egyptian hieroglyphic characters can face either to the left or right (e.g., ˜ or ˜), information that readers can use to deduce the direction in which a line should be read. Some applications need to simultaneously handle locales where the direction of writing is different, for instance, a word processor that supports the use of Hebrew and English in the same document. This level of support is outside the scope of the C Standard. C++ The C++ Standard does not discuss character display semantics. Coding Guidelines The direction of writing is an application issue. Any developer who is concerned with the direction of writing will, of necessity, require a deeper involvement with this topic than the material covered by the C Standard or these coding guidelines. Example The direction of writing can change during program execution. For instance, in a word processor that handles both English and Arabic or Hebrew, the character sequence ABCdefGHJ (using lowercase to represent English and uppercase to represent Arabic/Hebrew) might appear on the display as JHGdefCBA. If the active position is at the ﬁnal position of a line (if there is one), the behavior of the display device is 255 unspeciﬁed. Commentary The Committee recognized that there is no commonality of behavior exhibited by existing display devices when the ﬁnal position on a line is reached. C++ The C++ Standard does not discuss character display semantics. Common Implementations Some display devices wrap onto the next line, effectively generating an extra new-line character. Other devices write all subsequent characters, up to the next new-line character, at the ﬁnal position. On some displays, writing to the bottom right corner of a display has an effect other than displaying the character output, for instance, clearing the screen or causing it to scroll. The termcap and ncurses both provide conﬁguration options that specify whether writing to this display location has the desired effect. Coding Guidelines Organizing the characters on a display device is an application domain issue. The fact that the C Standard does not provide a deﬁned method of handling the situation described here needs to be dealt with, if applicable, during the design process. This is outside the scope of these coding guidelines. Alphabetic escape sequences representing nongraphic characters in the execution character set are intended 256 to produce actions on display devices as follows: Commentary This is the behavior of Ascii terminals enshrined in the C Standard. Rationale v 1.2 June 24, 2009