The New C Standard- P4

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:100

0
36
lượt xem
3
download

The New C Standard- P4

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Tham khảo tài liệu 'the new c standard- p4', công nghệ thông tin, kỹ thuật lập trình phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả

Chủ đề:
Lưu

Nội dung Text: The New C Standard- P4

  1. 5.2.1 Character sets 223 Table 221.2: Relative frequency (most common to least common, with parenthesis used to bracket extremely rare letters) of letter usage in various human languages (the English ranking is based on the British National Corpus). Based on Kelk.[729] Language Letters English etaoinsrhldcumfpgwybvkxjqz French esaitnrulodcmpévqfbghjàxèyêzâçîùôûïkëw Norwegian erntsilakodgmvfupbhøjyåæcwzx(q) Swedish eantrsildomkgväfhupåöbcyjxwzéq Icelandic anriestuðlgmkfhvoáþídjóbyæúöpé` cxwzq y Hungarian ˝ eatlnskomzrigáéydbvhjofupöócuíúüxw(q) ˝ 222 The representation of each member of the source and execution basic character sets shall fit in a byte. basic char- acter set fit in a byte Commentary This is a requirement on the implementation. The definition of character already specifies that it fits in a byte. 59 character single-byte However, a character constant has type int; which could be thought to imply that the value representation of 883 character constant type characters need not fit in a byte. This wording clarifies the situation. The representation of members of the 478 basic char- basic execution character set is also required to be a nonnegative value. acter set positive if stored in char object C++ A byte is at least large enough to contain any member of the basic execution character set and . . . 1.7p1 This requirement reverses the dependency given in the C Standard, but the effect is the same. Common Implementations On hosts where characters have a width 16 or 32 bits, that choice has usually been made because of addressability issues (pointers only being able to point at storage on 16- or 32-bit address boundaries). It is not usually necessary to increase the size of a byte because of representational issues to do with the character set. In the EBCDIC character set, the value of ’a’ is 129 (in Ascii it is 97). If the implementation-defined value of CHAR_BIT is 8, then this character, and some others, will not be representable in the type signed 307 CHAR_BIT macro char (in most implementations the representation actually used is the negative value whose least significant eight bits are the same as those of the corresponding bits in the positive value, in the character set). In such implementations the type char will need to have the same representation as the type unsigned char. The ICL 1900 series used a 6-bit byte. Implementing this requirement on such a host would not have been possible. Coding Guidelines 569.1 represen- A general principle of coding guidelines is to recommend against the use of representation information. In tation in- formation using this case the standard is guaranteeing that a character will fit within a given amount of storage. Relying on this requirement might almost be regarded as essential in some cases. Example 1 void f(void) 2 { 3 char C_1 = ’W’; /* Guaranteed to fit in a char. */ 4 char C_2 = ’$’; /* Not guaranteed to fit in a char. */ 5 signed char C_3 = ’W’; /* Not guaranteed to fit in a signed char. */ 6 } June 24, 2009 v 1.2
  2. 224 5.2.1 Character sets digit characters In both the source and execution basic character sets, the value of each character after 0 in the above list of 223 contiguous decimal digits shall be one greater than the value of the previous. Commentary This is a requirement on the implementation. The Committee realized that a large number of existing programs depended on this statement being true. It is certainly true for the two major character sets used in the English-speaking world, Ascii, EBCDIC, and all of the human language digit encodings specified in Unicode, see Table 797.1. The Committee thus saw fit to bless this usage. Not only is it possible to perform relational comparisons on the digit characters (e.g, ’0’
  3. 5.2.1 Character sets 227 Commentary This is a requirement on the implementation. The C library makes a distinction between text and binary files. However, there is no requirement that source files exist in either of these forms. The worst-case scenario: In a host environment that did not have a native method of delimiting lines, an implementation would have to provide/define its own convention and supply tools for editing such files. Some integrated development environments do define their own conventions for storing source files and other associated information. C++ The C++ Standard does not specify this level of detail (although it does refer to end-of-line indicators, 2.1p1n1). Common Implementations Unicode Technical Report #13: “Unicode newline guidelines” discusses the issues associated with repre- senting new-lines in files. The ISO 6429 standard also defines NEL (NExt Line, hexadecimal 0x85) as an end-of-line indicator. The Microsoft Windows convention is to indicate this end-of-line with a carriage return/line feed pair, \r\n (a convention that goes back through CP/M to DEC RT-11); the Unix convention is to use a single line feed character \n; the MacIntosh convention is to use the carriage return character, \r. Some mainframes implement a form of text files that mimic punched cards by having fixed-length lines. Each line contains the same number of characters, often 80. The space after the last user-written character is sometimes padded with spaces, other times it is padded with null characters. 225 this International Standard treats such an end-of-line indicator as if it were a single new-line character. Commentary 116 transla- The standard is not interested in the details of the byte representation of end-of-line on storage media. It tion phase 1 makes use of the concept of end-of-line and uses the conceptual simplification of treating it as if it were a single character. C++ . . . (introducing new-line characters for end-of-line indicators) . . . 2.1p1n1 226 In the basic execution character set, there shall be control characters representing alert, backspace, carriage basic execution character set return, and new line. control characters Commentary This is a requirement on the implementation. These characters form part of the set of 96 execution character set members (counting the null character) 221 basic execu- defined by the standard, plus new line which is introduced in translation phase 1. However, these characters tion character set are not in the basic source character set, and are represented in it using escape sequences. 116 transla- tion phase 1 Other Languages 866 escape se- quence Few other languages include the concept of control characters, although many implementations provide syntax semantics for them in source code (they are usually mapped exactly from the source to the execution character set). Java defines the same control characters as C and gives them their equivalent Ascii values. However, it does not define any semantics for these characters. Common Implementations ECMA-48 Control Functions for Coded Character Sets, Fifth Edition (available free from their Web site, http://www.ecma-international.ch) was fast-tracked as the third edition of ISO/IEC 6429. This standard defines significantly more control functions than those specified in the C Standard. June 24, 2009 v 1.2
  4. 228 5.2.1 Character sets If any other characters are encountered in a source file (except in an identifier, a character constant, a string 227 literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined. Commentary The standard does not prohibit such characters from occurring in a source file outright. The Committee was aware of implementations that used such characters to extend the language. For instance, the use of the @ character in an object definition to specify its address in storage. The list of exceptions is extensive. The only usage remaining, for such characters, is as a punctuator. Any # 1950 other character has to be accepted as a preprocessing token. It may subsequently, for instance, be stringized. operator preprocess- 137 ing token It is the attempt to convert this preprocessing token into a token where the undefined behavior occurs. converted to token C90 Support for additional characters in identifiers is new in C99. C++ Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name 2.1p1 that designates that character. The C++ Standard specifies the behavior and a translator is required to handle source code containing such a character. A C translator is permitted to issue a diagnostic and fail to translate the source code. Other Languages Most languages regard the appearance of an unknown character in the source as some form of error. Like C, most language implementations support additional characters in string literals and comments. Common Implementations Most implementations generate a diagnostic, either when the preprocessing token containing one of these characters is converted to a token, or as a result of the very likely subsequent syntax violation. Some implementations[728] define the @ character to be a token, its usual use being to provide the syntax for specifying the address at which an object is to be placed in storage. It is generally followed by an integer constant expression. Coding Guidelines An occurrence of a character outside of the basic source character set, in one of these contexts, is most likely to be a typing mistake and is very likely to be diagnosed by the translator. The other possibility is that such characters were intended to be used because use is being made of an extension. This issue is discussed extensions 95.1 elsewhere. cost/benefit Example 1 static int glob @ 0x100; /* Put glob at location 0x100. */ letter A letter is an uppercase letter or a lowercase letter as defined above; 228 Commentary This defines the term letter. There is a third kind of case that characters can have, titlecase (a term sometimes applied to words where the first letter is in uppercase, or titlecase, and the other letters are in lowercase). In most instances titlecase is the same as uppercase, but there are a few characters where this is not true; for instance, the titlecase of the Unicode character U01C9, lj, is U01C8, Lj, and its uppercase is U01C7, LJ. v 1.2 June 24, 2009
  5. 5.2.1.1 Trigraph sequences 232 C90 This definition is new in C99. 229 in this International Standard the term does not include other characters that are letters in other alphabets. Commentary All implementations are required to support the basic source character set to which this terminology applies. Annex D lists those universal character names that can appear in identifiers. However, they are not referred to as letters (although they may well be regarded as such in their native language). The term letter assumes that the orthography (writing system) of a language has an alphabet. Some 792 orthography orthographies, for instance Japanese, don’t have an alphabet as such (let alone the concept of upper- and lowercase letters). Even when the orthography of a language does include characters that are considered to be matching upper and lowercase letters by speakers of that language (e.g., æ and Æ, å and Å), the C Standard does not define these characters to be letters. C++ The definition used in the C++ Standard, 17.3.2.1.3 (the footnote applies to C90 only), implies this is also true in C++. Coding Guidelines The term letter has a common usage meaning in a number of different languages. Developers do not often use this term in its C Standard sense. Perhaps the safest approach for coding guideline documents to take is to avoid use of this term completely. 230 The universal character name construct provides a way to name other characters. Commentary In theory all characters on planet Earth and beyond. In practice, those defined in ISO 10646. 28 ISO 10646 C90 Support for universal character names is new in C99. Other Languages Other language standards are slowly moving to support ISO 10646. Java supports a similar concept. Common Implementations Support for these characters is relatively new. It will take time before similarities between implementations become apparent. 231 Forward references: universal character names (6.4.3), character constants (6.4.4.4), preprocessing direc- tives (6.10), string literals (6.4.5), comments (6.4.9), string (7.1.1). 5.2.1.1 Trigraph sequences trigraph se- quences 232 All occurrences in a source file Before any other processing takes place, each occurrence of one of the replaced by following sequences of three characters (called trigraph sequences12) ) are replaced with the corresponding single character. Commentary Trigraphs were an invention of the C committee. They are a method of supporting the input (into source files, not executing programs) and the printing of some C source characters in countries whose alphabets, and keyboards, do not include them in their national character set. Digraphs, discussed elsewhere, are another 916 digraphs sequence of characters that are replaced by a corresponding single character. 895 string literal The \? escape sequence was introduced to allow sequences of ?s to occur within string literals. syntax The wording was changed by the response to DR #309. June 24, 2009 v 1.2
  6. 234 5.2.1.1 Trigraph sequences Other Languages Until recently many computer languages did not attempt to be as worldly as C, requiring what might be called an Ascii keyboard. Pascal specifies what it calls lexical alternatives for some lexical tokens. The character sequences making up these lexical alternatives are only recognized in a context where they can form a single, complete token. Common Implementations On the Apple MacIntosh host, the notation ’????’ is used to denote the unknown file type. Translators in this environment often disable trigraphs by default to prevent unintended replacements from occurring. trigraph se- quences mappings 233 ??= # ??) ] ??! | ??( [ ??’ ^ ??< } ??/ \ ??< { ??- ~ Commentary The above sequences were chosen to minimize the likelihood of breaking any existing, conforming, C source code. Other Languages Many languages use a small subset, or none, of these problematic source characters, reducing the potential severity of the problem. The Pascal standard specifies (. and .) as alternative lexical representations of [ and ] respectively. Common Implementations Recognizing trigraph sequences entails a check against every character read in by the translator. Performance profiling of translators has shown that a large percentage of time is spent in the lexer. A study by Waite[1469] found 41% of total translation time was spent in a handcrafted lexer (with little code optimization performed by the translator). An automatically produced lexer, the lex tool was used, consumed 3 to 5 as much time. One vendor, Borland, who used to take pride, and was known, for the speed at which their translators operated, did not include trigraph processing in the main translator program. A stand-alone utility was provided to perform trigraph processing. Those few programs that used trigraphs needed to be processed by this utility, generating a temporary file that was processed by the main translator program. While using this pre-preprocessor was a large overhead for programs that used trigraphs, performance was not degraded for source code that did not contain them. Usage There are insufficient trigraphs in the visible form of the .c files to enable any meaningful analysis of the usage of different trigraphs to be made. trigraph se- No other trigraph sequences exist. 234 quences no other Commentary The set of characters for which trigraphs were created to provide an alternative spelling are known, and unlikely to be extended. Coding Guidelines Although no other trigraph sequences exist, sequences of two adjacent questions marks in string literals may lead to confusion. Developers may be unsure about whether they represent a trigraph or not. Using the escape sequence \? on at least one of these questions marks can help clarify the intent. Example 1 char *unknown_trigraph = "??++"; 2 char *cannot_be_trigraph = "?\?--"; v 1.2 June 24, 2009
  7. 5.2.1.2 Multibyte characters 238 Usage The visible form of the .c files contained 593 (.h 10) instances of two question marks (i.e., ??) in string literals that were not followed by a character that would have created a trigraph sequence. 235 Each ? that does not begin one of the trigraphs listed above is not changed. Commentary Two ?s followed by any other character than those listed above is not a trigraph. Common Implementations No implementation is known to define any other sequence of ?s to be replaced by other characters. Coding Guidelines No other trigraph sequences are defined by the standard, have been notified for future addition to the standard, or used in known implementations. Placing restrictions on other uses of other sequences of ?s provides no benefit. 236 EXAMPLE 1 ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??) becomes #define arraycheck(a,b) a[b] || b[a] Commentary This example was added by the response to DR #310 and is intended to show a common trigraph usage. 237 EXAMPLE 2 The following source line printf("Eh???/n"); becomes (after replacement of the trigraph sequence ??/) printf("Eh?\n"); Commentary This illustrates the sometimes surprising consequences of trigraph processing. 5.2.1.2 Multibyte characters 238 The source character set may contain multibyte characters, used to represent members of the extended multibyte character character set. source contain Commentary 60 multibyte The mapping from physical source file multibyte characters to the source character set occurs in translation character phase 1. Whether multibyte characters are mapped to UCNs, single characters (if possible), or remain as 116 transla- tion phase multibyte characters depends on the model used by the implementation. 115 1 UCN models of C++ The representations used for multibyte characters, in source code, invariably involve at least one character that is not in the basic source character set: Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name 2.1p1 that designates that character. The C++ Standard does not discuss the issue of a translator having to process multibyte characters during translation. However, implementations may choose to replace such characters with a corresponding universal- character-name. June 24, 2009 v 1.2
  8. 241 5.2.1.2 Multibyte characters Other Languages Most programming languages do not contain the concept of multibyte characters. Common Implementations universal 815 Support for multibyte characters in identifiers, using a shift state encoding, is sometimes seen as an ex- charac- tension. Support for multibyte characters in this context using UCNs is new in C99. The most common ter name syntax implementations have been created to support the various Japanese character sets. Coding Guidelines The standard does not define how multibyte characters are to be represented. Any program that contains them is dependent on a particular implementation to do the right thing. Converting programs that existed before support for universal character names became available may not be economically viable. Some coding guideline documents recommend against the use of characters that are not specified in the C Standard. Simply prohibiting multibyte characters because they rely on implementation-defined behavior ignores the cost/benefit issues applicable to the developers who need to read the source. These are complex issues for which your author has insufficient experience with which to frame any applicable guideline recommendations. The execution character set may also contain multibyte characters, which need not have the same encoding 239 as for the source character set. Commentary Multibyte characters could be read from a file during program execution, or even created by assigning byte values to contiguous array elements. These multibyte sequences could then be interpreted by various library functions as representing certain (wide) characters. The execution character set need not be fixed at translation time. A program’s locale can be changed at execution time (by a call to the setlocale function). Such a change of locale can alter how multibyte characters are interpreted by a library function. C++ There is no explicit statement about such behavior being permitted in the C++ Standard. The C header (specified in Amendment 1 to C90) is included by reference and so the support it defines for multibyte characters needs to be provided by C++ implementations. Other Languages Most languages do not include library functions for handling multibyte characters. Coding Guidelines Use of multibyte characters during program execution is an applications issue that is outside the scope of these coding guidelines. For both character sets, the following shall hold: 240 Commentary This is a set of requirements that applies to an implementation. It is the minimum set of guaranteed requirements that a program can rely on. Coding Guidelines The set of requirements listed in the following C-sentences is fairly general. Dealing with implementations that do not meet the requirements listed in these sentences is outside the scope of these coding guidelines. — The basic character set shall be present and each character shall be encoded as a single byte. 241 v 1.2 June 24, 2009
  9. 5.2.1.2 Multibyte characters 243 Commentary This is a requirement on the implementation. It prevents an implementation from being purely multibyte- 222 basic char- based. The members of the basic character set are guaranteed to always be available and fit in a byte. acter set fit in a byte Common Implementations An implementation that includes support for an extended character set might choose to define CHAR_BIT to 216 extended character set be 16 (most of the commonly used characters in ISO 10646 are representable in 16 bits, each in UTF-16; at 307 CHAR_BIT macro 28 ISO 10646 least those likely to be encountered outside of academic research and the traditional Chinese written on Hong 28 UTF-16 Kong). Alternatively, an implementation may use an encoding where the members of the basic character set are representable in a byte, but some members of the extended character set require more than one byte for their encoding. One such representation is UTF-8. 28 UTF-8 242 — The presence, meaning, and representation of any additional members is locale-specific. Commentary On program startup the execution locale is the "C" locale. During execution it can be set under program control. The standard is silent on what the translation time locale might be. Common Implementations The full Ascii character set is used by a large number of implementations. Coding Guidelines It often comes as a surprise to developers to learn what characters the C Standard does not require to be provided by an implementation. Source code readability could be affected if any of these additional members appear within comments and cannot be meaningfully displayed. Balancing the benefits of using additional members against the likelihood of not being able to display them is a management issue. The use of any additional members during the execution of a program will be driven by the user require- ments of the application. This issue is outside the scope of these coding guidelines. 243 — A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte multibyte character characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte state-dependent characters are encountered in the sequence. encoding shift state Commentary State-dependent encodings are essentially finite state machines. When a state encoding, or any multibyte encoding, is being used the number of characters in a string literal is not the same as the number of bytes encountered before the null character. There is no requirement that the sequence of shift states and characters 215 extended representing an extended character be unique. characters There are situations where the visual appearance of two or more characters is considered to be a single combining characters character. For instance, (using ISO 10646 as the example encoding), the two characters LATIN SMALL LETTER O (U+006F) followed by COMBINING CIRCUMFLEX ACCENT (U+0302) represent the grapheme cluster (the ISO 10646 term[334] for what might be considered a user character) ô not the two characters o ^. Some languages use grapheme clusters that require more than one combining character, for instance ô. Unicode (not ISO 10646) defines a canonical accent ordering to handle sequences of these combining ¯ characters. The so-called combining characters are defined to combine with the character that comes immediately before them in the character stream. For backwards compatibility with other character encodings, and ease of conversion, the ISO 10646 Standard provides explicit codes for some accent characters; for instance, LATIN SMALL LETTER O WITH CIRCUMFLEX (U+00F4) also denotes ô. A character that is capable of standing alone, the o above, is known as a base character. A character that modifies a base character, the ô above, is known as a combining character (the visible form of some combining characters are called diacritic characters). Most character encodings do not contain any combining characters, and those that do contain them rarely specify whether they should occur before or after the modified base June 24, 2009 v 1.2
  10. 243 5.2.1.2 Multibyte characters character. Claims that a particular standard require the combining character to occur before the base character it modifies may be based on a misunderstanding. For instance, ISO/IEC 6937 specifies a single-byte encoding for base characters and a double-byte encoding for some visual combinations of (diacritic + base) Latin letter. These double-byte encodings are precomposed in the sense that they represent a single character; there is no single-byte encoding for the diacritic character, and the representation of the second byte happens to be the same as that of the single-byte representation of the corresponding base character (e.g., 0xC14F represents LATIN CAPITAL LETTER O WITH GRAVE and 0xC16F represents LATIN SMALL LETTER O WITH GRAVE). C90 The C90 Standard specified implementation-defined shift states rather than locale-specific shift states. C++ The definition of multibyte character, 1.3.8, says nothing about encoding issues (other than that more than one byte may be used). The definition of multibyte strings, 17.3.2.1.3.2, requires the multibyte characters to begin and end in the initial shift state. Common Implementations ISO 2022 Most methods for state-dependent encoding are based on ISO/IEC 2022:1994 (identical to the standard ECMA-35 “Character Code Structure and Extension Techniques”, freely available from their Web site, http://www.ecma.ch). This uses a different structure than that specified in ISO/IEC 10646–1. The encoding method defined by ISO 2022 supports both 7-bit and 8-bit codes. It divides these codes up into control characters (known as C0 and C1) and graphics characters (known as G0, G1, G2, and G3). In the initial shift state the C0 and G0 characters are in effect. Table 243.1: Commonly seen ISO 2022 Control Characters. The alternative values for SS2 and SS3 are only available for 8-bit codes. Name Acronym Code Value Meaning Escape ESC 0x1b Escape Shift-In SI 0x0f Shift to the G0 set Shift-Out SO 0x0e Shift to the G1 set Locking-Shift 2 LS2 ESC 0x6e Shift to the G2 set Locking-Shift 3 LS3 ESC 0x6f Shift to the G3 set Single-Shift 2 SS2 ESC 0x4e, or 0x8e Next character only is in G2 Single-Shift 3 SS3 ESC 0x4f, or 0x8f Next character only is in G3 Some of the control codes and their values are listed in Table 243.1. The codes SI, SO, LS2, and LS3 are known as locking shifts. They cause a change of state that lasts until the next control code is encountered. A stream that uses locking shifts is said to use stateful encoding. ISO 2022 specifies an encoding method: it does not specify what the values within the range used for ISO 8859 24 graphic characters represent. This role is filled by other standards, such as ISO 8859. A C implementation that supports a state-dependent encoding chooses which character sets are available in each state that it supports (the C Standard only defines the character set for the initial shift state). Table 243.2: An implementation where G1 is ISO 8859–1, and G2 is ISO 8891–7 (Greek). Encoded values 0x62 0x63 0x64 0x0e 0xe6 0x1b 0x6e 0xe1 0xe2 0xe3 0x0f Control character SO LS2 SI Graphic character a b c æ α β γ Having to rely on implicit knowledge of what character set is intended to be used for G1, G2, and so on, is not always satisfactory. A method of specifying the character sets in the sequence of bytes is needed. The v 1.2 June 24, 2009
  11. 5.2.1.2 Multibyte characters 244 ESC control code provides this functionality by using two or more following bytes to specify the character set (ISO maintains a registry of coded character sets). It is possible to change between character sets without any intervening characters. Table 243.3 lists some of the commonly used Japanese character sets. C source code written by Japanese developers probably has the highest usage of shift sequences. There are several JIS (Japanese Industrial Standard) documents specifying representations for such sequences. Shift JIS (developed by Microsoft) belies its name and does not involve shift sequences that use a state-dependent encoding. Table 243.3: ESC codes for some of the character sets used in Japanese. Character Set Byte Encoding Visible Ascii Representation JIS C 6226–1978 1B 24 40 $ @ JIS X 0208–1983 1B 24 42 $ B JIS X 0208–1990 1B 26 40 1B 24 42 & @ $ B JIS X 0212–1990 1B 24 28 44 $ ( D JIS-Roman 1B 28 4A ( J Ascii 1B 28 42 ( B Half width Katakana 1B 28 49 ( I Table 243.4: A JIS encoding of the character sequence (“kana and kanji”). Encoded values 0x1b 0x24 0x42 0x242b 0x244a 0x3441 0x3b7a 0x1b 0x28 0x4a Control character $ B ( J Graphic character Ascii characters $+ $J 4A ;z Coding Guidelines Developers do not need to remember the numerical values for extended characters. The editor, or program development environment, used to create the source code invariably looks after the details (generating any escape sequences and the appropriate byte values for the extended character selected by the developer). How these tools decide to encode multibyte character sequences is outside the scope of these coding guidelines. It is usually possible to express an extended character in a minimal number of bytes using a particular state-dependent encoding. The extent to which developers might create fixed-length data structures on the assumption that multibyte characters will not contain any redundant shift sequences is outside the scope of 2017 footnote 152 313 this book. The value of the MB_LEN_MAX macro places an upper limit on the number of possible redundant MB_LEN_MAX shift sequences. Example 1 #include 2 3 char *p1 = "^[$B$3$l$OF|K\8lI=8=^[(J"; /* ^[$BF|K\8lJ8;zNs^[(J */ 4 char *p2 = "^[$B$3$l$OF|1Q^[(Jmixed^[$BJ8;zNs^[(J"; /* Ascii + ^[$BF|K\8l^[(J */ 5 char *p3 = "^[$B$3$l$OH>3Q^[(J^N6@6E^O^[$B$H^[(JASCII^[$B:.9g^[(J"; 6 7 int main(void) 8 { 9 printf("%s^[$B$H^[(J%s^[$B$H^[(J%s\n", p1, p2, p3); 10 } 244 While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. June 24, 2009 v 1.2
  12. 247 5.2.1.2 Multibyte characters Commentary The implementation of a stateful encoding has to pick a special character, which is not in the basic character set, to indicate the start of a shift sequence. When not in the initial shift state, it is very unlikely that single bytes will be interpreted the same way as when in the initial shift state. C++ The C++ Standard does not explicitly specify this requirement. Common Implementations The ESC character, 0x1b, is commonly used to indicate the start of a shift sequence. footnote 12) The trigraph sequences enable the input of characters that are not defined in the Invariant Code Set as 245 12 described in ISO/IEC 646, which is a subset of the seven-bit US ASCII code set. Commentary When trigraphs are used, it is possible to write C source code that contains only those characters that are in the Invariant Code Set of ISO/IEC 646. C90 The C90 Standard explicitly referred to the 1983 version of ISO/IEC 646 standard. The interpretation for subsequent bytes in the sequence is a function of the current shift state. 246 Commentary This wording is really a suggestion for the design of multibyte shift states (it is effectively describing the processing performed by finite state machines, which is what a shift state encoding is). Being able to interpret a byte independent of the current shift state would indicate that the sequence of bytes that resulted in the current state were redundant. 313 MB_LEN_MAX The specification of the macro MB_LEN_MAX requires that the maximum number of bytes needed to handle a supported multibyte character be provided. It may, or may not, be possible to represent some redundant shift sequence within the available bytes. The standard does not explicitly require or prohibit support for redundant shift sequences. C++ A set of virtual functions for handling state-dependent encodings, during program execution, is discussed in Clause 22, Localization library. But, this requirement is not specified. Common Implementations Implementations usually use a simple finite state machine, often automatically generated, to handle the mapping of shift states into their execution character value. The extent to which sequences of redundant shift sequences is supported will depend on the implementation. Coding Guidelines The sequence of bytes in a shift sequence are usually generated via some automated process. For this reason a guideline recommending against the use of redundant shift sequences is unlikely to be enforceable, and none is given. byte — A byte with all bits zero shall be interpreted as a null character independent of shift state. 247 all bits zero Commentary This is a requirement on the implementation. This requirement makes it possible to search for the end of a string without needing any knowledge of the encoding that has been used. For instance, string-handling functions can copy multibyte characters without interpreting their contents. v 1.2 June 24, 2009
  13. 5.2.1.2 Multibyte characters 250 C++ . . . , plus a null character (respectively, null wide character), whose representation has all zero bits. 2.2p3 While the C++ Standard does not rule out the possibility of all bits zero having another interpretation in other contexts, other requirements (17.3.2.1.3.1p1 and 17.3.2.1.3.2p1) restrict these other contexts, as do existing character set encodings. 248 — A byte with all bits zero shall not occur in the second or subsequent bytes of a Such a byte shall not occur multibyte character as part of any other multibyte character. end in initial shift state Commentary This is a requirement on the implementation. The effect of this requirement is that partial multibyte characters cannot be created (otherwise the behavior is undefined). A null character can only exist outside of the sequence of bytes making up a multibyte character. For source files this requirement follows from the requirement to end in the initial shift state. During program execution this requirement means that library 250 token shift state functions processing multibyte characters do not need to concern themselves with handling partial multibyte characters at the end of a string. The wording was changed by the response to DR #278 (it is a requirement on the implementation that forbids a two-byte character from having a first, or any, byte that is zero). C++ This requirement can be deduced from the definition of null terminated byte strings, 17.3.2.1.3.1p1, and null terminated multibyte strings, 17.3.2.1.3.2p1. 249 For source files, the following shall hold: Commentary These C-sentences specify requirements on a program. A program that violates them exhibits undefined behavior. 44 locale- Use of multibyte characters can involve locale-specific and implementation-defined behaviors. A source specific behavior file does not affect the conformance status of any program built using it, provided its use of multibyte 42 implementation- characters either involves locale-specific behavior or the implementation-defined behavior does not affect defined behavior program output (e.g., they appear in comments). Coding Guidelines The creation of multibyte characters within source files is usually handled by an editor. The developer involvement in the process being the selection of the appropriate character. In such an environment the developer has no control over the byte sequences used. A guideline recommending against such usage is likely to be impractical to implement and none is given. 250 — An identifier, comment, string literal, character constant, or header name shall begin and end in the initial token shift state shift state. Commentary These are the only tokens that can meaningfully contain a multibyte character. A token containing a multibyte character should not affect the processing of subsequent tokens. Without this requirement a token that did not end in the initial shift state would be likely to affect the processing of subsequent tokens. C90 Support for multibyte characters in identifiers is new in C99. June 24, 2009 v 1.2
  14. 252 5.2.2 Character display semantics C++ transla- 116 tion phase In C++ all characters are mapped to the source character set in translation phase 1. Any shift state encoding 1 will not exist after translation phase 1, so the C requirement is not applicable to C++ source files. Coding Guidelines The fact that many multibyte sequences are created automatically, by an editor, can make it very difficult for a developer to meet this requirement. A developer is unlikely to intentionally end a preprocessing token, created using a multibyte sequence, in other than the initial state. A coding guideline is unlikely to be of benefit. — An identifier, comment, string literal, character constant, or header name shall consist of a sequence of 251 valid multibyte characters. Commentary What is a valid multibyte character? This decision can only be made by a translator, should it chose to accept multibyte characters. In C90 it was relatively easy to lexically process a source file containing multibyte characters. The context in which these characters occurred often meant that a lexer simply had to look for the character that terminated the kind of token being processed (unless that character occurred as part of a multibyte character). Identifier tokens do not have a single termination character. This means that it is not possible to generalise support for multibyte characters in identifiers across all translators. It is possible that source containing a multibyte character identifier supported by one translator will cause another translator to issue a diagnostic. C90 Support for multibyte characters in identifiers is new in C99. C++ transla- 116 tion phase In C++ all characters are mapped to the source character set in translation phase 1. Any shift state encoding 1 will not exist after translation phase 1, so the C requirement is not applicable to C++ source files. Coding Guidelines In some cases source files can contain multibyte characters and be translated by translators that have no knowledge of the structure of these multibyte characters. The developer is relying on the translator ignoring them in comments containing their native language, or simply copying the character sequence in a string literal into the program image. In other cases, for instance identifiers, knowledge of the encoding used for the multibyte character set is likely to be needed by a translator. Ensuring that a translator capable of handling any multibyte characters occurring in the source is used, is a configuration-management issue that is outside the scope of these coding guidelines. 5.2.2 Character display semantics Commentary character display There is no guarantee that a character display will exist on any hosted implementation. If such a device is semantics supported by an implementation, this clause specifies its attributes. C++ Clause 18 mentions “display as a wstring” in Notes:. But, there is no other mention of display semantics anywhere in the standard. Common Implementations termcap Most Unix-based environments contain a database of terminal capabilities, the so-called termcap database.[1332] database This database provides information to the host on a large number of terminal capabilities and characteristics. Knowing the display device currently being used (this usually relies on the user setting an environment variable) enables the database to be queried for device attribute information. This information can then be used by an application to handle its output to display devices. There is a similar database of information on printer characteristics. v 1.2 June 24, 2009
  15. 5.2.2 Character display semantics 254 252 The active position is that location on a display device where the next character output by the fputc function would appear. Commentary This defines the term active position; however, the term current cursor position is more commonly used by developers. The wide character output functions act as if fputc is called. C++ C++ has no concept of active position. The fputc function appears in "Table 94" as one of the functions supported by C++. Other Languages Most languages don’t get involved in such low-level I/O details. 253 The intent of writing a printing character (as defined by the isprint function) to a display device is to display a graphic representation of that character at the active position and then advance the active position to the next position on the current line. Commentary The standard specifies an intent, not a requirement. Some devices produce output that cannot be erased later (e.g., printing to paper) while other devices always display the last character output at a given position (e.g., VDUs). The ability of printers to display two or more characters at the same position is sometimes required. For instance, programs wanting to display the ô character on a wide variety of printers might generate the sequence o, backspace, ^ (all of these characters are contained in the invariant subset of ISO 646). The intended behavior describes the movement of the active position, not the width of the character displayed. There is nothing in this definition to prevent the writing of one character affecting previously written characters (which can occur in Arabic). This specification implies that the positions are a fixed width apart. The graphic representation of a character is known as a glyph. 58 glyph C++ The C++ Standard does not discuss character display semantics. Common Implementations In some oriental languages, character glyphs can usually be organized into two groups, one being twice the width as the other. Implementations in these environments often use a fixed width for each glyph, creating empty spaces between some glyph pairs. Some orthographies, which use an alphabetic representation, contain single characters that use what appears to be two characters in their visual representation. For instance, the character denoted by the Unicode value U00C6 is Æ, and the character denoted by the Unicode value U01C9 is lj. Both representations are considered to be a single character (the former is also a single letter, while the latter is two letters). Coding Guidelines The concept of active position is useful for describing the basic set of operations supported by the C Standard. The applications’ requirements for displaying characters may, or may not, be feasible within the functionality provided by the standard; this is a top-level application design issue. How characters appear on a display device is an application user interface issue that is outside the scope of this book. 254 The direction of writing is locale-specific. writing direction locale-specific June 24, 2009 v 1.2
  16. 256 5.2.2 Character display semantics Commentary Although left-to-right is used by many languages, this direction is not the only one used. Arabic uses right-to-left (also Hebrew, Urdu, and Berber). In Japanese it is possible for the direction to be from top to bottom with the lines going right-to-left (mainland Chinese has the columns going from left-to-right, in Taiwan it goes right-to-left), or left-to-right with the lines going top to bottom (the same directional conventions as English) There is no requirement that the direction of writing always be the same direction, for instance, braille alternates in direction between adjacent lines (known as boustrophedron), as do Egyptian hieroglyphs, Mayan, and Hittite. Some Egyptian hieroglyphic characters can face either to the left or right (e.g., ˜ or ˜), information that readers can use to deduce the direction in which a line should be read. Some applications need to simultaneously handle locales where the direction of writing is different, for instance, a word processor that supports the use of Hebrew and English in the same document. This level of support is outside the scope of the C Standard. C++ The C++ Standard does not discuss character display semantics. Coding Guidelines The direction of writing is an application issue. Any developer who is concerned with the direction of writing will, of necessity, require a deeper involvement with this topic than the material covered by the C Standard or these coding guidelines. Example The direction of writing can change during program execution. For instance, in a word processor that handles both English and Arabic or Hebrew, the character sequence ABCdefGHJ (using lowercase to represent English and uppercase to represent Arabic/Hebrew) might appear on the display as JHGdefCBA. If the active position is at the final position of a line (if there is one), the behavior of the display device is 255 unspecified. Commentary The Committee recognized that there is no commonality of behavior exhibited by existing display devices when the final position on a line is reached. C++ The C++ Standard does not discuss character display semantics. Common Implementations Some display devices wrap onto the next line, effectively generating an extra new-line character. Other devices write all subsequent characters, up to the next new-line character, at the final position. On some displays, writing to the bottom right corner of a display has an effect other than displaying the character output, for instance, clearing the screen or causing it to scroll. The termcap and ncurses both provide configuration options that specify whether writing to this display location has the desired effect. Coding Guidelines Organizing the characters on a display device is an application domain issue. The fact that the C Standard does not provide a defined method of handling the situation described here needs to be dealt with, if applicable, during the design process. This is outside the scope of these coding guidelines. Alphabetic escape sequences representing nongraphic characters in the execution character set are intended 256 to produce actions on display devices as follows: Commentary This is the behavior of Ascii terminals enshrined in the C Standard. Rationale v 1.2 June 24, 2009
  17. 5.2.2 Character display semantics 258 To avoid the issue of whether an implementation conforms if it cannot properly effect vertical tabs (for instance), the Standard emphasizes that the semantics merely describe intent. These escape sequences can also be output to files. The data values written to a file may depend on whether the stream was opened in text or binary mode. C++ The C++ Standard does not discuss character display semantics. Other Languages Java provides a similar set of functionality to that described here. Common Implementations Most display devices are capable of handling most of the functions described here. Coding Guidelines A program cannot assume that any of the functionality described will occur when the escape sequence is sent to a display device. The root cause for the variability in support for the intended behaviors is the variability of the display devices. In most cases an implementation’s action is to send the binary representation of the escape sequence to the device. The manufacturers of display devices are aware of their customers expectations of behavior when these kinds of values are received. There is little that coding guidelines can recommend to help reduce the dependency on display devices. The design guidelines of creating individual functions to perform specific operations on display devices and isolating variable implementation behaviors in one place are outside the scope of these coding guidelines. 257 \a (alert) Produces an audible or visible alert without changing the active position. Commentary The intent of an alert is to draw attention to some important event, such as a warning message that the host is to be shut down, or that some unexpected situation has occurred. A program running as a background process (a concept that is not defined by the C Standard) may not have a display device attached (does a tree falling in a forest with nobody to hear it make a noise?). C++ Alert appears in Table 5, 2.13.2p3. There is no other description of this escape sequence, although the C behavior might be implied from the following wording: The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12: 17.4.1.2p3 Common Implementations Most implementations provide an audible alert. On display devices that don’t have a mechanism for producing a sound, a visible alert might be to temporarily blank the screen or to temporarily increase the brightness of the screen. Coding Guidelines Programs that produce too many alerts run the risk of having them ignored. The human factor involved in producing alerts are outside of the scope of these coding guidelines. Issues such as a display device not being able to produce an audible alert because its speaker is broken, is also outside the scope of these coding guidelines. 258 \b (backspace) Moves the active position to the previous position on the current line. backspace escape sequence Commentary The standard specifies that the active position is moved. It says nothing about what might happen to any character displayed prior to the backspace at the new current active position. June 24, 2009 v 1.2
  18. 260 5.2.2 Character display semantics Common Implementations Some devices erase any character displayed at the previous position. C++ Backspace appears in Table 5, 2.13.2p3. There is no other description of this escape sequence, although the C behavior might be implied from the following wording: The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12: 17.4.1.2p3 Example 1 #include 2 3 int main(void) 4 { 5 printf("h\bHello \b World\n"); 6 } If the active position is at the initial position of a line, the behavior of the display device is unspecified. 259 Commentary Some terminals have input locking states. In such cases an unspecified behavior put the display device into a state where it no longer displays characters written to it. C90 If the active position is at the initial position of a line, the behavior is unspecified. This wording differs from C99 in that it renders the behavior of the program as unspecified. The program simply writes the character; how the device handles the character is beyond its control. C++ The C++ Standard does not discuss character display semantics. Common Implementations The most common implementation behavior is to ignore the request leaving the active position unchanged. Some VDUs have the ability to wrap back to the final position on the preceding line. Coding Guidelines While it may be technically correct to specify that the behavior of the display device as unspecified, it does indirectly affect the output behavior of a program in that subsequent output may not appear on that display device. \f (form feed) Moves the active position to the initial position at the start of the next logical page. 260 Commentary page Whatever a page, logical or otherwise, is. This concept is primarily applied to printers. The functionality logical to move to the start of the next page, from anywhere on the current page, is generally provided by printer vendors. Programs might use this functionality since it frees them from needing to know the number of lines on a page (provided the minimum needed to support the generated output is available). C++ Form feed appears in Table 5, 2.13.2p3. There is no other description of this escape sequence, although the C behavior might be implied from the following wording: 17.4.1.2p3 v 1.2 June 24, 2009
  19. 5.2.2 Character display semantics 263 The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12: Coding Guidelines Use of this escape sequence could remove the need for a program to be aware of the number of lines on the page of the display device being written. However, it does place a dependency on the characteristics of the termcap display device being known to the host executing the program, or on the device itself, to respond to the data database sent to it. 261 \n (new line) Moves the active position to the initial position of the next line. new-line escape sequence Commentary What happens to the preceding lines is not specified. For instance, whether the display device scrolls lines or wraps back to the top of any screen. The standard is silent on the issue of display devices that only support one line. For instance, do the contents of the previous line disappear? C++ New line appears in Table 5, 2.13.2p3. There is no other description of this escape sequence, although the C behavior might be implied from the following wording: The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12: 17.4.1.2p3 Other Languages Some languages provide a library function that produces the same effect. Common Implementations On some hosts the new-line character causes more than one character to be sent to the display device (e.g., carriage return, line feed). A printing device may simply move the media being printed on. A VDU may display characters on some previous line (wrapping to the start of the screen). On some display devices (usually memory-mapped ones), the start of a new line is usually indicated by an end-of-line character appearing at the end of the previous line. On other display devices, a fixed amount of storage is allocated for the characters that may occur on 224 end-of-line representation each line. In this case the end of line is not stored as a character in the display device. Coding Guidelines Issues, such as handling lines that are lost when a new line is written or display devices that contain a single line, are outside the scope of these coding guidelines. 262 \r (carriage return) Moves the active position to the initial position of the current line. carriage return escape sequence Commentary The behavior might be viewed as having the same effect as writing the appropriate number of backspace characters. However, the effect of writing a backspace character might be to erase the previous character, while a carriage return does not cause the contents of a line to be erased. Like backspace, the standard says 258 backspace escape sequence nothing about the effect of writing characters at the position on a line that has previously been written to. C++ Carriage return appears in Table 5, 2.13.2p3. There is no other description of this escape sequence, although the C behavior might be implied from the following wording: The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12: 17.4.1.2p3 June 24, 2009 v 1.2
  20. 265 5.2.2 Character display semantics horizontal tab \t (horizontal tab) Moves the active position to the next horizontal tabulation position on the current line. 263 escape sequence Commentary Horizontal tabulation positions are provided by vendors of display devices as a convenient method of aligning data, on different lines, into columns. In some cases they can remove the need for a program to count the number of characters that have been written. The C Standard does not provide a method for controlling the location of horizontal tabulation positions. Neither does a program have any method of finding out which positions they occupy. C++ Horizontal tab appears in Table 5, 2.13.2p3. There is no other description of this escape sequence, although the C behavior might be implied from the following wording: The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12: 17.4.1.2p3 Common Implementations The location of tabulation positions on a line are usually controlled by the display device. There may be a limited number that can be configured on a line. Configuring a horizontal tab position every eight active positions from the start of the line is a common default. Many hosts allow the default setting to be changed, and some users actively make use of this configuration option. Coding Guidelines A commonly seen application problem is the assumption, by the developer, of where the horizontal tabulation positions occur on a display device. However, the handling display devices are outside the scope of these coding guidelines. If the active position is at or past the last defined horizontal tabulation position, the behavior of the display 264 device is unspecified. Commentary The standard does not specify how many horizontal tabulation positions must be supported by an implemen- tation, if any. C90 If the active position is at or past the last defined horizontal tabulation position, the behavior is unspecified. Common Implementations Some implementations do not move the active position when the last defined horizontal tabulation position has been reached; others treat writing such a character as being equivalent to writing a single white-space character at this position. In some cases the behavior is to move the active position to the first horizontal tabulation position on the next line. vertical tab \v (vertical tab) Moves the active position to the initial position of the next vertical tabulation position. 265 escape sequence Commentary Although the standard recognizes that the direction of writing is locale-specific, it says nothing about the order in which lines are organized. The vertical tab (and new line) escape sequence move the active position in the same line direction. There is no escape sequence for moving the active position in the opposite direction, similar to backspace for movement within a line. The concept of vertical tabulation implicitly invokes the concept of current page. This concept is primarily page 260 logical applied to printers, while the dimensions of a page might be less variable than a terminal. Before laser printers were invented, it was very important to ensure that output occurred in a controlled, top-down fashion. v 1.2 June 24, 2009
Đồng bộ tài khoản