The New C Standard- P9

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:100

0
42
lượt xem
4
download

The New C Standard- P9

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Tham khảo tài liệu 'the new c standard- p9', công nghệ thông tin, kỹ thuật lập trình phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả

Chủ đề:
Lưu

Nội dung Text: The New C Standard- P9

  1. 6.4.2.1 General 796 Coding Guidelines The visual similarity of these letters is discussed elsewhere. 792 character visual similarity 795 There is no specific limit on the maximum length of an identifier. Commentary The standard does specify a minimum limit on the number of characters a translator must consider as 282 internal significant. Implementations are free to ignore characters once this limit is reached. The ignored characters identifier significant charac- ters do not form part of another token. It is as if they did not appear in the source at all. 283 external identifier significant charac- C90 ters The C90 Standard does not explicitly state this fact. Other Languages Few languages place limits on the maximum length of an identifier that can appear in a source file. Like C, some specify a lower limit on the number of characters that must be considered significant. Coding Guidelines Using a large number of characters in an identifier spelling has many potential benefits; for instance, it provides the opportunity to supply a lot of information to readers, or to reduce dependencies on existing reader knowledge by spelling words in full rather than using abbreviations. There are also potential costs; for instance, they can cause visual layout problems in the source (requiring new-lines within an expression in an attempt to keep the maximum line length within the bounds that can be viewed within a fixed-width window), or increase the cognitive effort needed to visually scan source containing them. The length of an identifier is not itself directly a coding guideline issue. However, length is indirectly involved in many identifier memorability, confusability, and usability issues, which are discussed elsewhere. 792 identifier syntax Usage The distribution of identifier lengths is given in Figure 792.7. 796 Each universal character name in an identifier shall designate a character whose encoding in ISO/IEC 10646 identifier UCN falls into one of the ranges specified in annex D.60) Commentary Using other UCNs results in undefined behavior (in some cases even using these UCNs can be a constraint 816 UCNs violation). These character encodings could be thought of as representing letters in the specified national not basic char- acter set character set. C90 Support for universal character names is new in C99. Other Languages The ISO/IEC 10646 standard is relatively new and languages are only just starting to include support for the 28 ISO 10646 characters it specifies. Java specifies a similar list of UCNs. Common Implementations A collating sequence may not be defined for these universal character names. In practice a lack of a defined collating sequence is not an implementation problem. Because a translator only ever needs to compare the spelling of one identifier for equality with another identifier, which involves a simple character-by-character comparison (the issue of the ordering of diacritics is handled by not allowing them to occur in an identifier). Support for this functionality is new and the extent to which implementations are likely to check that UCN values fall within the list given in annex D is not known. June 24, 2009 v 1.2
  2. 797 6.4.2.1 General Coding Guidelines The intended purpose for supporting universal character names in identifiers is to reduce the developer effort needed to comprehend source. Identifiers spelled in the developer’s native tongue are more immediately recognizable (because of greater practice with those characters) and also have semantic associations that are more readily brought to mind. ISO 10646 28 The ISO 10646 Standard does not specify which languages contain the characters it specifies (although it does give names to some sets of characters that correspond to a language that contains them). The written form of some human languages share common characters; for instance, the characters a through z (and their orthography 792 uppercase forms) appear in many European orthographies. The following discussion refers to using UCNs from more than one human language. This is to be taken to mean using UCNs that are not part of the written form of the native language of the developer (the case of developers having more than one native language is not considered). For instance, the character a is used in both Swedish and German; the character û is used in Swedish, but not German; the character ß is used in German but not Swedish. Both Swedish and German developers would be familiar with the character a, but the character ß would be considered foreign to a Swedish developer, and the character û foreign to the German. Some coding guideline documents recommend against the use of UCNs. Their use within identifiers can increase the portability cost of the source. The use of UCNs is an economic issue; the potential cost of not permitting their use in identifiers needs to be compared against the potential portability benefits. (Alternatively, the benefits of using UCNs could be compared against the possible portability costs.) Given the purpose of using UCNs, is there any rationale for identifiers to contain characters from more than one human language? As an English speaker, your author can imagine a developer wanting to use an English word, or its common abbreviation, as a prefix or suffix to an identifier name. Perhaps an Urdu speaker can imagine a similar usage with Urdu words. The issue is whether the use of characters in the same identifier from different human languages has meaning to the developers who write and maintain the source. Identifiers very rarely occur in isolation. Should all the identifiers in the same function, or even source file, only contain UCNs that form the set of characters used by a single human language? Using characters from different human languages when it is possible to use only characters from a single language, potentially increases the cost of maintenance. Future maintainers are either going to have to be familiar with the orthography and semantics of the two human languages used or spend additional time processing instances of identifiers containing characters they are not familiar with. However, in some cases it might not be possible to enforce a single human language rule. For instance, a third-party library may contain callable functions whose spellings use characters from a human language different from that used in the source code that contains calls to it. Support for the use of UCNs in identifiers is new in C99 (and other computer languages) and at the time of this writing there is almost no practical experience available on the sort of mistakes that developers make with them. The initial character shall not be a universal character name designating a digit. 797 Commentary identifier 792 The terminal identifier-nondigit that appears in the syntax implies that the possible UCNs exclude the syntax digit characters. Also the list given in annex D does not include the digit characters. This means that an identifier containing a UCN designating a digit in any position results in undefined behavior. constant 822 The syntax for constants does not support the use of UCNs. This sentence, in the standard, reminds syntax implementors that such usage could be supported in the future and that, while they may support UCN digits within an identifier, it would not be a good idea to support them as the initial character. v 1.2 June 24, 2009
  3. 6.4.2.1 General 798 Table 797.1: The Unicode digit encodings. Encoding Range Language Encoding Range Language 0030–0039 ISO Latin-1 0BE7–0BEF Tamil (has no zero) 0660–0669 Arabic–Indic 0C66–0C6F Telugu 06F0–06F9 Eastern Arabic–Indic 0CE6–0CEF Kannada 0966–096F Devanagari 0D66–0D6F Malayalam 09E6–09EF Bengali 0E50–0E59 Thai 0A66–0A6F Gurmukhi 0ED0–0ED9 Lao 0AE6–0AEF Gujarati FF10–FF19 Fullwidth 0B66–0B6F Oriya digits C++ This requirement is implied by the terminal non-name used in the C++ syntax. Annex E of the C++ Standard does not list any UCN digits in the list of supported UCN encodings. Other Languages Java has a similar requirement. Coding Guidelines The extent to which different cultural conventions support the use of a digit as the first character in an identifier is not known to your author. At some future date the Committee may chose to support the writing of integer constants using UCNs. If this happens, any identifiers that start with a UCN designating a digit are liable to result in syntax violations. There does not appear to be a worthwhile benefit in a guideline recommendation dealing with the case of an identifier beginning with a UCN designating a digit. Example 1 int \u1f00\u0ae6; 2 int \u0ae6; 798 An implementation may allow multibyte characters that are not part of the basic source character set to appear identifier multibyte in identifiers; character in Commentary Prior to C99 there was no standardized method of representing nonbasic source character set characters in the source code. Support for multibyte characters in string literals and constants was specified in C90; some implementations extended this usage to cover identifiers. They are now officially sanctioned to do this. Support for the ISO 10646 Standard is new in C99. However, there are a number of existing implementations 28 ISO 10646 that use a multibyte encoding scheme and this usage is likely to continue for many years. The C committee recognized the importance of this usage and do not force developers to go down a UCN-only path. The standard says nothing about the behavior of the _ _func_ _ reserved identifier in the case when a 810 __func__ function name is spelled using wide characters. C90 This permission is new in C99. C++ 116 transla- The C++ Standard does not explicitly contain this permission. However, translation phase 1 performs an tion phase 1 implementation-defined mapping of the source file characters, and an implementation may choose to support multibyte characters in identifiers via this route. June 24, 2009 v 1.2
  4. 801 6.4.2.1 General Other Languages While other language standards may not mention multibyte characters, the problem they address is faced by implementations of those languages. For this reason, it is to be expected that some implementations of other languages will contain some form of support for multibyte characters. Coding Guidelines universal 815 UCNs may be the preferred, C Standard way, of representing nonbasic character set characters in identifiers. charac- However, developers are at the mercy of editor support for how they enter and view characters that are not in ter name syntax the basic source character set. which characters and their correspondence to universal character names is implementation-defined. 799 Commentary Various national bodies have defined standards for representing their national character sets in computer files. ISO 10646 28 While ISO 10646 is intended to provide a unified standard for all characters, it may be some time before existing software is converted to use it. Common Implementations It is common to find translators aimed at the Japanese market supporting JIS, shift-JIS, and EUC encodings (see Table 243.3). These encoding use different numeric values than those given in ISO 10646 to represent the same national character. When preprocessing tokens are converted to tokens during translation phase 7, if a preprocessing token could 800 be converted to either a keyword or an identifier, it is converted to a keyword. Commentary The Committee could have created a separate name space for keywords and allowed developers to define identifiers having the same spelling as a keyword. The complexity added to a translator by such a specification would be significant (based on implementation experience for languages that support this functionality), while a developer’s inability to define identifiers having these spellings was considered a relatively small inconvenience. C90 This wording is a simplification of the convoluted logic needed in the C90 Standard to deduce from a constraint what C99 now says in semantics. The removal of this C90 constraint is not a change of behavior, since it was not possible to write a program that violated it. Constraints C90 6.1.2 In translation phase 7 and 8, an identifier shall not consist of the same sequence of characters as a keyword. Other Languages Some languages allow keywords to be used as variable names (e.g., PL/1), using the context to disambiguate intended use. footnote 60) On systems in which linkers cannot accept extended characters, an encoding of the universal character 801 60 name may be used in forming valid external identifiers. Commentary This is really an implementation tip for translators. The standard defines behavior in terms of an abstract machine that produces external output. The tip given in this footnote does not affect the conformance status of an implementation that chooses to implement this functionality in another way. The only time such a mapping might be visible is through the use of a symbolic execution-time debugging tool, or by having to link against object files created by other translators. v 1.2 June 24, 2009
  5. 6.4.2.1 General 805 C90 Extended characters were not available in C90, so the suggestion in this footnote does not apply. 215 extended characters Other Languages Issues involving third-party linkers are common to most language implementations that compile to machine code. Some languages, for instance Java, define the characteristics of an implementation at translation and execution time. The Java language specification goes to the extreme (compared to other languages) of specifying the format of the generated file object code file. Common Implementations There is a long-standing convention of prefixing externally visible identifier names with an underscore character when information on them is written out to an object file. There is little experience available on implementation issues involving UCNs, but many existing linkers do assume that identifiers are encoded using 8-bit characters. Coding Guidelines The encoding of external identifiers only needs to be considered when interfacing to, or from code written in another language. Cross-language interfacing is outside the scope of these coding guidelines. 802 For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal character name. Commentary Some linkers may not support an occurrence of the backslash (\) character in an identifier name. One solution to this problem is to create names that cannot be declared in the source code by the developer; for instance, by deleting the \ characters and prefixing the name with a digit character. Common Implementations There are no standards for encoding of universal character names in object files. The requirement to support this form of encoding is too new for it to be possible to say anything about common encodings. 803 Extended characters may produce a long external identifier. Commentary Here the word long does not have any special meaning. It simply suggests an identifier containing many 282 internal characters. identifier significant charac- ters Implementation limits 804 As discussed in 5.2.4.1, an implementation may limit the number of significant initial characters in an identifier; Implemen- tation limits Commentary This subclause lists a number of minimum translation limits 276 translation limits C90 The C90 Standard does not contain this observation. C++ All characters are significant.20) 2.10p1 C identifiers that differ after the last significant character will cause a diagnostic to be generated by a C++ translator. Annex B contains an informative list of possible implementation limits. However, “ . . . these quantities are only guidelines and do not determine compliance.”. June 24, 2009 v 1.2
  6. 806 6.4.2.1 General the limit for an external name (an identifier that has external linkage) may be more restrictive than that for an 805 internal name (a macro name or an identifier that does not have external linkage). external 283 Commentary identifier External identifiers have to be processed by a linker, which may not be under the control of a vendor’s significant characters C implementations. In theory, any tool that performs the linking process falls within the remit of the C Committee. However, the Committee recognized that, in practice, it is not always possible for translator vendors to supply their own linker. The limitations of existing linkers needed to be factored into the limits internal 282 specified in the standard. identifier Internal identifiers only need to be processed by the translator and the standard is in a strong position to significant characters specify the behavior. Other Languages Most other language implementations face similar problems with linkers as C does. However, not all language specifications explicitly deal with the issue (by specifying the behavior). The Java standard defines a complete environment that handles all external linkages. Coding Guidelines What are the costs associated with a change to the linkage of an identifier during program maintenance, from internal linkage to external linkage? (Experience shows that identifier linkage is rarely changed from external to internal?) external 283 In most cases implementations support a sufficiently large number of significant characters in an external identifier name that a change of identifier linkage makes no difference to its significant characters (i.e., the number significant characters identifier 792 of characters it contains falls inside the implementation limit). In those cases where a change of identifier number of characters linkage results in some of its significant characters being ignored, the affect may be benign (there is no other external 1818 identifier defined with external linkage whose name is the same as the truncated name) or results in undefined linkage behavior (the program defines two identifiers with external linkage with the same name). exactly one external definition The number of significant characters in an identifier is implementation-defined. 806 internal 282 Commentary identifier Subject to the minimum requirements specified in the standard. significant characters C++ 2.10p1 All characters are significant.20) References to the same C identifier, which differs after the last significant character, will cause a diagnostic to be generated by a C++ translator. There is also an informative annex which states: Number of initial characters in an internal identifier or a macro name [1024] Annex Bp2 Number of initial characters in an external identifier [1024] Other Languages Some languages require all characters in an identifier to be significant (e.g., Java, Snobol 4), while others don’t (e.g., Cobol, Fortran). Common Implementations It is rare to find an implementation that does not meet the minimum limits specified in the standard. A few translators treat all identifiers as significant. Most have a limit of between 256 and 2,000 significant characters. The POSIX standard requires that any language that binds to its API needs to support 14 significant characters in an external identifier. v 1.2 June 24, 2009
  7. 6.4.2.1 General 806 Coding Guidelines While the C90 minimum limits for the number of significant characters in an identifier might be considered unacceptable by many developers, the C99 limits are sufficiently generous that few developers are likely to complain. Automatically generated C source sometimes relies on a large number of significant characters in an identifier. This can occur because of the desire to simplify the implementation of the generator. Character sequences in different offsets within an identifier might be reserved for different purposes. Predefined default character sequence is used to pad the identifier spelling where necessary. As the following example shows, it is possible for a program’s behavior to change, both when the number of significant identifiers is increased and when it is decreased. 1 /* 2 * Yes, C99 does specify 64 significant characters in an internal 3 * identifier. But to keep this example within the page width 4 * we have taken some liberties. 5 */ 6 7 extern float _________1_________2_________3___bb; 8 9 void f(void) 10 { 11 int _________1_________2_________3___ba; 12 13 /* 14 * If there are 34 significant characters, the following operand 15 * will resolve to the locally declared object. 16 * 17 * If there are 35 significant characters, the following operand 18 * will resolve to the globally declared object. 19 */ 20 _________1_________2_________3___bb++; 21 } 22 23 void g(void) 24 { 25 int _________1_________2_________3___aa; 26 27 /* 28 * If there are 34 significant characters, the following operand 29 * will resolve to the globally declared object. 30 * 31 * If there are 33 significant characters, the following operand 32 * will resolve to the locally declared object. 33 */ 34 _________1_________2_________3___bb++; 35 } The following issues need to be addressed: • All references to the same identifier should use the same character sequence; that is, all characters are intended to be significant. References to the same identifiers that differ in nonsignificant characters need to be treated as faults. • Within how many significant characters should different identifiers differ? Should identifiers be required to differ within the minimum number of significant characters specified by the standard, or can a greater number of characters be considered significant? Readers do not always carefully check all characters in the spelling of an identifier. The contribution made by characters occurring in different parts of an identifier will depend on the pattern of eye movements employed June 24, 2009 v 1.2
  8. 807 6.4.2.1 General 100 ××× ∆∆ × × gcc .. • • ∆× • ∆×× . •∆ . • ∆× . • ∆× . . idsoftware . • ∆× . • ∆× 10 . • ∆×× . •∆ × ∆ ∆ linux % identical matches . • ∆∆×× . • • ∆ ×× . . •∆• × • • mozilla . • ∆∆ • ×× . . •∆• × 1 . . ∆∆∆××× ••• . ∆∆ • ××× • • . ∆ • • ××× ••• 0.1 . ∆∆ •••• ∆ •••• ∆ ×× ••••••• .. ∆ . ∆ × 0.01 . ∆∆ ∆ ×× ∆ ∆×∆ 0.001 ∆ 6 10 20 30 40 50 Significant characters Figure 806.1: Occurrence of unique identifiers whose significant characters match those of a different identifier (as a percentage of all unique identifiers in a program), for various numbers of significant characters. Based on the visible form of the .c files. reading 770 kinds of by readers, which in turn may be affected by their reasons for reading the source, plus cultural factors (e.g., direction in which they read text in their native language, or the significance of word endings in their native identifiers 792 language). Characters occurring at both ends of an identifier are used by readers (at least native English- and Greek readers word 770 French-speaking ones) when quickly scanning text. reading individual Cg 806.1 When performing similarity checks on identifiers, all characters shall be considered significant. Any identifiers that differ in a significant character are different identifiers. 807 Commentary In many cases different identifiers also denote different entities. In a some cases they denote the same entity (e.g., two different typedef names that are synonyms for the type int). Other Languages This statement is common to all languages (but it does not always mean that they necessarily denote different entities). Coding Guidelines Identifiers that differ in a single significant character may be considered to be • different identifiers by a translator, but considered to be the same identifier by some readers of the source (because they fail to notice the difference). • the same identifiers by a translator (because the difference occurs in a nonsignificant character), but considered to be different identifiers by some readers of the source (because they treat all characters as being significant). • identifiers by both a translator and some readers of the source. developer 0 The possible reasons for readers making mistakes are discussed elsewhere, as are the guideline recommenda- errors identifier 792 tions for reducing the probability that these developer mistakes become program faults. filtering spellings Example v 1.2 June 24, 2009
  9. 6.4.2.2 Predefined identifiers 810 1 extern int e1; 2 extern long el; 3 extern int a_longer_more_meaningful_name; 4 extern int a_longer_more_meeningful_name; 5 extern int a_meaningful_more_longer_name; 808 If two identifiers differ only in nonsignificant characters, the behavior is undefined. Commentary While the obvious implementation strategy is to ignore the nonsignificant characters, the standard does not require implementations to use this strategy. To speed up identifier lookup many implementations use a hashed symbol table— the hash value for each identifier is computed from the sequence of characters it contains. Computing this hash value as the characters are read in, to form an identifier, saves a second pass over those same characters later. If nonsignificant characters were included in the original computed hash value, a subsequent occurrence of that identifier in the source, differing in nonsignificant characters, would result in a different hash value being calculated and a strong likelihood that the hash table lookup would fail. Developers generally expect implementations to ignore nonsignificant characters. An implementation that behaved differently because identifiers differed in nonsignificant characters might not be regarded as being very user friendly. Highlighting misspellings that occur in nonsignificant characters is not always seen in a positive light by some developers. C++ In C++ all characters are significant, thus this statement does not apply in C++. Other Languages Some languages specify that nonsignificant characters are ignored and have no effect on the program, while others are silent on the subject. Common Implementations Most implementations simply ignore nonsignificant characters. They play no part in identifier lookup in symbol tables. Coding Guidelines The coding guideline issues relating to the number of characters in an identifier that should be considered 792 identifier significant are discussed elsewhere. guideline signifi- cant characters 809 Forward references: universal character names (6.4.3), macro replacement (6.10.3). 6.4.2.2 Predefined identifiers Semantics 810 The identifier _ _func_ _ shall be implicitly declared by the translator as if, immediately following the opening __func__ brace of each function definition, the declaration static const char __func__[] = "function-name"; appeared, where function-name is the name of the lexically-enclosing function.61) Commentary Implicitly declaring _ _func_ _ immediately after the opening brace in a function definition means that the first, developer-written declaration within that function can access it. Giving _ _func_ _ static storage duration enables its address to be referred to outside the lifetime of the function that contains it (e.g., enabling a call history to be displayed at some later stage of program execution). This is not a storage overhead because space needs to be allocated for the string literal denoted by _ _func_ _. The const qualifier ensures June 24, 2009 v 1.2
  10. 810 6.4.2.2 Predefined identifiers that any attempts to modify the value cause undefined behavior. The identifier _ _func_ _ has an array type, transla- 135 tion phase and is not a string literal, so the string concatenation that occurs in translation phase 6 is not applicable. 6 This identifier is useful for providing execution trace information during program testing. Developers who make use of UCNs may need to ensure that the library they use supports the character output required by them: 1 #include 2 3 void \u30CE(void) 4 { 5 printf ("Just entered %s\n", __func__); 6 } identifier 798 multibyte character in The issue of wide characters in identifiers is discussed elsewhere. Which function name is used when a function definition contains the inline function specifier? In: 1 #include 2 3 inline void f(void) 4 { 5 printf("We are in %s\n", __func__); 6 } 7 8 int main(void) 9 { 10 f(); 11 printf("We are in %s\n", __func__); 12 } the name of the function f is output, even if that function is inlined into main. C90 Support for the identifier _ _func_ _ is new in C99. C++ Support for the identifier _ _func_ _ is new in C99 and is not available in the C++ Standard. Common Implementations A translator only needs to declare _ _func_ _ if a reference to it occurs within a function. An obvious storage saving optimization is to delay any declaration until such time as it is known to be required. Another optimization is for the storage allocated for _ _func_ _ to exactly overlay that allocated to the string literal. Allocating storage for a string literal and copying the characters to the separately allocated object it initializes is not necessary when that object is defined using the const qualifier. gcc also supports the built-in form _ _FUNCTION_ _. Example Debugging code in functions can provide useful information. But when there are lots of functions, the quantity of useless information can be overwhelming. Controlling which functions are to output debugging information by using conditional compilation requires that code be edited and the program rebuilt. The names of functions can be used to dynamically control which functions are to output debugging information. This control not only reduces the amount of information output, but can also reduce execution time by orders of magnitude (output can be a resource-intense operation). flookup.h 1 typedef struct f__rec { 2 char *func_name; 3 _Bool enabled; 4 struct f__rec *next; v 1.2 June 24, 2009
  11. 6.4.2.2 Predefined identifiers 811 5 } func__list; 6 7 extern _Bool func_lookup(func__list *, char *); 8 9 /* 10 * Use the name of the function to control whether debugging is 11 * switched on/off. lookup is only called the first time this code 12 * is executed, thereafter the value f___l->enabled can be used. 13 */ 14 #define D_func_trace(func_name, code) { \ 15 static func__list * f___l = NULL; \ 16 if (f___l ? f___l->enabled : lookup(&f___l, func_name)) \ 17 {code} \ 18 } flookup.c 1 #include 2 3 #include "flookup.h" 4 5 /* 6 * A fixed list of functions and their debug mode. 7 * We could be more clever and make this a list which 8 * could be added to as a program executes. 9 */ 10 static struct { 11 char *func_name; 12 _Bool enabled; 13 func__list *traces_seen; 14 } lookup_table[] = { 15 "abc", true, NULL, 16 NULL, false, NULL 17 }; 18 19 _Bool func_lookup(func__list *f_list, char *f_name) 20 { 21 /* 22 * Loop through lookup_table looking for a match against f_name. 23 * If a match is found, add f_list to the traces_seen list and 24 * return the value of enabled for that entry. 25 */ 26 } 27 28 void change_enabled_setting(char *f_name, _Bool new_enabled) 29 { 30 /* 31 * Loop through lookup_table looking for a match against f_name. 32 * If a match is found, loop over its traces_seen list setting 33 * the enabled flag to new_enabled. 34 * 35 * This function can switch on/off the debugging output from 36 * any registered function. 37 */ 38 } 811 This name is encoded as if the implicit declaration had been written in the source character set and then translated into the execution character set as indicated in translation phase 5. Commentary 133 transla- Having the name appearing as if in translation phase 5 avoids any potential issues caused by macro names tion phase 5 June 24, 2009 v 1.2
  12. 814 6.4.2.2 Predefined identifiers defined with the spelling of keywords or the name _ _func_ _. It also enables a translator to have an identifier name and type predefined internally, ready to be used when this reserved identifier is encountered. Translation phase 5 is also where characters get converted to their corresponding members in the execution character set, an essential requirement for spelling a function name. In many implementations the function name written to program 141 image the object file, or program image, is different from the one appearing in the source. This translation phase 5 requirement ensures that it is not any modified name that is used. Example 1 #include 2 3 #define __func__ __CNUF__ 4 #define __CNUF__ "g" 5 6 void f(void) 7 { 8 /* 9 * The implicit declaration does not appear until after preprocessing. 10 * So there is no declaration ’static const char __func__[] = "f";’ 11 * visible to the preprocessor (which would result in __func__ being 12 * mapped to __CNUF__ and "f" rather than "g" being output). 13 */ 14 printf("Name of function is %s\n", __CNUF__); 15 } EXAMPLE Consider the code fragment: 812 #include void myfunc(void) { printf("s\n", __func__); /* ... */ } Each time the function is called, it will print to the standard output stream: myfunc Commentary This assumes that the standard output stream is not closed (in which case the behavior would be undefined). Forward references: function definitions (6.9.1). 813 footnote 61) Since the name _ _func_ _ is reserved for any use by the implementation (7.1.3), if any other identifier is 814 61 explicitly declared using the name _ _func_ _, the behavior is undefined. Commentary The name is reserved because it begins with two underscores. The fact that the standard defines an interpreta- tion for this name in the identifier name space in block scope does not give any license to the developer to use it in other name spaces or at file scope. This name is still reserved for use in other name spaces and scopes. C90 Names beginning with two underscores were specified as reserved for any use by the C90 Standard. The following program is likely to behave differently when translated and executed by a C99 implementation. v 1.2 June 24, 2009
  13. 6.4.3 Universal character names 815 1 #include 2 3 int main(void) 4 { 5 int __func__ = 1; 6 7 printf("d\n", __func__); 8 } C++ Names beginning with _ _ are reserved for use by a C++ implementation. This leaves the way open for a C++ implementation to use this name for some purpose. 6.4.3 Universal character names universal char- acter name 815 syntax universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit Commentary It is intended that this syntax notation not be visible to the developer, when reading or writing source code that contains instances of this construct. That is, a universal-character-name aware editor displays the ISO 10646 glyph representing the numeric value specified by the hex-quad sequence value. Without such 58 glyph editor support, the whole rationale for adding these characters to C, allowing developers to read and write identifiers in their own language, is voided. C90 Support for this syntactic category is new in C99. Other Languages Java calls this lexical construct a UnicodeInputCharacter (and does not support the \U form, only the \u one). Coding Guidelines It is difficult to imagine developers regularly using UCNs with an editor that does not display UCNs in some graphical form. A guideline recommending the use of such an editor would not be telling developers anything they did not already know. 792 Word A number of theories about how people recognize words have been proposed. One of the major issues yet recognition models of to be resolved is the extent to which readers make use of whole word recognition versus mapping character sequences to sound (phonological coding). Support for UCNs increases the possibility that developers will encounter unfamiliar characters in source code. The issue of developer performance in handling unfamiliar 792 reading characters is discussed elsewhere. characters unknown to reader Example 1 #define foo(x) 2 3 void f(void) 4 { 5 foo("\\u0123") /* Does not contain a UCN. */ June 24, 2009 v 1.2
  14. 817 6.4.3 Universal character names 6 foo(\\u0123); /* Does contain a UCN. */ 7 } Constraints UCNs A universal character name shall not specify a character whose short identifier is less than 00A0 other than 816 not basic char- acter set 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.62) Commentary ISO 10646 28 The ISO 10646 Standard defines the ranges 00 through 01F, and 07F through 09F, as the 8-bit control codes (what it calls C0 and C1). Most of the UCNs with values less than 00A0 represent characters in the basic source character set. The exceptions listed enumerate characters that are in the Ascii character set, but not in the basic source character set. The ranges 0D800 through DBFF and 0DC00 through 0DFFF are known as the surrogate ranges. The purpose of these ranges is to allow representation of rare characters in future versions of the Unicode standard. This constraint means that source files cannot contain the UCN equivalent for any members of the basic source character set. Rationale UCNs are not permitted to designate characters from the basic source character set in order to permit fast compilation times for C programs. For some real world programs, compilers spend a significant amount of time merely scanning for the characters that end a quoted string, or end a comment, or end some other token. Although, it is trivial for such loops in a compiler to be able to recognize UCNs, this can result in a surprising amount of overhead. A UCN is constrained not to specify a character short identifier in the range 0000 through 0020 or 007F through 009F inclusive for the same reason: this avoids allowing a UCN to designate the newline character. Since different implementations use different control characters or sequences of control characters to represent newline, UCNs are prohibited from representing any control character. C++ If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F–0x9F (inclusive), 2.2p2 or if the universal character name designates a character in the basic source character set, then the program is ill-formed. The range of hexadecimal values that are not permitted in C++ is a subset of those that are not permitted in C. This means that source which has been accepted by a conforming C translator will also be accepted by a conforming C++ translator, but not the other way around. Other Languages Java has no such restrictions on the hexadecimal values. Common Implementations Support for UCNs is new in C99. It remains to be seen whether translator vendors decide to support any UCN hexadecimal value as an extension. Example 1 \u0069\u006E\u0074 glob; /* Constraint violation. */ Description v 1.2 June 24, 2009
  15. 6.4.3 Universal character names 818 817 Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the basic character set. Commentary UCNs may also appear in comments. However, comments do not have a lexical structure to them. Inside a comment character, sequences starting with \u are not treated as UCNs by a translator, although other tools may choose to do so, in this context. The mapping of UCNs in character constants and string literals to the execution character set occurs in translation phase 5. 816 UCNs The constraint on the range of values that a UCN may take prevents them from being used to represent not basic char- acter set keywords. C++ The C++ Standard also supports the use of universal character names in these contexts, but does not say in words what it specifies in the syntax (although 2.2p2 comes close for identifiers). Other Languages In Java, UnicodeInputCharacters can represent any character and is mapped in lexical translation step 1. It is possible for every character in the source to appear in this form. The mapping only occurs once, so \u005cu005a becomes \u005a, not Z (005c is the Unicode value for \ and 005a is the Unicode character for Z). Coding Guidelines UCNs in character constants and string literals are used to represent characters that are output when a program is executed, or in identifiers to provide more readable source code. In the former case it is possible that UCNs from different natural languages will need to be represented. In the latter case it might be surprising if source code contained UCNs from different languages. This usage is a complex one involving issues outside of these coding guidelines (e.g., configuration management and customer requirements) and your author has insufficient experience to know whether any guideline recommendations might be worthwhile. Some of the coding guideline issues relating to the use of characters outside of the basic execution 238 multibyte character set are discussed elsewhere. character source contain Example 1 #include 2 3 int \u0386\u0401; 4 wchar_t *hello = "\u05B0\u0901"; Semantics 818 The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as short identifier specified by ISO/IEC 10646) is nnnnnnnn.63) Commentary The standard specifies how UCNs are represented in source code. A development environment may chose to provide, to developers, a visible representation of the UCN that matches the glyph with the corresponding numeric value in ISO 10646. The ISO 10646 BNF syntax for short identifiers is: ISO 10646 short identifier { U | u } [ {+}(xxxx | xxxxx | xxxxxx) | {-}xxxxxxxx ] where x represents a hexadecimal digit. June 24, 2009 v 1.2
  16. 822 6.4.4 Constants Other Languages Java does not support eight-digit universal character names. Coding Guidelines external 283 This form of UCN counts toward a greater number of significant characters in identifiers with external identifier linkage and therefore is not the preferred representation. However, the developer may not have any control significant characters over the method used by an editor to represent UCNs. Given that characters from the majority of human languages can be represented using four-digit short identifiers, eight-digit short identifiers are not likely to be needed. If the development environment offers a choice of representations, use of four-digit short identifiers is likely to result in more significant characters being retained in identifiers having external linkage. Similarly, the universal character name \unnnn designates the character whose four-digit short identifier is 819 nnnn (and whose eight-digit short identifier is 0000nnnn). Commentary It was possible to represent all of the characters specified by versions 1 and 2 of the Unicode-sponsored character set using four-digit short identifiers. Version 3 introduced characters whose representation value requires more than four digits. Other Languages Java only supports this form of four-digit universal character names. footnote 62) The disallowed characters are the characters in the basic character set and the code positions reserved by 820 62 ISO/IEC 10646 for control characters, the character DELETE, and the S-zone (reserved for use by UTF-16). Commentary basic char- 215 Requiring that characters in the basic character set not be represented using UCN notation helps guarantee acter set that existing tools (e.g., editors) continue to be able to process source files. The control characters may have special meaning for some tools that process source files (e.g., a commu- nications program used for sending source down a serial link). C++ The C++ Standard does not make this observation. footnote 63) Short identifiers for characters were first specified in ISO/IEC 10646–1/AMD9:1997. 821 63 Commentary This amendment appeared eight years after the first publication of the C Standard (which was made by ANSI in 1989). 6.4.4 Constants constant syntax 822 constant: integer-constant floating-constant enumeration-constant character-constant Commentary constant ex- 1322 pression A constant differs from a constant-expression in that it consists of a single token. The term literal is syntax often used by developers to refer to what the C Standard calls a constant (technically the only literals C string literal 895 syntax contains are string literals). There is a more general usage of the term constant to mean something whose value does not change. What the C Standard calls a constant-expression developers often shorten to constant. v 1.2 June 24, 2009
  17. 6.4.4 Constants 822 C++ 21) The term “literal” generally designates, in this International Standard, those tokens that are called “constants” Footnote 21 in ISO C. The C++ Standard also includes string-literal and boolean-literal in the list of literals, but it does not include enumeration constants in the list of literals. However: The identifiers in an enumerator-list are declared as constants, and can appear wherever constants are 7.2p1 required. The C++ terminology more closely follows common developer terminology by using literal (a single token) and constant (a sequence of operators and literals whose value can be evaluated at translation time). The value of a literal is explicit in the sequence of characters making up its token. A constant may be made up of more than one token or be an identifier. The operands in a constant have to be evaluated by the translator to obtain its result value. C uses the more easily confused terminology of integer-constant (a single token) and constant-expression (a sequence of operators, integer-constant and floating-constant whose value can be evaluated at translation time). Other Languages Languages that support types not supported by C (e.g., instance sets) sometimes allow constants having these types to be specified (e.g., in Pascal [’a’, ’d’] represents a set containing two characters). Fortran supports complex literal constants (e.g., (1.0, 2.0) represents the complex number 1.0 + 2.0i) Many languages do not support (e.g., Java until version 1.5) some form of enumeration-constant. Coding Guidelines Constants are the mechanism by which numeric values are written into source code. The term constant is used because the numeric values do not change during program execution (and are known at translation time; although in some cases a person reading the source may only know that the value used will be one of a list of possible values because the definition of a macro may be conditional on the setting of some translation time 1931 macro option— for instance, -D). object-like The use of constants in source code creates a number of possible maintenance issues, including: • A constant value, representing some quantity, often needs to occur in multiple locations within source code. Searching for and replacing all occurrences of a particular numeric value in the code is an error prone process. It is not possible, for instance, to know that all 15s occurring in the source code have the same semantic association and some may need to remain unchanged. (Your author was once told by a developer, whose source contained lots of 15s, that the UK government would never change value-added tax from 15%; a few years later it changed to 17.5%.) • On encountering a constant in the source, a reader usually needs to deduce its semantic association (either in the application domain or its internal algorithmic function). While its semantics may be very familiar to the author of the source, the association between value and semantics may not be so readily made by later readers. • A cognitive switch may need to be made because of the representation used for the constant (e.g., 0 cognitive switch floating point, hexadecimal integer, or character constant). One solution to these problems is to use an identifier to give a symbolic name822.1 to the constant, and to use symbolic name that symbolic name wherever the constant would have appeared in the source. Changes to the value of the constant can then be made by a single modification to the definition of the identifier and a well-chosen name can help readers make the appropriate semantic association. The creation of a symbolic name provides two pieces of information: 822.1 In some cases the linguistically more correct terminology would be iconic name. June 24, 2009 v 1.2
  18. 822 6.4.4 Constants 1. The property represented by that symbolic name. For instance, the maximum value of a particular INT_MAX 318 2015 type (INT_MAX), whether an implementation supports some feature (_ _STDC_IEC_559_ _), a means of __STDC_IEC_559__ specifying some operation (SEEK_SET), or a way to obtain information (FE_OVERFLOW). macro 2. A method of operating on the symbolic name to access the property it represents. For instance, arith- metic operations (INT_MAX), testing in a conditional preprocessing directive (_ _STDC_IEC_559_ _), passing as an argument to a library function (SEEK_SET); passing as an argument to a library function, possibly in combination with other symbolic names (FE_OVERFLOW). Operating on symbolic names involves making use of representation information. (Assignment, or argument passing, is the only time that representation might not be an issue.) The extent to which the use of representation information will be considered acceptable will depend on the symbolic name. For instance, FE_OVERFLOW appearing as the operand of a bitwise operator is to be expected, but its appearance as the operand of an arithmetic operator would be suspicious. The use of symbolic names is rarely seen by developers, as applying to all constants that occur in source code. In some cases the following are claimed: • The constants are so sufficiently well-known that there is no need to give them a name. • The number of occurrences of particular constants is not sufficient to warrant creating a name for them. • Operations involving some constant values occur so frequently that their semantic associations are obvious to developers; for instance, assigning 0 or adding 1. It is true that not all numeric values are meaningless to everybody. A few values are likely to be universally known (at least to Earth-based developers). For instance, there are 60 seconds in a minute, 60 minutes in an hour, and 24 hours in a day. The value 24 occurring in an expression involving time is likely to represent hours in a day. Many values will only be well known to developers working within a given application domain, such as atomic physics (e.g., the value 6.6261E-34). Between these extremes are other values; for instance, 3.14159 will be instantly recognized by developers with a mathematics background. However, developers without this background may need to think about what it represents. There is the possibility that developers who have grown up surrounded by other mathematically oriented people will be completely unaware that others do not recognize the obvious semantic association for this value. A constant having a particular semantic association may only occur once in the source. However, the issue is not how many times a constant having a particular semantic association occurs, but how many times the particular constant value occurs. The same constant value can appear because of different semantic associations. A search for a sequence of digits (a constant value) will locate all occurrences, irrespective of semantic association. While an argument can always be made for certain values being so sufficiently well-known that there is no benefit in replacing them by identifiers, the effort/time taken in discussions on what values are sufficiently well-known to warrant standing on their own, instead of an identifier, is likely to be significantly greater than the sum total of all the extra one seconds, or so, taken to type the identifier. The constant values 0 and 1 occur very frequently in source code (see Figure 825.1). Experience suggests that the semantic associations tend to be that of assigning an initial value in the case of 0 and accessing a preceding or following item in the case of 1. The coding guideline issues are discussed in the subsections that deal with the different kinds of constants (e.g., integer, or floating). What form of definition should a symbolic name denoting constant value have? Possibilities include the following: • Macro names. These are seen by developers as being technically the same as constants in that they are replaced by the numeric value of the constant during translation (there can also be an unvoiced bias toward perceived efficiency here). v 1.2 June 24, 2009
  19. 6.4.4 Constants 823 • Enumeration constants. The purpose of an enumerated type is to associate a list of constants with each 517 enumeration other. This is not to say the definition of an enumerated type containing a single enumeration constant set of named constants should not occur, but this usage would be unusual. Enumeration constants share the same unvoiced developer bias as macro names— perceived efficiency. • Objects initialized with the constant. This approach is advocated by some coding guideline documents for C++. The extent to which this is because an object declared with the const qualifier really is constant and a translator need not allocate storage for it, or because use of the preprocessor (often called the C preprocessor, as if it were not also in C++) is frowned on in the C++ community and is left to the reader to decide. 517 enumeration The enumeration constant versus macro name issue is discussed in detail elsewhere. set of named constants What name to choose? The constant 6.6261E-34 illustrates another pitfall. Planck’s constant is almost universally represented, within the physics community, using the letter h (a closely related constant is ¯ , h the reduced Planck constant)). A developer might be tempted to make use of this idiom to name the value, perhaps even trying to find a way of using UCNs to obtain the appropriate h. The single letter h probably gives no more information than the value. The name PLANCK_CONSTANT is self-evident. The developer attitude— anybody who does not know what 6.6261E-34 represents has no business reading the source— is not very productive or helpful. Table 822.1: Occurrence of different kinds of constants (as a percentage of all tokens). Based on the visible form of the .c and .h files. Kind of Constant .c files .h files character-constant 0.16 0.06 integer-constant 6.70 20.79 floating-constant 0.02 0.20 string-literal 1.02 0.74 Constraints 823 The value of a constant shall be in the range of representable values for its type. constant representable in its type Commentary This is something of a circular definition in that a constant’s value is also used to determine its type. The 824 constant lexical form of a constant is also a factor in determining which of a number of possible types it may take. An type determined by form and value unsuffixed constant that is too large to be represented in the type long long, or a suffixed constant that is larger than the type with the greatest rank applicable to that suffix, violates this requirement (unless there is some extended integer type supported by the implementation into whose range the value falls). It can be argued that all floating constants are in range if the implementation supports ±∞. 1440 enumeration There is a similar constraint for enumeration constants constant representable in int C++ The C++ Standard has equivalent wording covering integer-literals (2.13.1p3), character-literals (2.13.2p3) and floating-literals (2.13.3p1). For enumeration-literals their type depends on the context in which the question is asked: Following the closing brace of an enum-specifier, each enumerator has the type of its enumeration. Prior to 7.2p4 the closing brace, the type of each enumerator is the type of its initializing value. 7.2p5 June 24, 2009 v 1.2
  20. 824 6.4.4 Constants The underlying type of an enumeration is an integral type that can represent all the enumerator values defined in the enumeration. Other Languages Most languages have a similar requirement, even those supporting a single integer or floating type. Common Implementations Some implementations use the string-to-integer conversions provided by the library, while others prefer the flexibility (and fuller control of error recovery) afforded by specially written code. Parker[1074] describes the minimal functionality required. Example 1 char ch = ’\0\0\0\0y’; 2 3 float f_1 = 1e99999999999999999999999999999999999999999999999; 4 float f_2 = 0e99999999999999999999999999999999999999999999999; 5 float f_3 = 1e-99999999999999999999999999999999999999999999999; /* Approximately zero. */ 6 float f_4 = 0e-99999999999999999999999999999999999999999999999; /* Exact zero. */ 7 8 short s_1 = 9999999999999999999999999999999999999999999999999; 9 short s_2 = 99999999999999999999999 / 99999999999999999999999; The integer constant 10000000000000000000L would violate this constraint on an implementation that represented the type long long in 64 bits. The use of an L suffix precludes the constant being given the type unsigned long long. Semantics constant Each constant has a type, determined by its form and value, as detailed later. shall have a type and the value 824 type determined by form and value of a constant shall be in the range of representable values for its type. Commentary Just as there are different floating and integer object types, the possible types that constants may have is not integer 836 constant limited to a single type. possible types transla- 136 tion phase It is a constraint violation for a constant to occur during translation phrase 7 without a type. 7 integer 841 The requirement that a constant be in the range of representable values for its type is a requirement on the constant no type implementation. The wording was changed by the response to DR #298. C++ The type of an integer literal depends on its form, value, and suffix. 2.13.1p2 The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, 2.13.3p1 the suffixes l and L specify long double. There are no similar statements for the other kinds of literals, although C++ does support suffixes on the floating types. However, the syntactic form of string literals, character literals, and boolean literals determines their type. v 1.2 June 24, 2009
Đồng bộ tài khoản