The New C Standard- P9

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:100

Thêm vào BST

Báo xấu

73
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Tham khảo tài liệu 'the new c standard- p9', công nghệ thông tin, kỹ thuật lập trình phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: The New C Standard- P9

6.4.2.1 General 796 Coding Guidelines The visual similarity of these letters is discussed elsewhere. 792 character visual similarity 795 There is no speciﬁc limit on the maximum length of an identiﬁer. Commentary The standard does specify a minimum limit on the number of characters a translator must consider as 282 internal signiﬁcant. Implementations are free to ignore characters once this limit is reached. The ignored characters identiﬁer signiﬁcant charac- ters do not form part of another token. It is as if they did not appear in the source at all. 283 external identiﬁer signiﬁcant charac- C90 ters The C90 Standard does not explicitly state this fact. Other Languages Few languages place limits on the maximum length of an identiﬁer that can appear in a source ﬁle. Like C, some specify a lower limit on the number of characters that must be considered signiﬁcant. Coding Guidelines Using a large number of characters in an identiﬁer spelling has many potential beneﬁts; for instance, it provides the opportunity to supply a lot of information to readers, or to reduce dependencies on existing reader knowledge by spelling words in full rather than using abbreviations. There are also potential costs; for instance, they can cause visual layout problems in the source (requiring new-lines within an expression in an attempt to keep the maximum line length within the bounds that can be viewed within a ﬁxed-width window), or increase the cognitive effort needed to visually scan source containing them. The length of an identiﬁer is not itself directly a coding guideline issue. However, length is indirectly involved in many identiﬁer memorability, confusability, and usability issues, which are discussed elsewhere. 792 identiﬁer syntax Usage The distribution of identiﬁer lengths is given in Figure 792.7. 796 Each universal character name in an identiﬁer shall designate a character whose encoding in ISO/IEC 10646 identiﬁer UCN falls into one of the ranges speciﬁed in annex D.60) Commentary Using other UCNs results in undeﬁned behavior (in some cases even using these UCNs can be a constraint 816 UCNs violation). These character encodings could be thought of as representing letters in the speciﬁed national not basic char- acter set character set. C90 Support for universal character names is new in C99. Other Languages The ISO/IEC 10646 standard is relatively new and languages are only just starting to include support for the 28 ISO 10646 characters it speciﬁes. Java speciﬁes a similar list of UCNs. Common Implementations A collating sequence may not be deﬁned for these universal character names. In practice a lack of a deﬁned collating sequence is not an implementation problem. Because a translator only ever needs to compare the spelling of one identiﬁer for equality with another identiﬁer, which involves a simple character-by-character comparison (the issue of the ordering of diacritics is handled by not allowing them to occur in an identiﬁer). Support for this functionality is new and the extent to which implementations are likely to check that UCN values fall within the list given in annex D is not known. June 24, 2009 v 1.2
797 6.4.2.1 General Coding Guidelines The intended purpose for supporting universal character names in identiﬁers is to reduce the developer effort needed to comprehend source. Identiﬁers spelled in the developer’s native tongue are more immediately recognizable (because of greater practice with those characters) and also have semantic associations that are more readily brought to mind. ISO 10646 28 The ISO 10646 Standard does not specify which languages contain the characters it speciﬁes (although it does give names to some sets of characters that correspond to a language that contains them). The written form of some human languages share common characters; for instance, the characters a through z (and their orthography 792 uppercase forms) appear in many European orthographies. The following discussion refers to using UCNs from more than one human language. This is to be taken to mean using UCNs that are not part of the written form of the native language of the developer (the case of developers having more than one native language is not considered). For instance, the character a is used in both Swedish and German; the character û is used in Swedish, but not German; the character ß is used in German but not Swedish. Both Swedish and German developers would be familiar with the character a, but the character ß would be considered foreign to a Swedish developer, and the character û foreign to the German. Some coding guideline documents recommend against the use of UCNs. Their use within identiﬁers can increase the portability cost of the source. The use of UCNs is an economic issue; the potential cost of not permitting their use in identiﬁers needs to be compared against the potential portability beneﬁts. (Alternatively, the beneﬁts of using UCNs could be compared against the possible portability costs.) Given the purpose of using UCNs, is there any rationale for identiﬁers to contain characters from more than one human language? As an English speaker, your author can imagine a developer wanting to use an English word, or its common abbreviation, as a preﬁx or sufﬁx to an identiﬁer name. Perhaps an Urdu speaker can imagine a similar usage with Urdu words. The issue is whether the use of characters in the same identiﬁer from different human languages has meaning to the developers who write and maintain the source. Identiﬁers very rarely occur in isolation. Should all the identiﬁers in the same function, or even source ﬁle, only contain UCNs that form the set of characters used by a single human language? Using characters from different human languages when it is possible to use only characters from a single language, potentially increases the cost of maintenance. Future maintainers are either going to have to be familiar with the orthography and semantics of the two human languages used or spend additional time processing instances of identiﬁers containing characters they are not familiar with. However, in some cases it might not be possible to enforce a single human language rule. For instance, a third-party library may contain callable functions whose spellings use characters from a human language different from that used in the source code that contains calls to it. Support for the use of UCNs in identiﬁers is new in C99 (and other computer languages) and at the time of this writing there is almost no practical experience available on the sort of mistakes that developers make with them. The initial character shall not be a universal character name designating a digit. 797 Commentary identiﬁer 792 The terminal identifier-nondigit that appears in the syntax implies that the possible UCNs exclude the syntax digit characters. Also the list given in annex D does not include the digit characters. This means that an identiﬁer containing a UCN designating a digit in any position results in undeﬁned behavior. constant 822 The syntax for constants does not support the use of UCNs. This sentence, in the standard, reminds syntax implementors that such usage could be supported in the future and that, while they may support UCN digits within an identiﬁer, it would not be a good idea to support them as the initial character. v 1.2 June 24, 2009
6.4.2.1 General 798 Table 797.1: The Unicode digit encodings. Encoding Range Language Encoding Range Language 0030–0039 ISO Latin-1 0BE7–0BEF Tamil (has no zero) 0660–0669 Arabic–Indic 0C66–0C6F Telugu 06F0–06F9 Eastern Arabic–Indic 0CE6–0CEF Kannada 0966–096F Devanagari 0D66–0D6F Malayalam 09E6–09EF Bengali 0E50–0E59 Thai 0A66–0A6F Gurmukhi 0ED0–0ED9 Lao 0AE6–0AEF Gujarati FF10–FF19 Fullwidth 0B66–0B6F Oriya digits C++ This requirement is implied by the terminal non-name used in the C++ syntax. Annex E of the C++ Standard does not list any UCN digits in the list of supported UCN encodings. Other Languages Java has a similar requirement. Coding Guidelines The extent to which different cultural conventions support the use of a digit as the ﬁrst character in an identiﬁer is not known to your author. At some future date the Committee may chose to support the writing of integer constants using UCNs. If this happens, any identiﬁers that start with a UCN designating a digit are liable to result in syntax violations. There does not appear to be a worthwhile beneﬁt in a guideline recommendation dealing with the case of an identiﬁer beginning with a UCN designating a digit. Example 1 int \u1f00\u0ae6; 2 int \u0ae6; 798 An implementation may allow multibyte characters that are not part of the basic source character set to appear identiﬁer multibyte in identiﬁers; character in Commentary Prior to C99 there was no standardized method of representing nonbasic source character set characters in the source code. Support for multibyte characters in string literals and constants was speciﬁed in C90; some implementations extended this usage to cover identiﬁers. They are now ofﬁcially sanctioned to do this. Support for the ISO 10646 Standard is new in C99. However, there are a number of existing implementations 28 ISO 10646 that use a multibyte encoding scheme and this usage is likely to continue for many years. The C committee recognized the importance of this usage and do not force developers to go down a UCN-only path. The standard says nothing about the behavior of the _ _func_ _ reserved identiﬁer in the case when a 810 __func__ function name is spelled using wide characters. C90 This permission is new in C99. C++ 116 transla- The C++ Standard does not explicitly contain this permission. However, translation phase 1 performs an tion phase 1 implementation-deﬁned mapping of the source ﬁle characters, and an implementation may choose to support multibyte characters in identiﬁers via this route. June 24, 2009 v 1.2
801 6.4.2.1 General Other Languages While other language standards may not mention multibyte characters, the problem they address is faced by implementations of those languages. For this reason, it is to be expected that some implementations of other languages will contain some form of support for multibyte characters. Coding Guidelines universal 815 UCNs may be the preferred, C Standard way, of representing nonbasic character set characters in identiﬁers. charac- However, developers are at the mercy of editor support for how they enter and view characters that are not in ter name syntax the basic source character set. which characters and their correspondence to universal character names is implementation-deﬁned. 799 Commentary Various national bodies have deﬁned standards for representing their national character sets in computer ﬁles. ISO 10646 28 While ISO 10646 is intended to provide a uniﬁed standard for all characters, it may be some time before existing software is converted to use it. Common Implementations It is common to ﬁnd translators aimed at the Japanese market supporting JIS, shift-JIS, and EUC encodings (see Table 243.3). These encoding use different numeric values than those given in ISO 10646 to represent the same national character. When preprocessing tokens are converted to tokens during translation phase 7, if a preprocessing token could 800 be converted to either a keyword or an identiﬁer, it is converted to a keyword. Commentary The Committee could have created a separate name space for keywords and allowed developers to deﬁne identiﬁers having the same spelling as a keyword. The complexity added to a translator by such a speciﬁcation would be signiﬁcant (based on implementation experience for languages that support this functionality), while a developer’s inability to deﬁne identiﬁers having these spellings was considered a relatively small inconvenience. C90 This wording is a simpliﬁcation of the convoluted logic needed in the C90 Standard to deduce from a constraint what C99 now says in semantics. The removal of this C90 constraint is not a change of behavior, since it was not possible to write a program that violated it. Constraints C90 6.1.2 In translation phase 7 and 8, an identiﬁer shall not consist of the same sequence of characters as a keyword. Other Languages Some languages allow keywords to be used as variable names (e.g., PL/1), using the context to disambiguate intended use. footnote 60) On systems in which linkers cannot accept extended characters, an encoding of the universal character 801 60 name may be used in forming valid external identiﬁers. Commentary This is really an implementation tip for translators. The standard deﬁnes behavior in terms of an abstract machine that produces external output. The tip given in this footnote does not affect the conformance status of an implementation that chooses to implement this functionality in another way. The only time such a mapping might be visible is through the use of a symbolic execution-time debugging tool, or by having to link against object ﬁles created by other translators. v 1.2 June 24, 2009
6.4.2.1 General 805 C90 Extended characters were not available in C90, so the suggestion in this footnote does not apply. 215 extended characters Other Languages Issues involving third-party linkers are common to most language implementations that compile to machine code. Some languages, for instance Java, deﬁne the characteristics of an implementation at translation and execution time. The Java language speciﬁcation goes to the extreme (compared to other languages) of specifying the format of the generated ﬁle object code ﬁle. Common Implementations There is a long-standing convention of preﬁxing externally visible identiﬁer names with an underscore character when information on them is written out to an object ﬁle. There is little experience available on implementation issues involving UCNs, but many existing linkers do assume that identiﬁers are encoded using 8-bit characters. Coding Guidelines The encoding of external identiﬁers only needs to be considered when interfacing to, or from code written in another language. Cross-language interfacing is outside the scope of these coding guidelines. 802 For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal character name. Commentary Some linkers may not support an occurrence of the backslash (\) character in an identiﬁer name. One solution to this problem is to create names that cannot be declared in the source code by the developer; for instance, by deleting the \ characters and preﬁxing the name with a digit character. Common Implementations There are no standards for encoding of universal character names in object ﬁles. The requirement to support this form of encoding is too new for it to be possible to say anything about common encodings. 803 Extended characters may produce a long external identiﬁer. Commentary Here the word long does not have any special meaning. It simply suggests an identiﬁer containing many 282 internal characters. identiﬁer signiﬁcant charac- ters Implementation limits 804 As discussed in 5.2.4.1, an implementation may limit the number of signiﬁcant initial characters in an identiﬁer; Implemen- tation limits Commentary This subclause lists a number of minimum translation limits 276 translation limits C90 The C90 Standard does not contain this observation. C++ All characters are signiﬁcant.20) 2.10p1 C identiﬁers that differ after the last signiﬁcant character will cause a diagnostic to be generated by a C++ translator. Annex B contains an informative list of possible implementation limits. However, “ . . . these quantities are only guidelines and do not determine compliance.”. June 24, 2009 v 1.2
806 6.4.2.1 General the limit for an external name (an identiﬁer that has external linkage) may be more restrictive than that for an 805 internal name (a macro name or an identiﬁer that does not have external linkage). external 283 Commentary identiﬁer External identiﬁers have to be processed by a linker, which may not be under the control of a vendor’s signiﬁcant characters C implementations. In theory, any tool that performs the linking process falls within the remit of the C Committee. However, the Committee recognized that, in practice, it is not always possible for translator vendors to supply their own linker. The limitations of existing linkers needed to be factored into the limits internal 282 speciﬁed in the standard. identiﬁer Internal identiﬁers only need to be processed by the translator and the standard is in a strong position to signiﬁcant characters specify the behavior. Other Languages Most other language implementations face similar problems with linkers as C does. However, not all language speciﬁcations explicitly deal with the issue (by specifying the behavior). The Java standard deﬁnes a complete environment that handles all external linkages. Coding Guidelines What are the costs associated with a change to the linkage of an identiﬁer during program maintenance, from internal linkage to external linkage? (Experience shows that identiﬁer linkage is rarely changed from external to internal?) external 283 In most cases implementations support a sufﬁciently large number of signiﬁcant characters in an external identiﬁer name that a change of identiﬁer linkage makes no difference to its signiﬁcant characters (i.e., the number signiﬁcant characters identiﬁer 792 of characters it contains falls inside the implementation limit). In those cases where a change of identiﬁer number of characters linkage results in some of its signiﬁcant characters being ignored, the affect may be benign (there is no other external 1818 identiﬁer deﬁned with external linkage whose name is the same as the truncated name) or results in undeﬁned linkage behavior (the program deﬁnes two identiﬁers with external linkage with the same name). exactly one external deﬁnition The number of signiﬁcant characters in an identiﬁer is implementation-deﬁned. 806 internal 282 Commentary identiﬁer Subject to the minimum requirements speciﬁed in the standard. signiﬁcant characters C++ 2.10p1 All characters are signiﬁcant.20) References to the same C identiﬁer, which differs after the last signiﬁcant character, will cause a diagnostic to be generated by a C++ translator. There is also an informative annex which states: Number of initial characters in an internal identiﬁer or a macro name [1024] Annex Bp2 Number of initial characters in an external identiﬁer [1024] Other Languages Some languages require all characters in an identiﬁer to be signiﬁcant (e.g., Java, Snobol 4), while others don’t (e.g., Cobol, Fortran). Common Implementations It is rare to ﬁnd an implementation that does not meet the minimum limits speciﬁed in the standard. A few translators treat all identiﬁers as signiﬁcant. Most have a limit of between 256 and 2,000 signiﬁcant characters. The POSIX standard requires that any language that binds to its API needs to support 14 signiﬁcant characters in an external identiﬁer. v 1.2 June 24, 2009
6.4.2.1 General 806 Coding Guidelines While the C90 minimum limits for the number of signiﬁcant characters in an identiﬁer might be considered unacceptable by many developers, the C99 limits are sufﬁciently generous that few developers are likely to complain. Automatically generated C source sometimes relies on a large number of signiﬁcant characters in an identiﬁer. This can occur because of the desire to simplify the implementation of the generator. Character sequences in different offsets within an identiﬁer might be reserved for different purposes. Predeﬁned default character sequence is used to pad the identiﬁer spelling where necessary. As the following example shows, it is possible for a program’s behavior to change, both when the number of signiﬁcant identiﬁers is increased and when it is decreased. 1 /* 2 * Yes, C99 does specify 64 significant characters in an internal 3 * identifier. But to keep this example within the page width 4 * we have taken some liberties. 5 */ 6 7 extern float _________1_________2_________3___bb; 8 9 void f(void) 10 { 11 int _________1_________2_________3___ba; 12 13 /* 14 * If there are 34 significant characters, the following operand 15 * will resolve to the locally declared object. 16 * 17 * If there are 35 significant characters, the following operand 18 * will resolve to the globally declared object. 19 */ 20 _________1_________2_________3___bb++; 21 } 22 23 void g(void) 24 { 25 int _________1_________2_________3___aa; 26 27 /* 28 * If there are 34 significant characters, the following operand 29 * will resolve to the globally declared object. 30 * 31 * If there are 33 significant characters, the following operand 32 * will resolve to the locally declared object. 33 */ 34 _________1_________2_________3___bb++; 35 } The following issues need to be addressed: • All references to the same identiﬁer should use the same character sequence; that is, all characters are intended to be signiﬁcant. References to the same identiﬁers that differ in nonsigniﬁcant characters need to be treated as faults. • Within how many signiﬁcant characters should different identiﬁers differ? Should identiﬁers be required to differ within the minimum number of signiﬁcant characters speciﬁed by the standard, or can a greater number of characters be considered signiﬁcant? Readers do not always carefully check all characters in the spelling of an identiﬁer. The contribution made by characters occurring in different parts of an identiﬁer will depend on the pattern of eye movements employed June 24, 2009 v 1.2
807 6.4.2.1 General 100 ××× ∆∆ × × gcc .. • • ∆× • ∆×× . •∆ . • ∆× . • ∆× . . idsoftware . • ∆× . • ∆× 10 . • ∆×× . •∆ × ∆ ∆ linux % identical matches . • ∆∆×× . • • ∆ ×× . . •∆• × • • mozilla . • ∆∆ • ×× . . •∆• × 1 . . ∆∆∆××× ••• . ∆∆ • ××× • • . ∆ • • ××× ••• 0.1 . ∆∆ •••• ∆ •••• ∆ ×× ••••••• .. ∆ . ∆ × 0.01 . ∆∆ ∆ ×× ∆ ∆×∆ 0.001 ∆ 6 10 20 30 40 50 Signiﬁcant characters Figure 806.1: Occurrence of unique identiﬁers whose signiﬁcant characters match those of a different identiﬁer (as a percentage of all unique identiﬁers in a program), for various numbers of signiﬁcant characters. Based on the visible form of the .c ﬁles. reading 770 kinds of by readers, which in turn may be affected by their reasons for reading the source, plus cultural factors (e.g., direction in which they read text in their native language, or the signiﬁcance of word endings in their native identiﬁers 792 language). Characters occurring at both ends of an identiﬁer are used by readers (at least native English- and Greek readers word 770 French-speaking ones) when quickly scanning text. reading individual Cg 806.1 When performing similarity checks on identiﬁers, all characters shall be considered signiﬁcant. Any identiﬁers that differ in a signiﬁcant character are different identiﬁers. 807 Commentary In many cases different identiﬁers also denote different entities. In a some cases they denote the same entity (e.g., two different typedef names that are synonyms for the type int). Other Languages This statement is common to all languages (but it does not always mean that they necessarily denote different entities). Coding Guidelines Identiﬁers that differ in a single signiﬁcant character may be considered to be • different identiﬁers by a translator, but considered to be the same identiﬁer by some readers of the source (because they fail to notice the difference). • the same identiﬁers by a translator (because the difference occurs in a nonsigniﬁcant character), but considered to be different identiﬁers by some readers of the source (because they treat all characters as being signiﬁcant). • identiﬁers by both a translator and some readers of the source. developer 0 The possible reasons for readers making mistakes are discussed elsewhere, as are the guideline recommenda- errors identiﬁer 792 tions for reducing the probability that these developer mistakes become program faults. ﬁltering spellings Example v 1.2 June 24, 2009
6.4.2.2 Predeﬁned identiﬁers 810 1 extern int e1; 2 extern long el; 3 extern int a_longer_more_meaningful_name; 4 extern int a_longer_more_meeningful_name; 5 extern int a_meaningful_more_longer_name; 808 If two identiﬁers differ only in nonsigniﬁcant characters, the behavior is undeﬁned. Commentary While the obvious implementation strategy is to ignore the nonsigniﬁcant characters, the standard does not require implementations to use this strategy. To speed up identiﬁer lookup many implementations use a hashed symbol table— the hash value for each identiﬁer is computed from the sequence of characters it contains. Computing this hash value as the characters are read in, to form an identiﬁer, saves a second pass over those same characters later. If nonsigniﬁcant characters were included in the original computed hash value, a subsequent occurrence of that identiﬁer in the source, differing in nonsigniﬁcant characters, would result in a different hash value being calculated and a strong likelihood that the hash table lookup would fail. Developers generally expect implementations to ignore nonsigniﬁcant characters. An implementation that behaved differently because identiﬁers differed in nonsigniﬁcant characters might not be regarded as being very user friendly. Highlighting misspellings that occur in nonsigniﬁcant characters is not always seen in a positive light by some developers. C++ In C++ all characters are signiﬁcant, thus this statement does not apply in C++. Other Languages Some languages specify that nonsigniﬁcant characters are ignored and have no effect on the program, while others are silent on the subject. Common Implementations Most implementations simply ignore nonsigniﬁcant characters. They play no part in identiﬁer lookup in symbol tables. Coding Guidelines The coding guideline issues relating to the number of characters in an identiﬁer that should be considered 792 identiﬁer signiﬁcant are discussed elsewhere. guideline signiﬁ- cant characters 809 Forward references: universal character names (6.4.3), macro replacement (6.10.3). 6.4.2.2 Predeﬁned identiﬁers Semantics 810 The identiﬁer _ _func_ _ shall be implicitly declared by the translator as if, immediately following the opening __func__ brace of each function deﬁnition, the declaration static const char __func__[] = "function-name"; appeared, where function-name is the name of the lexically-enclosing function.61) Commentary Implicitly declaring _ _func_ _ immediately after the opening brace in a function deﬁnition means that the ﬁrst, developer-written declaration within that function can access it. Giving _ _func_ _ static storage duration enables its address to be referred to outside the lifetime of the function that contains it (e.g., enabling a call history to be displayed at some later stage of program execution). This is not a storage overhead because space needs to be allocated for the string literal denoted by _ _func_ _. The const qualiﬁer ensures June 24, 2009 v 1.2
810 6.4.2.2 Predeﬁned identiﬁers that any attempts to modify the value cause undeﬁned behavior. The identiﬁer _ _func_ _ has an array type, transla- 135 tion phase and is not a string literal, so the string concatenation that occurs in translation phase 6 is not applicable. 6 This identiﬁer is useful for providing execution trace information during program testing. Developers who make use of UCNs may need to ensure that the library they use supports the character output required by them: 1 #include 2 3 void \u30CE(void) 4 { 5 printf ("Just entered %s\n", __func__); 6 } identiﬁer 798 multibyte character in The issue of wide characters in identiﬁers is discussed elsewhere. Which function name is used when a function deﬁnition contains the inline function speciﬁer? In: 1 #include 2 3 inline void f(void) 4 { 5 printf("We are in %s\n", __func__); 6 } 7 8 int main(void) 9 { 10 f(); 11 printf("We are in %s\n", __func__); 12 } the name of the function f is output, even if that function is inlined into main. C90 Support for the identiﬁer _ _func_ _ is new in C99. C++ Support for the identiﬁer _ _func_ _ is new in C99 and is not available in the C++ Standard. Common Implementations A translator only needs to declare _ _func_ _ if a reference to it occurs within a function. An obvious storage saving optimization is to delay any declaration until such time as it is known to be required. Another optimization is for the storage allocated for _ _func_ _ to exactly overlay that allocated to the string literal. Allocating storage for a string literal and copying the characters to the separately allocated object it initializes is not necessary when that object is deﬁned using the const qualiﬁer. gcc also supports the built-in form _ _FUNCTION_ _. Example Debugging code in functions can provide useful information. But when there are lots of functions, the quantity of useless information can be overwhelming. Controlling which functions are to output debugging information by using conditional compilation requires that code be edited and the program rebuilt. The names of functions can be used to dynamically control which functions are to output debugging information. This control not only reduces the amount of information output, but can also reduce execution time by orders of magnitude (output can be a resource-intense operation). flookup.h 1 typedef struct f__rec { 2 char *func_name; 3 _Bool enabled; 4 struct f__rec *next; v 1.2 June 24, 2009
6.4.2.2 Predeﬁned identiﬁers 811 5 } func__list; 6 7 extern _Bool func_lookup(func__list *, char *); 8 9 /* 10 * Use the name of the function to control whether debugging is 11 * switched on/off. lookup is only called the first time this code 12 * is executed, thereafter the value f___l->enabled can be used. 13 */ 14 #define D_func_trace(func_name, code) { \ 15 static func__list * f___l = NULL; \ 16 if (f___l ? f___l->enabled : lookup(&f___l, func_name)) \ 17 {code} \ 18 } flookup.c 1 #include 2 3 #include "flookup.h" 4 5 /* 6 * A fixed list of functions and their debug mode. 7 * We could be more clever and make this a list which 8 * could be added to as a program executes. 9 */ 10 static struct { 11 char *func_name; 12 _Bool enabled; 13 func__list *traces_seen; 14 } lookup_table[] = { 15 "abc", true, NULL, 16 NULL, false, NULL 17 }; 18 19 _Bool func_lookup(func__list *f_list, char *f_name) 20 { 21 /* 22 * Loop through lookup_table looking for a match against f_name. 23 * If a match is found, add f_list to the traces_seen list and 24 * return the value of enabled for that entry. 25 */ 26 } 27 28 void change_enabled_setting(char *f_name, _Bool new_enabled) 29 { 30 /* 31 * Loop through lookup_table looking for a match against f_name. 32 * If a match is found, loop over its traces_seen list setting 33 * the enabled flag to new_enabled. 34 * 35 * This function can switch on/off the debugging output from 36 * any registered function. 37 */ 38 } 811 This name is encoded as if the implicit declaration had been written in the source character set and then translated into the execution character set as indicated in translation phase 5. Commentary 133 transla- Having the name appearing as if in translation phase 5 avoids any potential issues caused by macro names tion phase 5 June 24, 2009 v 1.2
814 6.4.2.2 Predeﬁned identiﬁers deﬁned with the spelling of keywords or the name _ _func_ _. It also enables a translator to have an identiﬁer name and type predeﬁned internally, ready to be used when this reserved identiﬁer is encountered. Translation phase 5 is also where characters get converted to their corresponding members in the execution character set, an essential requirement for spelling a function name. In many implementations the function name written to program 141 image the object ﬁle, or program image, is different from the one appearing in the source. This translation phase 5 requirement ensures that it is not any modiﬁed name that is used. Example 1 #include 2 3 #define __func__ __CNUF__ 4 #define __CNUF__ "g" 5 6 void f(void) 7 { 8 /* 9 * The implicit declaration does not appear until after preprocessing. 10 * So there is no declaration ’static const char __func__[] = "f";’ 11 * visible to the preprocessor (which would result in __func__ being 12 * mapped to __CNUF__ and "f" rather than "g" being output). 13 */ 14 printf("Name of function is %s\n", __CNUF__); 15 } EXAMPLE Consider the code fragment: 812 #include void myfunc(void) { printf("s\n", __func__); /* ... */ } Each time the function is called, it will print to the standard output stream: myfunc Commentary This assumes that the standard output stream is not closed (in which case the behavior would be undeﬁned). Forward references: function deﬁnitions (6.9.1). 813 footnote 61) Since the name _ _func_ _ is reserved for any use by the implementation (7.1.3), if any other identiﬁer is 814 61 explicitly declared using the name _ _func_ _, the behavior is undeﬁned. Commentary The name is reserved because it begins with two underscores. The fact that the standard deﬁnes an interpreta- tion for this name in the identiﬁer name space in block scope does not give any license to the developer to use it in other name spaces or at ﬁle scope. This name is still reserved for use in other name spaces and scopes. C90 Names beginning with two underscores were speciﬁed as reserved for any use by the C90 Standard. The following program is likely to behave differently when translated and executed by a C99 implementation. v 1.2 June 24, 2009
6.4.3 Universal character names 815 1 #include 2 3 int main(void) 4 { 5 int __func__ = 1; 6 7 printf("d\n", __func__); 8 } C++ Names beginning with _ _ are reserved for use by a C++ implementation. This leaves the way open for a C++ implementation to use this name for some purpose. 6.4.3 Universal character names universal char- acter name 815 syntax universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit Commentary It is intended that this syntax notation not be visible to the developer, when reading or writing source code that contains instances of this construct. That is, a universal-character-name aware editor displays the ISO 10646 glyph representing the numeric value speciﬁed by the hex-quad sequence value. Without such 58 glyph editor support, the whole rationale for adding these characters to C, allowing developers to read and write identiﬁers in their own language, is voided. C90 Support for this syntactic category is new in C99. Other Languages Java calls this lexical construct a UnicodeInputCharacter (and does not support the \U form, only the \u one). Coding Guidelines It is difﬁcult to imagine developers regularly using UCNs with an editor that does not display UCNs in some graphical form. A guideline recommending the use of such an editor would not be telling developers anything they did not already know. 792 Word A number of theories about how people recognize words have been proposed. One of the major issues yet recognition models of to be resolved is the extent to which readers make use of whole word recognition versus mapping character sequences to sound (phonological coding). Support for UCNs increases the possibility that developers will encounter unfamiliar characters in source code. The issue of developer performance in handling unfamiliar 792 reading characters is discussed elsewhere. characters unknown to reader Example 1 #define foo(x) 2 3 void f(void) 4 { 5 foo("\\u0123") /* Does not contain a UCN. */ June 24, 2009 v 1.2
817 6.4.3 Universal character names 6 foo(\\u0123); /* Does contain a UCN. */ 7 } Constraints UCNs A universal character name shall not specify a character whose short identiﬁer is less than 00A0 other than 816 not basic char- acter set 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.62) Commentary ISO 10646 28 The ISO 10646 Standard deﬁnes the ranges 00 through 01F, and 07F through 09F, as the 8-bit control codes (what it calls C0 and C1). Most of the UCNs with values less than 00A0 represent characters in the basic source character set. The exceptions listed enumerate characters that are in the Ascii character set, but not in the basic source character set. The ranges 0D800 through DBFF and 0DC00 through 0DFFF are known as the surrogate ranges. The purpose of these ranges is to allow representation of rare characters in future versions of the Unicode standard. This constraint means that source ﬁles cannot contain the UCN equivalent for any members of the basic source character set. Rationale UCNs are not permitted to designate characters from the basic source character set in order to permit fast compilation times for C programs. For some real world programs, compilers spend a signiﬁcant amount of time merely scanning for the characters that end a quoted string, or end a comment, or end some other token. Although, it is trivial for such loops in a compiler to be able to recognize UCNs, this can result in a surprising amount of overhead. A UCN is constrained not to specify a character short identiﬁer in the range 0000 through 0020 or 007F through 009F inclusive for the same reason: this avoids allowing a UCN to designate the newline character. Since different implementations use different control characters or sequences of control characters to represent newline, UCNs are prohibited from representing any control character. C++ If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F–0x9F (inclusive), 2.2p2 or if the universal character name designates a character in the basic source character set, then the program is ill-formed. The range of hexadecimal values that are not permitted in C++ is a subset of those that are not permitted in C. This means that source which has been accepted by a conforming C translator will also be accepted by a conforming C++ translator, but not the other way around. Other Languages Java has no such restrictions on the hexadecimal values. Common Implementations Support for UCNs is new in C99. It remains to be seen whether translator vendors decide to support any UCN hexadecimal value as an extension. Example 1 \u0069\u006E\u0074 glob; /* Constraint violation. */ Description v 1.2 June 24, 2009
6.4.3 Universal character names 818 817 Universal character names may be used in identiﬁers, character constants, and string literals to designate characters that are not in the basic character set. Commentary UCNs may also appear in comments. However, comments do not have a lexical structure to them. Inside a comment character, sequences starting with \u are not treated as UCNs by a translator, although other tools may choose to do so, in this context. The mapping of UCNs in character constants and string literals to the execution character set occurs in translation phase 5. 816 UCNs The constraint on the range of values that a UCN may take prevents them from being used to represent not basic char- acter set keywords. C++ The C++ Standard also supports the use of universal character names in these contexts, but does not say in words what it speciﬁes in the syntax (although 2.2p2 comes close for identiﬁers). Other Languages In Java, UnicodeInputCharacters can represent any character and is mapped in lexical translation step 1. It is possible for every character in the source to appear in this form. The mapping only occurs once, so \u005cu005a becomes \u005a, not Z (005c is the Unicode value for \ and 005a is the Unicode character for Z). Coding Guidelines UCNs in character constants and string literals are used to represent characters that are output when a program is executed, or in identiﬁers to provide more readable source code. In the former case it is possible that UCNs from different natural languages will need to be represented. In the latter case it might be surprising if source code contained UCNs from different languages. This usage is a complex one involving issues outside of these coding guidelines (e.g., conﬁguration management and customer requirements) and your author has insufﬁcient experience to know whether any guideline recommendations might be worthwhile. Some of the coding guideline issues relating to the use of characters outside of the basic execution 238 multibyte character set are discussed elsewhere. character source contain Example 1 #include 2 3 int \u0386\u0401; 4 wchar_t *hello = "\u05B0\u0901"; Semantics 818 The universal character name \Unnnnnnnn designates the character whose eight-digit short identiﬁer (as short identiﬁer speciﬁed by ISO/IEC 10646) is nnnnnnnn.63) Commentary The standard speciﬁes how UCNs are represented in source code. A development environment may chose to provide, to developers, a visible representation of the UCN that matches the glyph with the corresponding numeric value in ISO 10646. The ISO 10646 BNF syntax for short identiﬁers is: ISO 10646 short identiﬁer { U | u } [ {+}(xxxx | xxxxx | xxxxxx) | {-}xxxxxxxx ] where x represents a hexadecimal digit. June 24, 2009 v 1.2
822 6.4.4 Constants Other Languages Java does not support eight-digit universal character names. Coding Guidelines external 283 This form of UCN counts toward a greater number of signiﬁcant characters in identiﬁers with external identiﬁer linkage and therefore is not the preferred representation. However, the developer may not have any control signiﬁcant characters over the method used by an editor to represent UCNs. Given that characters from the majority of human languages can be represented using four-digit short identiﬁers, eight-digit short identiﬁers are not likely to be needed. If the development environment offers a choice of representations, use of four-digit short identiﬁers is likely to result in more signiﬁcant characters being retained in identiﬁers having external linkage. Similarly, the universal character name \unnnn designates the character whose four-digit short identiﬁer is 819 nnnn (and whose eight-digit short identiﬁer is 0000nnnn). Commentary It was possible to represent all of the characters speciﬁed by versions 1 and 2 of the Unicode-sponsored character set using four-digit short identiﬁers. Version 3 introduced characters whose representation value requires more than four digits. Other Languages Java only supports this form of four-digit universal character names. footnote 62) The disallowed characters are the characters in the basic character set and the code positions reserved by 820 62 ISO/IEC 10646 for control characters, the character DELETE, and the S-zone (reserved for use by UTF-16). Commentary basic char- 215 Requiring that characters in the basic character set not be represented using UCN notation helps guarantee acter set that existing tools (e.g., editors) continue to be able to process source ﬁles. The control characters may have special meaning for some tools that process source ﬁles (e.g., a commu- nications program used for sending source down a serial link). C++ The C++ Standard does not make this observation. footnote 63) Short identiﬁers for characters were ﬁrst speciﬁed in ISO/IEC 10646–1/AMD9:1997. 821 63 Commentary This amendment appeared eight years after the ﬁrst publication of the C Standard (which was made by ANSI in 1989). 6.4.4 Constants constant syntax 822 constant: integer-constant floating-constant enumeration-constant character-constant Commentary constant ex- 1322 pression A constant differs from a constant-expression in that it consists of a single token. The term literal is syntax often used by developers to refer to what the C Standard calls a constant (technically the only literals C string literal 895 syntax contains are string literals). There is a more general usage of the term constant to mean something whose value does not change. What the C Standard calls a constant-expression developers often shorten to constant. v 1.2 June 24, 2009
6.4.4 Constants 822 C++ 21) The term “literal” generally designates, in this International Standard, those tokens that are called “constants” Footnote 21 in ISO C. The C++ Standard also includes string-literal and boolean-literal in the list of literals, but it does not include enumeration constants in the list of literals. However: The identiﬁers in an enumerator-list are declared as constants, and can appear wherever constants are 7.2p1 required. The C++ terminology more closely follows common developer terminology by using literal (a single token) and constant (a sequence of operators and literals whose value can be evaluated at translation time). The value of a literal is explicit in the sequence of characters making up its token. A constant may be made up of more than one token or be an identiﬁer. The operands in a constant have to be evaluated by the translator to obtain its result value. C uses the more easily confused terminology of integer-constant (a single token) and constant-expression (a sequence of operators, integer-constant and floating-constant whose value can be evaluated at translation time). Other Languages Languages that support types not supported by C (e.g., instance sets) sometimes allow constants having these types to be speciﬁed (e.g., in Pascal [’a’, ’d’] represents a set containing two characters). Fortran supports complex literal constants (e.g., (1.0, 2.0) represents the complex number 1.0 + 2.0i) Many languages do not support (e.g., Java until version 1.5) some form of enumeration-constant. Coding Guidelines Constants are the mechanism by which numeric values are written into source code. The term constant is used because the numeric values do not change during program execution (and are known at translation time; although in some cases a person reading the source may only know that the value used will be one of a list of possible values because the deﬁnition of a macro may be conditional on the setting of some translation time 1931 macro option— for instance, -D). object-like The use of constants in source code creates a number of possible maintenance issues, including: • A constant value, representing some quantity, often needs to occur in multiple locations within source code. Searching for and replacing all occurrences of a particular numeric value in the code is an error prone process. It is not possible, for instance, to know that all 15s occurring in the source code have the same semantic association and some may need to remain unchanged. (Your author was once told by a developer, whose source contained lots of 15s, that the UK government would never change value-added tax from 15%; a few years later it changed to 17.5%.) • On encountering a constant in the source, a reader usually needs to deduce its semantic association (either in the application domain or its internal algorithmic function). While its semantics may be very familiar to the author of the source, the association between value and semantics may not be so readily made by later readers. • A cognitive switch may need to be made because of the representation used for the constant (e.g., 0 cognitive switch ﬂoating point, hexadecimal integer, or character constant). One solution to these problems is to use an identiﬁer to give a symbolic name822.1 to the constant, and to use symbolic name that symbolic name wherever the constant would have appeared in the source. Changes to the value of the constant can then be made by a single modiﬁcation to the deﬁnition of the identiﬁer and a well-chosen name can help readers make the appropriate semantic association. The creation of a symbolic name provides two pieces of information: 822.1 In some cases the linguistically more correct terminology would be iconic name. June 24, 2009 v 1.2
822 6.4.4 Constants 1. The property represented by that symbolic name. For instance, the maximum value of a particular INT_MAX 318 2015 type (INT_MAX), whether an implementation supports some feature (_ _STDC_IEC_559_ _), a means of __STDC_IEC_559__ specifying some operation (SEEK_SET), or a way to obtain information (FE_OVERFLOW). macro 2. A method of operating on the symbolic name to access the property it represents. For instance, arith- metic operations (INT_MAX), testing in a conditional preprocessing directive (_ _STDC_IEC_559_ _), passing as an argument to a library function (SEEK_SET); passing as an argument to a library function, possibly in combination with other symbolic names (FE_OVERFLOW). Operating on symbolic names involves making use of representation information. (Assignment, or argument passing, is the only time that representation might not be an issue.) The extent to which the use of representation information will be considered acceptable will depend on the symbolic name. For instance, FE_OVERFLOW appearing as the operand of a bitwise operator is to be expected, but its appearance as the operand of an arithmetic operator would be suspicious. The use of symbolic names is rarely seen by developers, as applying to all constants that occur in source code. In some cases the following are claimed: • The constants are so sufﬁciently well-known that there is no need to give them a name. • The number of occurrences of particular constants is not sufﬁcient to warrant creating a name for them. • Operations involving some constant values occur so frequently that their semantic associations are obvious to developers; for instance, assigning 0 or adding 1. It is true that not all numeric values are meaningless to everybody. A few values are likely to be universally known (at least to Earth-based developers). For instance, there are 60 seconds in a minute, 60 minutes in an hour, and 24 hours in a day. The value 24 occurring in an expression involving time is likely to represent hours in a day. Many values will only be well known to developers working within a given application domain, such as atomic physics (e.g., the value 6.6261E-34). Between these extremes are other values; for instance, 3.14159 will be instantly recognized by developers with a mathematics background. However, developers without this background may need to think about what it represents. There is the possibility that developers who have grown up surrounded by other mathematically oriented people will be completely unaware that others do not recognize the obvious semantic association for this value. A constant having a particular semantic association may only occur once in the source. However, the issue is not how many times a constant having a particular semantic association occurs, but how many times the particular constant value occurs. The same constant value can appear because of different semantic associations. A search for a sequence of digits (a constant value) will locate all occurrences, irrespective of semantic association. While an argument can always be made for certain values being so sufﬁciently well-known that there is no beneﬁt in replacing them by identiﬁers, the effort/time taken in discussions on what values are sufﬁciently well-known to warrant standing on their own, instead of an identiﬁer, is likely to be signiﬁcantly greater than the sum total of all the extra one seconds, or so, taken to type the identiﬁer. The constant values 0 and 1 occur very frequently in source code (see Figure 825.1). Experience suggests that the semantic associations tend to be that of assigning an initial value in the case of 0 and accessing a preceding or following item in the case of 1. The coding guideline issues are discussed in the subsections that deal with the different kinds of constants (e.g., integer, or ﬂoating). What form of deﬁnition should a symbolic name denoting constant value have? Possibilities include the following: • Macro names. These are seen by developers as being technically the same as constants in that they are replaced by the numeric value of the constant during translation (there can also be an unvoiced bias toward perceived efﬁciency here). v 1.2 June 24, 2009
6.4.4 Constants 823 • Enumeration constants. The purpose of an enumerated type is to associate a list of constants with each 517 enumeration other. This is not to say the deﬁnition of an enumerated type containing a single enumeration constant set of named constants should not occur, but this usage would be unusual. Enumeration constants share the same unvoiced developer bias as macro names— perceived efﬁciency. • Objects initialized with the constant. This approach is advocated by some coding guideline documents for C++. The extent to which this is because an object declared with the const qualiﬁer really is constant and a translator need not allocate storage for it, or because use of the preprocessor (often called the C preprocessor, as if it were not also in C++) is frowned on in the C++ community and is left to the reader to decide. 517 enumeration The enumeration constant versus macro name issue is discussed in detail elsewhere. set of named constants What name to choose? The constant 6.6261E-34 illustrates another pitfall. Planck’s constant is almost universally represented, within the physics community, using the letter h (a closely related constant is ¯ , h the reduced Planck constant)). A developer might be tempted to make use of this idiom to name the value, perhaps even trying to ﬁnd a way of using UCNs to obtain the appropriate h. The single letter h probably gives no more information than the value. The name PLANCK_CONSTANT is self-evident. The developer attitude— anybody who does not know what 6.6261E-34 represents has no business reading the source— is not very productive or helpful. Table 822.1: Occurrence of different kinds of constants (as a percentage of all tokens). Based on the visible form of the .c and .h ﬁles. Kind of Constant .c ﬁles .h ﬁles character-constant 0.16 0.06 integer-constant 6.70 20.79 floating-constant 0.02 0.20 string-literal 1.02 0.74 Constraints 823 The value of a constant shall be in the range of representable values for its type. constant representable in its type Commentary This is something of a circular deﬁnition in that a constant’s value is also used to determine its type. The 824 constant lexical form of a constant is also a factor in determining which of a number of possible types it may take. An type determined by form and value unsufﬁxed constant that is too large to be represented in the type long long, or a sufﬁxed constant that is larger than the type with the greatest rank applicable to that sufﬁx, violates this requirement (unless there is some extended integer type supported by the implementation into whose range the value falls). It can be argued that all ﬂoating constants are in range if the implementation supports ±∞. 1440 enumeration There is a similar constraint for enumeration constants constant representable in int C++ The C++ Standard has equivalent wording covering integer-literals (2.13.1p3), character-literals (2.13.2p3) and floating-literals (2.13.3p1). For enumeration-literals their type depends on the context in which the question is asked: Following the closing brace of an enum-specifier, each enumerator has the type of its enumeration. Prior to 7.2p4 the closing brace, the type of each enumerator is the type of its initializing value. 7.2p5 June 24, 2009 v 1.2
824 6.4.4 Constants The underlying type of an enumeration is an integral type that can represent all the enumerator values deﬁned in the enumeration. Other Languages Most languages have a similar requirement, even those supporting a single integer or ﬂoating type. Common Implementations Some implementations use the string-to-integer conversions provided by the library, while others prefer the ﬂexibility (and fuller control of error recovery) afforded by specially written code. Parker[1074] describes the minimal functionality required. Example 1 char ch = ’\0\0\0\0y’; 2 3 float f_1 = 1e99999999999999999999999999999999999999999999999; 4 float f_2 = 0e99999999999999999999999999999999999999999999999; 5 float f_3 = 1e-99999999999999999999999999999999999999999999999; /* Approximately zero. */ 6 float f_4 = 0e-99999999999999999999999999999999999999999999999; /* Exact zero. */ 7 8 short s_1 = 9999999999999999999999999999999999999999999999999; 9 short s_2 = 99999999999999999999999 / 99999999999999999999999; The integer constant 10000000000000000000L would violate this constraint on an implementation that represented the type long long in 64 bits. The use of an L sufﬁx precludes the constant being given the type unsigned long long. Semantics constant Each constant has a type, determined by its form and value, as detailed later. shall have a type and the value 824 type determined by form and value of a constant shall be in the range of representable values for its type. Commentary Just as there are different ﬂoating and integer object types, the possible types that constants may have is not integer 836 constant limited to a single type. possible types transla- 136 tion phase It is a constraint violation for a constant to occur during translation phrase 7 without a type. 7 integer 841 The requirement that a constant be in the range of representable values for its type is a requirement on the constant no type implementation. The wording was changed by the response to DR #298. C++ The type of an integer literal depends on its form, value, and sufﬁx. 2.13.1p2 The type of a ﬂoating literal is double unless explicitly speciﬁed by a sufﬁx. The sufﬁxes f and F specify float, 2.13.3p1 the sufﬁxes l and L specify long double. There are no similar statements for the other kinds of literals, although C++ does support sufﬁxes on the ﬂoating types. However, the syntactic form of string literals, character literals, and boolean literals determines their type. v 1.2 June 24, 2009