Apress - Pro SQL Server 2008 Relational Database Design and Implementation (2008)04

Chia sẻ: Hoang Nhan | Ngày: | Loại File: PDF | Số trang:5

0
66
lượt xem
8
download

Apress - Pro SQL Server 2008 Relational Database Design and Implementation (2008)04

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Apress - Pro SQL Server 2008 Relational Database Design and Implementation (2008)04

Chủ đề:
Lưu

Nội dung Text: Apress - Pro SQL Server 2008 Relational Database Design and Implementation (2008)04

  1. 18 CHAPTER 1 s INTRODUCTION TO DATABASE CONCEPTS The choice of primary key is largely a matter of convenience and what is easiest to use. We’ll discuss primary keys later in this chapter in the context of relationships. The important thing to remember is that when you have values that should exist only once in the database, you need to protect against duplicates. Choosing Keys While keys can consist of any number of columns, it is best to try to limit the number of columns in a key as much as possible. For example, you may have a Book table with the columns Publisher_Name, Publisher_City, ISBN_Number, Book_Name, and Edition. From these attributes, the following three keys might be defined: • Publisher_Name, Book_Name, Edition: A publisher will likely publish more than one book. Also, it is safe to assume that book names are not unique across all books. However, it is probably true that the same publisher will not publish two books with the same title and the same edition (at least, we assume that this is true!). • ISBN_Number: The ISBN number is the unique identification number assigned to a book when it is published. • Publisher_City, ISBN_Number: Because ISBN_Number is unique, it follows that Publisher_City and ISBN_Number combined is also unique. The choice of (Publisher_Name, Book_Name) as a composite candidate key seems valid, but the (Publisher_City, ISBN_Number) key requires more thought. The implication of this key is that in every city, ISBN_Number can be used again, a conclusion that is obviously not appropriate. This is a common problem with composite keys, which are often not thought out properly. In this case, you might choose ISBN_Number as the PK and (Publisher_Name, Book_Name) as the AK. s Note It is important to not confuse unique indexes with keys. There may be valid performance-based reasons to implement the Publisher_City, ISBN_Number index in your SQL Server database. However, this would not be identified as a key of a table. In Chapter 6, we’ll discuss implementing keys, and in Chapter 8, we’ll cover implementing indexes for data access enhancement. Having established what keys are, we’ll next discuss the two main types of keys: natural keys (including smart keys) and surrogate keys. Natural Keys Wikipedia (http://www.wikipedia.com) defines the term natural key as “a candidate key that has a logical relationship to the attributes within that row” (at least it did when this chapter was written). In other words, it is a “real” attribute of an entity that the user logically uses to uniquely identify each instance of an entity. From our previous examples, all of our candidate keys so far—employee number, Social Security number (SSN), ISBN, and the (Publisher_Name, Book_Name) composite key—have been examples of natural keys. Some common examples of good natural keys are as follows: • For people: Driver’s license numbers (including the state of issue), company identification number, or other assigned IDs (e.g., customer numbers or employee numbers). • For transactional documents (e.g., invoices, bills, and computer-generated notices): These usu- ally have some sort of number assigned when they are printed. • For products for sale: These could be product numbers (product names are likely not unique).
  2. CHAPTER 1 s INTRODUCTION TO DATABASE CONCEPTS 19 • For companies that clients deal with: These are commonly assigned a customer/client number for tracking. • For buildings: This is usually the complete address, including the postal code. • For mail: These could be the addressee’s name and address and the date the item was sent. Be careful when choosing a natural key. Ideally, you are looking for something that is stable, that you can control, and that is definitely going to allow you to uniquely identify every row in your database. One thing of interest here is that what might be considered a natural key in your database is often not actually a natural key in the place where it is defined, for example, the driver’s license number of a person. In the example database, this is a number that every person has (or may need before inclusion in our database, perhaps). However, the value of the driver’s license number is just a series of integers. This number did not actually occur in nature tattooed on the back of the per- son’s neck at birth. In the database where that number was created, it was actually more of a surrogate key (which we will define in a later section). Given that three-part names are common in the United States, it is usually relatively rare that you’ll have two people working in the same company or attending the same school who have the same three names. (Of course, if you work in a company with 200,000 people, the odds will go up that you will have duplicates.) If you include prefixes and suffixes, it is a bit less likely, but “rare” or even “extremely rare” cannot be implemented in a manner that makes it a safe key. If you happen to hire two people called Sir Lester James Fredingston III, then the second of them probably isn’t going to take kindly to being called Les for short just so your database system isn’t compromised. One notable profession where names must be unique is acting. No two actors who have their union cards can have the same name. Some change their names from Archibald Leach to some- thing more pleasant like Cary Grant, but in some cases the person wants to keep his or her name, so in the actors database they add a uniquifier to the name to make it unique. A uniquifier might be some meaningless value added to a column or set of columns to give you a unique key. For example, five people (up from four, last edition) are listed on the Internet Movie Database site (http://www.imdb.com) with the name Gary Grant (not Cary, but Gary). Each has a dif- ferent number associated with his name to make him a unique Gary Grant. (Of course, none of these people has hit the big time, but watch out—it could be happening soon!) sTip We tend to think of names in most systems as a kind of semiunique natural key. This isn’t good enough for identifying a single row, but it’s great for a human to find a value. The phone book is a good example of this. Say you need to find Ray Janakowski in the phone book. There might be more than one person with this name, but it might be a “good enough” way to look up a person’s phone number. This semiuniqueness is a very interesting attribute of a table and should be documented for later use, but only in rare cases would you use the semiunique values and make a key from them using a uniquifier. Smart Keys A commonly occurring type of natural key in computer systems is a smart or intelligent key. Some identifiers will have additional information embedded in them, often as an easy way to build a unique value for helping a human identify some real-world thing. In most cases, the smart key can be disassembled into its parts. In some cases, however, the data will probably not jump out at you. Take the following example of the fictitious product serial number XJV102329392000123: • X: Type of product (LCD television) • JV: Subtype of product (32-inch console)
  3. 20 CHAPTER 1 s INTRODUCTION TO DATABASE CONCEPTS • 1023: Lot that the product was produced in (the 1023rd batch produced) • 293: Day of year • 9: Last digit of year • 2: Color • 000123: Order of production The simple-to-use smart key values serve an important purpose to the end user, in that the technician who received the product can decipher the value and see that in fact this product was built in a lot that contained defective whatchamajiggers, and he needs to replace it. The essential thing for us during the logical design phase is to find all the bits of information that make up the smart keys because each of these values is likely going to need to be stored in its own column. Smart keys, while useful in some cases, often present the database implementor with problems that will occur over time. When at all possible, instead of implementing a single column with all of these values, consider having multiple column values for each of the different pieces of information and calculating the value of the smart key. The end user gets what they need, and you in turn get what you need, a column value that never needs to be broken down into parts to work with. A big problem with smart keys is that it is possible to run out of unique values for the con- stituent parts, or some part of the key (e.g., the product type or subtype) may change. It is imperative that you be very careful and plan ahead if you use smart keys to represent multiple pieces of information. When you have to change the format of smart keys, it often becomes a large validation problem to make sure that different values of the smart key are actually valid. sNote Smart keys are useful tools to communicate a lot of information to the user in a small package. However, all the bits of information that make up the smart key need to be identified, documented, and implemented in a straightforward manner. Optimum SQL code expects the data to all be stored in individual columns, and as such, it is of great importance that you needn’t ever base computing decisions on decoding the value. We will talk more about the subject of choosing implementation keys in Chapter 5. Surrogate Keys Surrogate keys (sometimes described as artificial keys) are kind of the opposite of natural keys. The word surrogate means “something that substitutes for,” and in this case, a surrogate key substitutes for a natural key. Sometimes there may not be a natural key that you think is stable or reliable enough to use, in which case you may decide to use a surrogate key. In reality, many of our exam- ples of natural keys were actually surrogate keys in their original database but were elevated to a natural status by usage in the “real” world. A surrogate key can uniquely identify each instance of an entity, but it has no actual meaning with regard to that entity other than to represent existence. Surrogate keys are usually maintained by the system. Common methods for creating surrogate key values are using a monotonically increasing number (e.g., an Identity column), some form of hash function, or even a globally unique identifier (GUID), which is a very long identifier that is unique on all machines in the world. The concept of a surrogate key can be troubling to purists. Since the surrogate key does not describe the row at all, can it really be an attribute of the row? Nevertheless, an exceptionally nice aspect of a surrogate key is that the value of the key should never change. This, coupled with the fact that surrogate keys are always a single column, makes several aspects of implementation far easier. The only reason for the existence of the surrogate key is to identify a row. The main reason for an artificial key is to provide a key that an end user never has to view and never has to interact with. Think of it like your driver’s license number, an ID number that is given to you when you begin to
  4. CHAPTER 1 s INTRODUCTION TO DATABASE CONCEPTS 21 drive. It may have no other meaning than a number that helps a police officer look up who you are when you’ve been testing to see just how fast you can go in sixth gear (although in the United King- dom it is a scrambled version of the date of birth). The surrogate key should always have some element that is just randomly chosen, and it should never be based on data that can change. If your driver’s license number were a smart key and decoded to include your hair color, the driver’s license number might change frequently (for some youth and we folks whose hair has turned a different color). No, this value is good only for looking you up in a database. Usually a true surrogate key is never shared with any users. It will be a value generated on the computer system that is hidden from use, while the user directly accesses only the natural keys’ val- ues. Probably the best reason for this definition is that once a user has access to a value, it then may need to be modified. For example, if you were customer 0000013 or customer 00000666, you might request a change. s Note In some ways, surrogate keys should probably not even be mentioned in the logical design section of this book, but it is important to know of their existence, since they will undoubtedly still crop up in some logical designs. A typical flame war on the newsgroups (and amongst the tech reviewers of this book) is concerning whether surrogate keys are a good idea. I’m a proponent of their use (as you will see), but I try to be fairly open in my approach in the book to demonstrate both ways of doing things. Generally speaking, if a value is going to be accessible to the end user, my preference is that it really needs to be modifiable and readable. You can also have two surrogate keys in a table: one that is the unchanging “address” of a value, the other that is built for user con- sumption (that is compact, readable, and changeable if it somehow offends your user). Just as the driver’s license number probably has no meaning to the police officer other than a means to quickly call up and check your records, the surrogate is used to make working with the data programmatically easier. Since the source of the value for the surrogate key does not have any correspondence to something a user might care about, once a value has been associated with a row, there is not ever a reason to change the value. This is an exceptionally nice aspect of surrogate keys. The fact that the value of the key does not change, coupled with the fact that it is always a single col- umn, makes several aspects of implementation far easier. This will be made clearer later in the book when choosing a primary key. Thinking back to the driver’s license analogy, if the driver’s license card has just a single value (the surrogate key) on it, how would Officer Uberter Sloudoun determine whether you were actually the person identified? He couldn’t, so there are other attributes listed, such as name, birth date, and usually your picture, which is an excellent unique key for a human to deal with (except possibly for identical twins, of course). In this very same way, a table ought to have other keys defined as well, or it is not a proper table. Consider the earlier example of a product identifier consisting of seven parts: • X: Type of product (LCD television) • JV: Subtype of product (32-inch console) • 1023: Lot that the product was produced in (the 1023rd batch produced) • 293: Day of year • 9: Last digit of year • 2: Color • 000123: Order of production A natural key would consist of these seven parts. There is also a product serial number, which is the concatenation of the values such as XJV102329392000123 to identify the row. Say you also have
  5. 22 CHAPTER 1 s INTRODUCTION TO DATABASE CONCEPTS a surrogate key on the table that has a value of 3384038483. If the only key defined on the rows is the surrogate, the following situation might occur: SurrogateKey ProductSerialNumber –––––––––––– ––––––––––––––––––– 10 XJV102329392000123 3384038483 XJV102329392000123 3384434222 ZJV104329382043534 The first two rows are not duplicates, but since the surrogate key values have no real meaning, in essence these are duplicate rows, since the user could not effectively tell them apart. This sort of problem is common, because most people using surrogate keys do not understand that only having a surrogate key opens them up to having rows with duplicate data in the columns where the data has some logical relationship to each other. A user looking at the preceding table would have no clue which row actually represented the product he or she was after, or if both rows did. sNote When doing logical design, I tend to model each table with a surrogate key, since during the design process I may not yet know what the final keys will in fact turn out to be. This approach will become obvious throughout the book, especially in the case study presented throughout much of the book. Missing Values (NULLs) If you look up the definition of a “loaded subject” in a computer dictionary, you will likely find the word NULL. In the database, there must exist some way to say that the value of a given column is not known or that the value is irrelevant. Often, a value outside of legitimate actual range (sometimes referred to as a sentinel value) is used to denote this value. For decades, programmers have used ancient dates in a date column to indicate that a certain value does not matter, they use a negative value where it does not make sense in the context of a column, or they simply use a text string of 'UNKNOWN' or 'N/A'. These approaches are fine, but special coding is required to deal with these val- ues, for example: IF (value'UNKNOWN') THEN ... This is OK if it needs to be done only once. The problem, of course, is that this special coding is needed every time a new type of column is added. Instead, it is common to use a value of NULL, which in relational theory means an empty set or a set with no value. Going back to Codd’s rules, the third rule states the following: NULL values (distinct from empty character string or a string of blank characters or zero) are supported in the RDBMS for representing missing information in a systematic way, independ- ent of data type. There are a couple of properties of NULL that you need to consider: • Any value concatenated with NULL is NULL. NULL can represent any valid value, so if an unknown value is concatenated with a known value, the result is still an unknown value. • All math operations with NULL will return NULL, for the very same reason that any value con- catenated with NULL returns NULL. • Logical comparisons can get tricky when NULL is introduced.
Đồng bộ tài khoản