Thuật toán Algorithms (Phần 18)

Chia sẻ: Tran Anh Phuong | Ngày: | Loại File: PDF | Số trang:10

Thêm vào BST

Báo xấu

52
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Tham khảo tài liệu 'thuật toán algorithms (phần 18)', khoa học tự nhiên, toán học phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Thuật toán Algorithms (Phần 18)

EXTERNAL SORTING 163 exactly after the sort phase is completed ) The best choice between these two alternatives of the lowest reasonable value of P and the highest reasonable value of P is obviously very dependent on many systems parameters: both alternatives (and some in between) should be considered. Polyphase Merging One problem with balanced multiway merging for tape sorting is that it requires either an excessive number of tape units or excessive copying. For P-way merging either we must use 2P t lpes (P for input and P for output) or we must copy almost all of the file from a single output tape to P input tapes between merging passes, which effectively doubles the number of passes to be about 21og,(N/2M). S everal clevl:r tape-sorting algorithms have been invented which eliminate virtually all of this copying by changing the way in which the small sorted blocks are merged together. The most prominent of these methods is called polyphase mergir;g. The basic idea behind polyphase merging is to distribute the sorted blocks produced by replacement selection somewhat unevenly among the available tape units (leaving one empty) and thc:n to apply a “merge until empty” strategy, at which point one of the output tapes and the input, tape switch roles. For example, suppose that we have just three tapes, and we start out with the following initial configuration of sorted blocks on the tapes. (This comes from applying replacement selection to our example file with an internal memory that can only hold two records.: Tape I : A 0 R S T I N A G N D E M R G I N Tape,2:EGX A M P E L Tape 3: After three 2-way merges from tape3 1 and 2 to tape 3, the second tape becomes empty and we are left with the configuration: Tapel: DEMR G IN Tape 2: TapeS:AEGOR STX A I M N P A E G L N Then, after two 2-way merges from tapes 1 and 3 to tape 2, the first tape becomes empty, leaving: Tape 1: TapeZ:ADEEGMORRSTX A G I I M N N P Tape3:AEGLN
164 CHAPTER 13 The sort is completed in two more steps. First, a two-way merge from tapes 2 and 3 to tape 1 leaves one file on tape 2, one file on tape 1. Then a twoway merge from tapes 1 and 2 to tape 3 leaves the entire sorted file on tape 3. This “merge until empty” strategy can be extended to work for an ar- bitrary number of tapes. For example, if we have four tape units Tl, T2, T3, and T4 and we start out with Tl being the output tape, T2 having 13 initial runs, T3 having 11 initial runs, and T4 having 7 initial runs, then after running a 3-way “merge until empty,” we have T4 empty, Tl with 7 (long) runs, T2 with 6 runs, and T3 with 4 runs. At this point, we can rewind Tl and make it an input tape, and rewind T4 and make it an output tape. Continuing in this way, we eventually get the whole sorted file onto Tl: Tl T2 T3 T4 0 13 11 7 7 6 4 0 3 2 0 4 1 0 2 2 0 1 1 1 1 0 0 0 The merge is broken up into many phases which don’t involve all the data, but no direct copying is involved. The main difficulty in implementing a polyphase merge is to determine how to distribute the initial runs. It is not difficult to see how to build the table above by working backwards: take the largest number on each line, make it zero, and add it to each of the other numbers to get the previous line. This corresponds to defining the highest-order merge for the previous line which could give the present line. This technique works for any number of tapes (at least three): the numbers which arise are “generalized Fibonacci numbers” which have many interesting properties. Of course, the number of initial runs may not be known in advance, and it probably won’t be exactly a generalized Fibonacci number. Thus a number of “dummy” runs must be added to make the number of initial runs exactly what is needed for the table. The analysis of polyphase merging is complicated, interesting, and yields surprising results. For example, it turns out that the very best method for distributing dummy runs among the tapes involves using extra phases and more dummy runs than would seem to be needed. The reason for this is that some runs are used in merges much more often than others.
EXTERNAL SORTING 165 There are many other factors to be t&ken into consideration in implement- ing a most efficient tape-sorting method. For example, a major factor which we have not considered at all is the timt: that it takes to rewind a tape. This subject has been studied extensively, ant many fascinating methods have been defined. However, as mentioned above, the savings achievable over the simple multiway balanced merge are quite limited. Even polyphase merging is only better than balanced merging for small P, and then not substantially. For P > 8, balanced merging is likely to run j’aster than polyphase, and for smaller P the effect of polyphase is basically to sue two tapes (a balanced merge with two extra tapes will run faster). An Easier Way Many modern computer systems provide a large virtual memory capability which should not be overlooked in imp ementing a method for sorting very large files. In a good virtual memory syf#tem, the programmer has the ability to address a very large amount of data, leaving to the system the responsibility of making sure that addressed data is Lransferred from external to internal storage when needed. This strategy relict on the fact that many programs have a relatively small “locality of reference” : each reference to memory is likely to be to an area of memory that is relatively close to other recently referenced areas. This implies that transfers from e:rternal to internal storage are needed infrequently. An int,ernal sorting method with a small locality of reference can work very well on a virtual memory system. (For example, Quicksort has two “localities” : most references are near one of the two partitioning pointers.) But check with your systems programmclr before trying it on a very large file: a method such as radix sorting, which hE,s no locality of reference whatsoever, would be disastrous on a virtual memory system, and even Quicksort could cause problems, depending on how well the available virtual memory system is implemented. On the other hand, th’: strategy of using a simple internal sorting method for sorting disk files desl:rves serious consideration in a good virtual memorv environment.
166 Exercises 1. Describe how you would do external selection: find the kth largest in a file of N elements, where N is much too large for the file to fit in main memory. 2. Implement the replacement selection algorithm, then use it to test the claim that the runs produced are about twice the internal memory size. 3. What is the worst that can happen when replacement selection is used to produce initial runs in a file of N records, using a priority queue of size M, with M < N. 4. How would you sort the contents of a disk if no other storage (except main memory) were available for use? 5. How would you sort the contents of a disk if only one tape (and main memory) were available for use? 6. Compare the 4-tape and 6-tape multiway balanced merge to polyphase merge with the same number of tapes, for 31 initial runs. 7. How many phases does 5-tape polyphase merge use when started up with four tapes containing 26,15,22,28 runs? 8. Suppose the 31 initial runs in a 4-tape polyphase merge are each one record long (distributed 0, 13, 11, 7 initially). How many records are there in each of the files involved in the last three-way merge? 9. How should small files be handled in a Quicksort implementation to be run on a very large file within a virtual memory environment? 10. How would you organize an external priority queue? (Specifically, design a way to support the insert and remove operations of Chapter 11, when the number of elements in the priority queue could grow to be much to large for the queue to fit in main memory.)
167 SOURCES for Sorting The primary reference for this section is volume three of D. E. Knuth’s series on sorting and searching. Further information on virtually every topic that we’ve touched upon can be found in that book. In particular, the results that we’ve quoted on performance chal,acteristics of the various algorithms are backed up by complete mathematic:tl analyses in Knuth’s book. There is a vast amount of literatllre on sorting. Knuth and Rivest’s 1973 bibliography contains hundreds of entries, and this doesn’t include the treatment of sorting in countless books ind articles on other subjects (not to mention work since 1973). For Quicksort, the best reference is Hoare’s original 1962 paper, which suggests all the important variants, including the use for selection discussed in Chapter 12. Many more details on the mathematical analysis and the practical effects of many of the modifications and embellishments which have been suggested over the years may be fat nd in this author’s 1975 Ph.D. thesis. A good example of an advanced priority queue structure, as mentioned in Chapter 11, is J. Vuillemin’s “binomial cueues” as implemented and analyzed by M. R. Brown. This data structure supports all of the priority queue operations in an elegant and efficient manner. To get an impression of the myriall details of reducing algorithms like those we have discussed to general-purpoire practical implementations, a reader would be advised to study the reference material for his particular computer system’s sort utility. Such material necef sarily deals primarily with formats of keys, records and files as well as many other details, and it is often interesting to identify how the algorithms themselv:s are brought into play. M. R. Brown, “Implementation and am.lysis of binomial queue algorithms,” SIAM Journal of Computing, 7, 3, (August, 1978). C. A. R. Hoare, “Quicksort,” Computer Journal, 5, 1 (1962). D. E. Knuth, The Art of Computer Programming. Volume S: Sorting and Searching, Addison-Wesley, Reading, M9, second printing, 1975. R. L. Rivest and D. E. Knuth, “BibliogIaphy 26: Computing Sorting,” Com- puting Reviews, 13, 6 (June, 1972). R. Sedgewick, Quicksort, Garland, New York, 1978. (Also appeared as the author’s Ph.D. dissertation, Stanford University, 1975).
SEARCHING c f I -- I !t-i
14. Elementary Searching Methods A fundamental operation intrinsic ;o a great many computational tasks is searching: retrieving some partic-liar information from a large amount of previously stored information. Normally we think of the information as divided up into records, each record haling a key for use in searching. The goal of the search is to find all records with keys matching a given search key. The purpose of the search is usually to ;1ccess information within the record (not merely the key) for processing. Two common terms often used to describe data structures for searching are dictionaries and symbol tables. For example, in an English language dic- tionary, the “keys” are the words and the “records” the entries associated with the words which contain the definition, pronunciation, and other associated in- formation. (One can prepare for learning and appreciating searching methods by thinking about how one would implenent a system allowing access to an English language dictionary.) A symbol table is the dictionary for a program: contain information describing the objett the “keys” a-e the symbolic names used in the program, and the “records” named. In searching (as in sorting) we havt: programs which are in widespread use on a very frequent basis, so that it vrill be worthwhile to study a variety of methods in some detail. As with sorling, we’ll begin by looking at some elementary methods which are very useful for small tables and in other special situations and illustrate fundamental techniques exploited by more advanced methods. We’ll look at methods which stelre records in arrays which are either searched with key comparisons or index:d by key value, and we’ll look at a fundamental method which builds structures defined by the key values. As with priority queues, it is best to think of search algorithms as belong- ing to packages implementing a variety of generic operations which can be separated from particular implementations, so that alternate implementations could be substituted easily. The operations of interest include: 171
172 CHAPTER 14 Initialize the data structure. Search for a record (or records) having a given key. Insert a new record. Delete a specified record. Join two dictionaries to make a large one. Sort the dictionary; output all the records in sorted order. As with priority queues, it is sometimes convenient to combine some of these operations. For example, a search and insert operation is often included for efficiency in situations where records with duplicate keys are not to be kept within the data structure. In many methods, once it has been determined that a key does not appear in the data structure, then the internal state of the search procedure contains precisely the information needed to insert a new record with the given key. Records with duplicate keys can be handled in one of several ways, depending on the application. First, we could insist that the primary searching data structure contain only records with distinct keys. Then each “record” in this data structure might contain, for example, a link to a list of all records having that key. This is the most convenient arrangement from the point of view of the design of searching algorithms, and it is convenient in some applications since all records with a given search key are returned with one search. The second possibility is to leave records with equal keys in the primary searching data structure and return any record with the given key for a search. This is simpler for applications that process one record at a time, where the order in which records with duplicate keys are processed is not important. It is inconvenient from the algorithm design point of view because some mechanism for retrieving all records with a given key must still be provided. A third possibility is to assume that each record has a unique identifier (apart from the key), and require that a search find the record with a given identifier, given the key. Or, some more complicated mechanism could be used to distinguish among records with equal keys. Each of the fundamental operations listed above has important applica- tions, and quite a large number of basic organizations have been suggested to support efficient use of various combinations of the operations. In this and the next few chapters, we’ll concentrate on implementations of the fundamental functions search and insert (and, of course, initialize), with some comment on delete and sort when appropriate. As with priority queues, the join operation normally requires advanced techniques which we won’t be able to consider here. Sequential Searching The simplest method for searching is simply to store the records in an array,