High-Performance Parallel Database Processing and Grid Databases- P3

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:50

Thêm vào BST

Báo xấu

98
lượt xem 8
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

High-Performance Parallel Database Processing and Grid Databases- P3: Parallel databases are database systems that are implemented on parallel computing platforms. Therefore, high-performance query processing focuses on query processing, including database queries and transactions, that makes use of parallelism techniques applied to an underlying parallel computing platform in order to achieve high performance.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: High-Performance Parallel Database Processing and Grid Databases- P3

80 Chapter 4 Parallel Sort and GroupBy may also be used. Apart from these basic functions, most commercial relational database management systems (RDBMS) also include other advanced functions, such as advanced statistical functions, etc. From a query processing point of view, these functions take a set of records (i.e., a table) as their input and produce a single value as the result. 4.1.3 GroupBy An example of a GroupBy query is “retrieve number of students for each degree”. The student records are grouped according to speciﬁc degrees, and for each group the number of records is counted. These numbers will then represent the number of students in each degree program. The SQL and a sample result of this query are given below. Query 4.5: Select Sdegree, COUNT(*) From STUDENT Group By Sdegree; It is also worth mentioning that the input table may have been ﬁltered by using a Where clause (in both scalar aggregate and GroupBy queries), and additionally for GroupBy queries the results of the grouping may be further ﬁltered by using a Having clause. 4.2 SERIAL EXTERNAL SORTING METHOD Serial external sorting is external sorting in a uniprocessor environment. The most common serial external sorting algorithm is based on sort-merge. The underlying principle of sort-merge algorithm is to break the ﬁle up into unsorted subﬁles, sort the subﬁles, and then merge the sorted subﬁles into larger and larger sorted subﬁles until the entire ﬁle is sorted. Note that the ﬁrst stage involves sorting the ﬁrst lot of subﬁles, whereas the second stage is actually the merging phase. In this scenario, it is important to determine the size of the ﬁrst lot of subﬁles that are to be sorted. Normally, each of these subﬁles must be small enough to ﬁt into the main memory, so that sorting of these subﬁles can be done in the main memory with any internal sorting technique. In other words, the size of these subﬁles is usually determined by the buffer size in main memory, which is to be used for sorting each subﬁle internally. A typical algorithm for external sorting using B buffers is presented in Figure 4.1. The algorithm presented in Figure 4.1 is divided into two phases: sort and merge. The merge phase consists of loops and each run in the outer loop is called a pass; subsequently, the merge phase contains i passes, where i D 1; 2; : : : . For consistency, the sort phase is named pass 0. To explain the sort phase, consider the following example. Assume the size of the ﬁle to be sorted is 108 pages and we have 5 buffer pages available (B D 5
4.2 Serial External Sorting Method 81 Algorithm: Serial External Sorting // Sort phase Pass 0 1. Read B pages at a time into memory 2. Sort them, and Write out a sub-ﬁle 3. Repeat steps 1-2 until all pages have been processed // Merge phase Pass i D 1, 2, : : : 4. While the number of sub-ﬁles at end of previous pass is > 1 5. While there are sub-ﬁles to be merged from previous pass 6. Choose B-1 sorted sub-ﬁles from the previous pass 7. Read each sub-ﬁle into an input buffer page at a time 8. Merge these sub-ﬁles into one bigger sub-ﬁle 9. Write to the output buffer one page at a time Figure 4.1 External sorting algorithm based on sort-merge pages). First read 5 pages from the ﬁle, sort them, and write them as one subﬁle into the disk. Then read, sort, and write another 5 pages. In the last run, read, sort, and write 3 pages only. As a result of this sort phase, d108=Be D 22 subﬁles, where the ﬁrst 21 subﬁles are of size 5 pages each and the last subﬁle is only 3 pages long. Once the sorting of subﬁles is completed, the merge phase starts. Continuing the example above, we will use B 1 buffers (i.e., 4 buffers) for input and 1 buffer for output. The merging process is as follows. In pass 1, we ﬁrst read 4 sorted subﬁles that are produced in the sort phase. Then we perform a 4-way merg- ing (because only 4 buffers are used as input). This 4-way merging is actually a k-way merging, and in this case k D 4, since the number of input buffers is 4 (i.e., B 1 buffers D 4 buffers). An algorithm for a k-way merging is explained in Figure 4.2. The above 4-way merging is repeated until all subﬁles (e.g., 22 subﬁles from pass 0) are processed. This process is called pass 1, and it produces d22=4e D 6 subﬁles of 20 pages each, except for the last run, which is only 8 pages long. The next pass, pass 2, repeats the 4-way merging to merge the 6 subﬁles pro- duced in pass 1. We then ﬁrst read 4 subﬁles of 20 pages long and perform a 4-way merge. This results in a subﬁle 80 pages long. Then we read the last 2 subﬁles, one of which is 20 pages long while the other is only 8 pages long, and merge them to become the second subﬁle in this pass. So, as a result, pass 2 produces d6=4e D 2 subﬁles. Finally, the ﬁnal pass, pass 3, is to merge the 2 subﬁles produced in pass 2 and to produce a sorted ﬁle. The process stops as there are no more subﬁles. In the above example, using an 108-page ﬁle and 5 buffer pages, we need to have 4 passes, where pass 0 is the sort phase and passes 1 to 3 are the merge phase. The
82 Chapter 4 Parallel Sort and GroupBy Algorithm: k-way merging input ﬁles f1 , f2 , ..., fn ; output ﬁle fo /* Sort ﬁles f1 , f2 , ..., fn , based on the attributes a1 of all ﬁles */ 1. Open ﬁles f1 , f2 , ..., fn . 2. Read a record from ﬁles f1 , f2 , ..., fn . 3. Find the smallest value among attributes a1 of the records from step 2. Store this value to ax and the ﬁle to fx (f1 Äfx Äfn ). 4. Write ax to an output ﬁle fo . 5. Read a record from ﬁle fx . 6. Repeat steps 3-5, until no more record in all ﬁles f1 , f2 , ..., fn . Figure 4.2 k-Way merging algorithm number of passes can be calculated as follows. The number of passes needed to sort a ﬁle with B buffers available is dlog B 1 dﬁle size=Bee C 1, where dﬁle size=Be is the number of subﬁles produced in pass 0 and dlog B 1 dﬁle size=Bee is the number of passes in the merge phase. This can be seen as follows. In general, the number of passes x in the merge phase of α items satisﬁes the relationship: α=.B 1/x D 1, from which we obtain x D log B 1 .α/. In each pass, we read and write all the pages (e.g., 108 pages). Therefore, the total I/O cost for the overall serial external sorting can be calculated as 2 ð ﬁle size ð number of passes D 2 ð 108 ð 4 D 864 pages. More com- prehensive cost models for serial external sort are explained below in Section 4.4. As shown in the above example, an important aspect of serial external sorting is the buffer size, where each subﬁle comfortably ﬁts into the main memory. The big- ger the buffer (main memory) size, the fewer number of passes taken to sort a ﬁle, resulting in performance gain. Table 4.1 illustrates how performance is improved when the number of buffers increases. In terms of total I/O cost, the number of passes is a key determinant. For example, to sort 1 billion pages, using 129 buffers is 6 times more efﬁcient than using 3 buffers (e.g., 30:5 D 6:1). There are a number of variations to the serial external sort-merge explained above, such as using a double buffering technique or a blocked I/O method. As our concern is not with the serial part of external sorting, our assumption of serial external sorting is based on the above sort-merge technique using B buffers. As stated in the beginning, serial external sort is the basis for parallel exter- nal sort. Particularly in a shared-nothing environment, each processor has its own
4.3 Algorithms for Parallel External Sort 83 Table 4.1 Number of passes in serial external sorting as number of buffer increases R BD3 BD5 BD9 B D 17 B D 129 B D 257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1 million 20 10 7 5 3 3 10 million 23 12 8 6 4 3 100 million 26 14 9 7 4 4 1 billion 30 15 10 8 5 4 data, and sorting this data locally in each processor is done as per serial external sort explained above. Therefore, the main concern in parallel external sort is not on the local sort but on when the local sort is carried out (i.e., local sort is done ﬁrst or later) and how merging is performed. The next section describes different meth- ods of parallel external sort by basically considering the two factors mentioned above. 4.3 ALGORITHMS FOR PARALLEL EXTERNAL SORT In this section, ﬁve parallel external sort methods for parallel database systems are explained; (i/ parallel merge-all sort, (ii) parallel binary-merge sort, (iii) paral- lel redistribution binary-merge sort, (iv) parallel redistribution merge-all sort, and (v/ parallel partitioned sort. Each of these will be described in more detail in the following. 4.3.1 Parallel Merge-All Sort The Parallel merge-all sort method is a traditional approach, which has been adopted as the basis for implementing sorting operations in several database machine prototypes (e.g., Gamma) and some commercial Parallel DBMS. Parallel merge-all sort is composed of two phases: local sort and ﬁnal merge. The local sort phase is carried out independently in each processor. Local sorting in each processor is performed as per a normal serial external sorting mechanism. A serial external sorting is used as it is assumed that the data to be sorted in each processor is very large and cannot be ﬁtted into the main memory, and hence external sorting (as opposed to internal sorting) is required in each processor. After the local sort phase has been completed, the second phase, ﬁnal merge phase, starts. In this ﬁnal merge phase, the results from the local sort phase are
84 Chapter 4 Parallel Sort and GroupBy 1 16 1 Final merge 4 1 8 3 2 5 12 7 6 9 16 11 10 13 15 14 1 2 3 4 Local sort 8 11 14 1 12 15 2 5 16 3 6 9 4 7 10 13 Records from the child operator Figure 4.3 Parallel merge-all sort transferred to the host for ﬁnal merging. The ﬁnal merge phase is carried out by one processor, namely, the host. An algorithm for a k-way merging is explained in Figure 4.2. Figure 4.3 illustrates a parallel merge-all sort process. For simplicity, a list of numbers is used and this list is to be sorted. In the real world, the list of numbers is actually a list of records from very large tables. Figure 4.3 shows that a parallel merge-all sort is simple, because it is a one-level tree. Load balancing in each processor at the local sort phase is relatively easy to achieve, especially if a round-robin data placement technique is used in the initial data partitioning. It is also easy to predict the outcome of the process, as performance modeling of such a process is relatively straightforward. Despite its simplicity, the parallel merge-all sort method incurs an obvious prob- lem, particularly in the ﬁnal merging phase, as merging in one processor is heavy. This is true especially if the number of processors is large and there is a limit to the number of ﬁles to be merged (i.e., limitation in number of ﬁles to be opened). Another factor in merging is the buffer size as mentioned above in the discussion of serial external sorting. Another problem with parallel merge-all sort is network contention, as all tem- porary results from each processor in the local sort phase are passed to the host. The problem of merging by one host is to be tackled by the next sorting scheme, where merging is not done by one processor but is shared by multiple processors in the form of hierarchical merging.
4.3 Algorithms for Parallel External Sort 85 4.3.2 Parallel Binary-Merge Sort The ﬁrst phase of parallel binary-merge sort is a local sort similar to the paral- lel merge-all sort. The second phase, the merging phase, is pipelined instead of concentrating on one processor. The way the merging phase works is by taking the results from two processors and then merging the two in one processor. As this merging technique uses only two processors, this merging is called “binary merging.” The result of the merging between two processors is passed on to the next level until one processor (the host) is left. Subsequently, the merging process forms a hierarchy. Figure 4.4 illustrates the process. The main reason for using parallel binary-merge sort is that the merging work- load is spread to a pipeline of processors instead of one processor. It is true, however, that ﬁnal merging still has to be done by one processor. Some of the beneﬁts of parallel binary-merge sort are similar to those of parallel merge-all sort. For instance, balancing in local sort can be done if a round-robin 1 16 3 1 4 2 1 3 7 8 6 11 9 12 10 15 13 16 14 Two-level hierarchical merging using (N –1) 4 3 2 1 nodes in a pipeline. 8 2 7 6 3 5 12 11 10 9 16 15 14 13 1 2 3 4 Local sort 8 11 14 1 12 15 2 5 16 3 6 9 4 7 10 13 Records from the child operator Figure 4.4 Parallel binary-merge sort
86 Chapter 4 Parallel Sort and GroupBy Parallel Merge-All Sort Parallel Binary-Merge Sort Figure 4.5 Binary-merge vs. k-way merging binary merging k-way merge in the merging phase data placement is initially used for the raw data to be sorted. Another beneﬁt, as stated above, is that by merging the workload it is now shared among processors. However, problems relating to the heavy merging workload in the host still exist, even though now the ﬁnal merging merges only a pair of lists of sorted data and is not a k-way merging like that in parallel merge-all sort. Binary merging can still be time consuming, particularly if the two lists to be merged are very large. Figure 4.5 illustrates binary-merge versus k-way merge, which is carried out by the host. The main difference between k-way merging and binary merging is that in k-way merging, there is a searching process in the merging; that is, it searches the smallest value among all values being compared at the same time. In binary merging, this searching is purely to obtain a comparison between two values simul- taneously. Regarding the system requirement, k-way merging requires a sufﬁcient number of ﬁles to be opened at the same time. This requirement is trivial in binary merging, as it requires only a maximum of two ﬁles to be opened, and this is easily satisﬁed by any operating systems. The pipeline system, as in the binary merging, will certainly produce extra work through the pipe itself. The pipeline mechanism also produces a higher tree, not a one-level tree as with the previous method. However, if there is a limit to the number of opened ﬁles permitted in the k-way merging, parallel merge-all sort will incur merging overheads. In parallel binary-merge sort, there is still no true parallelism in the merging because only a subset, not all, of the available processors are used. In the next three sections, three possible alternatives using the concept of redis- tribution or repartitioning are described. The ﬁrst approach is a modiﬁcation of parallel binary-merge sort by incorporating redistribution in the pipeline hierarchy of merging. The second approach is an alteration to parallel merge-all sort, also through the use of redistribution. The third approach differs from the others, as local sorting is delayed after partitioning is done. 4.3.3 Parallel Redistribution Binary-Merge Sort Parallel redistribution binary-merge sort is motivated by parallelism at all levels in the pipeline hierarchy. Therefore, it is similar to parallel binary-merge sort, because
4.3 Algorithms for Parallel External Sort 87 both methods use a hierarchy pipeline for merging local sort results, but differs in terms of the number of processors involved in the pipe. With parallel redistribution binary-merge sort, all processors are used at each level in the hierarchy of merging. The steps for parallel redistribution binary-merge sort can be described as fol- lows. First, carry out a local sort in each processor similar to the previous sorting methods. Second, redistribute the results of the local sort to the same pool of pro- cessors. Third, do a merging using the same pool of processors. Finally, repeat the above two steps until ﬁnal merging. The ﬁnal result is the union of all temporary results obtained in each processor. Figure 4.6 illustrates the parallel redistribution binary-merge sort method. 1 6 11 2 7 12 3 8 13 Sorted list 4 9 14 5 10 15 16 1 7 2 6 11 3 13 4 Final merge 8 9 12 14 3 1 16 10 15 4 2 5 6–10 11–15 16–20 1–5 Range Redistribution 1 3 11 2 5 Sorted among 4 12 and within files 7 15 6 13 8 16 9 14 10 1 2 3 4 Intermediate merge 2 1 4 3 6 5 14 13 12 11 8 10 9 7 16 15 1-10 11-20 1-10 11-20 Range Range Redistribution Redistribution Redistribution 4 3 2 1 8 7 6 5 12 11 10 9 16 15 14 13 1 2 3 4 Local sort 8 11 14 1 12 15 2 5 16 3 6 9 4 7 10 13 Records from the child operator Figure 4.6 Parallel redistribution binary-merge sort
88 Chapter 4 Parallel Sort and GroupBy Note from the illustration that in the ﬁnal merge phase, some of the boxes are empty (i.e., gray boxes). This indicates that they do not receive any values from the designated processors. For example, the ﬁrst box on the left is gray because there are no values ranging from 1 to 5 from processor 2. Practically, in this example, processor 1 performs the ﬁnal merging of two lists, because the other two lists are empty. Also, note that the results produced by the intermediate merging in the above example are sorted within and among processors. This means that, for example, processors 1 and 2 produce a sorted list each, and the union of these results is also sorted where the results from processor 2 are preceded by those from processor 1. This is applied to other pairs of processors. Each pair of processors in this case forms a pool of processors. At the next level of merging, two pools of processors use the same strategy as in the previous level. Finally, in the ﬁnal merging, all processors will form one pool, and therefore results produced in each processor are sorted, and these results united together are then sorted based on the processor order. In some systems, this is already a ﬁnal result. If there is a need to place the results in one processor, results transfers are then carried out. The apparent beneﬁt of this method is that merging becomes lighter compared with those without redistribution, because merging is now shared by multiple pro- cessors, not monopolized by just one processor. Parallelism is therefore accom- plished at all levels of merging, even though the performance beneﬁts of this mechanism are restricted. The problem of the redistribution method still remains, which relates to the height of the tree. This is due to the fact that merging is done in a pipeline format. Another problem raised by the redistribution is skew. Although initial placement in each disk is balanced through the use of round-robin data partitioning, redistri- bution in the merging process is likely to produce skew, as shown in Figure 4.6. Like the merge-all sort method, ﬁnal merging in the redistribution method is also dependent upon the maximum number of ﬁles opened. 4.3.4 Parallel Redistribution Merge-All Sort Parallel redistribution merge-all sort is motivated by two factors, namely, reducing the height of the tree while maintaining parallelism at the merging stage. This can be achieved by exploiting the features of parallel merge-all and parallel redistribu- tion binary-merge methods. In other words, parallel redistribution is a two-phase method (local sort and ﬁnal merging) like parallel merge-all sort, but does a redis- tribution based on a range partitioning. Figure 4.7 gives an illustration of parallel redistribution merge-all sort. As shown in Figure 4.7, parallel redistribution merge-all sort is a two-phase method, where in phase one, local sort is carried out as is done with other methods, and in phase two, results from local sort are redistributed to all processors based on a range partitioning, and merging is then performed by each processor. Similar to parallel redistribution binary-merge sort, empty (gray) boxes are actu- ally empty lists as a result of data redistribution. In the above example, processor
4.3 Algorithms for Parallel External Sort 89 1 6 11 2 7 12 3 8 13 Sorted list 4 9 14 5 10 15 16 1 8 7 2 6 9 12 11 3 14 13 4 Final merge 10 15 4 3 2 1 16 5 6–10 11–15 Redistribution 1–5 16–20 Range Redistribution 4 3 2 1 8 7 6 5 12 11 10 9 16 15 14 13 Local sort 1 2 3 4 8 11 14 1 12 15 2 5 16 3 6 9 4 7 10 13 Records from the child operator Figure 4.7 Parallel redistribution merge-all sort 4 has three empty lists coming from processors 2, 3, and 4, as they do not have values ranging from 16 to 20 as speciﬁed by the range partitioning function. Also, note that the ﬁnal results produced in the ﬁnal merging phase in each processor are sorted, and these are also sorted among all processors based on the order of the processors speciﬁed by the range partitioning function. The advantage of this method is the same as that of parallel redistribution binary-merge sort, including true parallelism in the merging process. However, the tree of parallel redistribution merge-all sort is not a tall tree as in the paral- lel redistribution binary-merge sort. It is, in fact, a one-level tree, the same as in parallel merge-all sort. Not only do the advantages of parallel redistribution merge-all sort mirror those in parallel merge-all sort and parallel redistribution binary-merge sort, so also do the problems. Skew problems found in parallel redistribution binary-merge sort also exist with this method. Consequently, skew modeling needs some simpliﬁed assumptions as well. Additionally, a bottleneck problem in merging, which is sim- ilar to that of parallel merge-all sort is also common here, especially if the number of processors is large and exceeds the limit of the number of ﬁles that can be opened at once.
90 Chapter 4 Parallel Sort and GroupBy 4.3.5 Parallel Partitioned Sort Parallel partitioned sort is inﬂuenced by the techniques used in parallel partitioned join, where the process is split into two stages: partitioning and independent local work. In parallel partitioned sort, ﬁrst we partition local data according to range partitioning used in the operation. Note the difference between this method and others. In this method, the ﬁrst phase is not a local sort. Local sort is not carried out here. Each local processor scans its records and redistributes or repartitions according to some range partitioning. After partitioning is done, each processor will have an unsorted list whose val- ues come from various processors (places). It is then that local sort is carried out. Thus local sort is carried out after the partitioning, not before. It is also noted that merging is not needed. The results produced by the local sort are already the ﬁnal results. Each processor will have produced a sorted list, and all processors in the order of the range partitioning method used in this process are also sorted. Figure 4.8 illustrates this method. 1 6 11 2 7 12 3 8 13 Sorted list 4 9 14 5 10 15 16 1 2 3 4 Local sort 4 8 12 16 3 7 11 2 6 15 1 10 14 5 9 13 6–10 11–15 Redistribution 1–5 16–20 Range Redistribution Scan only 1 2 3 4 (no local sort) 8 11 14 1 12 15 2 5 16 3 6 9 4 7 10 13 Records from the child operator Figure 4.8 Parallel partitioned sort
4.3 Algorithms for Parallel External Sort 91 Processor 1 Processor 2 Processor 3 F Processors: G E B C A D C Buckets: D E A B F G Figure 4.9 Bucket tuning load balancing The main beneﬁt of parallel partitioned sort is that no merging is necessary, and hence the bottleneck in merging is avoided. It is also a true parallelism, as all processors are being used in the two phases. And most importantly, it is a one-level tree, reducing unnecessary overheads in the pipeline hierarchy. Despite these advantages, the problem that still remains outstanding is skew that is produced by the partitioning. This is a common problem even in the parti- tioned join. Load balancing in this situation is often carried out by producing more buckets than there are available processors, and the workload arrangement of these buckets can then be carried out by evenly distributing buckets among processors. For example, in Figure 4.9, seven buckets have been created for three processors. The size of each bucket is likely to be different, and after the buckets are cre- ated bucket placement and arrangement are performed to make the workload of the three processors balanced. For example, buckets A; B, and G go to processor 1, buckets C and F to processor 2, and the rest to processor 3. In this way, the workload of these three processors will be balanced. However, bucket tuning in the original form as shown in Figure 4.9 is not rele- vant to parallel sort. This is because in parallel sort the order of the processors is important. In the above example, bucket A will have values that are smaller than those in bucket B, and values in bucket B are smaller than those in bucket C, etc. Then buckets A to G are in order. The values in each bucket are to be sorted, and once they are sorted the union of values from each bucket, together with the bucket order, produces a sorted list. Imagine that bucket tuning as shown in Figure 4.9 is applied to parallel partitioned sort. Processor 1 will have three sorted lists, from buckets A; B, and G. Processors 2 and 3 will have 2 sorted lists each. However, since the buckets in the three processors are not in the original order (i.e., A to G/, the union of sorted lists from processors 1, 2, and 3 will not produce a sorted list, unless a further operation is carried out.
92 Chapter 4 Parallel Sort and GroupBy 4.4 PARALLEL ALGORITHMS FOR GROUPBY QUERIES Parallel aggregate processing is very similar to parallel sorting, described in the previous section. From the lessons we learned from parallel sorting, we focus on three parallel aggregate query algorithms; Ž Traditional methods including merge-all and hierarchical merging, Ž Two-phase method, and Ž Redistribution method 4.4.1 Traditional Methods (Merge-All and Hierarchical Merging) The traditional method was ﬁrst used in Gamma, one of the ﬁrst parallel database system prototypes. This method consists of two steps, which are explained as follows. The ﬁrst step is a local aggregation step. In this step, each node groups local records according to the designated group-by attribute and performs the aggregate function. Using Query 4.5 as an example, one node may produce, for example, (Math, 300) and (Science, 500) and another node (Business, 100) and (Science, 100). The numerical ﬁgures indicate the number of students in that degree. The second step is a global aggregation step, in which all the temporary results obtained in each node are passed to the host for consolidation in order to produce the global aggregate values. Continuing the above example, (Science, 500) from the ﬁrst node and (Science, 100) from the second are merged into one record, that is, (Science, 600). This global aggregation step can be very tricky depending on the complexity of the aggregate functions used in the actual query. If, for example, an AVG function were used instead of COUNT in the above query, when calculating an average value based on temporary averages, one must take into account the actual raw records involved in each node. Therefore, for these kinds of aggregate functions, the local aggregate must also produce the number of raw records in each node, although they are not speciﬁed in the query. This is needed in order for the global aggregation to produce correct values. Query 4.6: Select Sdegree, AVG(SAge) From STUDENT Group By Sdegree; For example, one node may produce (Science, 21.5, 500) and the other (Science, 22, 100). The host calculates the global average by dividing the sum of the two SAge by the total number of students. The total number of students in each degree needs to be determined in each node, although it is not speciﬁed in the SQL.
4.4 Parallel Algorithms for GroupBy Queries 93 host Coordinator 1 2 3 4 Local aggregation Records from the child operator Figure 4.10 Traditional method As the host coordinates all temporary results from each node, intuitively this method works well if the number of nodes is small and the number of resulting records is also very small. But as soon as the groups size becomes moderate, the host starts becoming a bottleneck. In general, the use of a single node for global aggregation forms a serial bottleneck at that node. Figure 4.10 shows the traditional parallel aggregate method. The hierarchical merging method is introduced in order to overcome the bot- tleneck of the host as in the traditional method. Instead of using one node to do the global aggregation, it utilizes a binary merging scheme to off-load some of the work from the host node. This binary merging scheme can be explained as follows. For each pair of nodes, the local aggregation results of one of the nodes are sent to the other, where a second level of local aggregates is computed. Once all pairs have been processed, all the nodes holding the second-level aggregates are then processed in the same manner, until there is only one processor left, the top node of which coordinates the ﬁnal aggregate results. Figure 4.11 shows the hierarchical merging method. Like the traditional method, the hierarchical merging method works well with a small number of results. Although it may handle medium-sized results well, when the number of records becomes sufﬁciently large, its performance will decline. This is simply because the ﬁnal merging phase still creates a bottleneck. 4.4.2 Two-Phase Method As the name states, the two–phase method consists of two phases: local aggre- gation and global aggregation. The ﬁrst phase is the local aggregation phase, where each processor calculates its local aggregate values. Local aggregation is calculated based on the records on the local processor. In this phase, each proces- sor groups local records according to the designated group-by attribute and per- forms the aggregate function. Using the same query as an example, one processor
94 Chapter 4 Parallel Sort and GroupBy 2 Two-level hierarchical merging using (N–1) nodes in a pipeline. 2 3 1 2 3 4 Local aggregation Records from the child operator Figure 4.11 Hierarchical merging method may produce, for instance, (Math, 300) and (Science, 500) and another processor (Business, 100) and (Science, 100). The numerical ﬁgures indicate the number of students in these degrees. The second phase is a global aggregation phase, in which all the temporary results obtained in each processor are redistributed to all processors to produce the global aggregate values. The way global aggregation works is as follows. After local aggregates are formulated in each processor, each processor distributes each of the groups to another processor depending on the adopted distribution function. A possible distribution function is, for example, that degrees beginning with A–G are to be distributed to processor 1, H –M to processor 2, N –T to processor 3, and the rest to processor 4. With this range distribution function, the processor that pro- duces (Math, 300) and (Science, 500) will distribute its (Math, 300) to processor 2 and (Science, 500) to processor 3. This distribution scheme is commonly used in parallel join, where raw records are partitioned into buckets based on an adopted partitioning scheme like the above range partitioning. Once the distribution of local results based on a particular distribution func- tion has been completed, global aggregation in each processor is done by simply merging all identical degrees into one aggregate value. For example, processor 3 will merge (Science, 500) from one processor and (Science, 100) from the other to produce (Science, 600), which is the ﬁnal aggregate value for this degree. The global aggregation operation for different groups is done in parallel by distributing local aggregates, so as to avoid the bottleneck produced by the traditional method. Figure 4.12 illustrates this method. The circles indicate processors, and the directed arrows show data ﬂow. 4.4.3 Redistribution Method The redistribution method is inﬂuenced by the practice of parallel join algorithms, where raw records are ﬁrst partitioned and allocated to each processor and then
4.4 Parallel Algorithms for GroupBy Queries 95 Processors: 1 2 3 4 Global aggregation Distribute local results based on the group-by attribute. Processors: 1 2 3 4 Local aggregation Records from the child operator Figure 4.12 Two-phase method each processor performs its operation. In the context of parallel aggregates, the difference between the redistribution method and other methods is that this method does not process local aggregates. The redistribution method is motivated by the fast message passing of multiprocessor systems. The ﬁrst phase (i.e., partitioning phase) in the Redistribution method is parti- tioning of raw records based on the group-by attribute according to a distribution function. An example of a partitioning function is, as for the previous example, to allocate to each processor degrees ranging from certain letters as their ﬁrst letter and certain letters as their last letter. Using the same range partitioning as described in the previous sections, a processor will have all records that have degrees from letter A to G. Other processors will follow on the basis of alphabet division, such as processor 2 from H to M. Once the partitioning has been completed, each processor will have records within certain groups identiﬁed by the group-by attribute. Subsequently, the sec- ond phase (the aggregation phase), which calculates the aggregate values of each group, can proceed. Aggregation in each processor can be carried out with a sort or a hash function. As a result of the second phase, each processor will have one aggregate value for each group; for example, processor 3 will have (Science, 600). Since each processor has distinct aggregate groups as a result of partitioning of the group-by attribute, the ﬁnal query result is a union of all subresults produced by each processor. Figure 4.13 illustrates the redistribution method. Note that partitioning is done to the raw records, and the aggregate operation on each processor is carried out after the partitioning phase. Also, observe that if the number of groups is less than the number of available processors, not all processors can be utilized, thereby reducing the capability of parallelism. The cost components for the redistribution method are different from those of two-phase method, particularly in the ﬁrst phase, in which the redistribution method does not perform a local aggregation. In the ﬁrst phase of the redistribution
96 Chapter 4 Parallel Sort and GroupBy Processors: 1 2 3 4 Aggregate Distribute records on the group-by attribute. Records from the child operator Figure 4.13 Redistribution method method, the raw records are simply distributed to other processors. Hence, the main cost component of the ﬁrst phase of the redistribution method is the distribution cost. 4.5 COST MODELS FOR PARALLEL SORT In addition to the cost notations described in Chapter 2, there are a few new cost notations, which are particularly relevant for parallel sort. These are listed in Table 4.2. Before presenting the cost models for each of the ﬁve parallel external sortings discussed in the previous section, we will ﬁrst study the cost models for serial external sort, which are the foundation of cost models for the parallel versions; understanding these is important in the context of parallel external sort. 4.5.1 Cost Models for Serial External Merge-Sort There are two main cost components for serial external sort, the costs relating to I/O and those relating to CPU processing. The I/O costs are the disk costs, which consist of load cost and save cost. These I/O costs are as follows. Table 4.2 Additional cost notations for parallel sort Symbol Description System parameters B Buffer size Time unit costs tm Time to merge ts Time to compare and swap two keys tv Time to move a record
4.5 Cost Models for Parallel Sort 97 ž Load cost is the cost of loading data from disk to main memory. Data loading from disk is done by pages. Load cost D Number of pages ð Number of passes ð Input/output unit cost where Number of pages D .R=P/ and Number of passes D .dlog B 1 .R=P=B/e C 1/ (4.1) Hence, the above load cost becomes: .R=P/ ð .dlog B 1 .R=P=B/e C 1/ ð IO ž Save cost is the cost of writing data from the main memory back to the disk. The save cost is actually identical to the load cost, since the number of pages loaded from the disk is the same as the number of pages written back to the disk. No ﬁltering to the input ﬁle has been done during sorting. The CPU cost components are determined by the costs involved in getting records out of the data page, sorting, merging, and generating results, which are as follows. ž Select cost is the cost of obtaining a record from the data page, which is calculated as the number of records loaded from the disk times reading and writing unit cost to the main-memory. The number of records loaded from the disk is inﬂuenced by the number of passes, and therefore equation 4.1 above is being used here to calculate the number of passes. jRj ð Number of passes ð .tr C tw / ž Sorting cost is the internal sorting cost, which has a O.N ð log2 N / complex- ity. Using the cost notation, the O.N ð log2 N / complexity has the following cost. jRj ð dlog2 .jRj/e ð ts The sorting cost is the cost of processing a record in pass 0 only. ž Merging cost is applied to pass 1 onward. It is calculated based on the number of records being processed, which is also inﬂuenced by the number of passes in the algorithm, multiplied by the merging unit cost. The merging unit cost is assumed to involve a k-way merging where searching for the lowest value in the merging is incorporated in the merging unit cost. Also, bear in mind that 1 must be subtracted from the number of passes, as the ﬁrst pass (i.e., pass 0) is used by sorting. jRj ð .Number of passes 1/ ð tm ž Generating result cost is the number of records being generated or produced in each pass before they are written to disk multiplied by the writing unit cost. jRj ð Number of passes ð tw
98 Chapter 4 Parallel Sort and GroupBy 4.5.2 Cost Models for Parallel Merge-All Sort The cost models for parallel merge-all sort are divided into two categories: local merge-sort costs and ﬁnal merging costs. Local merge-sort costs are the costs of local sorting in each processor using a merge-sort technique, whereas the ﬁnal merging costs are the costs of consolidating temporary results from all processing elements at the host. The local merge-sort costs are similar to the serial external merge-sort cost models explained in the previous section, except for two major differences. One difference is that for the local merge-sort costs in parallel merge-all sort the frag- ment size to be sorted in each processor is determined by the values of Ri and jRi j, instead of just R and jRj. This is because in parallel merge-all sort the data has been partitioned to all processors, whereas in the serial external merge-sort only one processor is being used. Since we now use Ri and jRi j, these two cost ele- ments may involve data skew. When skew is involved, the values of Ri and jRi j are calculated not by a straight division with N , but with a much lower value than N due to skewness. The second difference is that the local merge-sort costs of parallel merge-all sort involve communication costs, which do not appear in the original serial external sort cost models. The communication costs are the costs associated with the data transfer from each processor to the host at the end of the local sorting phase. The local merge-sort costs, consisting of I/O costs, CPU costs, and communi- cation costs, are summarized as follows. ž I/O costs, which consist of load and save costs, are as follows: Save cost D Load cost D .Ri =P/ ð Number of passes ð IO (4.2) where Number of passes D .dlog B 1 .Ri =P=B/e C 1/ ž CPU costs, which consist of select cost, sorting cost, merging cost, and gen- erating results cost, are as follows: Select cost D jRi j ð N umber o f passes ð .tr C tw / Sorting cost D jRi j ð dlog2 .jRi j/e ð ts Merging cost D jRi j ð .N umber o f passes 1/ ð tm Generating result cost D jRi j ð N umber o f passes ð tw where Number of passes is as shown in equation 4.2 above. ž Communication costs for sending local sorted results to the host are given by the number of pages to be transferred multiplied by the message unit cost, as follows: Communication cost D .Ri =P/ ð .m p C m l / The ﬁnal merging costs involve communication costs, I/O costs, and CPU costs. The communication costs are the costs involved when the host receives data from all other processors. The I/O and CPU costs are the costs associated directly with
4.5 Cost Models for Parallel Sort 99 the merging process at the host. The three cost components for the ﬁnal merging costs are given as follows. ž Communication cost, which is the receiving record cost from local sorting operators, is calculated by the number of records being received (in this case the total number of records from all processors) multiplied by the message unit cost. Communication cost D .R=P/ ð m p ž I/O cost, which consists of load and save costs, is inﬂuenced by two factors, the total number of records being received and processed and the number of passes in the merging of N subﬁles. When the data is ﬁrst received from the local sorting operator, the data has to be written out to the disk in the host. After this, the host starts the k-way merging process by ﬁrst loading the data from the local host disk, processing them, and saving the results back to the local host disk. As the k-way merging process may be done at a number of passes, data loading and saving are carried out as many times as the number of passes in the merging process. Moreover, the total number of data savings is one more than the total number of data loadings, as the ﬁrst data saving must be done when the data is ﬁrst received by the host. Save cost D .R=P/ ð .Number of merging passes C 1/ ð IO Load cost D .R=P/ ð Number of merging passes ð IO (4.3) where Number of merging passes D dlog B 1 .N /e Note that the Number of merging passes is determined by the number of pro- cessors N and the number of buffers. The number of processors N is served as the number of streams in the k-way merging, and each stream contains a sorted list of data, which is obtained from the local sorting phase. Since all processors participate in the local sorting phase, the value of N is not inﬂu- enced by skew. Whether or not there is data skew in the local sorting phase, all processors will have at least one record to work with, and subsequently when these data are transferred to the host, none of the stream is empty. ž CPU cost consists of the select costs, merging costs, and generating results costs only. Sorting costs are not included since the host does not sort data but only merges. CPU costs are determined by the total number of records being merged, the number of merging passes, and the unit cost. Select cost D jRj ð Number of merging passes ð .tr C tw / Merging cost D jRj ð Number of merging passes ð tm Generating result cost D jRj ð Number of merging passes ð tw where Number of merging passes is as shown in equation 4.3 above. There are two things to mention regarding the above ﬁnal merging costs. First, the host processes all records, and hence R and jRj are used in the cost equations,