Advances in Database Technology- P14

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:50

0
46
lượt xem
3
download

Advances in Database Technology- P14

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Tham khảo tài liệu 'advances in database technology- p14', công nghệ thông tin, cơ sở dữ liệu phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả

Chủ đề:
Lưu

Nội dung Text: Advances in Database Technology- P14

  1. 632 N. Karayannidis, T. Sellis, and Y. Kouvaras Fig. 7. Impact of cube dimensionality increase to the CUBE File size We used synthetic data sets that were produced with an OLAP data generator that we have developed. Our aim was to create data sets with a realistic number of dimensions and hierarchy levels. In Table 1, we present the hierarchy configuration for each dimension used in the experimental data sets. The shortest hierarchy consists of 2 levels, while the longest consists of 10 levels. We tried each data set to consist of a good mixture of hierarchy lengths. Table 2 shows the data set configuration for each series of experiments. In order to evaluate the adaptation to sparse data spaces, we created cubes that were very sparse. Therefore the number of input tuples was kept from a small to a moderate level. To simulate the cube data distribution, for each cube we created ten hyper-rectangular regions as data point containers. These regions are defined randomly at the most detailed level of the cube and not by combination of hierarchy values (although this would be more realistic), in order not to favor the CUBE File particularly, due to the hierarchical chunking. We then filled each region with data points uniformly spread and tried to maintain the same number of data points in each region. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  2. CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes 633 Fig. 8. Size ratio between the UB-tree and the CUBE File for increasing dimensionality Fig. 9. Size scalability in the number of input tuples (i.e., stored data points) 4.2 Structure Experiments Fig. 7 shows the size of the CUBE File as the dimensionality of the cube increases. The vertical axe is in logarithmic scale. We see the cube data space size (i.e., the product of the dimension grain-level cardinalities) “exploding” exponentially as the number of dimensions increases. The CUBE File size remains many orders of magnitude smaller than the data space. Moreover, the CUBE File size is also smaller than the ASCII file, containing the input tuples to be loaded into SISYPHUS. This clearly shows that the CUBE File: 1. Adapts to the large sparseness of the cube allocating space comparable to the actual number of data points 2. Achieves a compression of the input data since it does not store the data point coordinates (i.e., the h-surrogate keys of the dimension values) in each cell but only the measure values. Furthermore, we wish to pinpoint that the current CUBE File implementation ([6]) does not impose any compression to the intermediate nodes (i.e., the directory chunks). Only the data chunks are compressed by means of a bitmap representing the Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  3. 634 N. Karayannidis, T. Sellis, and Y. Kouvaras cell offsets, which however is stored uncompressed also. This was a deliberate choice in order to evaluate the compression achieved merely by the “pruning ability” of our chunk-to-bucket allocation scheme, according to which no space is allocated for empty chunk-trees (i.e., empty data space regions). Therefore, regarding the compression achieved the following could improve the compression ratio even further: (a) compression of directory chunks and (b) compression of offset-bitmaps (e.g., with run-length encoding). Fig. 8 shows the ratio of the UB-tree size to the CUBE File size for increasing dimensionality. We see that the UB-tree imposes a greater storage overhead than the CUBE File for almost all cases. Indeed, the CUBE file remains 2-3 times smaller in size than the UB-tree/MHC. For eight dimensions both structures have approximately the same size but for nine dimensions the CUBE File size is four times larger. This is primarily due to the increase of the size of the intermediate nodes in the CUBE File, since for 9 dimensions and 100,000 data points the data space has become extremely sparse As we noted above, our implementation does not apply any compression to the directory chunks. Therefore, it is reasonable that for such extremely sparse data spaces the overhead from these chunks becomes significant, since a single data point might trigger the allocation of all the cells in the parent nodes. An implementation that would incorporate the compression of directory chunks as well would eliminate this effect substantially. Fig. 9 depicts the size of the CUBE File as the number of cube data points (i.e., input tuples) scales up, while the cube dimensionality remains constant (five dimensions with a good mixture of hierarchy lengths – see Table 1). In the same graph we show the corresponding size of the UB-tree/MHC and the size of the root- bucket. The CUBE File maintains a lower storage cost for all tuple cardinalities. Moreover, the UB-tree size increases in a faster rate making the difference of the two larger as the number of tuples increases. The root-bucket size is substantially lower than the CUBE File and demonstrates an almost constant behaviour. Note that in our implementation we store the whole root-directory in the root-bucket and thus the whole root-directory is kept in main memory during query evaluation. Thus the graph also shows that the root-directory size becomes very fast negligible compared to the CUBE File size as the number of data points increase. Indeed, for cubes containing more than 1 million tuples the root-directory size is below 5% of the CUBE File size, although the directory chunks are stored uncompressed in our current implementation. Hence it is feasible to keep the whole root-directory in main memory. 4.3 Query Experiments For the query experiments we ran a total of 5,234 HPP queries both on the CUBE File and the UB-tree/MHC. These queries were classified in three classes: (a) 1,593 prefix queries, (b) 1,806 prefix range queries and (c) 1,835 prefix multi-range queries. A prefix query is one in which we access the data points by a specific chunk-id prefix. For example the following prefix query is represented by the shown chunk expression, which denotes the restriction on each hierarchy of a 3-dimensional cube of 4 chunking depth levels Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  4. CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes 635 This expression represents a chunk-id access pattern, denoting the cells that we need to access in each chunk. means “any”, i.e., no restriction is imposed on this dimension level. The greatest depth containing at least one restriction is called the maximum depth of restrictions In this example it corresponds to the D- domain and thus equals 1. The greater the maximum depth of restrictions the less are the returned data points (smaller cube selectivity) and vice- versa. A prefix range query is a prefix query that includes at least one range selection on a hierarchy level, thus resulting in a larger selection hyper-rectangle at the grain level of the cube. For example: Finally, a prefix multi-range query is a prefix range query that includes at least one multiple range restriction on a hierarchy level of the form {[a-b],[c-d]...}. This results in multiple disjoint selection hyper-rectangles at the grain level. For example: As mentioned earlier, our goal was to evaluate the hierarchical clustering achieved by means of the performed I/Os for the evaluation of these queries. To this end, we ran two series of experiments: the hot-cache experiments and the cold-cache ones. In the hot-cache experiments we assumed that the root-bucket (containing the whole root-directory) is cached in main memory and counted only the remaining bucket I/Os. For the UB-tree in the hot-cache case, we counted only the page I/Os at the leaf level omitting the intermediate node accesses altogether. In contrast, for the cold- cache experiments for each query on the CUBE File we counted also the size of the whole root-bucket, while for the UB-tree we counted both intermediate and leaf-level page accesses. The root-bucket size equals to 295 buckets according to the following, which shows the sizes for the two structures for the data set used: UB-tree total num of pages: 15,752 CUBE File total num of buckets: 4,575 Root bucket number of buckets: 295 Fig. 11, shows the I/O ratio between the UB-tree and the CUBE File for all three classes of queries for the hot-cache case. This ratio is calculated from the total number of I/Os for all queries of the same maximum depth of restrictions for each data structure. As increases, essentially the cube selectivity decreases (i.e., less data points are returned in the result set). We see that the UB-tree performs more I/Os for all depths and for all query classes. For small-depth restrictions where the selection rectangles are very large the CUBE File performs 3 times less I/Os than the UB-tree. Moreover, for more restrictive queries the CUBE file is multiple times better achieving up to 37 times less I/Os. An explanation for this is that the smaller the selection hyper-rectangle the greater becomes the percentage of UB-tree leaf-pages containing very few (or even none) of the qualifying data points in the total number of accessed pages. Thus more I/Os are required on the whole, in order to evaluate the restriction, and for large-depth restrictions the UB-tree performs even worse, because essentially it fails to cluster the data with respect to the more detailed hierarchy levels. This behaviour was also observed in [7], where for queries with small cube selectivities the UB-tree performance was worse and the hierarchical clustering effect Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  5. 636 N. Karayannidis, T. Sellis, and Y. Kouvaras reduced. We believe this is due to the way data are clustered into z-regions (i.e., disk pages) along the z-curve [1]. In contrast, the hierarchical chunking applied in the CUBE File, creates groups of data (i.e., chunks) that belong in the same “hierarchical family” even for the most detailed levels. This, in combination with the chunk-to- bucket allocation that guarantees that hierarchical families will be clustered together, results in better hierarchical clustering of the cube even for the most detailed levels of the hierarchies. Fig. 10. Size ratio between the UB-tree and the CUBE File for increasing tuple cardinality Fig. 11. I/O ratios for the hot-cache experiments Note that in two subsets of queries the returned result set was empty (prefix multi- range queries for and The UB-tree had to descend down to the leaf level and access the corresponding pages, performing I/Os essentially for nothing. In contrast, the CUBE File performed no I/Os, since directly from a root directory node it could identify an empty subtree and thus terminate the search immediately. Since the denominator was zero, we depict the corresponding ratios for these two cases in Fig. 11 with a zero value. Fig. 12, shows the I/O ratios for the cold-cache experiments. In this figure we can observe the impact of having to read the whole root directory in memory for each query on the CUBE File. For queries of small-depth restrictions (large result set) the difference in the performed I/Os for the two structures remains essentially the same with the hot-cache case. However, for larger-depth restrictions (smaller result set) the overhead imposed by the root-directory reduces the difference between the two, as it Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  6. CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes 637 was expected. Nevertheless, the CUBE File is still multiple times better in all cases, clearly demonstrating a better hierarchical clustering. Furthermore, note that even if no cache area is available, in reality there will never be a case where the whole root directory is accessed for answering a single query. Naturally, only the relative buckets of the root-directory are accessed for each query. Fig. 12. I/O ratios for the cold-cache experiments 5 Summary and Conclusions In this paper we presented the CUBE File, a novel file structure for organizing the most detailed data of an OLAP cube. This structure is primarily aimed at speeding up ad hoc OLAP queries containing restrictions on the hierarchies, which comprise the most typical OLAP workload. The key features of the CUBE File are that it is a natively multidimensional data structure. It explicitly supports dimension hierarchies, enabling fast access to cube data via a directory of chunks formed exactly from the hierarchies. It clusters data with respect to the dimension hierarchies resulting in reduced I/O cost for query evaluation. It imposes a low storage overhead basically for two reasons: (a) it adapts perfectly to the extensive sparseness of the cube, not allocating space for empty regions, and (b) it does not need to store the dimension values along with the measures of the cube, due to its location-based access mechanism of cells. These two result in a significant compression of the data space. Moreover this compression can increase even further, if compression of intermediate nodes is employed. Finally, it achieves a high space utilization filling the buckets to capacity. We have verified the aforementioned performance aspects of the CUBE File by running an extensive set of experiments and we have also shown that the CUBE File outperforms UB-Tree/MHC, the most effective method proposed up to now for hierarchically clustering the cube, in terms of storage cost and number of disk I/Os. Furthermore, the CUBE File fits perfectly to the processing framework for ad hoc OLAP queries over hierarchically clustered fact tables (i.e., cubes) proposed in our previous work [7]. In addition, it supports directly the effective hierarchical pre-grouping transformation [13, 19], since it uses hierarchically encoded surrogate keys. Finally, it can be used as a physical base for implementing a chunk-based caching scheme [3]. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  7. 638 N. Karayannidis, T. Sellis, and Y. Kouvaras Acknowledgements. We wish to thank Transaction Software GmbH for providing us Transbase Hypercube to run the UB-tree/MHC experiments. This work has been partially funded by the European Union’s Information Society Technologies Programme (IST) under project EDITH (IST-1999-20722). References 1. R. Bayer: The universal B-Tree for multi-dimensional Indexing: General Concepts. WWCA 1997. 2. C. Y. Chan, Y. E. Ioannidis: Bitmap Index Design and Evaluation. SIGMOD 1998. 3. P. Deshpande, K. Ramasamy, A. Shukla, J. F. Naughton: Caching Multidimensional Queries Using Chunks. SIGMOD 1998. 4. Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and SubTotal. ICDE 1996. 5. N. Karayannidis: Storage Structures, Query Processing and Implementation of On-Line Analytical Processing Systems, Ph.D. Thesis, National Technical University of Athens, 2003. Available at: http://www.dblab.ece.ntua.gr/~ni kos/thesis/PhD_thesis_en.pdf. 6. N. Karayannidis, T. Sellis: SISYPHUS: The Implementation of a Chunk-Based Storage Manager for OLAP Data Cubes. Data and Knowledge Engineering, 45(2): 155-188, May 2003. 7. N. Karayannidis et al: Processing Star-Queries on Hierarchically-Clustered Fact-Tables. VLDB 2002. 8. L. V. S. Lakshmanan, J. Pei, J. Han: Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB 2002. 9. V. Markl, F. Ramsak, R. Bayern: Improving OLAP Performance by Multidimensional Hierarchical Clustering. IDEAS 1999. 10. P. E. O’Neil, G. Graefe: Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record 24(3): 8-11 (1995). 11. J. Nievergelt, H. Hinterberger, K. C. Sevcik: The Grid File: An Adaptable, Symmetric Multikey File Structure. TODS 9(1): 38-71 (1984). 12. P. E. O’Neil, D. Quass: Improved Query Performance with Variant Indexes. SIGMOD 1997. 13. R. Pieringer et al: Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star Join Processing. ICDE 2003. 14. F. Ramsak et al: Integrating the UB-Tree into a Database System Kernel. VLDB 2000. 15. S. Sarawagi: Indexing OLAP Data. Data Engineering Bulletin 20(1): 36-43 (1997). 16. Y. Sismanis, A. Deligiannakis, N. Roussopoulos, Y. Kotidis: Dwarf: shrinking the PetaCube. SIGMOD 2002. 17. S. Sarawagi, M. Stonebraker: Efficient Organization of Large Multidimensional Arrays. ICDE 1994 18. The Transbase Hypercube® relational database system (http://www.transaction.de). 19. Aris Tsois, Timos Sellis: The Generalized Pre-Grouping Transformation: Aggregate- Query Optimization in the Presence of Dependencies. VLDB 2003. 20. Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  8. Efficient Schema-Based Revalidation of XML Mukund Raghavachari1 and Oded Shmueli2 1 IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA, raghavac@us.ibm.com, 2 Technion – Israel Institute of Technology, Haifa, Israel, oshmu@cs.technion.ac.il Abstract. As XML schemas evolve over time or as applications are in- tegrated, it is sometimes necessary to validate an XML document known to conform to one schema with respect to another schema. More gener- ally, XML documents known to conform to a schema may be modified, and then, require validation with respect to another schema. Recently, solutions have been proposed for incremental validation of XML docu- ments. These solutions assume that the initial schema to which a doc- ument conforms and the final schema with which it must be validated after modifications are the same. Moreover, they assume that the in- put document may be preprocessed, which in certain situations, may be computationally and memory intensive. In this paper, we describe how knowledge of conformance to an XML Schema (or DTD) may be used to determine conformance to another XML Schema (or DTD) efficiently. We examine both the situation where an XML document is modified before it is to be revalidated and the situation where it is unmodified. 1 Introduction The ability to validate XML documents with respect to an XML Schema [21] or DTD is central to XML’s emergence as a key technology for application integration. As XML data flow between applications, the conformance of the data to either a DTD or an XML schema provides applications with a guarantee that a common vocabulary is used and that structural and integrity constraints are met. In manipulating XML data, it is sometimes necessary to validate data with respect to more than one schema. For example, as a schema evolves over time, XML data known to conform to older versions of the schema may need to be verified with respect to the new schema. An intra-company schema used by a business might differ slightly from a standard, external schema and XML data valid with respect to one may need to be checked for conformance to the other. The validation of an XML document that conforms to one schema with re- spect to another schema is analogous to the cast operator in programming lan- guages. It is useful, at times, to access data of one type as if it were associated with a different type. For example, XQuery [20] supports a validate operator which converts a value of one type into an instance of another type. The type safety of this conversion cannot always be guaranteed statically. At runtime, E. Bertino et al. (Eds.): EDBT 2004, LNCS 2992, pp. 639–657, 2004. © Springer-Verlag Berlin Heidelberg 2004 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  9. 640 M. Raghavachari and O. Shmueli XML fragments known to correspond to one type must be verified with respect to another. As another example, in XJ [9], a language that integrates XML into Java, XML variables of a type may be updated and then cast to another type. A compiler for such a language does not have access to the documents that are to be revalidated. Techniques for revalidation that rely on preprocessing the document [3,17] are not appropriate. The question we ask is how can one use knowledge of conformance of a doc- ument to one schema to determine whether the document is valid according to another schema? We refer to this problem as the schema cast validation prob- lem. An obvious solution is to revalidate the document with respect to the new schema, but in doing so, one is disregarding useful information. The knowledge of a document’s conformance to a schema can help determine its conformance to another schema more efficiently than full validation. The more general situation, which we refer to as schema cast with modifications validation, is where a docu- ment conforming to a schema is modified slightly, and then, verified with respect to a new schema. When the new schema is the same as the one to which the document conformed originally, schema cast with modifications validation ad- dresses the same problem as the incremental validation problem for XML [3,17]. Our solution to this problem has different characteristics, as will be described. The scenario we consider is that a source schema A and a target schema B are provided and may be preprocessed statically. At runtime, documents valid according to schema A are verified with respect to schema B. In the modification case, inserts, updates, and deletes are performed to a document before it is verified with respect to B. Our approach takes advantage of similarities (and differences) between the schemas A and B to avoid validating portions of a document if possible. Consider the two XML Schema element declarations for purchaseOrder shown in Figure 1. The only difference between the two is that whereas the billTo element is optional in the schema of Figure 1a, it is required in the schema of Figure 1b. Not all XML documents valid with respect to the first schema are valid with respect to the second — only those with a billTo element would be valid. Given a document valid according to the schema of Figure 1a, an ideal validator would only check the presence of a billTo element and ignore the validation of the other components (they are guaranteed to be valid). This paper focuses on the validation of XML documents with respect to the structural constraints of XML schemas. We present algorithms for schema cast validation with and without modifications that avoid traversing subtrees of an XML document where possible. We also provide an optimal algorithm for revalidating strings known to conform to a deterministic finite state automaton according to another deterministic finite state automaton; this algorithm is used to revalidate content model of elements. The fact that the content models of XML Schema types are deterministic [6] can be used to show that our algorithm for XML Schema cast validation is optimal as well. We describe our algorithms in terms of an abstraction of XML Schemas, abstract XML schemas, which model the structural constraints of XML schema. In our experiments, our algorithms achieve 30-95% performance improvement over Xerces 2.4. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  10. Efficient Schema-Based Revalidation of XML 641 Fig. 1. Schema fragments defining a purchaseOrder element in (a) Source Schema (b) Target Schema. The contributions of this paper are the following: 1. An abstraction of XML Schema, abstract XML Schema, which captures the structural constraints of XML schema more precisely than specialized DTDs [16] and regular type expressions [11]. 2. Efficient algorithms for schema cast validation (with and without updates) of XML documents with respect to XML Schemas. We describe optimizations for the case where the schemas are DTDs. Unlike previous algorithms, our algorithms do not preprocess the documents that are to be revalidated. 3. Efficient algorithms for revalidation of strings with and without modifica- tions according to deterministic finite state automata. These algorithms are essential for efficient revalidation of the content models of elements. 4. Experiments validating the utility of our solutions. Structure of the Paper: We examine related work in Section 2. In Section 3, we introduce abstract XML Schemas and provide an algorithm for XML schema revalidation. The algorithm relies on an efficient solution to the problem of string revalidation according to finite state automata, which is provided in Section 4. We discuss the optimality of our algorithms in Section 5. We report on experi- ments in Section 6, and conclude in Section 7. 2 Related Work Papakonstantinou and Vianu [17] treat incremental validation of XML docu- ments (typed according to specialized DTDs). Their algorithm keeps data struc- tures that encode validation computations with document tree nodes and utilizes these structures to revalidate a document. Barbosa et al. [3] present an algorithm that also encodes validation computations within tree nodes. They take advan- tage of the 1-unambiguity of content models of DTDs and XML Schemas [6], and structural properties of a restricted set of DTDs, to revalidate documents. Our algorithm is designed for the case where schemas can be preprocessed, but the documents to be revalidated are not available a priori to be preprocessed. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  11. 642 M. Raghavachari and O. Shmueli Examples include message brokers, programming language and query compilers, etc. In these situations, techniques that preprocess the document and store state information at each node could incur unacceptable memory and computing over- head, especially if the number of updates is small with respect to the document, or the size of the document is large. Moreover, our algorithm handles the case where the document must be revalidated with respect to a different schema. Kane et al. [12] use a technique based on query modification for handling the incremental update problem. Bouchou and Halfeld-Ferrari [5] present an algo- rithm that validates each update using a technique based on tree automata. Again, both algorithms consider only the case where the schema to which the document must conform after modification is the same as the original schema. The subsumption of XML schema types used in our algorithm for schema cast validation is similar to Kuper and Siméon’s notion of type subsumption [13]. Their type system is more general than our abstract XML schema. They assume that a subsumption mapping is provided between types such that if one schema is subsumed by another, and if a value conforming to the subsumed schema is annotated with types, then by applying the subsumption mapping to these type annotations, one obtains an annotation for the subsuming schema. Our solution is more general in that we do not require either schema to be subsumed by the other, but do handle the case where this occurs. Furthermore, we do not require type annotations on nodes. Finally, we consider the notion of disjoint types in addition to subsumption in the revalidation of documents. One approach to handling XML and XML Schema has been to express them in terms of formal models such as tree automata. For example, Lee et al. describe how XML Schema may be represented in terms of deterministic tree grammars with one lookahead [15]. The formalism for XML Schema and the algorithms in these paper are a more direct solution to the problem, which obviates some of the practical problems of the tree automata approach, such as having to encode unranked XML trees as ranked trees. Programming languages with XML types [1,4,10,11] define notions of types and subtyping that are enforced statically. XDuce [10] uses tree automata as the base model for representing XML values. One difference between our work and XDuce is that we are interested in dynamic typing (revalidation) where static analysis is used to reduce the amount of needed work. Moreover, unlike XDuce’s regular expression types and specialized DTDs [17], our model for XML values captures exactly the structural constraints of XML Schema (and is not equivalent to regular tree automata). As a result, our subtyping algorithm is polynomial rather than exponential in complexity. 3 XML Schema and DTD Conformance In this section, we present the algorithm for revalidation of documents accord- ing to XML Schemas. We first define our abstractions for XML documents, ordered labeled trees, and for XML Schema, abstract XML Schema. Abstract XML Schema captures precisely the structural constraints of XML Schema. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  12. Efficient Schema-Based Revalidation of XML 643 Ordered Labeled Trees. We abstract XML documents as ordered labeled trees, where an ordered labeled tree over a finite alphabet is a pair where is an ordered tree consisting of a finite set of nodes, N, and a set of edges E, and is a function that associates a label with each node of N. The label, which can only be associated with leaves of the tree represents XML Schema simple values. We use root(T) to denote the root of tree We shall abuse notation slightly to allow to denote the label of the root node of the ordered labeled tree T. We use to denote an ordered tree with root and subtrees where denotes an ordered tree with a root that has no children. We use to represent the set of all ordered labeled trees. Abstract XML Schema. XML Schemas, unlike DTDs, permit the decoupling of an element tag from its type; an element may have different types depending on context. XML Schemas are not as powerful as regular tree automata. The XML Schema specification places restrictions on the decoupling of element tags and types. Specifically, in validating a document according to an XML Schema, each element of the document can be assigned a single type, based on the element’s label and the type of the element’s parent (without considering the content of the element). Furthermore, this type assignment is guaranteed to be unique. We define an abstraction of XML Schema, an abstract XML Schema, as a 4-tuple, where is the alphabet of element labels (tags). is the set of types defined in the schema. is a set of type declarations, one for each where is either a simple type of the form : simple, or a complex type of the form where: is a regular expression over denotes the language associated with Let be the set of element labels used in Then, : is a function that assigns a type to each element label used in the type declaration of The function, abstracts the notion of XML Schema that each child of an element can be assigned a type based on its label without considering the child’s content. It also models the XML Schema constraint that if two children of an element have the same label, they must be assigned the same type. is a partial function which states which element labels can occur as the root element of a valid tree according to the schema, and the type this root element is assigned. Consider the XML Schema fragment of Figure 1a. The function maps global element declarations to their appropriate types, that is, Table 1 shows the type declaration for POType1 in our formalism. Abstract XML Schemas do not explicitly represent atomic types, such as xsd:integer. For simplicity of exposition, we have assumed that all XML Schema atomic and simple types are represented by a single simple type. Handling atomic Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  13. 644 M. Raghavachari and O. Shmueli and simple types, restrictions on these types and relationships between the values denoted by these types is a straightforward extension. We do not address the identity constraints (such as key and keyref constraints) of XML Schema in this paper. This is an area of future work. Other features of XML Schema such as substitution groups, subtyping, and namespaces can be integrated into our model. A discussion of these issues is beyond the scope of the paper. We define the validity of an ordered, labeled tree with respect to an abstract XML Schema as follows: Definition 1. The set of ordered labeled trees that are valid with respect to a type is defined in terms of the least solution to a set of equations, one for each of the form ( are nodes): An ordered labeled tree, T, is valid with respect to a schema if is defined and If is a complex type, and contains the empty string contains all trees of height 0, where the root node has a label from that is, may have an empty content model. We are interested only in productive types, that is types, where We assume that for a schema all are productive. Whether a type is productive can be verified easily as follows: 1. Mark all simple types as productive since by the definition of valid, they contain trees of height 1 with labels from 2. For complex types, compute the set defined as is productive}. 3. Mark as productive if In other words, a type is productive if or there is a string in that uses only labels from 4. Repeat Steps 2 and 3 until no more types can be marked as productive. This procedure identifies all productive types defined in a schema. There is a straightforward algorithm for converting a schema with types that are non- productive into one that contains only productive types. The basic idea is to Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  14. Efficient Schema-Based Revalidation of XML 645 modify for each productive so that the language of the new regular expression is Pseudocode for validating an ordered, labeled tree with respect to an abstract XML Schema is provided below, constructstring is a utility method (not shown) that creates a string from the labels of the root nodes of a sequence of trees (it returns if the sequence is empty). Note that if a node has no children, the body of the foreach loop will not be executed. A DTD can be viewed as an abstract XML Schema, where each is assigned a unique type irrespective of the context in which it is used. In other words, for all there exists such that for all is either not defined or If is defined, then as well. 3.1 Algorithm Overview Given two abstract XML Schemas, and and an ordered labeled tree, T, that is valid according to S, our algorithm validates T with respect to S and in parallel. Suppose that during the validation of T with respect to we wish to validate a subtree of T, with respect to a type Let be the type assigned to during the validation of T with respect to S. If one can assert that every ordered labeled tree that is valid according to is also valid according to then one can immediately deduce the validity of according to Conversely, if no ordered labeled tree that is valid according to is also valid according to then one can stop the validation immediately since will not be valid according to We use subsumed type and disjoint type relationships to avoid traversals of subtrees of T where possible: Definition 2. A type is subsumed by a type denoted if Note that and can belong to different schemas. Definition 3. Two types and are disjoint, denoted if Again, note that and can belong to different schemas. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  15. 646 M. Raghavachari and O. Shmueli In the following sections, we present algorithms for determining whether an abstract XML Schema type is subsumed by another or is disjoint from another. We present an algorithm for efficient schema cast validation of an ordered la- beled tree, with and without updates. Finally, in the case where the abstract XML Schemas represent DTDs, we describe optimizations that are possible if additional indexing information is available on ordered labeled trees. 3.2 Schema Cast Validation Our algorithm relies on relations, and that capture precisely all sub- sumed type and disjoint type information with respect to the types defined in and We first describe how these relations are computed, and then, present our algorithm for schema cast validation. Computing the Relation Definition 4. Given two schemas, and let be the largest relation such that for all one of the following two conditions hold: i. are both simple types. ii. are both complex types, and where is defined, As mentioned before, for exposition reasons, we have chosen to merge all simple types into one common simple type. It is straightforward to extend the definition above so that the various XML Schema atomic and simple types, and their derivations are used to bootstrap the definition of the subsumption relationship. Also, observe that is a finite relation since there are finitely many types. The following theorem states that the relation captures precisely the notion of subsumption defined earlier: Theorem 1. if and only if We now present an algorithm for computing the relation. The algorithm starts with a subset of and refines it successively until is obtained. 1. Let such that implies that both and are simple types, or both of them are complex types. 2. For if remove from 3. For each if there exists and and remove from 4. Repeat Step 3 until no more tuples can be removed from the relation Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  16. Efficient Schema-Based Revalidation of XML 647 Computing the relation. Rather than computing directly, we com- pute its complement. Formally: Definition 5. Given two schemas, and let be defined as the smallest relation (least fixpoint) such that if: i. and are both simple types. ii. and are both complex types, To compute the relation, the algorithm begins with an empty relation and adds tuples until is obtained. 1. Let 2. Add all to such that simple simple 3. For each let If add to 4. Repeat Step 3 until no more tuples can be added to Theorem 2. if and only if Algorithm for Schema Cast Validation. Given the relations and if at any time, a subtree of the document that is valid with respect to from S is being validated with respect to from and then the subtree need not be examined (since by definition, the subtree belongs to On the other hand, if the document can be determined to be invalid with respect to immediately. Pseudocode for incremental validation of the document is provided below. Again, constructstring is a utility method (not shown) that creates a string from the labels of the root nodes of a sequence of trees (returning if the sequence is empty). We can efficiently verify the content model of with respect to by using techniques for finite automata schema cast validation, as will be described in the Section 4. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  17. 648 M. Raghavachari and O. Shmueli 3.3 Schema Cast Validation with Modifications Given an ordered, labeled tree, T, that is valid with respect to an abstract XML Schema S, and a sequence of insertions and deletions of nodes, and modifications of element tags, we discuss how the tree may be validated efficiently with respect to a new abstract XML Schema The updates permitted are the following: 1. Modify the label of a specified node with a new label. 2. Insert a new leaf node before, or after, or as the first child of a node. 3. Delete a specified leaf node. Given a sequence of updates, we perform the updates on T, and at each step, we encode the modifications on T to obtain by extending with special element tags of the form where A node in with label represents the modification of the element tag in T with the element tag in Similarly, a node in with label represents a newly inserted node with tag and a label denotes a node deleted from T. Nodes that have not been modified have their labels unchanged. By discarding all nodes with label and converting the labels of all other nodes labeled into one obtains the tree that is the result of performing the modifications on T. We assume the availability of a function modified on the nodes of that re- turns for each node whether any part of the subtree rooted at that node has been modified. The function modified can be implemented efficiently as follows. We assume we have the Dewey decimal number of the node (generated dynamically as we process). Whenever a node is updated we keep it in a trie [7] according to its Dewey decimal number. To determine whether a descendant of a node was modified, the trie is searched according to the Dewey decimal number of Note that we can navigate the trie in parallel to navigating the XML tree. The algorithm for efficient validation of schema casts with modifications val- idates with respect to S and in parallel. While processing a subtree of with respect to types from S and from one of the following cases apply: 1. If modified is false, we can run the algorithm described in the previous subsection on this subtree. Since the subtree is unchanged and we know that when checked with respect to S, we can treat the vali- dation of as an instance of the schema cast validation problem (without modifications) described in Section 3.2. 2. Otherwise, if we do not need to validate the subtree with respect to any since that subtree has been deleted. 3. Otherwise, if since the label denotes that is a newly in- serted subtree, we have no knowledge of its validity with respect to any other schema. Therefore, we must validate the whole subtree explicitly. 4. Otherwise, if or since elements may have been added or deleted from the original content model of the node, we must ensure that the content of is valid with respect to If is a simple type, the content model must satisfy (1) of Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  18. Efficient Schema-Based Revalidation of XML 649 Definition 1. Otherwise, if one must check that fit into the content model of as specified by In verifying the content model, we check whether where is defined as: is defined analogously. If the content model check succeeds, and is also a complex type, then we continue recursively validating with respect to from S and from (note that if is we do not have to validate that since it has been deleted in If is not a complex type, we must validate each explicitly. 3.4 DTDs Since the type of an element in an XML Schema may depend on the context in which it appears, in general, it is necessary to process the document in a top- down manner to determine the type with which one must validate an element (and its subtree). For DTDs, however, an element label determines uniquely the element’s type. As a result, there are optimizations that apply to the DTD case that cannot be applied to the general XML Schema case. If one can access all instances of an element label in an ordered labeled tree directly, one need only visit those elements where the types of in S and are neither subsumed nor disjoint from each other and verify their immediate content model. 4 Finite Automata Conformance In this section, we examine the schema cast validation problem (with and without modifications) for strings verified with respect to finite automata. The algorithms described in this section support efficient content model checking for DTDs and XML Schemas (for example, in the statement of the method validate of Sec- tion 3.2: if Since XML Schema content models correspond directly to deterministic finite state automata, we only address that case. Similar techniques can be applied to non-deterministic finite state automata, though the optimality results do not hold. For reasons of space, we omit details regarding non-deterministic finite state automata. 4.1 Definitions A deterministic finite automaton is a 5-tuple where Q is a finite set of states, is a finite alphabet of symbols, is the start state, is a set of final, or accepting, states, and is the transition relation. is a map from Without loss of generality, we assume that for all Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  19. 650 M. Raghavachari and O. Shmueli is defined. We use where to denote that maps to For string and state denotes the state reached by operating on one symbol at a time. A string is accepted by a finite state automaton if is rejected by the automaton if is not accepted by it. The language accepted (or recognized) by a finite automaton denoted is the set of strings accepted by We also define as Note that for a finite state automaton if a string is in and then is in We shall drop the subscript from when the automaton is clear from the context. A state is a dead state if either: 1. if then or 2. if then In other words, either the state is not reachable from the start state or no final state is reachable from it. We can identify all dead states in a finite state automaton in time linear in the size of the automaton via a simple graph search. Intersection Automata. Given two automata, and one can derive an intersection automaton such that accepts exactly the language The intersection automaton evaluates a string on both and in parallel and accepts only if both would. Formally, where and where Note that if and are deterministic, is deterministic as well. Immediate Decision Automata. We introduce immediate decision automata as modified finite state automata that accept or reject strings as early as pos- sible. Immediate decision automata can accept or reject a string when certain conditions are met, without scanning the entire string. Formally, an immediate decision automaton is a 7-tuple, where I A, IR are disjoint sets and I A, (each member of IA and IR is a state). As with ordinary finite state automata, a string is accepted by the automaton if Furthermore, also accepts after evaluating a strict prefix of (that is ) if rejects after evaluating a strict prefix of if We can derive an immediate decision automaton from a finite state automaton so that both automata accept the same language. Definition 6. Let be a finite state automaton. The de- rived immediate decision automaton is where: and Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
  20. Efficient Schema-Based Revalidation of XML 651 It can be easily shown that and accept the same language. For deterministic automata, we can determine all states that belong to and efficiently in time linear in the number of states of the automaton. The members of can be derived easily from the dead states of 4.2 Schema Cast Validation The problem that we address is the following: Given two deterministic finite- state automata, and and a string does One could, of course, scan using to determine acceptance by When many strings that belong to are to be validated with respect to it can be more efficient to prepreprocess and so that the knowledge of acceptance by can be used to determine its membership in Without loss of generality, we assume that Our method for the efficient validation of a string in with respect to relies on evaluating on and in parallel. Assume that after parsing a prefix of we are in a state in and a state in Then, we can: 1. Accept immediately if because is guaranteed to be in (since accepts ), which implies that will be in By definition of will accept 2. Reject immediately if Then, is guaranteed not to be in and therefore, will not accept We construct an immediate decision automaton, from the intersection automaton of and with and based on the two conditions above: Definition 7. Let be the intersection automaton derived from two finite state automata and The derived immediate decision automa- ton is where: is a dead state }. Theorem 3. For all accepts if and only if The determination of the members of can be done efficiently for deter- ministic finite state automata. The following proposition is useful to this end. Proposition 1. For any state, if and only if there exist states and such that and if _ then We now present an alternative, equivalent definition of Definition 8. For all if there exist states such that and if then Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark
Đồng bộ tài khoản