Báo cáo khoa học: "System for Querying Syntactically Annotated Corpora"
lượt xem 4
download
This paper presents a system for querying treebanks. The system consists of a powerful query language with natural support for cross-layer queries, a client interface with a graphical query builder and visualizer of the results, a command-line client interface, and two substitutable query engines: a very efficient engine using a relational database (suitable for large static data), and a slower, but paralel-computing enabled, engine operating on treebank files (suitable for “live” data). ...
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Báo cáo khoa học: "System for Querying Syntactically Annotated Corpora"
- System for Querying Syntactically Annotated Corpora Petr Pajas ˇ e a Jan Stˇ p´ nek ´ Charles Univ. in Prague, MFF UFAL ´ Charles Univ. in Prague, MFF UFAL Malostransk´ n´ m. 25 e a Malostransk´ n´ m. 25 e a 118 00 Prague 1 – Czech Rep. 118 00 Prague 1 – Czech Rep. pajas@ufal.mff.cuni.cz stepanek@ufal.mff.cuni.cz Abstract layers are merged together at the cost of loosing some structural information. This paper presents a system for querying The presented system attempts to combine and treebanks. The system consists of a pow- extend features of the existing query tools and re- erful query language with natural support solve the limitations mentioned above. We are for cross-layer queries, a client interface grateful to an anonymous referee for pointing us with a graphical query builder and visual- to ANNIS2 (Zeldes and others, 2009) – another izer of the results, a command-line client system that targets annotation on multiple levels. interface, and two substitutable query en- gines: a very efficient engine using a re- 2 System Overview lational database (suitable for large static Our system, named PML Tree Query (PML-TQ), data), and a slower, but paralel-computing consists of three main components (discussed fur- enabled, engine operating on treebank files ther in the following sections): (suitable for “live” data). • an expressive query language supporting 1 Introduction cross-layer queries, arbitrary boolean com- Syntactically annotated treebanks are a great re- binations of statements, able to query com- source of linguistic information that is available plex data structures. It also includes a sub- hardly or not at all in flat text corpora. Retrieving language for generating listings and non- this information requires specialized tools. Some trivial statistical reports, which goes far be- of the best-known tools for querying treebanks yond statistical features of e.g. TigerSearch. include TigerSEARCH (Lezius, 2002), TGrep2 • client interfaces: a graphical user inter- (Rohde, 2001), MonaSearch (Maryns and Kepser, face with a graphical query builder, a cus- 2009), and NetGraph (M´rovsk´ , 2006). All these ı y tomizable visualization of the results and a tools dispose of great power when querying a sin- command-line interface. gle annotation layer with nodes labeled by “flat” feature records. • two interchangeable engines that evaluate However, most of the existing systems are little queries: a very efficient engine that requires equipped for applications on structurally complex the treebank to be converted into a rela- treebanks, involving for example multiple inter- tional database, and a somewhat slower en- connected annotation layers, multi-lingual par- gine which operates directly on treebank files allel annotations with node-to-node alignments, and is useful especially for data in the process or annotations where nodes are labeled by at- of annotation and experimental data. tributes with complex values such as lists or nested attribute-value structures. The Prague Depen- The query language applies to a generic data dency Treebank 2.0 (Hajiˇ and others, 2006), PDT c model associated with an XML-based data format 2.0 for short, is a good example of a treebank with called Prague Markup Language or PML (Pajas multiple annotation layers and richly-structured ˇe a and Stˇ p´ nek, 2006). Although PML was devel- attribute values. NetGraph was a tool tradition- oped in connection with PDT 2.0, it was designed ally used for querying over PDT, but still it does as a universally applicable data format based on not directly support cross-layer queries, unless the abstract data types, completely independent of a 33 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, pages 33–36, Suntec, Singapore, 3 August 2009. c 2009 ACL and AFNLP
- particular annotation schema. It can capture sim- • user-implemented relations, i.e. relations ple linear annotations as well as annotations with whose low-level implementation is provided one or more richly structured interconnected an- by the user as an extension to PML-TQ1 notation layers. A concrete PML-based format for (for example, we define relations eparent and a specific annotation is defined by describing the echild for PDT 2.0 to distinguish effective de- data layout and XML vocabulary in a special file pendency from technical dependency). called PML Schema and referring to this schema file from individual data files. • transitive closures of the preceding two types It is relatively easy to convert data from other of relations (e.g. if coref text.rf is a re- formats to PML without loss of information. In lation representing textual coreference, then fact, PML-TQ is implemented within the TrEd coref text.rf{4,} is a relation rep- ˇe a framework (Pajas and Stˇ p´ nek, 2008), which resenting chains of textual coreference of uses PML as its native data format and already of- length at least 4). fers all kinds of tools for work with treebanks in The query can be accompanied by an optional several formats using on-the-fly transformation to part consisting of a chain of output filters that can PML (for XML input via XSLT). be used to extract data from the matching nodes, The whole framework is covered by an open- compute statistics, and/or format and post-process source license and runs on most current platforms. the results of a query. It is also language and script independent (operat- Let us examine these features on an example ing internally with Unicode). of a query over PDT 2.0, which looks for Czech The graphical client for PML-TQ is an exten- words that have a patient or effect argument in in- sion to the tree editor TrEd that already serves as finitive form: the main annotation tool for treebank projects (in- t-node $t := [ cluding PDT 2.0) in various countries. The client child t-node $s := [ and server communicate over the HTTP protocol, functor in { "PAT", "EFF" }, which makes it possible to easily use PML-TQ en- a/lex.rf $a ] ]; a-node $a := [ gine as a service for other applications. m/tag ˜ ’ˆVf’, 0x child a-node [ afun = ’AuxV’ ] ]; 3 Query Language >> for $s.functor,$t.t_lemma A PML-TQ query consists of a part that selects give $1, $2, count() sort by $3 desc nodes in the treebank, and an optional part that generates a report from the selected occurrences. The square brackets enclose conditions regarding The selective part of the query specifies condi- one node, so t-node $t := [...] is read tions that a group of nodes must satisfy to match ‘t-node $t with . . . ’. Comma is synonymous with the query. The conditions can be formulated as logical and. See Fig. 3 for the graphical represen- arbitrary boolean combinations of subqueries and tation of the query and one match. simple statements that can express all kinds of re- This particular query selects occurrences of a lations between nodes and/or attribute values. This group of three nodes, $t, $s, and $a with the part of the query can be visualized as a graph with following properties: $t and $s are both of type vertices representing the matching nodes, con- t-node, i.e. nodes from a tectogrammatical tree nected by various types of edges. The edges (vi- (the types are defined in the PML Schema for the sualized by arrows of different colors and styles) PDT 2.0); $s is a child of $t; the functor at- represent various types of relations between the tribute of $s has either the value PAT or EFF; the nodes. There are four kinds of these relations: node $s points to a node of type a-node, named $a, via an ID-based reference a/lex.rf (this • topological relations (child, descendant expression in fact retrieves value of an attribute depth-first-precedes, order-precedes, same- lex.rf from an attribute-value structure stored tree-as, same-document-as) and their in the attribute a of $s); $a has an attribute m car- reversed counterparts (parent, ancestor, rying an attribute-value structure with the attribute depth-first-follows, order-follows) 1 In some future version, the users will also be able to de- • inter- or cross-layer ID-based references fine new relations as separate PML-TQ queries. 34
- a/lex.rf child zapomenout.enunc Output filters: PRED >> for $s.functor,$t.t_lemma a-lnd94103-087-p1s3 v give $1,$2,count() AuxS sort by $3 desc #PersPron dýchat Zapomněli . t-node $t a-node $a ACT PAT Pred AuxK m/tag ~ '^Vf' n.pron.def.pers v jsme dýchat 0x #Cor AuxV Obj t-node $s a-node ACT functor in { "PAT", "EFF" } afun = 'AuxV' qcomplex Zapomnˇ li jsme d´ chat. [We-forgot (aux) to-breathe.] e y Figure 1: Graphical representation of a query (left) and a result spanning two annotation layers tag matching regular expression ˆVf (in PDT 2.0 tribute path using the member keyword and query tag set this indicates that $a is an infinitive); $a it the same way we query a node in the treebank: has no child node that is an auxiliary verb (afun t-node $n:= [ = ’AuxV’). This last condition is expressed as a member bridging [ sub-query with zero occurrences (0x). type = "CONTRAST", The selective part of the query is followed by target.rf t-node [ functor="PAT" ]]] one output filter (starting with >>). It returns three where bridging is an attribute of t-node con- values for each match: the functor of $s, the tec- taining a list of labeled graph edges (attribute- togrammatical lemma of $t, and for each distinct value structures). We select one that has type pair of these two values the number of occurrences CONTRAST and points to a node with functor PAT. of this pair counted over the whole matching set. The output is ordered by the 3rd column in the de- 4 Query Editor and Client scending order. It may look like this: PAT moˇnost z 115 PAT schopn´y 110 EFF a 85 PAT #Comma 83 PAT rozhodnout_se 75 In the PML data model, attributes (like a of $t, m of $a in our example) can carry com- plex values: attribute-value structures, lists, se- quences of named elements, which in turn may contain other complex values. PML-TQ addresses values nested within complex data types by at- tribute paths whose notation is somewhat similar to XPath (e.g. m/tag or a/[2]/aux.rf). An attribute path evaluated on a given node may re- turn more than one value. This happens for ex- Figure 2: The PML-TQ graphical client in TrEd ample when there is a list value on the attribute path: the expression m/w/token=’a’ where m The graphical user interface lets the user to is a list of attribute-value structures reads as some build the query graphically or in the text form; in one value returned by m/w/token equals ’a’. both cases it assists the user by offering available By prefixing the path with a *, we may write node-types, applicable relations, attribute paths, all values returned by m/w/token equal ’a’ as and values for enumerated data types. It commu- *m/w/token=’a’. nicates with the query engine and displays the re- We can also fix one value returned by an at- sults (matches, reports, number of occurrences). 35
- Colors are used to indicate which node in the lished framework. The current version of the sys- query graph corresponds to which node in the re- tem is available at http://ufal.mff.cuni. sult. Matches from different annotation layers are cz/˜pajas/pmltq. displayed in parallel windows. For each result, the user can browse the complete document for con- Acknowledgments text. Individual results can be saved in the PML This paper as well as the development of the sys- format or printed to PostScript, PDF, or SVG. The tem is supported by the grant Information Society user can also bookmark any tree from the result ˇ of GA AV CR under contract 1ET101120503 and set, using the bookmarking features of TrEd. The by the grant GAUK No. 22908. queries are stored in a local file.2 5 Engines References For practical reasons, we have developed two en- Steven Bird et al. 2006. Designing and evaluating an XPath dialect for linguistic queries. In ICDE ’06: gines that evaluate PML-TQ queries: Proceedings of the 22nd International Conference The first one is based on a translator of PML- on Data Engineering, page 52. IEEE Computer So- TQ to SQL. It utilizes the power of modern re- ciety. lational databases3 and provides excellent perfor- Jan Hajiˇ et al. 2006. The Prague Dependency Tree- c mance and scalability (answering typical queries bank 2.0. CD-ROM. Linguistic Data Consortium over a 1-million-word treebank in a few seconds). (CAT: LDC2006T01). To use this engine, the treebank must be, simi- Wolfgang Lezius. 2002. Ein Suchwerkzeug f¨ r syn- u larly to (Bird and others, 2006), converted into taktisch annotierte Textkorpora. Ph.D. thesis, IMS, read-only database tables, which makes this en- University of Stuttgart, December. Arbeitspapiere gine more suitable for data that do not change too des Instituts f¨ r Maschinelle Sprachverarbeitung u (AIMS), volume 8, number 4. often (e.g. final versions of treebanks). For querying over working data or data not Hendrik Maryns and Stephan Kepser. 2009. likely to be queried repeatedly, we have devel- Monasearch – querying linguistic treebanks with monadic second-order logic. In Proceedings of the oped an index-less query evaluator written in Perl, 7th International Workshop on Treebanks and Lin- which performs searches over arbitrary data files guistic Theories (TLT 2009). sequentially. Although generally slower than the Jiˇ´ M´rovsk´ . 2006. Netgraph: A tool for searching rı ı y database implementation (partly due to the cost in Prague Dependency Treebank 2.0. In Proceed- of parsing the input PML data format), its perfor- ings of the 5th Workshop on Treebanks and Linguis- mance can be boosted up using a built-in support tic Theories (TLT 2006), pages 211–222. for parallel execution on a computer cluster. ˇe a Petr Pajas and Jan Stˇ p´ nek. 2008. Recent advances Both engines are accessible through the identi- in a feature-rich framework for treebank annotation. cal client interface. Thus, users can run the same In The 22nd International Conference on Computa- query over a treebank stored in a database as well tional Linguistics - Proceedings of the Conference, volume 2, pages 673–680. The Coling 2008 Orga- as their local files of the same type. nizing Committee. When implementing the system, we periodi- cally verify that both engines produce the same ˇe a Petr Pajas and Jan Stˇ p´ nek. 2006. XML-based repre- results on a large set of test queries. This testing sentation of multi-layered annotation in the PDT 2.0. In Proceedings of the LREC Workshop on Merging proved invaluable not only for maintaining con- and Layering Linguistic Information (LREC 2006), sistency, but also for discovering bugs in the two pages 40–47. implementations and also for performance tuning. Douglas L.T. Rohde. 2001. TGrep2 the next-generation search engine for parse trees. 6 Conclusion http://tedlab.mit.edu/˜dr/Tgrep2/. We have presented a powerful open-source sys- Amir Zeldes et al. 2009. Information structure in tem for querying treebanks extending an estab- african languages: Corpora and tools. In Proceed- 2 ings of the Workshop on Language Technologies for The possibility of storing the queries in a user account African Languages (AFLAT), 12th Conference of the on the server is planned. European Chapter of the Association for Computa- 3 The system supports Oracle Database (version 10g or tional Linguistics (EACL-09), Athens, Greece, pages newer, the free XE edition is sufficient) and PostgreSQL (ver- 17–24. sion at least 8.4 is required for complete functionality). 36
CÓ THỂ BẠN MUỐN DOWNLOAD
-
báo cáo khoa học: "Evaluation of the new AJCC staging system for resectable hepatocellular carcinoma"
26 p | 65 | 6
-
Báo cáo khoa học: " Immunohistochemistry for detection of avian infectious bronchitis virus strain M41 in the proventriculus and nervous system of experimentally infected chicken embryos"
7 p | 66 | 6
-
báo cáo khoa học: " Lessons for non-VA care delivery systems from the U.S. Department of Veterans Affairs Quality Enhancement Research Initiative: QUERI Series"
4 p | 60 | 5
-
báo cáo khoa học: " Designing an automated clinical decision support system to match clinical practice guidelines for opioid therapy for chronic pain"
11 p | 103 | 5
-
Báo cáo khoa học: "A Semantic Analyzer for English Sentences"
13 p | 90 | 5
-
báo cáo khoa học: " Systems medicine and the integration of bioinformatic tools for the diagnosis of Alzheimer’s disease"
5 p | 45 | 5
-
báo cáo khoa học: " Designing an automated clinical decision support system to match clinical practice guidelines for opioid therapy for chronic pain"
11 p | 57 | 5
-
Báo cáo khoa hoc:" Exploring the bases for a mixed reality stroke rehabilitation system, Part II: Design of Interactive Feedback for upper limb rehabilitation"
21 p | 39 | 4
-
Báo cáo khoa học: "Sentence-for-sentence translation"
9 p | 47 | 4
-
báo cáo khoa học: " Quantifying hepatitis C transmission risk using a new weighted scoring system for the Blood-Borne Virus Transmission Risk Assessment Questionnaire (BBV-TRAQ): Applications for community-based HCV surveillance, education and prevention"
12 p | 49 | 4
-
báo cáo khoa học: " It's time for Canadian community early warning systems for illicit drug overdoses"
5 p | 42 | 4
-
Báo cáo khoa học: "Stereotactic body radiotherapy for stage I lung cancer and small lung metastasis: evaluation of an immobilization system for suppression of respiratory tumor movement and preliminary results"
10 p | 59 | 4
-
báo cáo khoa học: " Oil for health in sub-Saharan Africa: health systems in a 'resource curse' environment"
17 p | 83 | 4
-
Báo cáo khoa học: "Application of Portsmouth modification of physiological and operative severity scoring system for enumeration of morbidity and mortality (P-POSSUM) in pancreatic surgery"
6 p | 65 | 3
-
Báo cáo khoa hoc:" Exploring the bases for a mixed reality stroke rehabilitation system, Part I: A unified approach for representing action, quantitative evaluation, and interactive feedback"
15 p | 38 | 3
-
Báo cáo khoa học: "The "Win-Win" initiative: a global, scientifically based approach to resource sparing treatment for systemic breast cancer therapy"
5 p | 41 | 2
-
Báo cáo khoa học: "A Programming Language for Mechanical Translation"
17 p | 38 | 2
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn