Báo cáo khoa học: "WISDOM: A Web Information Credibility Analysis System"
lượt xem 3
download
We demonstrate an information credibility analysis system called WISDOM. The purpose of WISDOM is to evaluate the credibility of information available on the Web from multiple viewpoints. WISDOM considers the following to be the source of information credibility: information contents, information senders, and information appearances. We aim at analyzing and organizing these measures on the basis of semantics-oriented natural language processing (NLP) techniques.
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Báo cáo khoa học: "WISDOM: A Web Information Credibility Analysis System"
- WISDOM: A Web Information Credibility Analysis System Susumu Akamine† Daisuke Kawahara† Yoshikiyo Kato† Tetsuji Nakagawa† Kentaro Inui† Sadao Kurohashi†‡ Yutaka Kidawara† † National Institute of Information and Communications Technology ‡ Graduate School of Informatics, Kyoto University {akamine, dk, ykato, tnaka, inui, kidawara}@nict.go.jp, kuro@i.kyoto-u.ac.jp Abstract distribution for a given topic. For this purpose, syntactic and discourse structures must be ana- We demonstrate an information credibility lyzed, their types and relations must be extracted, analysis system called WISDOM. The purpose and synonymous and ambiguous expressions of WISDOM is to evaluate the credibility of in- should be handled properly. formation available on the Web from multiple Furthermore, it is important to determine the viewpoints. WISDOM considers the following identity of the information sender and his/her to be the source of information credibility: in- specialty as criteria for credibility, which require formation contents, information senders, and named entity recognition and total analysis of information appearances. We aim at analyzing documents. and organizing these measures on the basis of In this paper, we describe an information cre- semantics-oriented natural language processing dibility analysis system called WISDOM, which (NLP) techniques. automatically analyzes and organizes the above aspects on the basis of semantically oriented 1. Introduction NLP techniques. WISDOM currently operates As computers and computer networks become over 100 million Japanese Web pages. increasingly sophisticated, a vast amount of in- 2. Overview of WISDOM formation and knowledge has been accumulated We consider the following three criteria for the and circulated on the Web. They provide people judgment of information credibility. with options regarding their daily lives and are starting to have a strong influence on govern- (1) Credibility of information contents, mental policies and business management. How- (2) Credibility of the information sender, and ever, a crucial problem is that the information (3) Credibility estimated from the document available on the Web is not necessarily credible. style and superficial characteristics. It is actually very difficult for human beings to In order to help people judge the credibility of judge the credibility of the information and even information from these viewpoints, we have been more difficult for computers. However, comput- developing an information analysis system called ers can be used to develop a system that collects, WISDOM. Figure 1 shows the analysis result of organizes, and relativises information and helps WISDOM on the analysis topic “Is bio-ethanol human beings view information from several good for the environment?” Figure 2 shows the viewpoints and judge the credibility of the in- system architecture of WISDOM. formation. Given an analysis topic (query), WISDOM Information organization is a promising en- sends the query to the search engine TSUBAKI deavor in the area of next-generation Web search. (Shinzato et al., 2008), and TSUBAKI returns a The search engine Clusty provides a search result list of the top N relevant Web pages (N is usually clustering 1 , and Cuil classifies a search result on set to 1000). the basis of query-related terms2. The persuasive Then, those pages are automatically analyzed, technology research project at Stanford Universi- and major and contradictory expressions and eva- ty discussed how websites can be designed to luative expressions are extracted. Furthermore, influence people’s perceptions (B. J. Fogg, 2003). the information senders of the Web pages, which However, as per our knowledge, no research has were analyzed beforehand, are collected and the been carried out for supporting the human judg- distribution is calculated. ment on information credibility and information The WISDOM analysis results can be viewed organization systems for this purpose. from several viewpoints by changing the tabs In order to support the judgment of informa- using a Web browser. The leftmost tab, “Sum- tion credibility, it is necessary to extract the mary,” shows the summary of the analysis, with background, facts, and various opinions and their major phrases and major/contradictory state- ments first. 1 http://clusty.com/, http://clusty.jp/ 1 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, pages 1–4, Suntec, Singapore, 3 August 2009. c 2009 ACL and AFNLP
- Summary Query: “Is bio-ethanol good for the environment?” Search Result Major/Contradictory Expressions Opinion Sender Figure 1. An analysis example of the information credibility analysis system WISDOM. Figure 2. System architecture of WISDOM. By referring to these phrases and statements, negative opinions related to the topic spread over a user can grasp the important issues related to 1000 pages, for all and for each sender class. For the topic at a glance. The pie diagram indicates example, with regard to “Bio-ethanol,” we can the distribution of the information sender class see that the number of positive opinions is more spread over 1000 pages, such as company, indus- than that of negative opinions, but it is the oppo- try group, and government. The names of the site in the case of some sender classes. Several information senders of the class can be viewed display units in the Summary tab are cursor sen- by placing the cursor over a class region. The last sitive, providing links to more detailed informa- bar chart shows the distribution of positive and tion (e.g., the page list including a major state- 2
- ment, the page list of a sender class, and the page (cram-free education), for example, tsumekomi list containing negative opinions). kyouiku (cramming education) and ikiru chikara The “Search Result” tab shows the search re- (life skills) are extracted as the major noun sult by TSUBAKI, i.e., ranking the relevant pag- phrases; yutori kyouiku-wo minaosu (reexamine es according to the TSUBAKI criteria. The “Ma- cram-free education) and gakuryokuga teika-suru jor/Contradictory Expressions” tab shows the list (scholastic ability deteriorates), as the major pre- of major phrases and major/contradictory state- dicate-argument structures; and gakuryoku-ga ments about the given topic and the list of pages koujousuru (scholastic ability ameliorates), as its containing the specified phrase or statement. The contradiction. This kind of summarized informa- “Opinion” tab shows the analysis result of the tion enables a user to grasp the facts and argu- evaluative expressions, classified according to ments on the analysis topic available on the Web. for/against, like/dislike, merit/demerit, and others, We use 1000 Web pages for a topic retrieved and it also shows the list of pages containing the from the search engine TSUBAKI. Our method specified type of evaluative expressions. The of extracting major expressions and their contra- “Sender” tab classifies the pages according to the dictions consists of the following steps: class of the information sender, for example, a user can view the pages created only by the gov- 1. Extracting candidates of major expressions: ernment. The candidates of major expressions are ex- Furthermore, the superficial characteristics of tracted from each Web page in the search result. pages called as information appearance are ana- From the relevant sentences to the analysis topic lyzed beforehand and can be viewed in WIS- that consist of approximately 15 sentences se- DOM, such as whether or not the contact address lected from each Web page, compound nouns, is shown in the page and the privacy policy is on parenthetical expressions, and predicate- the page, the volume of advertisements on the argument structures are extracted as the candi- page, the number of images, and the number of dates of the major expressions. in/out links. 2. Distilling major expressions: As shown thus far, given an analysis topic, Simply presenting expressions at a high fre- WISDOM collects and organizes the relevant quency is not always information of high quality. information available on the Web and provides This is because scattering synonymous expres- users with multi-faceted views. We believe that sions such as karikyuramu (curriculum) and such a system can considerably support the hu- kyouiku katei (course of study) and entailing ex- man judgment of information credibility. pressions such as IWC and IWC soukai (IWC 3. Data Infrastructure plenary session), all of which occur frequently, We usually utilize 100 million Japanese Web hamper the understanding process of users. Fur- pages as the analysis target. The Web pages have ther, synonymous predicate-argument structures been converted into the standard formatted Web such as gakuryoku-ga teika-suru (scholastic data, an XML format. The format includes sever- ability deteriorates) and gakuryoku-ga sagaru al metadata such as URLs, crawl dates, titles, and (scholastic ability lowers) have the same problem. in/out links. A text in a page is automatically To overcome this problem, we distill major ex- segmented into sentences (note that the sentence pressions by merging spelling variations with boundary is not clear in the original HTML file), morphological analysis, merging synonymous and the analysis results obtained by a morpholog- expressions automatically acquired from an ordi- ical analyzer, parser, and synonym analyzer are nary dictionary and the Web, and merging ex- also stored in the standard format. Furthermore, pressions that can be entailed by another expres- the site operator, the page author, and informa- sion. tion appearance (e.g., contact address, privacy 3. Extracting contradictory expressions: policy, volume of advertisements, and images) Predicate-argument structures that negate the are automatically analyzed and stored in the predicate of major ones and that replace the pre- standard format. dicate of major ones with its antonym are ex- tracted as contradictions. For example, gakuryo- 4. Extraction of Major Expressions and ku-ga teika-shi-nai (scholastic ability does not Their Contradictions deteriorate) and gakuryokuga koujou-suru (scho- For the organization of information contents, lastic ability ameliorates) are extracted as the WISDOM extracts and presents the major ex- contradictions to gakuryoku-ga teikasuru (scho- pressions and their contradictions on a given lastic ability deteriorates). This process is per- analysis topic (Kawahara et al., 2008). Major formed using an antonym lexicon, which consists expressions are defined as expressions occurring of approximately 2000 pairs; these pairs are ex- at a high frequency in the set of Web pages on tracted from an ordinary dictionary. the analysis topic. They are classified into two: noun phrases and predicate-argument structures 5. Extraction of Evaluative Information (statements). Contradictions are the predicate- The extraction and classification of evaluative argument structures that contradict the major ex- information from texts are important tasks with pressions. For the Japanese phrase yutori kyouiku 3
- many applications and they have been actively called information sender class. A site operator studied recently (Pang and Lee, 2008). Most pre- of a Web page is the governing body of a website vious studies on opinion extraction or sentiment on which the page is published. The information analysis deal with only subjective and explicit sender class categorizes the information sender expressions. For example, Japanese sentences on the basis of axes such as individuals vs. or- such as watashi-wa apple-ga sukida (I like ap- ganizations and profit vs. nonprofit organizations. ples) and kono seido-ni hantaida (I oppose the The list below shows the categories of informa- system) contain evaluative expressions that are tion sender class. directly expressed with subjective expressions. However, sentences such as kono shokuhin-wa 1. Organization 1. Organization (cont’d) kou-gan-kouka-ga aru (this food has an anti- (a) Profit Organization (c) Press cancer effect) and kono camera-wa katte 3-ka-de i. Company i. Broadcasting Station kowareta (this camera was broken 3 days after I ii. Industry Group ii. Newspaper bought it) do not contain subjective expressions (b) Nonprofit Organization iii. Publisher but contain negative evaluative expressions. i. Academic Society 2. Individual From the viewpoint of information credibility, it ii. Government (a) Real Name iii. Political Organization (b) Anonymous, appears important to deal with a wide variety of iv. Public Service Corp., Screen Name evaluative information including such implicit Nonprofit Organization evaluative expressions (Nakagawa et al., 2008). v. University A corpus annotated with evaluative informa- vi. Voluntary Association tion was developed for evaluative information vii. Education Institution analysis studies. Fifty topics such as “Bio- ethanol” and “Pension plan” were chosen. For WISDOM allows the user to organize the in- each topic, 200 sentences containing the topic formation on the basis of the information sender word were collected from the Web to construct class assigned to each Web page. Technical de- the corpus totaling 10,000 sentences. For each tails of the information sender analysis employed sentence, annotators judged whether or not the in WISDOM can be found in (Kato et al., 2008). sentence contained evaluative expressions. When evaluative expressions were identified, the evalu- 7. Conclusions ative expressions, their holders, their sentiment This paper has described an information analy- polarities (positive or negative), and their relev- sis system called WISDOM. As shown in this pa- ance to the topic were annotated. per, WISDOM already provides a reasonably nice We developed an automatic analyzer of evalu- organized view for a given topic and can serve as a ative information using the corpus. We per- useful tool for handling informational queries and formed experiments of sentiment polarity classi- for supporting human judgment of information fication using Support Vector Machines. Word credibility. WISDOM is freely available at forms, POS tags, and sentiment polarities from http://wisdom-nict.jp/. an evaluative word dictionary of all the words in evaluative expressions were used as features, and References an accuracy of 83% was obtained. From the error B. J. Fogg. 2003. Persuasive Technology: Using Com- analysis, we found that it was difficult to classify puters to Change What We Think and Do (The Mor- domain-specific evaluative expressions; we are gan Kaufmann Series in Interactive Technologies). now planning the automatic acquisition of evalu- Morgan Kaufmann. ative word dictionaries. K. Shinzato, T. Shibata, D. Kawahara, C. Hashimoto, 6. Information Sender Analysis and S. Kurohashi 2008. TSUBAKI: An open search engine infrastructure for developing new information The source of information (or information sender) access methodology. In Proceedings of IJCNLP2008. is one of the important elements when judging the credibility of information. It is rather easy for human D. Kawahara, S. Kurohashi, and K. Inui 2008. Grasping beings to identify the information sender of a Web major statements and their contradictions toward in- page. When reading a Web page, whether it is deli- formation credibility analysis of web contents. In berate or not, we attribute some characteristics to the Proceedings of WI’08. information sender and accordingly form our atti- B. Pang and L. Lee 2008. Opinion mining and senti- tudes toward the information. However, the state-of- ment analysis, Foundations and Trends in Informa- the-art search engines do not provide facilities to tion Retrieval, Volume 2, Issue 1-2, 2008. organize a vast amount of information on the basis T. Nakagawa, T. Kawada, K. Inui, and S. Kurohashi of the information sender. If we can organize the 2008. Extracting subjective and objective evaluative information on a topic on the basis of who or what expressions from the web. In Proceedings of type the information sender is, it would enable the ISUC2008. user to grasp an overview of the topic or to judge the credibility of relevant information. Y. Kato, D. Kawahara, K. Inui, S. Kurohashi, and T. Shibata 2008. Extracting the author of web pages. In WISDOM automatically identifies the site op- Proceedings of WICOW2008. erators of Web pages and classifies them into predefined categories of information sender 4
CÓ THỂ BẠN MUỐN DOWNLOAD
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn