Nếu bạn sở hữu 1 chiếc iPod và tất nhiên là dung lượng trong máy vẫn còn trống. Hãy thử cài đặt Wikipedia - Cuốn bách khoa toàn thư của thế giới lên chiếc máy của mình. Sở hữu cuốn bách khoa toàn thư Wikipedia sẽ giúp bạn thuận tiện tra cứu nhanh thông tin trong iPod chỉ với 1 dung lượng 1,7 GB.
Want to be part of the largest group-writing project in human history? Learn how to contribute to Wikipedia, the user-generated online reference for the 21st century. Considered more popular than eBay, Microsoft.com, and Amazon.com, Wikipedia generates approximately 30,000 requests per second, or about 2.5 billion per day. It's become the first point of reference for people the world over who need a fact fast. If you want to jump on board and add to the content, Wikipedia: The Missing Manual is your first-class ticket.
Wikipedia provides a wealth of knowledge, where the ﬁrst sentence, infobox (and relevant sentences), and even the entire document of a wiki article could be considered as diverse versions of summaries (deﬁnitions) of the target topic. We explore how to generate a series of summaries with various lengths based on them. To obtain more reliable associations between sentences, we introduce wiki concepts according to the internal links in Wikipedia.
Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible.
Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection.
In this paper we examine the task of sentence simpliﬁcation which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simpliﬁcation.
This paper describes the extraction from Wikipedia of lexical reference rules, identifying references to term meanings triggered by other terms. We present extraction methods geared to cover the broad range of the lexical reference relation and analyze them extensively. Most extraction methods yield high precision levels, and our rule-base is shown to perform better than other automatically constructed baselines in a couple of lexical expansion and matching tasks. Our rule-base yields comparable performance to WordNet while providing largely complementary information. ...
This paper presents an unsupervised relation extraction method for discovering and enhancing relations in which a speciﬁed concept in Wikipedia participates. Using respective characteristics of Wikipedia articles and Web corpus, we develop a clustering approach based on combinations of patterns: dependency patterns from dependency analysis of texts in Wikipedia, and surface patterns generated from highly redundant information related to the Web. Evaluations of the proposed approach on two different domains demonstrate the superiority of the pattern combination over existing approaches. ...
We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classiﬁer from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges.
The API computes semantic relatedness by: 1. taking a pair of words as input; 2. retrieving the Wikipedia articles they refer to (via a disambiguation strategy based on the link structure of the articles); 3. computing paths in the Wikipedia categorization graph between the categories the articles are assigned to; 4. returning as output the set of paths found, scored according to some measure deﬁnition. The implementation includes path-length (Rada et al., 1989; Wu & Palmer, 1994; Leacock & Chodorow, 1998), information-content (Resnik, 1995; Seco et al.
We consider the problem of NER in Arabic Wikipedia, a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation, we obtain annotations for articles in four topical groups, allowing annotators to identify domain-speciﬁc entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall.
We evaluate measures of contextual ﬁtness on the task of detecting real-word spelling errors. For that purpose, we extract naturally occurring errors and their contexts from the Wikipedia revision history. We show that such natural errors are better suited for evaluation than the previously used artiﬁcially created errors. In particular, the precision of statistical methods has been largely over-estimated, while the precision of knowledge-based approaches has been under-estimated.
In this paper, we propose an annotation schema for the discourse analysis of Wikipedia Talk pages aimed at the coordination efforts for article improvement. We apply the annotation schema to a corpus of 100 Talk pages from the Simple English Wikipedia and make the resulting dataset freely available for download1 . Furthermore, we perform automatic dialog act classiﬁcation on Wikipedia discussions and achieve an average F1 -score of 0.82 with our classiﬁcation pipeline.
In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences.
Wikipedia articles in different languages are connected by interwiki links that are increasingly being recognized as a valuable source of cross-lingual information. Unfortunately, large numbers of links are imprecise or simply wrong. In this paper, techniques to detect such problems are identiﬁed. We formalize their removal as an optimization task based on graph repair operations.
Is it possible to use sense inventories to improve Web search results diversity for one word queries? To answer this question, we focus on two broad-coverage lexical resources of a different nature: WordNet, as a de-facto standard used in Word Sense Disambiguation experiments; and Wikipedia, as a large coverage, updated encyclopaedic resource which may have a better coverage of relevant senses in Web pages.
We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efﬁciently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efﬁcient tools for managing the huge amount of provided data. ...
A well-recognized limitation of research on supervised sentence compression is the dearth of available training data. We propose a new and bountiful resource for such training data, which we obtain by mining the revision history of Wikipedia for sentence compressions and expansions. Using only a fraction of the available Wikipedia data, we have collected a training corpus of over 380,000 sentence pairs, two orders of magnitude larger than the standardly used Ziff-Davis corpus.
This paper describes the online demo of the QuALiM Question Answering system. While the system actually gets answers from the web by querying major search engines, during presentation answers are supplemented with relevant passages from Wikipedia. We believe that this additional information improves a user’s search experience.
We investigate the automatic detection of sentences containing linguistic hedges using corpus statistics and syntactic patterns. We take Wikipedia as an already annotated corpus using its tagged weasel words which mark sentences and phrases as non-factual. We evaluate the quality of Wikipedia as training data for hedge detection, as well as shallow linguistic features.