Architectural Issues of Web−Enabled Electronic Business phần 5

Chia sẻ: Trần Hùng Dũng | Ngày: | Loại File: PDF | Số trang:41

Thêm vào BST

Báo xấu

47
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Máy chủ giao thức chuyển hướng. Vì vậy, cách tiếp cận này không thích bản thân để cân bằng tải thông minh. Kể từ khi cung cấp nội dung động là rất nhạy cảm với tải trên các máy chủ, tuy nhiên, phương pháp này không thể được ưa thích trong hệ thống thương mại điện tử.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Architectural Issues of Web−Enabled Electronic Business phần 5

Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback Zhixiang Chen University of Texas−Pan American Binhai Zhu Montana State University Xiannong Meng Bucknell University Copyright © 2003, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Abstract In this chapter, machine−learning approaches to real−time intelligent Web search are discussed. The goal is to build an intelligent Web search system that can find the users desired information with as little relevance feedback from the user as possible. The system can achieve a significant search precision increase with a small number of iterations of user relevance feedback. A new machine−learning algorithm is designed as the core of the intelligent search component. This algorithm is applied to three different search engines with different emphases. This chapter presents the algorithm, the architectures, and the performances of these search engines. Future research issues regarding real−time intelligent Web search are also discussed. Introduction This chapter presents the authors approaches to intelligent Web search systems that are built on top of existing search engine design and implementation techniques. An intelligent search engine would use the search results of the general purpose search engines as its starting search space, from which it would adaptively learn from the users feedback to boost and to enhance the search performance and accuracy. It may use feature extraction, document clustering and filtering, and other methods to help an adaptive learning process. The goal is to design practical and efficient algorithms by exploring the nature of the Web search. With these new algorithms, three intelligent Web search enginesWEBSAIL, YARROW and FEATURES are built that are able to achieve significant search precision increase with just four to five iterations of real−time learning from a users relevance feedback. The characteristics of those three intelligent search engines are reported in this chapter. Background Recently, three general approaches have been taken to increase Web search accuracy and performance. One is the development of meta−search engines that forward user queries to multiple search engines at the same time in order to increase the coverage and hope to include what the user wants in a short list of top−ranked 152
Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback results. Examples of such meta−search engines include MetaCrawler (MC), Inference Find (IF), and Dogpile (DP). Another approach is the development of topic−specific search engines that are specialized in particular topics. These topics range from vacation guides (VG) to kids' health (KH). The third approach is to use some group or personal profiles to personalize the Web search. Examples of such efforts include GroupLens (Konstan et al., 1997), PHOAKS (Terveen, Hill, Amento, McDonald, & Creter, 1997), among others. The first generation meta−search engines address the problem of decreasing coverage by simultaneously querying multiple general−purpose engines. These meta−search engines suffer to certain extent the inherited problem of information overflow that it is difficult for users to pin down specific information for which they are searching. Specialized search engines typically contain much more accurate and narrowly focused information. However, it is not easy for a novice user to know where and which specialized engine to use. Most personalized Web search projects reported so far involve collecting users behavior at a centralized server or a proxy server. While it is effective for the purpose of e−commerce where vendors can collectively learn consumer behaviors, this approach does present the privacy problem. Users of the search engines would have to submit their search habits to some type of servers, though most likely the information collected is anonymous. The clustering, user profiling, and other advanced techniques used by these search engines and other projects (Bollacker, Lawrence, & Giles, 1998, 1999) are static in the sense that they are built before the search begins. They cannot be changed dynamically during the real−time search process. Thus, they do not reflect the changing interests of the user at different time, at different location or on different subjects. The static nature of the existing search engines makes it very difficult, if not impossible, to support the dynamic changes of the users search interests. The augmented features of personalization (or customization) certainly help a search engine to increase its search performance, however their ability is very limited. An intelligent search engine should be built on top of existing search engine design and implementation techniques. It should use the search results of the general−purpose search engines as its starting search space, from which it would adaptively learn in real−time from the users relevance feedback to boost and to enhance the search performance and the relevance accuracy. With the ability to perform real−time adaptive learning from relevance feedback, the search engine is able to learn the users search interest changes or shifts, and thus provides the user with improved search results. Relevance feedback is the most popular query reformation method in information retrieval ( Baeza−Yates & Ribeiro−Neto 1999, Salton 1975). It is essentially an adaptive learning process from the document examples judged by the user as relevant or irrelevant. It requires a sequence of iterations of relevance feedback to search for the desired documents. As it is known in (Salton, 1975), a single iteration of similarity−based relevance feedback usually produces improvements from 40 to 60 percent in the search precision, evaluated at certain fixed levels of the recall ,and averaged over a number of user queries. Some people might think that Web search users are not willing to try iterations of relevance feedback to search for their desired documents. However, the authors think otherwise. It is not a question of whether the Web search users are not willing to try iterations of relevance feedback to perform their search. Rather it is a question of whether an adaptive learning system can be built that supports high search precision increase with just a few iterations of relevance feedback. The Web search users may have no patience to try more than a dozen iterations of relevance feedback. But, if a system has a 20% or so search precision increase with just about four to five iterations of relevance feedback, are the users willing to use such a system? The authors believe that the answer is yes. Intelligent Web search systems that dynamically learn the users information needs in real−time must be built to advance the state of art in Web search. Machine−learning techniques can be used to improve Web search, because machine−learning algorithms are able to adjust the search process dynamically so as to satisfy the users information needs. Unfortunately, the existing machine−learning algorithms (e.g., Angluin, 1987; Littlestone, 1988), including the most popular similarity−based relevance feedback algorithm (Rocchio, 1971), suffer from the large number of iterations required to achieve the search goal. Average users are not willing to go through too many iterations of learning to find what they want. 153
Web Search and Adaptive Learning Web Search and Adaptive Learning Overview There have been great research efforts on applications of machine−learning to automatic extraction, clustering and classification of information from the Web. Some earlier research includes WebWatcher (Armstrong, Freitage, Joachims, & Mitchell, 1995) that interactively help users locate desired information by employing learned knowledge about which hyperlinks are likely to lead to the target information; Syskill and Webert (Pazzani, Muramatsu, & Billus, 1996), a system that uses a Bayesian classifier to learn about interesting Web pages for the user; and NewsWeeder (Lang, 1995), a news−filtering system that allows the users to rate each news article being read and learns a user profile based on those ratings. Some research is aimed at providing adaptive Web service through learning. For example, Ahoy! The Homepage Finder in (Shakes, Langheinrich, & Etzioni, 1997) performs dynamic reference shifting; Adaptive Web Sites in (Etzioni & Weld 1995, Perkowitz & Etzioni 2000) automatically improve their organization and presentation based on user access data; and Adaptive Web Page Recommendation Services (Balabanovi, 1997) recommends potentially interesting Web pages to the users. Since so much work has been done on intelligent Web search and on learning from the Web by many researchers, a comprehensive review is beyond the scope and the limited space of this chapter. Interested readers may find good surveys of the previous research on learning the Web in Kobayashi and Takeda (2000). Dynamic Features and Dynamic Vector Space In spite of the World Wide Webs size and the high dimensionality of Web document index features, the traditional vector space model in information retrieval (Baeza−Yates & Ribeiro−Neto,1999; Salton, 1989; Salton et al., 1975) has been used for Web document representation and search. However, to implement real−time adaptive learning with limited computing resource, the traditional vector space model cannot be applied directly. Recall that back in 1998, the AltaVista (AV) system was running on 20 multi−processor machines, all of them having more than 130 Giga−Bytes of RAM and over 500 Giga−Bytes of disk space (Baeza−Yates & Ribeiro−Neto,1999). A new model is needed that is efficient enough both in time and space for Web search implementations with limited computing resources. The new model may also be used to enhance the computing performance of a Web search system even if enough computing resources are available. Let us now examine indexing in Web search. In the discussion, keywords are used as document index features. Let X denote the set of all index keywords for the whole Web (or, practically, a portion of the whole Web). Given any Web document d, let I(d) denote the set of all index keywords in X that are used to index d with non−zero values. Then, the following two properties hold: • The size of I(d) is substantially smaller than the size of X. Practically, I(d) can be bounded by a constant. The rationale behind this is that in the simplest case only a few of the keywords in d are needed to index it. • For any search process related to the search query q, let D(q) denote the collection of all the documents that match q, then the set of index keywords relevant to q, denoted by F(q), is Although the size of F(q) varies from different queries, it is still substantially smaller than the size of X, and might be bounded by a few hundreds or a few thousands in practice. 154
The General Setting of Learning Definition 1. Given any search query q, F(q), which is given in the above paragraph, is defined as the set of dynamic features relevant to the search query q. Definition 2. Given any search query q, the dynamic vector space V(q) relevant to q is defined as the vector space that is constructed with all the documents in D(q) such that each of those documents is indexed by the dynamic features in F(q). The General Setting of Learning Lest be a Web search system.For any query q, S first finds the set of documents D(q) that match the query q. It finds D(q) with the help of a general−purpose search strategy through searching its internal database, or through external search engine such as AltaVista (AV) when no matches are found within its internal database. It then finds the set of dynamic features F(q), and later constructs the dynamic vector space V(q). Once D(q), F(q) and V(q) have been found, S starts its adaptive learning process with the help of the learning algorithm that is to be presented in the following subsections. More precisely, let } K denotes a dynamic feature (i.e., an index keyword). S maintains a common w = for dynamic features in F(q). The components of w have non−negative real values. The learning algorithm uses w to extract and learn the most relevant features and to classify documents in D(q) as relevant or irrelevant. weight vector ) Algorithm TW2 As the authors have investigated (Chen, Meng, & Fowler, 1999; Chen & Meng, Chen, Meng, Fowler, & Zhu, 2000), intelligent Web search can be modeled as an adaptive learning process such as adaptive learning, where the search engine acts as a learner and the user as a teacher. The user sends a query to the engine, and the engine uses the query to search the index database and returns a list of URLs that are ranked according to a ranking function. Then the user provides the engine relevance feedback, and the engine uses the feedback to improve its next search and returns a refined list of URLs. The learning (or search) process ends when the engine finds the desired documents for the user. Conceptually, a query entered by the user can be understood as the logical expression of the collection of the documents wanted by the user. A list of URLs returned by the engine can be interpreted as an approximation of the collection of the desired documents. Let us now consider how to use adaptive learning from equivalence queries to approach the problem of Web search. The vector space model (Baeza−Yates & Ribeiro−Neto, 1999; Salton, 1989; Salton et al., 1975) is used to represent documents. The vector space may consist of Boolean vectors. It may also consist of discretized vectors, for example, the frequency vector of the index keywords. A target concept is a collection of documents, which is equivalent to the set of vectors of the documents in the collection. The learner is the search engine and the teacher is the user. The goal of the search engine is to find the target concept in real−time with a minimal number of mistakes (or equivalence queries). The authors designed the algorithm TW2, a tailored version of Winnow2 (Littlestone 1988), which is described in the following. As described in the general setting of learning, for each query q entered by the user, algorithm TW2 uses a common weight vector w and a real−valued threshold q to classify documents in D(q). Initially, all weights in w have a value of 0. Let a > 1 be the promotion and demotion factor. Algorithm TW2 classifies documents whose vectors as relevant, and all others as irrelevant. If the user provides a document that contradicts the classification of TW2, then TW2 is said to have made a mistake. When the user responds with a document that may or may not contradict to the current classification, TW2 updates the weights through promotion or demotion. It should be noticed that in contrast to algorithm Winnow2 to set all initial weights in w to 1, algorithm TW2 sets all initial weights in w to 0 and has a different promotion strategy accordingly. Another substantial difference between TW2 and Winnow2 is that TW2 accepts document examples that may not contradict its current classification to promote or demote its weight 155
Feature Learning Algorithm FEX (Feature EXtraction) vector, while Winnow2 only accepts examples that contradict its current classification to perform promotion or demotion. The rationale behind setting all the initial weights to 0 by algorithm TW2 is to focus attention on the propagation of the influence of the relevant documents, and to use irrelevant documents to adjust the focused search space. Moreover, this approach is computationally feasible because existing effective document−ranking mechanisms can be coupled with the learning process. In contrast to the linear lower bounds proved for Rocchios similarity−based relevance feedback algorithm (Chen & Zhu, 2002), algorithm TW2 has surprisingly small mistake bounds for learning any collection of documents represented by a disjunction of a small number of relevant features. The mistake bounds are independent of the dimensionality of the index features. For example, one can show that to learn a collection of documents represented by a disjunction of at most k relevant features (or index keywords) over the n−dimensional Boolean vector space, TW2 makes at most mistakes, where A is the number of dynamic features that occurred in the learning process. The actual implementation of algorithm TW2 requires the help of document ranking and equivalence query simulation that are to be addressed later. Feature Learning Algorithm FEX (Feature EXtraction) Given any user query q, for any dynamic feature ) Ki F(q) with 1 I n, define the rank of Ki as h(Ki) = ho(Ki) + wi. Here, ho(Ki) is the initial rank for Ki. Reacal that Ki is some index keyword. With the feature ranking function h and the common weight vector w, FEX extracts and learns the most relevant features as follows. Document Ranking Let g be a ranking function independent of TW2 and FEX. Define the ranking function f for documents in D(q ) for any user query q as follows. For any Web document d ∈ D(q) with vector d = (x1,,xn) ∈ V(q), define 156
Equivalence Query Simulation Here, g remains constant for each document d during the learning process of the learning algorithm. Various strategies can be used to define g, for example, PageRank (Brin & Page, 1998), classical tf−idf scheme, vector spread, or cited−based rankings (Yuwono & Lee, 1996). The two additional tuning parameters are used to do individual document promotions or demotions of the documents that have been judged by the user. Initially, let ß(d) 0 and γ(d) = 1. ß(d) and γ (d) can be updated in a similar fashion as the weight value wi is updated by algorithm TW2. Equivalence Query Simulation Our system will use the ranking function f that was defined above to rank the documents in D(q) for each user query q, and for each iteration of leaning, it returns the top 10 ranked documents to the user. These top 10 ranked documents represent an approximation to the classification made by the learning algorithm that has been used by the system. The quantity 10 can be replaced by, say, 25 or 50. But it should not be too large for two reasons: (1) the user may only be interested in a very small number of top ranked documents, and (2) the display space for visualization is limited. The user can examine the short list of documents and can end the search process, or, if some documents are judged as misclassified, document relevance feedback can be provided. Sometimes, in addition to the top 10 ranked documents, the system may also provide the user with a short list of other documents below the top 10. Documents in the second short list may be selected randomly, or the bottom 10 ranked documents can be included. The motivation for the second list is to give the user some better view of the classification made by the learning algorithm. The Websail System and the Yarrow System The WEBSAIL System is a real−time adaptive Web search learner designed and implemented to show that the learning algorithm TW2 not only works in theory but also works practically. The detailed report of the system can be found in Chen et al. (2000c). WEBSAIL employs TW2 as its learning component and is able to help the user search for the desired documents with as little relevance feedback as possible. WEBSAIL has a graphic user interface to allow the user to enter his/her query and to specify the number of the top matched document URLs to be returned. WEBSAIL maintains an internal index database of about 800,000 documents. Each of those documents is indexed with about 300 keywords. It also has a meta−search component to query AltaVista whenever needed. When the user enters a query and starts a search process, WEBSAIL first searches its internal index database. If no relevant documents can be found within its database then it receives a list of top matched documents externally with the help of its meta−search component. WEBSAIL displays the search result to the user in a format as shown in Figure 1. 157
Equivalence Query Simulation Figure 1: The display format of WEBSAIL Also as shown in Figure 1, WEBSAIL provides at each iteration the top 10 and the bottom 10 ranked document URLs. Each document URL is preceded with two radio buttons for the user to judge whether the document is relevant to the search query or not. The document URLs are clickable for viewing the actual document contents so that the user can judge more accurately whether a document is relevant or not. After the user clicks a few radio buttons, he/she can click the feedback button to submit the feedback to TW2. WEBSAIL has a function to parse out the feedback provided by the user when the feedback button is clicked. Having received the feedback from the user, TW2 updates its common weight vector w and also performs individual document promotions or demotions. At the end of the current iteration of learning, WEBSAIL re−ranks the documents and displays the top 10 and the bottom10 document URLs to the user. At each iteration, the dispatcher of WEBSAIL parses query or relevance feedback information from the interface and decides which of the following components should be invoked to continue the search process: TW2, or Index Database Searcher, or Meta−Searcher. When meta−search is needed, Meta−Searcher is called to query AltaVista to receive a list of the top matched documents. The Meta−Searcher has a parser and an indexer that work in real−time to parse the received documents and to index each of them with at most 64 keywords. The received documents, once indexed, will also be cached in the index database. The following relative Recall and relative Precision are used to measure the performance of WEBSAIL. For any query q, the relative Recall and the relative Precision are where R is the total number of relevant documents among the set of the retrieved documents, and Rm is the number of relevant documents ranked among the top m positions in the final search result of the search engine. The authors have selected 100 queries to calculate the average relative Recall of WEBSAIL. Each query is represented by a collection of at most five keywords. For each query, WEBSAIL is tested with the returning document number m as 50, 100, 150, 200, respectively. For each test, the number of iterations used and the number of documents judged by the user were recorded. The relative Recall and Precision were calculated based on manual examination of the relevance of the returned documents. The experiments reveal that WEBSAIL achieves an average of 0.95 relative Recall and an average of 0.46 relative Precision with an average of 3.72 iterations and an average of 13.46 documents judged as relevance feedback. The Yarrow system (Chen & Meng, 2000) is a multi−threaded program. Its architecture differs from that of WEBSAIL in two aspects: (1) it replaces the meta−searcher of WEBSAIL with a generic Query Constructor and a group of meta−searchers, and it does not maintain its own internal index database. For each search process, it creates a thread and destroys the thread when the search process ends. Because of its light−weight 158
The Features System size, it can be easily converted or ported to run in different environments or platforms. The predominant feature of YARROW, compared with existing meta−search engines, is the fact that it learns from the users feedback in real−time on client side. The learning algorithm TW2 used in YARROW has some surprisingly small mistake bound. YARROW may be well used as a plug−in component for Web browsers on client side. A detailed report of the Yarrow system is given in Chen and Meng (2000). The Features System The FEATURES system (Chen, Meng, Fowler, & Zhu, 2001) is also a multi−threaded system, and its architecture is shown in Figure 2. The key difference between FEATURES and WEBSAIL is that FEATURES employs the two learning algorithmsFEX and TW2to update the common weight vector w concurrently. Figure 2: The architecture of FEATURES For each query, FEATURES usually shows the top 10 ranked documents, plus the top 10 ranked features, to the user for him/her to judge document relevance and feature relevance. The format of presenting the top 10 ranked documents together with the top 10 ranked features is shown in Figure 3. In this format, document URLs and features are preceded by radio buttons for the user to indicate whether they are relevant or not. Figure 3: The display format of FEATURES If the current task is a learning process from the users document and feature relevance feedback, Dispatcher sends the feature relevance feedback information to the feature learner FEX and the document relevance feedback information to the document learner TW2. FEX uses the relevant and irrelevant features as judged 159
Timing Statistics by the user to promote and demote the related feature weights in the common weight vector w. TW2 uses the relevant and irrelevant documents judged by the user as positive and negative examples to promote and demote the weight vector. Once FEX and TW2 have finished promotions and demotions, the updated weight vector w is sent to Query Searcher and to Feature Ranker. Feature Ranker re−ranks all the dynamic features, that are then sent to Html Constructor. Query Searcher searches Index Database to find the matched documents that are then sent to Document Ranker. Document Ranker re−ranks the matched documents and then sends them to Html Constructor to select documents and features to be displayed. Empirical results (Chen et al., 2001) show that FEATURES has substantially better search performance than AltaVista. Timing Statistics On December 13th and 14th of 2001, the authors conducted the experiments to collect the timing statistics for using WEBSAIL, YARROW and FEATURES. Thirty (30) query words were used to test each of these meta−search engines. Every time a query was sent, the wall−clock time needed for the meta−search engine to list the sorted result was recorded in the program. Also recorded was the wall−clock time to refine the search results based on the users feedback. Since YARROW supports multiple external search engines, ALTAVISTA and NORTHERN LIGHT were selected as the external search engines when YARROW was tested. The external search engine used by WEBSAIL and FEATURES is ALTAVISTA. The following tables show the statistical results at 95% confidence interval level. The original responding time is torig and the refining time is trefine, and C.I. denotes the confidence interval. Table 1: Response time of WEBSAIL (in seconds) Table 2: Response Time of YARROW (in seconds) Table 3: Response time of FEATURES (in seconds) The statistics from the table indicate that while the standard deviations and the confidence intervals are relatively high, they are in a reasonable range that users can accept. It takes WEBSAIL, YARROW and FEATURES in the order of a few seconds to 20 seconds to respond initially because they need to get the information from external search engines over the network. However, even for the initial response time is not long and hence is acceptable by the user. 160
The Commercial Applications The Commercial Applications Intelligent Web search can find many commercial applications. This section will concentrate on the applications to E−commerce. E−commerce can be viewed as three major components, the service and goods suppliers, the consumers, and the information intermediaries (or infomediaries). The service and goods suppliers are the producer or the source of the e−commerce flow. The consumers are the destination of the flow. Informediaries, according to Grover and Teng (2001), are an essential part of E−commerce. An enormous amount of information has to be produced, analyzed and managed in order for e−commerce to succeed. In this context, Web search is a major player in the infomediaries. Other components of infomediaries include communities of interest (e.g., online purchase), industry magnet sites (e.g., www.amazon.com), e−retailers, or even individual corporate sites (Grover & Teng, 2001). The machine−learning approaches in Web search studied in this chapter are particularly important in the whole context of E−commerce. The key feature of the machine−learning approach for Web search is interactive learning and narrowing the search results to what the user wants. This feature can be used in many e−commerce applications. The following are a few examples. Building a partnership: As pointed out in Tewari et al. (2001), building a partnership between the buyers and the seller is extremely important for the success of an e−Business. Tewari et al. used Multi−Attribute Resource Intermediaries (MARI) infrastructure to approximate buyer and seller preferences. They compare the degree of matching between buyers and sellers by computing a distance between the two vectors. When interactive learning features explored in this chapter are used in this process, the buyers and the sellers can negotiate the deal in real−time, thus greatly enhancing the capability of the system. A collection of sellers may provide an initial list of items available at certain prices for buyers to choose. The buyers may also have a list of expectations. According to the model proposed in (Tewari et.al, the possibility of a match is computed statically. If a machine−learning approach is taken, the buyers and the sellers may interactively find a best deal, similar to the situation where a face−to−face negotiation is taking place. Brokering between buyers and sellers: Brokerage between the producers and the consumers is a critical E−commerce component. Given a large number of producers and a large number of consumers, how to efficiently find a match between what is offered on the market and what a buyer is looking for? The work described in Meyyappan (2001) and, Santos et al. (2001) provided a framework for e−commerce search brokers. A broker here is to compare price information, product features, the reputation of the producer, and other information for a potential buyer. While in the previous category the seller and the buyer may negotiate interactively. Here the buyer interacts with the broker(s) only, very similar to the real−world situation. The interactive machine−learning and related Web search technology can be applied in this category as well. The machine−learning algorithm will use the collection of potential sellers as a starting space, interactively search the optimal seller for the user based on the information collected by the brokerage software. (Meyyappan 2001) and (Santos et.al. 2001) provided a framework for this brokerage to take place. The machine−learning algorithm discussed in this chapter can be used for a buyer to interact with the broker to get the best that is available on the market. For example, a broker may act as a meta−search engine that collects information from a number of sellers, behaving very much like general−purpose search engines. A buyer asks her broker to get certain information; the broker, which is a meta−search engine equipped with TW2 or other learning algorithms may search, collect, collate and rank the information returned from seller sources to the buyer. The buyer can interact with the broker, just as if in the scenario of Web search. The broker will refine its list until the buyer finds a satisfactory product and the seller. Interactive catalog: The service providers or the sellers can allow consumers to browse the catalog interactively. While browsing the learning algorithm can pick up users' interests and supply better information to the customer, much like what adaptive Web sites (Perkowitz & Etzioni, 2000) do for the customers. Here the learning can take place in two forms. The seller can explicitly ask how the potential buyers (browsers of 161
Future Work the catalog) feel about the usefulness of the catalog. This can be analogous to the interactive learning using algorithms such as TW2. Researchers have reported approaches of this type (though they didnt use TW2 explicitly.) See (Herlocker and Konstan 2001) for an example and other similar projects. In the second approach, the provider of the catalog (seller) would learn the user interests and behaviors implicitly as reported in Claypool et.al. (2001). The learning algorithm such as TW2 can be embedded in the catalog software. The buyers interests and intention can be captured through modified browser software. The learning algorithm can then revise the catalog listings by taking the buyers Web page clicks as feedback. This is very similar to the Web search situation. Web commerce infrastructure: Chaudhury, Mallick, and Rao (2000) describe using the Web in e−commerce as various channels. The Web can be used as advertising channel, ordering channel, and customer support channel. All these channels should be supported by an interactive system where customer feedback can be quickly captured, analyzed and used in updating the e−commerce system. Future Work In the future, the authors plan to improve the interface of their systems. Right now, the systems display the URLs of the documents. If the user wants to know the contents of the document, he/she needs to click the URL to view the content. The authors plan to display the URL of a document together with a good preview of its content. The authors also want to highlight those index keywords in the preview and allow them to be clickable for feature extracting and learning. The authors also plan to apply clustering techniques to increase the performance of their system. It is easy to observe that in most cases documents that are relevant to a search query can be divided into a few different clusters or groups. The authors believe that document clustering techniques such as graph spectral partitioning can be used to reduce the number of the iterations of the learning process and to increase the performance of the system. Acknowledgment The authors thank the two anonymous referees and the editor, Dr. Nansi Shi, for their valuable comments on the draft of this chapter. The final presentation of the chapter has greatly benefited from their comments. URL References (AV) AltaVista: www.altavista.com (IF) Inference Find: www.infind.com (KH) Kidshealth.com: www.kidshealth.com (VG) Vacations.Com: www.vacations.com (DP) Dogpile: www.dogpile.com 162
References (IS) Infoseek: www.infoseek.com (MC) MetaCrawler: www.metacrawler.com References Angluin, D. (1987). Queries and concept learning. Machine−learning, 2, 319−432. Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T. (1995). Webwatcher: A learning apprentice for the World Wide Web. In Working Notes of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, 6−12. AAAI Press. Balabanovi, M. (1997). An adaptive Web page recommendation service. In Proceedings of the First International Conference on Autonomous Agents, 378−387. New York: ACM Press. Baeza−Yates, R., & Ribeiro−Neto, B. (1999). Modern Information Retrieval. Reading, MA: Addison−Wesley. Bollacker, K., Lawrence, S., & Giles, C.L. (1998). Citeseer: An autonomous Web agent for automatic retrieval and identification of interesting publications. In Proceedings of the Second International Conference on Autonomous Agents, 116−113. New York: ACM Press. Bollacker, K., Lawrence, S., & Giles, C. L. (1999). A system for automatic personalized tracking of scientific literature on the Web. In Proceedings of the Fourth ACM Conference on Digital Libraries, 105−113. New York: ACM Press. Brin, S., & Page, L. (1998). The anatomy of a large−scale hypertextual Web search engine. In Proceedings of the Seventh World Wide Web Conference. Chaudhury, A., Mallick, D.N. & Rao, H.R. (2001). Web channels in E−commerce. Communications of the ACM, 44(1), 99−103. Chen, Z., & Meng, X. (2000). Yarrow: A real−time client site meta search learner. In Proceedings of the AAAI 2000 Workshop on Artificial Intelligence for Web Search (the full version will appear in Journal of Intelligent Information Systems), pp. 12−17. Chen, Z., Meng, X., & Fowler, R.H. (1999). Searching the Web with queries. Knowledge and Information Systems 1, 369−375. Chen, Z., Meng, X., Fowler, R. H., & Zhu, B. (2001). FEATURES: Real time adaptive features and document learning for Web search. Journal for the American Society for Information Science, 52(8), 655665. Chen, Z., & Zhu, B. (2002). Some formal analysis of the Rocchios similarity−based relevance feedback algorithm. Information Retrieval, 5(1), 61−86. Chen, Z., Meng, X., Zhu, B., & Fowler, R. (2000). Websail: From on−line learning to Web search. In Proceedings of the 2000 International Conference on Web Information Systems Engineering (the full version will appear in Journal of Knowledge and Information Systems, 4, 219−227. 163
References Claypool, M., Brown, D., Le, P., and Waseda, M. (2001). Inferring user interest, IEEE Internet Computing, 5(6), 32−39, November. Etzioni, O. & Weld, D. (1995). Intelligent agents on the Internet: Fact, fiction and forecast. IEEE Expert, 10(3), 44−49. Grover, V. & Teng, J.T.C. (2001). E−commerce and the information market. Communications of the ACM, 44(4), 79−86. Herlocker, J.L. & Konstan, J.A. (2001). Content−independent task−focused recommendation, IEEE Internet Computing, 5(6), 40−47, November. Ide, E. (1971). New experiments in relevance feedback. In G. Salton (Ed.), The Smart Retrieval System Experiments in automatic document processing,, 337−354. Englewood Cliffs, NJ: Prentice Hall Inc. Kobayashi, M. & Takeda, K. (2000). Information Retrieval on the Web. ACM Computing Surveys, 32(2), 144−173. Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., & Riedl, J. (1997). GroupLens: Applying collaborative filtering to Usernet news. Communications of ACM, 40(3), 77−87. Lang, K. (1995). Newsweeder: Learning to filter news. In Proceedings of the Twelfth International Conference on Machine−learning, 331−339. Lewis, D. (1991). Learning in intelligent information retrieval. In Proceedings of the Eighth International Workshop on Machine−learning, 235−239. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear−threshold algorithm. Machine−learning, 2, 285−318. Meng, X., & Chen, Z. (1999). Personalize Web search using information on clients side. In Advances in Computer Science and Technologies (985−992). Denver, CO: International Academic Publishers. Meyyappan, A. (2001). Proposing a new multi−routing agent architecture for E−marketplace. In Proceedings of the 2001 International Internet Computing Conference, 275−277. Pazzani, M., Muramatsu, J. & Billus, D. (1996). Syskill & Webert: Identifying interesting Web Sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, 54−61. Perkowitz, M. & Etzioni, O. (2000). Adaptive Web sites: Concept and case study. Artificial Intelligence, 118, 245−275. Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The smart retrieval systemExperiments in Automatic Document Processing, . 313−323. Englewood Cliffs, NJ: Prentice Hall, Inc. Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison−Wesley. Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of ACM, 18(11), 613−620. Santos, S.C., Anglim, S., & Meira, S.R.L. (2001). A framework for Web−commerce search brokers. In Proceedings the 2001 International of the Internet Computing Conference, 261−267. 164
References Shakes, J., Langheinrich, M., & Etzioni, O. (1997). Dynamic reference sifting: A case study in the homepage domain. In Proceedings of the Sixth International World Wide Web Conference, 189−200. Terveen, T., Hill, W., Amento, B., McDonald, D., & Creter, J. (1997). Phoaks: A system for sharing recommendation. Communications of ACM, 40(3), 50−62. Tewari, G., Berkovich, A., Gabovich, V., Liang, S., Ramakrishnan A., & Maes, P. (2001). Sustaining individual incentives while maximizing aggregate social welfare: A mediated brokering technique for trading agents in next−generation electronic markets. In Proceedings of the 2001 International Internet Computing Conference, pp. 247−253. Yuwono, B., & Lee, D. (1996). Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the International Conference on Data Engineering, 164−171. 165
Chapter 10: World Wide Web Search Engines Wen−Chen Hu University of North Dakota Jyh−Haw Yeh Boise State University Copyright © 2003, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Abstract The World Wide Web now holds more than 800 million pages covering almost all issues. The Webs fast growing size and lack of structural style present a new challenge for information retrieval. Numerous search technologies have been applied to Web search engines; however, the dominant search method has yet to be identified. This chapter provides an overview of the existing technologies for Web search engines and classifies them into six categories: 1) hyperlink exploration, 2) information retrieval, 3) metasearches, 4) SQL approaches, 5) content−based multimedia searches, and 6) others. At the end of this chapter, a comparative study of major commercial and experimental search engines is presented, and some future research directions for Web search engines are suggested. Introduction One of the most common tasks performed on the Web is to search Web pages, which is also one of the most frustrating and problematic. The situation is getting worse because of the Webs fast growing size and lack of structural style, as well as the inadequacy of existing Web search engine technologies (Lawrence & Giles, 1999a). Traditional search techniques are based on users typing in search keywords which the search services can then use to locate the desired Web pages. However, this approach normally retrieves too many documents, of which only a small fraction are relevant to the users needs. Furthermore, the most relevant documents do not necessarily appear at the top of the query output list. A number of corporations and research organizations are taking a variety of approaches to try to solve these problems. These approaches are diverse, and none of them dominate the field. This chapter provides a survey and classification of the available World Wide Web search engine techniques, with an emphasis on nontraditional approaches. Related Web search technology reviews can also be found in (Gudivada, Raghavan, Grosky, & Kasanagottu, 1997; Lawrence & Giles, 1998b; Lawrence & Giles, 1999b; Lu & Feng, 1998). Requirements of Web Search Engines It is first necessary to examine what kind of features a Web search engine is expected to have in order to conduct effective and efficient Web searches and what kind of challenges may be faced in the process of developing new Web search techniques. The requirements for a Web search engine are listed below, in order of importance: 1. effective and efficient location and ranking of Web documents; 2. thorough Web coverage; 166
Web Search Engine Technologies 3. up−to−date Web information; 4. unbiased access to Web pages; 5. an easy−to−use user interface which also allows users to compose any reasonable query; 6. expressive and useful search results; and 7. A system that adapts well to user queries. Web Search Engine Technologies Numerous Web search engine technologies have been proposed, and each technology employs a very different approach. This survey classifies the technologies into six categories: i) hyperlink exploration, ii) information retrieval, iii) metasearches, iv) SQL approaches, v) content−based multimedia searches, and vi) others. The chapter is organized as follows: Section 2 introduces the general structure of a search engine, and Sections 3 to 8 introduce each of the six Web search engine technologies in turn. A comparative study of major commercial and experimental search engines is shown in Section 9 and the final section gives a summary and suggests future research directions. Search Engine Structure Two different approaches are applied to Web search services: genuine search engines and directories. The difference lies in how listings are compiled: • - Search engines, such as Google, create their listings automatically. • - A directory, such as Yahoo!, depends on humans for its listings. Some search engines, known as hybrid search engines, maintain an associated directory. Search engines traditionally consist of three components: the crawler, the indexing software, and the search and ranking software (Greenberg & Garber, 1999; Yuwono & Lee, 1996). Figure 1 shows the system structure of a typical search engine. Figure 1: System structure of a Web search engine 167
Crawler Crawler A crawler is a program that automatically scans various Web sites and collects Web documents from them. Crawlers follow the links on a site to find other relevant pages. Two search algorithmsbreadth−first searches and depth−first searchesare widely used by crawlers to traverse the Web. The crawler views the Web as a graph, with the nodes being the objects located at Uniform Resource Locators (URLs). The objects could be (Hypertext Transfer Protocols (HTTPs), File Transfer Protocols (FTPs), mailto (e−mail), news, telnet, etc. They also return to sites periodically to look for changes. To speed up the collection of Web documents, several crawlers are usually sent out to traverse the Web at the same time. Three simple tools can be used to implement an experimental crawler: • lynx: Lynx is a text browser for Unix systems. For example, the command lynx −dump source http://www.w3c.org/ downloads the Web page source code at http://www.w3c.org/. • java.net: The java.net package of Java language provides plenty of networking utilities. Two classes in the package, java.net.URL and java.net.URLConnection, can be used to download Web pages. • Comprehensive Perl Archive Network (CPAN): Perl has been used intensively for Web−related applications. Some scripts provided by CPAN at http://www.cpan.org/ are useful for crawler construction. To construct an efficient and practical crawler, some other networking tools have to be used. Indexing Software Automatic indexing is the process of algorithmically examining information items to build a data structure that can be quickly searched. Filtering (Baeza−Yates, 1992) is one of the most important pre−processes for indexing. Filtering is a typical transformation in information retrieval and is often used to reduce the size of a document and/or standardize it to simplify searching. Traditional search engines utilize the following information, provided by HTML scripts, to locate the desired Web pages: • Content: Page content provides the most accurate, full−text information. However, it is also the least−used type of information, since context extraction is still far less practical. • Descriptions: Page descriptions can either be constructed from the metatags or submitted by Web masters or reviewers. • Hyperlink: Hyperlinks contain high−quality semantic clues to a pages topic. A hyperlink to a page represents an implicit endorsement of the page to which it points (Chakrabarti et al., 1999). • Hyperlink text: Hyperlink text is normally a title or brief summary of the target page. • Keywords: Keywords can be extracted from full−text documents or metatags. • Page title: The title tag, which is only valid in a head section, defines the title of an HTML document. • Text with a different font: Emphasized text is usually given a different font to highlight its importance. • The first sentence: The first sentence of a document is also likely to give crucial information related to the document. Search and Ranking Software Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. A user enters a keyword or keywords, along with Boolean modifiers such as and, or, or not, into a search engine, which then scans indexed Web pages for the keywords. To determine in which order to display pages to the user, the engine uses an algorithm to rank pages that contain the keywords (Zhang & Dong, 2000). For example, the engine may count the number of times the keyword appears on a page. To save time and space, 168
Hyperlink Exploration the engine may only look for keywords in metatags, which are HTML tags that provide information about a Web page. Unlike most HTML tags, metatags do not affect a documents appearance. Instead, they include such information as a Web pages contents and some relevant keywords. The following six sections give various methods of indexing, searching, and ranking the Web pages. Hyperlink Exploration Hypermedia documents contain cross references to other related documents by using hyperlinks, which allow the user to move easily from one to the other. Links can be tremendously important sources of information for indexers; the creation of a hyperlink by the author of a Web page represents an implicit endorsement of the page being to which it points. This approach is based on identifying two important types of Web pages for a given topic: • Authorities, which provide the best source of information on the topic, and • Hubs, which provide collections of links to authorities. For the example of professional basketball information, the official National Basketball Association site (http://www.nba.com/) is considered to be an authority, while the ESPN site (http://www.espn.com/) is a hub. Authorities and hubs are either given top ranking in the search results or used to find related Web pages (Dean & Henzinger, 1999). Analyzing the interconnections of a series of related pages can identify the authorities and hubs for a particular topic. A simple method to update a non−negative authority with a weight xp and a non−negative hub with a weight yp is given by Chakrabarti et al. (1999). If a page is pointed to by many good hubs, its authority weight is updated by using the following formula: where the notation q®ðp indicates that q links to p. Similarly, if a page points to many good authorities, its hub weight is updated via Unfortunately, applying the above formulas to the entire Web to find authorities and hubs is impracticable. Ideally, the formulas are applied to a small collection Ssð of pages that contain plenty of relevant documents. The concepts of a root set and a base set have been proposed by Kleinberg (1999) to find Ssð. The root set is usually constructed by collecting the t highest−ranked pages for the query sð from a search engine such as Google or Yahoo!. However, the root set may not contain most of the strongest authorities. A base set is therefore built by including any page pointed to by a page in the root set and any page that points to a page in the root set. Figure 2 shows an example of a root set and a base set. The above formulas can then be applied to a much smaller set, the base set, instead of the entire Web. In addition to the methods used to find authorities and hubs, a number of search methods based on connectivity have been proposed. A comparative study of various hypertext link analysis algorithms is given in (Borodin et al., 2001). The most widely used method is a Page Rank model (Brin & Page, 1998), which 169
Information Retrieval (IR) suggests the reputation of a page on a topic is proportional to the sum of the reputation weights of pages pointing to it on the same topic. That is, links emanating from pages with high reputations are weighted more heavily. The concepts of authorities and hubs, together with the Page Rank model, can also be used to compute the reputation rank of a page; those topics for which the page has a good reputation are then identified (Rafiei & Mendelzon, 2000). Some other ad hoc methods include an Hyperlink Vector Voting (HVV) method (Li, 1998) and a system known as WebQuery (Carriere & Kazman, 1997). The former method uses the content of hyperlinks to a document to rank its relevance to the query terms, while the latter system studies the structural relationships among the nodes returned in a content−based query and gives the highest ranking to the most highly connected nodes. An improved algorithm obtained by augmenting with content analysis is introduced in Bharat and Henzinger (1998). Figure 2: Expanding the root set into a base set Information Retrieval (IR) IR techniques are widely used in Web document searches (Gudivada et al, 1997). Among them, relevance feedback and data clustering are two of the most popular techniques used by search engines. The former method has not so far been applied to any commercial products because it requires some interaction with users, who normally prefer to use a keyword−only interface. The latter method has achieved more success since it does not require any interaction with users to achieve acceptable results. Relevance Feedback An initial query is usually a wild guess. Retrieved query results are then used to help construct a more precise query or modify the database indexes (Chang & Hsu, 1999). For example, if the following query is submitted to a search enginewhich TOYOTA dealer in Atlanta has the lowest price for a Corolla 2002?, the engine may produce the following list of ranked results: 1. Get the BEST price on a new Toyota, Lexus car or truck. http://www.toyotaforless.com/ 2. Toyota of Glendale¾ðYour #1 Toyota dealer. http://www.toyota−of−glendale.com/ 3. Leith Toyota¾ðRaleigh, North Carolina. http://www.leithtoyota.com/f_more_about_us.html 4. Atlanta rental cars & auto rentals. http://www.bnm.com/atl2.htm This list includes three relevant results: 1, 2, and 3, and one irrelevant result: 4. The following two relevance feedback methods can be used to improve the search results: • Query modification: Adjusts the initial query in an attempt to avoid unrelated or less−related query results. For example, the above query could be modified by adding a condition excluding rental cars. • Indexing modification: Through feedback from the users, system administrators can modify an unrelated documents terms to render it unrelated or less related to such a query. For example, the information concerning rental cars could be removed from the database indexes of car sales and prices. 170
Data Clustering For the above example, the search results after modification should not include Result #4. Data Clustering Data clustering is used to improve the search results by dividing the whole data set into data clusters. Each data cluster contains objects of high similarity, and clusters are produced that group documents relevant to the users query separately from irrelevant ones. For example, the formula below gives a similarity measure: where weightik is the weight assigned to termk in a document Di (Baeza−Yates, 1992). Clustering should not be based on the whole Web resource, but on smaller separate query results. In Zamir and Etzioni (1998), a Suffix Tree Clustering (STC) algorithm based on phrases shared between documents is used to create clusters. Beside clustering the search results, a proposed similarity function has been used to cluster similar queries according to their contents as well as user logs (Wen, Nie, & Zhang, 2001). The resulting clusters can provide useful information for Frequently Asked Queries (FAQ) identification. Another Web document clustering algorithm is suggested in Chang and Hsu (1999). Metasearches None of the current search engines is able to cover the Web comprehensively. Using an individual search engine may miss some critical information that is provided by other engines. Metasearch engines (Dreilinger & Howe, 1997; Howe & Dreilinger, 1997; Selberg & Etzioni, 1997) conduct a search using several other search engines simultaneously and then present the results in some sort of integrated format. This lets users see at a glance which particular search engine returned the best results for a query without having to search each one individually. They typically do not use their own Web indexes. Figure 3 shows the system structure of a metasearch engine, which consists of three major components: • Dispatch: Determines to which search engines a specific query is sent. The selection is usually based on network and local computational resources, as well as the long−term performance of search engines on specific query terms. • Interface: Adapts the users query format to match the format of a particular search engine, which varies from engine to engine. • Display: Raw results from the selected search engines are integrated for display to the user. Each search engine also produces different raw results from other search engines and these must be combined to give a uniform format for ease−of−use. 171