Wiley - Data Mining with Microsoft SQL Server 2008 (2009)02

Chia sẻ: Hoang Nhan | Ngày: | Loại File: PDF | Số trang:10

0
174
lượt xem
55
download

Wiley - Data Mining with Microsoft SQL Server 2008 (2009)02

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Wiley - Data Mining with Microsoft SQL Server 2008 (2009)02

Chủ đề:
Lưu

Nội dung Text: Wiley - Data Mining with Microsoft SQL Server 2008 (2009)02

  1. Introduction to Data Mining in SQL Server 2008 3 Figure 1-1 Student table In contrast, the data mining approach for this problem is almost the reverse of the query-and-explore method. Instead of guessing a hypothesis and trying it out in different ways, you ask the question in terms of the data that can support many hypotheses, and allow your data mining system to explore them for you. In this case, you indicate that the columns IQ, Gender, ParentIncome, and ParentEncouragement are to be used as hypotheses in determining CollegePlans. As the data mining system passes over the data, it analyzes the influence of each input column on the target column. Figure 1-2 shows the hypothetical result of a decision tree algorithm operat- ing on this data set. In this case, each path from the root node to the leaf node forms a rule about the data. Looking at this tree, you see that students with IQs greater than 100 and who are encouraged by their parents are highly likely to attend college. In this case, you have extracted knowledge from the data. As shown here, data mining applies algorithms such as decision trees, clustering, association, time series, and so on to a data set, and then analyzes its contents. This analysis produces patterns, which can be explored for valuable information. Depending on the underlying algorithm, these patterns can be in the form of trees, rules, clusters, or simply a set of mathematical formulas. The information found in the patterns can be used for reporting (to
  2. 4 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008 guide marketing strategies, for instance) and for prediction. For example, if you could collect data about undecided students, you could select those who are likely to be interested in continued education and preemptively market to that audience. Attend College: 55% Yes 45% No IQ > 100 IQ ≤ 100 Attend College: Attend College: 79% Yes 35% Yes 21% No 65% No Encouragement = Encouragement = Encouraged Not Encouraged Attend College: Attend College: 94% Yes 69% Yes 6% No 31% No Figure 1-2 Decision tree Business Problems for Data Mining Data mining techniques can be used in virtually all business applications, answering various types of businesses questions. In truth, given the software available today, all you need is the motivation and the know-how. In general, data mining can be applied whenever something could be known, but is not. The following examples describe some scenarios: Recommendation generation — What products or services should you offer to your customers? Generating recommendations is an important business challenge for retailers and service providers. Customers who are provided appropriate and timely recommendations are likely to be more valuable (because they purchase more) and more loyal (because they feel a stronger relationship to the vendor). For example, if you go to online stores such as Amazon.com or Barnesandnoble.com to purchase an item, you are provided with recommendations about other items you may be interested in. These recommendations are derived from using data mining to analyze purchase behavior of all of the retailer’s customers, and applying the derived rules to your personal information.
  3. Business Problems for Data Mining 5 Anomaly detection — How do you know whether your data is ‘‘good’’ or not? Data mining can analyze your data and pick out those items that don’t fit with the rest. Credit card companies use data mining–driven anomaly detection to determine if a particular transaction is valid. If the data mining system flags the transaction as anomalous, you get a call to see if it was really you who used your card. Insurance compa- nies also use anomaly detection to determine if claims are fraudulent. Because these companies process thousands of claims a day, it is impos- sible to investigate each case, and data mining can identify which claims are likely to be false. Anomaly detection can even be used to validate data entry — checking to see if the data entered is correct at the point of entry. Churn analysis — Which customers are most likely to switch to a com- petitor? The telecom, banking, and insurance industries face severe com- petition. On average, obtaining a single new mobile phone subscriber costs more than $200. Every business would like to retain as many cus- tomers as possible. Churn analysis can help marketing managers identify the customers who are likely to leave and why, and as a result, they can improve customer relations and retain customers. Risk management — Should a loan be approved for a particular cus- tomer? Since the subprime mortgage meltdown, this is the single most common question in banking. Data mining techniques are used to deter- mine the risk of a loan application, helping the loan officer make appro- priate decisions on the cost and validity of each application. Customer segmentation — How do you think of your customers? Are your customers the indescribable masses, or can you learn more about your customers to have a more intimate and appropriate discussion with them. Customer segmentation determines the behavioral and descriptive profiles for your customers. These profiles are then used to provide per- sonalized marketing programs and strategies that are appropriate for each group. Targeted ads — Web retailers or portal sites like to personalize their content for their Web customers. Using navigation or online purchase patterns, these sites can use data mining solutions to display targeted advertisements to their Web navigators. Forecasting — How many cases of wine will you sell next week in this store? What will the inventory level be in one month? Data mining fore- casting techniques can be used to answer these types of time-related questions.
  4. 6 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008 Data Mining Tasks For each question that can be asked of a data mining system, there are many tasks that may be applied. In some cases, an answer will become obvious with the application of a single task. In others, you will explore and combine multiple tasks to arrive at a solution. The following sections describe the general data mining tasks. Classification Classification is the most common data mining task. Business problems such as churn analysis, risk management, and targeted advertising usually involve classification. Classification is the act of assigning a category to each case. Each case contains a set of attributes, one of which is the class attribute. The task requires finding a model that describes the class attribute as a function of input attributes. In the College Plans data set shown in Figure 1-1, the class is the CollegePlans attribute with two states: Yes and No. A classification model will use the other attributes of a case (the input attributes) to determine patterns about the class (the output attribute). Data mining algorithms that require a target to learn against are considered supervised algorithms. Typical classification algorithms include decision trees, neural network, and Na¨ve Bayes. ı Clustering Clustering is also called segmentation. It is used to identify natural groupings of cases based on a set of attributes. Cases within the same group have more or less similar attribute values. Figure 1-3 shows a very simple customer data set containing two attributes: Age and Income. The clustering algorithm groups the data set into three seg- ments based on these two attributes. Cluster 1 contains a younger population with low income. Cluster 2 contains middle age customers with higher income. Cluster 3 is a group of older individuals with a relatively low income. Clustering is an unsupervised data mining task. There is no single attribute used to guide the training process, so all input attributes are treated equally. Most clustering algorithms build the model through a number of iterations, and stop when the model converges (that is, the boundaries of these segments are stabilized).
  5. Data Mining Tasks 7 Income Cluster 2 Cluster 3 Cluster 1 Age Figure 1-3 Clustering Association Association is also called market basket analysis. A typical association business problem is to analyze a sales transaction table and identify those products often in the same shopping basket. The common usage of association is to identify common sets of items and rules for the purpose of cross-selling, as shown in Figure 1-4. Milk Cake Beer Cheese Wine Donut Coke Pepsi Beef Juice Figure 1-4 Product association In terms of association, each piece of information is considered an item. The association task has two goals: to find those items that appear together frequently, and from that, to determine rules about the associations.
  6. 8 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008 Regression The regression task is similar to classification, except that instead of looking for patterns that describe a class, the goal is to find patterns to determine a numerical value. Simple linear line-fitting techniques are an example of regression, where the result is a function to determine the output based on the values of the inputs. More advanced forms of regression support categorical inputs as well as numerical inputs. The most popular techniques used for regression are linear regression and logistic regression. Other techniques supported by SQL Server Data Mining are regression trees (part of the Microsoft Decision Trees algorithm) and neural networks. Regression is used to solve many business problems — for example, to predict a coupon redemption rate based on the face value, distribution method, distribution volume, and season, or to predict wind velocities based on temperature, air pressure, and humidity. Forecasting Forecasting is yet another important data mining task. What will the stock value of Microsoft Corporation (NASDAQ symbol MSFT) be tomorrow? What will the sales amount of wine be next month? Forecasting can help answer these questions. As input, it takes sequences of numbers indicating a series of values through time, and then it imputes future values of those series using a variety of machine-learning and statistical techniques that deal with seasonality, trending, and noisiness of data. Figure 1-5 shows two curves. The solid line curve is the actual time-series data on Microsoft stock value, and the dotted curve is a time-series model that predicts values based on past values. MSFT 3-year price history 38 36 34 32 30 28 26 24 22 20 Figure 1-5 Time series
  7. Data Mining Project Cycle 9 Sequence Analysis Sequence analysis is used to find patterns in a series of events called a sequence. For example, a DNA sequence is a long series composed of four different states: A, G, C, and T. A click sequence on the Web contains a series of URLs. In certain circumstances, you may model customer purchases as a sequence of data. For example, a customer first buys a computer, and then buys speakers, and finally buys a webcam. Both sequence and time-series data are similar in that they contain adjacent observations that are order-dependent. The difference is that where a time series contains numerical data, a sequence series contains discrete states. Figure 1-6 shows Web click sequences from a news website. Each node is a URL category, and the lines represent transitions between them. Each transition is associated with a weight, representing the probability of the transition between one URL and another. Home 0.2 Weather Page 0.1 0.4 0.2 0.2 0.2 News Sport 0.3 0.1 Business Science 0.3 Figure 1-6 Web navigation sequence Deviation Analysis Deviation analysis is used to find rare cases that behave very differently from the norm. Deviation analysis is widely applicable, the most common usage being credit card fraud detection. Identifying abnormal cases among millions of transactions is a very challenging task. Other applications include network intrusion detection, manufacture error analysis, and so on. There is no standard technique for deviation analysis. Usually, analysts apply decision trees, clustering, or neural network algorithms for this task. Data Mining Project Cycle From the initial business problem formation through to deployment and sustained management, most data mining projects pass through the same phases.
  8. 10 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008 Business Problem Formation What are the problems you are trying to solve? What techniques are you going to apply to solve the problem? How do you know if you will be successful? These are important questions to ask before embarking on any project. You may find that a simple OLAP, reporting, or data integration solution may be sufficient. A predictive or data mining solution involves determining the unknown, relying on a belief that making sense of that unknown will add value. This is a shaky precipice from which to begin any business endeavor. Luckily, successful data mining solutions have been shown to have an average of 150-percent return on investment (ROI), so that makes justification easier. Data Collection Business data is stored in many systems across an enterprise. For example, at Microsoft, there are hundreds of online transaction processing (OLTP) databases and more than 70 data warehouses. The first step is to pull the relevant data into a database or a data mart where the data analysis is applied. For example, if you want to analyze your website’s click stream, the first step is to download the log data from your web servers. Sometimes you might be lucky and find that there is already an existing data warehouse on the subject of your analysis. However, in many cases, the data in the data warehouse is not rich enough and must be supplemented with additional data. For example, the log data from the web servers contains only data about web behavior and little (if any) data about the customers. You may need to gather customer information from other company systems or purchase demographic data to build models that meet your business requirements. Data Cleaning and Transformation Data cleaning and transformation are the most resource-consuming steps in a data mining project. The purpose of data cleaning is to remove noise and irrelevant information from the data set. The purpose of data transformation is to modify the source data in ways that make it useful for mining. Various techniques are applied to clean and transform data, including the following: Numerical transformation — For continuous data such as income and age, a typical transformation is to bin (or discretize) the data into buckets. For example, you may want to bin Age into five predefined age groups. SQL Server Data Mining has automatic discretization methods, but if you have meaningful groupings, they may be more informative both from a business sense and an algorithmic sense. Additionally, continu- ous data is often normalized. Normalization maps all numerical values to
  9. Data Mining Project Cycle 11 a range (such as between 0 and 1) or to have a specific standard deviation (such as 1). Grouping — Discrete data often has more distinct values than are use- ful. You can group these values to reduce the model complexity. For example, the column Profession may have many different types of engi- neers, such as Software Engineer, Telecom Engineer, Mechanical Engi- neer, and so on. You can group all of these professions to the single value Engineer. Aggregation — Aggregation is an important transformation to derive additional value from your data. Suppose you want to group customers based on their phone usage. If the call detail record information is too detailed for the model, you must aggregate all the calls into a few derived attributes such as total number of calls and the average call duration. These derived attributes can later be used in the model. Missing value handling — Most data sets contain missing values. This can be caused by many different things. For example, you may have two customer tables coming from two OLTP databases that, when merged, have missing values because the tables are not aligned. Another example occurs when customers don’t supply data values such as age. Another is when you have stock market values with blanks because the markets are closed on weekends and holidays. Addressing missing values is important, because it is reflected in the business value of your solution. You may need to retain the missing data (for example, customers who refuse to report their age may have other interesting things in common). You may need to discard the entire record (having too many unknowns could pollute your model). Or, you may simply be able to replace missing values with some other value (such as the previous value for time-series data such as stock market val- ues, or the most popular value). For more advanced cases, you can use data mining to predict the most likely value for each missing case. Removing outliers — Outliers are abnormal data and can be real or (as is often the case) errors. Abnormal data has an effect on the quality of your results. The best way to deal with outliers typically is to simply remove them before beginning the analysis. For example, you could remove 0.5 percent of the customers with highest or lowest income to eliminate any situations of people having negative or extremely unlikely incomes. SQL Server Integration Services (SSIS), which is included with Microsoft SQL Server, is an excellent tool for performing data cleaning and transformation tasks.
  10. 12 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008 Model Building Model building is the core of data mining, though it is not as time- and resource-intensive as data transformation. When you understand the shape of the business problem and the type of data mining task, it is relatively easy to pick algorithms that are suitable. Usually, you don’t know which algorithm is the best fit for the problem until you have built the model. The accuracy of an algorithm depends on the nature of the data. For example, a decision tree algorithm is usually a very good choice for any classifications. However, if the relationships among attributes are complicated, a neural network may perform better. A good approach is to build multiple models using different algorithms, and then compare the accuracy of these models. Even with a single algorithm, you can tune the parameter settings to optimize the model accuracy. Model Assessment In the model assessment stage, you use tools to determine the accuracy of the models that were created, and you examine the models to determine the meaning of discovered patterns and how they apply to your business. For example, a model may determine that Relationship = Husband ➪ Gender = Male with 100-percent confidence. Although the rule is valid, it doesn’t contain any business value. It is very important to work with business analysts who have the proper domain knowledge to validate the discoveries. Sometimes, the model doesn’t contain useful patterns. This is generally because the set of variables in the model are not the right ones to solve your business problem. You may need to repeat the data cleaning and transforma- tion steps, or even redefine your problem in order to derive more meaningful variables. Data mining is an exploratory process, and it often takes a few iterations before you find the right model. Reporting and Prediction In many organizations, the goal of data miners is to deliver reports to marketing executives. SQL Server Data Mining is integrated with SQL Server Reporting Services to generate reports directly from data mining results. Reports may contain predictions (such as lists of customers with the highest value potential) or the rules found in the data mining analysis. To provide predictions, you apply the selected model against new cases of data. Consider a banking scenario where you build a model about loan risk prediction. Every day there are thousands of new loan applications. You can
Đồng bộ tài khoản