SQL Server MVP Deep Dives- P19

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:40

Thêm vào BST

Báo xấu

71
lượt xem 9
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

SQL Server MVP Deep Dives- P19: Each year Microsoft invites all the MVPs from every technology and country to Redmond for an MVP Summit—all top secret—“don’t tweet what you see!” During the MVP Summit, each product team holds a series of presentations where they explain their technologies, share their vision, and listen to some honest feedback.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: SQL Server MVP Deep Dives- P19

674 CHAPTER 53 SQL Server Audit, change tracking, and change data capture ALTER SERVER AUDIT ServerAudit WITH ( STATE = ON ); GO USE [HR]; GO CREATE DATABASE AUDIT SPECIFICATION HRAudit FOR SERVER AUDIT ServerAudit ADD ( SELECT, INSERT, UPDATE, DELETE ON dbo.Employees BY [public] ) WITH ( STATE = ON ); GO When reviewing audit information (whether in a file, or in the event log), there is a variety of information available to you, including the time of the action, the session_id (SPID) of the user that performed the action, the database, server and object that was the target of the action, and whether or not the action succeeded. For a full listing of the columns written to an audit row, see the Books Online topic, “SQL Server Audit Records,” located at http:/ /msdn.microsoft.com/en-us/library/ cc280545.aspx. I was disappointed to see that host name and/or IP address of the session_id is not recorded. This can be important information in some instances, and is difficult to determine after the session has disconnected from the server. For example, if SQL Authentication is enabled, and the sa (or another sysadmin) password is commonly known, then anyone can connect that way via their own machine, and be relatively untraceable. Another important note here is that the type of action (for example, SELECT or ALTER) is recorded, but in the case of SELECT or DML queries, none of the data involved is included. For example, if you run the statement in listing 3, the event log entry will look like listing 4 (I’ve left out some of the columns). Listing 3 Updating the Employees table UPDATE dbo.Employees SET Salary = Salary * 1.8 WHERE EmployeeID = 5; Listing 4 Event log entry for the UPDATE command in listing 3 Log Name: Application User: N/A Event ID: 33205 Audit event: event_time:2008-10-05 18:14:31.3784745 action_id: UP session_id: 56 session_server_principal_name: sa server_instance_name: SENTINEL\SQL2008 database_name: HR schema_name: dbo object_name: Employees statement: UPDATE [dbo].[Employees] set [Salary] = [Salary]*@1 WHERE [EmployeeID]=@2 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
How does SQL Server 2008 solve these problems? 675 (Note that you may also see other events in the event log corresponding to the audit- ing activity itself.) Because the literal values in the statement are replaced by parameter placeholders, and because the previous version of the data is not included, it is going to be difficult to find the entries where salaries were increased. Also, the information in the event log entry does not include the host name and/or IP address of the computer that issued the statement. So, if you are using SQL Authentication and your developers share a single login, it will be difficult with auditing alone to figure out who per- formed this update. You can work your way around this by adding the SUCCESSFUL_LOGIN_GROUP to the Server Audit Specification, as shown in listing 5. Listing 5 Creating a Server Audit with the SUCCESSFUL_LOGIN_GROUP USE [master]; GO CREATE SERVER AUDIT SPECIFICATION CaptureLogins FOR SERVER AUDIT ServerAudit ADD ( SUCCESSFUL_LOGIN_GROUP ) WITH ( STATE = ON ); GO Once you do this, you will have login records in the log or file that you can correlate with session_id and event_time to the relevant database audit activity. The successful login entry will have (in addition to session_id and other data observed above) host name information in the following form, under the Additional information field: Additional information:... local machine / host name / IP If you are using Windows Authentication, on the other hand, then this seems like a reasonable way to capture exactly who executed the statement (without having to cor- relate to login events), but not necessarily what values they passed in. Take the case where you find that Employee 5’s salary has been increased from $100,000 to $250,000. Three such events appear in the Application Log, from three different users, with the exact same UPDATE statement. The first could have updated the salary to $250,000, and the other two could have left it that way (by explicitly defining $250,000 in their UPDATE statement, even though it did not ultimately change the data in the table). Or, the increment could have been performed by the second or third person, or each person could have increased the salary by $50,000. There are millions of other possible permutations, and this is a simple case. Imagine trying to unravel this mystery on a busy system with thousands of simultaneous users all affecting the same table. Before moving on to the next section, if you have created the sample code above, you can remove it using the code in listing 6. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
676 CHAPTER 53 SQL Server Audit, change tracking, and change data capture Listing 6 Cleaning up the audit specification USE [HR]; GO IF EXISTS ( SELECT 1 FROM sys.database_audit_specifications WHERE name = 'HRAudit' ) BEGIN ALTER DATABASE AUDIT SPECIFICATION HRAudit WITH (STATE = OFF); DROP DATABASE AUDIT SPECIFICATION HRAudit; END GO USE [master]; GO IF EXISTS ( SELECT 1 FROM sys.server_audit_specifications WHERE name = 'CaptureLogins' ) BEGIN ALTER SERVER AUDIT SPECIFICATION 'CaptureLogins' WITH (STATE = OFF); DROP SERVER AUDIT SPECIFICATION 'CaptureLogins'; END GO IF EXISTS ( SELECT 1 FROM sys.server_audits WHERE name = 'ServerAudit' ) BEGIN ALTER SERVER AUDIT ServerAudit WITH (STATE = OFF); DROP SERVER AUDIT ServerAudit; END GO Change tracking Change tracking is a feature that adds the ability to determine, at a glance, which rows in a table have changed in a specified period of time. This can be useful for synchro- nizing data between the primary database and a middle-tier data cache, and for allow- ing semi-connected applications to detect conflicts when updates have been made on both sides. Change tracking is meant to allow you to identify the rows that changed, but does not keep any information about the values that were changed (for example, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
How does SQL Server 2008 solve these problems? 677 a previous version of the row). Change tracking occurs synchronously, so there is some overhead to the process. In general, the overhead is equivalent to the mainte- nance costs of adding an additional nonclustered index to the table. The process assumes that you can always get the current version of the row directly from the table, and that you only care about knowing whether or not a row has changed. (Change tracking is described more in-depth in Books Online, starting at the topic, “Change Tracking,” at http:/ /msdn.microsoft.com/en-us/library/ cc280462.aspx.) To set up change tracking on a table, the table must have a primary key, and you must first enable the feature at the database level. (Books Online also suggests that the database must be at least at a compatibility level of 90, and that snapshot isolation is enabled.) Using the HR database and the dbo.Employees table created in the previ- ous section, you can enable change tracking as shown in listing 7. Listing 7 Enabling change tracking ALTER DATABASE HR SET ALLOW_SNAPSHOT_ISOLATION ON; ALTER DATABASE HR SET CHANGE_TRACKING = ON (CHANGE_RETENTION = 3 DAYS, AUTO_CLEANUP = ON); GO USE HR; GO ALTER TABLE dbo.Employees ENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = OFF); GO At the database level, the option CHANGE_RETENTION indicates how long you keep information about rows that have changed. If the applications last checked for changed data before that period started, then they will need to proceed as if the entire table is brand new (so a caching application, for example, will need to reload the entire table and start from scratch). AUTO_CLEANUP is the option that specifies whether this periodic purging should take place, and it can be disabled for trouble- shooting purposes. Although this sounds like something that requires SQL Server Agent, it is handled by an internal background task. It will work on all editions of SQL Server, including Express Edition, with or without SQL Server Agent enabled. At the table level, the TRACK_COLUMNS_UPDATED option is used to specify whether the system should store information about which columns were changed, or store the fact that the row was changed. The former can be useful for an application that tries to synchronize or cache data from a table that contains both an INT column and a LOB column (for example, VARCHAR(MAX)). Instead of pulling an identical copy of the LOB column for a row that changed, it can ignore that column and keep its local copy if it knows that it was not a part of any update that has happened since it was last loaded. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
678 CHAPTER 53 SQL Server Audit, change tracking, and change data capture Once change tracking is enabled, what an application can do is connect to the database, and determine the current baseline version of the table. This is a BIGINT value that is returned by calling the new function CHANGE_TRACKING_CURRENT_ VERSION() (this represents the most recent committed transaction). Once the appli- cation knows this value, it can load all of the data from the table, and then can check for further updates later using the CHANGETABLE() function. This function will return a set of data representing any rows that have changed in the specified table since the baseline version retrieved above. The following is all in T-SQL, but you can envision how an application would use the same logic. Open two new query windows in Man- agement Studio, connected to the HR database, and run the code in listing 8. Listing 8 Determining (and updating) the baseline version of a table SET TRANSACTION ISOLATION LEVEL SNAPSHOT; -- Check the current baseline: SELECT Baseline = CHANGE_TRACKING_CURRENT_VERSION(); -- Load the current version of the table: SELECT EmployeeID, FirstName, LastName, Salary FROM dbo.Employees; -- Now, switch to the second query window, and make some updates to the table: UPDATE dbo.Employees SET LastName = 'Kinison' WHERE EmployeeID = 2; DELETE dbo.Employees WHERE EmployeeID = 5; INSERT dbo.Employees ( EmployeeID, FirstName, LastName, Salary ) SELECT 6, 'Kirby', 'Quigley', 62500; Listing 9 shows the code to retrieve the changes made to the Employees table. Replace with the result from the baseline query in listing 8. Listing 9 Retrieving changes to the Employees table SELECT NewBaseLine = CHANGE_TRACKING_CURRENT_VERSION(), cv = SYS_CHANGE_VERSION, ccv = SYS_CHANGE_CREATION_VERSION, op = SYS_CHANGE_OPERATION, EmployeeID FROM CHANGETABLE(CHANGES dbo.Employees, ) AS ChT; The results should look something like this: NewBaseLine cv ccv op EmployeeID 3 3 NULL U 2 3 2 NULL D 5 3 1 1 I 6 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
How does SQL Server 2008 solve these problems? 679 Now, the application can use this output to determine which of the following it needs to do: Reload the data from the row for EmployeeID 2, because it has been updated since the last time it was loaded. Load the new row for EmployeeID 6, because it has been created since the table was last loaded. Remove the row for EmployeeID 5 from its local cache, because it has been deleted. Record the new baseline value (3) as it will need to use this as the version to check against the next time it polls for changes. Note that change tracking does not record any information about the user who made the change. It only records the fact that a row changed. But as with SQL Server Audit, there are ways around this limitation. SQL Server 2008 supports new syntax to allow change tracking to add some contextual information to the DML statement, allowing that data to be stored along with other details of the change. This was intended to pro- vide an application the ability to differentiate between its own updates from those of other applications, but you can use it for whatever other devious purposes you can dream up. For example, as shown in listing 10, you can easily add information such as host name and user name using an INSTEAD OF UPDATE TRIGGER, by utilizing the new WITH CHANGE_TRACKING_CONTEXT() construct, in order to store information about users performing updates to your table. Listing 10 Using WITH CHANGE_TRACKING_CONTEXT() in an INSTEAD OF trigger CREATE TRIGGER dbo.AppendEmployeeUpdates ON dbo.Employees INSTEAD OF UPDATE AS BEGIN SET NOCOUNT ON; DECLARE @i VARBINARY(128); SET @i = CONVERT ( VARBINARY(128), SUSER_SNAME() + '|' + HOST_NAME() ); WITH CHANGE_TRACKING_CONTEXT (@i) UPDATE e SET e.FirstName = i.FirstName, e.LastName = i.LastName, e.Salary = i.Salary FROM dbo.Employees e INNER JOIN inserted i ON e.EmployeeID = i.EmployeeID; END GO Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
680 CHAPTER 53 SQL Server Audit, change tracking, and change data capture In this case, because you are not tracking individual column updates, you don’t have to worry about only updating those columns that have changed. If you do implement a solution where individual columns matter, you might want more complex logic such that the trigger only touches the base table columns that should now contain different values. And for an even more bulletproof trigger, you would also want to handle the case where the primary key might change (even though, in theory, this should never happen). You could do this in a stored procedure instead, if you can prevent direct updates to the table itself, and enforce all access via stored procedures. That is possi- ble in some environments, but not all. Once the trigger is in place, you can run the following UPDATE statement: UPDATE dbo.Employees SET LastName = 'Malone' WHERE EmployeeID = 2; And now when you call the CHANGETABLE function, as shown in listing 11, you can add a new column that will return that contextual information (assuming the existing baseline was 3 after the above statements). Listing 11 Calling the CHANGETABLE function SELECT NewBaseLine = CHANGE_TRACKING_CURRENT_VERSION(), [user|host] = CONVERT(NVARCHAR(128), SYS_CHANGE_CONTEXT), cv = SYS_CHANGE_VERSION, ccv = SYS_CHANGE_CREATION_VERSION, op = SYS_CHANGE_OPERATION, EmployeeID FROM CHANGETABLE(CHANGES dbo.Employees, 3) AS ChT; -- results: NewBaseLine user|host cv ccv op EmployeeID 4 SENTINEL\Aaron|SENTINEL 11 NULL U 2 Arguably, you could also use the trigger to store the old and new values off in a table somewhere for deferred analysis. But that would require you to manually create tables to capture all of that information, and come up with your own cleanup mechanism. And, without spoiling any surprises, you would be duplicating the functionality of another feature added in SQL Server 2008. Before proceeding, you can disable change tracking on the HR database and the dbo.Employees table using the code in listing 12. Listing 12 Disabling change tracking USE [HR]; GO ALTER TABLE dbo.Employees DISABLE CHANGE_TRACKING; ALTER DATABASE HR SET CHANGE_TRACKING = OFF; Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
How does SQL Server 2008 solve these problems? 681 Change data capture Change data capture (CDC) is similar to change tracking in that it captures information about changes to data. But the information it captures (and how) is significantly differ- ent. Instead of capturing the primary key for each row that has changed, it records the data that has changed, for all columns, or for the subset of columns you specify. It records all of the data for INSERTs and DELETEs, and in the case of UPDATEs, it records both the before and after image of the row. And it does this by periodically retrieving data from the SQL transaction log, so the activity does not interfere directly with your OLTP processes. It does require that SQL Server Agent is enabled and running. The primary motivation for including CDC in SQL Server 2008 was to facilitate an easier process for extract, transform, and load (ETL) applications. Making all of the changed data available separately allows the application to pull only the updated data, without having to go to the base tables for the data (or to verify timestamp columns or perform expensive joins to determine deltas). You can investigate CDC in much more depth starting with the Books Online topic, “Overview of Change Data Capture,” located at http://msdn.microsoft.com/en-us/library/cc627397.aspx. To set up CDC, you must be running Enterprise or Developer Edition, and you must enable it at the database level first, and then for each table you want to capture. Note that unlike SQL Server Audit and change tracking, CDC features are enabled and disabled via system stored procedure calls. Using the same HR database and dbo.Employees table as in previous sections, listing 13 shows the commands necessary to start capturing data changes. Listing 13 Enabling a database and table for change tracking USE HR; GO EXEC sys.sp_cdc_enable_db; GO EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'Employees', @supports_net_changes = 1, @role_name = NULL; GO The first two parameters to the enable_table stored procedure are self-explanatory, but the last two are not. The @supports_net_changes parameter dictates whether the changed data can be retrieved as a data set that includes one row per key value, sum- marizing all of the changes that took place in the indicated timeframe (in this way, it works similarly to change tracking, but you will also see the data in each column in addition to the primary key value). Note that to support net changes, the source table must have a primary key or a unique index defined. You will still be able to investi- gate each individual change, but if you look at the net, this will allow you to perform one DML statement on the target instead of multiple, in the case where your extract, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
682 CHAPTER 53 SQL Server Audit, change tracking, and change data capture transform, and load (ETL) program is replicating changes to another system. The @role_name parameter is used to specify who can access the changed data table. This can be a fixed server role, a database role, or left as NULL (in which case, sysadmin and db_owner have full access, and other users inherit their permissions from the base table). The sys.sp_cdc_enable_table stored procedure has five other optional parame- ters. One is called @captured_column_list, which allows you to capture only changes to a specific subset of the columns. For example, you may not want to capture VAR- CHAR(MAX) or VARBINARY(MAX) contents, when all that has changed is a BIT column. The other is @filegroup_name, which lets you place the captured data on a filegroup other than PRIMARY/DEFAULT. The other three are @capture_instance, which allows you to specify a name for your CDC instance (because you can have multiple captures on the same table); @index_name, allowing you to specify an unique index instead of the primary key; and @allow_partition_switch, which lets you dictate whether parti- tion switches are allowed against the source table. The @capture_instance parameter can be particularly useful in preventing the system from trying to create conflicting names for the capture instance table. For example, if you have a table called dbo_foo.bar and another table called dbo.foo_bar, enabling both for CDC, without specifying a value for @capture_instance, will fail. This is because CDC tries to name both capture tables “dbo_foo_bar.” Although this is a fairly contrived case, if in doubt, use the @capture_instance parameter to ensure you have unique names. To retrieve information about data changes to a table, you use the new CDC func- tions cdc.fn_cdc_get_all_changes_ and, if you have enabled net changes, cdc.fn_cdc_get_net_changes_. These procedures require from and to parameters, but they are not based on time; instead you must determine the range of log sequence numbers (LSNs) that you wish to query. To obtain this information, you can use the function sys.fn_cdc_map_time_to_lsn. Now that CDC is enabled for the dbo.Employees table (make sure once again that SQL Server Agent is running), you can make some changes to the data, and see how you (or your applications) might query for the individual or net changes. Run the DML statements in listing 14. Listing 14 Inserting data into the Employees table SELECT CURRENT_TIMESTAMP; INSERT dbo.Employees ( EmployeeID, FirstName, LastName, Salary ) SELECT 7, 'Howard', Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
How does SQL Server 2008 solve these problems? 683 'Jones', 80000; UPDATE dbo.Employees SET LastName = 'Donaldson' WHERE EmployeeID = 3; UPDATE dbo.Employees SET Salary = Salary * 2 WHERE EmployeeID = 4; DELETE dbo.Employees WHERE EmployeeID = 6; UPDATE dbo.Employees SET LastName = 'Stern' WHERE EmployeeID = 7; UPDATE dbo.Employees SET LastName = 'Malone' WHERE EmployeeID = 3; Be sure to copy the result from the very first line in the query. You will need this to determine the range of LSNs you will need to pull from the CDC table. Now you can run the query in listing 15. Listing 15 Query against (and results from) a change data capture function DECLARE @start DATETIME, @end DATETIME, @lsn_A BINARY(10), @lsn_B BINARY(10); SELECT @start = '', @end = CURRENT_TIMESTAMP, @lsn_A = sys.fn_cdc_map_time_to_lsn('smallest greater than', @start), @lsn_B = sys.fn_cdc_map_time_to_lsn('largest less than', @end); SELECT operation = CASE __$operation WHEN 1 THEN 'D' WHEN 2 THEN 'I' WHEN 4 THEN 'U' ELSE NULL END, EmployeeID, FirstName, LastName, Salary FROM cdc.fn_cdc_get_all_changes_dbo_Employees(@lsn_A, @lsn_B, 'all'); -- result: Operation EmployeeID FirstName LastName Salary I 7 Howard Jones 80000.00 U 3 Don Donaldson 125000.00 U 4 Teemu Selanne 227000.00 D 6 Kirby Quigley 62500.00 U 7 Howard Stern 80000.00 U 3 Don Malone 125000.00 This result set does not include the before images of rows affected by UPDATE state- ments, because it is intended to be used to make a target data source look like the source. Hopefully you can see here that it would be easy to reconstruct all of the DML statements, in order to apply the same changes to another table that looked identical to this one before you made changes. If you change the final SELECT to use the get_net_changes function instead, as shown in listing 16, you can see that the set is compressed. Only the values necessary to make the target table look like the source (with one row per key) are included. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
684 CHAPTER 53 SQL Server Audit, change tracking, and change data capture Listing 16 Using the get_net_changes function SELECT operation = CASE __$operation WHEN 1 THEN 'D' WHEN 2 THEN 'I' WHEN 4 THEN 'U' ELSE NULL END, EmployeeID, FirstName, LastName, Salary FROM cdc.fn_cdc_get_net_changes_dbo_Employees(@lsn_A, @lsn_B, 'all'); -- result: Operation EmployeeID FirstName LastName Salary U 4 Teemu Selanne 227000.00 D 6 Kirby Quigley 62500.00 I 7 Howard Stern 80000.00 U 3 Don Malone 125000.00 And finally, as per listing 17, you can see the before and after image of each key row throughout all updates by looking directly at the CDC table. Listing 17 Viewing the before and after image of each key row SELECT [image] = CASE __$operation WHEN 3 THEN 'BEFORE' WHEN 4 THEN 'AFTER' ELSE NULL END, EmployeeID, FirstName, LastName, Salary FROM cdc.dbo_Employees_CT WHERE __$operation IN (3,4) ORDER BY __$start_lsn, __$operation; -- result: Image EmployeeID FirstName LastName Salary BEFORE 3 Don Mattingly 125000.00 AFTER 3 Don Donaldson 125000.00 BEFORE 4 Teemu Selanne 113500.00 AFTER 4 Teemu Selanne 227000.00 BEFORE 7 Howard Jones 80000.00 AFTER 7 Howard Stern 80000.00 BEFORE 3 Don Donaldson 125000.00 AFTER 3 Don Malone 125000.00 One challenge you might come across is when your schema changes. In this case you will need to disable CDC for the table and re-enable it when the change is complete. CDC will not break without this action, but if you add, remove, or rename columns, your captured data will be incomplete. Also, because change tracking and SQL Server Audit are synchronous, and CDC polls the transaction log after the fact, it is not so straightforward to capture the user- name responsible for the change. If this is an important part of your solution, then you are probably better off sticking to one of the other features discussed in this chapter, or resorting to more traditional means (for example, triggers, log reading utilities). To clean up the CDC settings, you can use the code in listing 18. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Comparison of features 685 Listing 18 Cleaning up change data capture settings USE HR; GO EXEC sys.sp_cdc_disable_table @source_schema = N'dbo', @source_name = N'Employees', @capture_instance ='dbo_Employees'; GO EXEC sys.sp_cdc_disable_db; GO Comparison of features At first glance, these three new features in SQL Server 2008 seem quite similar. As demonstrated here, their functionality may overlap in some cases, but they are clearly different and serve unique purposes. This treatment should help equip you with much of the information you will need to decide which feature you will need to use. To wrap up, table 1 should help you decide whether to use SQL Server Audit, change tracking, or CDC. Table 1 Comparing SQL Server Audit, change tracking, and change data capture SQL Server Change Change data Criteria Audit tracking capture Capture DML statements Yes1 No No Capture result of DML statements No Yes Yes Capture before and after values No No Yes Capture intermediate values No No Yes Capture SELECT statements Yes No No Capture user name / spid Yes Yes4 No Capture host name Yes2 Yes4 No Should use specific isolation level No Yes5 No Require SQL Server Agent No No Yes Available in all SKUs No3 Yes No3 1. You can see a tokenized copy of the DML statement, but the values in the statement are replaced by parameter placeholders. 2. You can capture host name in a separate login audit event, then correlate it manually with the event in question. 3. This feature is available in Enterprise, Evaluation, and Developer Editions only. 4. You can capture this information using a trigger to affect the context information included with the change tracking data. 5. Using snapshot isolation level can significantly impact tempdb usage and performance. Additionally, this may be a concern if you use distributed transactions, change schema frequently, disable constraints when bulk loading, or take databases offline (for example, detach or auto-close). You should read up on snapshot isolation level in Books Online: http://msdn.microsoft.com/en-us/library/ ms177404(SQL.100).aspx Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
686 CHAPTER 53 SQL Server Audit, change tracking, and change data capture Summary SQL Server 2008 provides a healthy offering of features that can assist you in tracking and dealing with changes to your data and schema. My goal for this chapter was to pro- vide a useful and practical guide to help you decide how these features might help solve data management issues in your environment. Hopefully this will give you a good starting point on implementing one or more of these features where you need it most. About the author Aaron Bertrand is the Senior Data Architect at One to One Interactive, a global marketing agency headquartered in Boston, Massachusetts. At One to One, Aaron is responsible for database design and application architecture. Due to his commitment to the community, shown through blogging at http:/ /www.sql- blog.com, peer-to-peer support on forums and newsgroups, and speaking at user group meetings and code camps, he has been awarded as a Microsoft MVP since 1998. Aaron recently pub- lished a technical white paper for Microsoft, detailing how to use the new Resource Governor feature in SQL Server 2008. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
54 Introduction to SSAS 2008 data mining Dejan Sarka With SQL Server 2008, you get a complete business intelligence (BI) suite. You can use the SQL Server Database Engine to maintain a data warehouse (DW), SQL Server Reporting Services (RS) to create managed and ad hoc reports, SQL Server Integration Services (SSIS) to build and use extract, transform, and load (ETL) applications, and SQL Server Analysis Services (SSAS) to create Unified Dimen- sional Model (UDM) cubes. Probably the easiest step into business intelligence is using reports created with RS. But this simplicity has a price. End users have limited dynamic capabilities when they view a report. You can extend the capabilities of RS with report models, but using report models to build reports is an advanced skill for end users. You also have to consider that the performance is limited; for example, aggregating two years of sales data from a production database could take hours. Therefore, RS reports aren’t useful for analyses of large quantities of data over time directly from production systems. In order to enable end users to do dynamic analysis—online analytical process- ing (OLAP)—you can implement a data warehouse and SSAS UDM cubes. In addi- tion to dynamic change of view, end users also get lightning-speed analyses. End users can change the view of information in real time, drilling down to see more details or up to see summary information. But they’re still limited with OLAP analy- ses. Typically, there are too many possible combinations of drilldown paths, and users don’t have time to examine all possible graphs and pivot tables using all possi- ble attributes and hierarchies. In addition, analysts are limited to searching only for patterns they anticipate. OLAP analysis is also usually limited to basic mathematical operations, such as comparing sums over different groups, operations that end users can solve graphically through client tool GUI. Data mining (DM) addresses most of these limitations. In short, data mining is data-driven analysis. When you create a DM model, you don’t anticipate results in 687 Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
688 CHAPTER 54 Introduction to SSAS 2008 data mining advance. You examine data with advanced mathematical methods, using data mining algorithms, and then you examine patterns and rules that your algorithms find. The SSAS data mining engine runs the algorithms automatically after you set up all of the parameters you need; therefore, you can check millions of different pivoting options in a limited time. In this chapter, you're going to learn how to perform data mining analyses with SSAS 2008. Data mining basics The first question you may ask yourself is what the term data mining means. In short, data mining enables you to deduce hidden knowledge by examining, or training, your data with data mining algorithms. Algorithms express knowledge found in patterns and rules. Data mining algorithms are based mainly on statistics, although some are based on artificial intelligence and other branches of mathematics and information technology as well. Nevertheless, the terminology comes mainly from statistics. What you’re examin- ing is called a case, which can be interpreted as one appearance of an entity, or a row in a table. The attributes of a case are called variables. After you find patterns and rules, you can use them to perform predictions. In SSAS 2008, the DM model is stored in the SSAS database as a kind of a table. It’s not a table in a relational sense, as it can include nested tables in columns. In the model, the information about the variables, algorithms used, and the parameters of the algorithms are stored. Of course, after the training, the extracted knowledge is stored in the model as well. The data used for training isn’t part of the model, but you can enable drillthrough on a model, and use drillthrough queries to browse the source data. Most of the literature divides DM techniques into two main classes: directed algo- rithms and undirected algorithms. With a directed approach, you have a target vari- able that supervises the training in order to explain its values with selected input variables. Then the directed algorithms apply gleaned information to unknown exam- ples to predict the value of the target variable. With the undirected approach, you’re trying to discover new patterns inside the dataset as a whole, without any specific tar- get variable. For example, you use a directed approach to find reasons why users pur- chased an article and an undirected approach to find out which articles are commonly purchased together. You can answer many business questions with data mining. Some examples include the following: A bank might ask what the credit risk of a customer is. A customer relationship management (CRM) application can ask whether there are any interesting groups of customers based on similarity of values of their attributes. A retail store might be interested in which products appear in the same market basket. A business might be interested in forecasting sales. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Data mining basics 689 If you maintain a website, you might be interested in usage patterns. Credit card issuers would like to find fraudulent transactions. Advanced email spam filters use data mining. And much, much more, depending on your imagination! Data mining projects The Cross-Industry Standard Process for Data Mining (CRISP-DM) defines four main distinct steps of a data mining project. The steps, also shown in figure 1, are as follows: Identifying the business problem Using DM techniques to transform the data into actionable information Acting on the information Measuring the result In the first step, you need to contact business subject matter experts in order to iden- tify business problems. The second step is where you use SQL Server BI suite to pre- pare the data and train the models on the data. This chapter is focused on the transform step. Acting means using patterns and rules learned in production. You can use data mining models as UDM dimensions; you can use them for advanced SSIS transformations; you can use them in your applications to implement constraints and warnings; you can create RS reports based on mining models and predictions; and more. After deployment in production, you have to measure improvements of your business. You can use UDM cubes with mining model dimen- sions as a useful measurement tool. As you can see from figure 1, the project doesn’t have to finish here: you can continue it or open a new project with identifying new business problems. The second step, the transform step, has its own internal cycle. You need to under- stand your data; you need to make an overview. Then you have to prepare the data for data mining. Then you train your models. If your models don’t give you desired results, you have to return to the data overview phase and learn more about your data, or to the data preparation phase and prepare the data differently. Iden fy Transform Measure Act Figure 1 The CRISP-DM standard process for data mining projects Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
690 CHAPTER 54 Introduction to SSAS 2008 data mining Data overview and preparation Overviewing and preparing the data is probably the most exhaustive part of a data min- ing project. To get a comprehensive overview of your data, you can use many different techniques and tools. You can start with SQL queries and reports, or you can use UDM cubes. In addition, you can use descriptive statistics such as frequency distribution for discrete variables, and the mean value and the spread of the distribution for continu- ous variables. You can use Data Source View for a quick overview of your variables in table, pivot table, graph, or pivot graph format. Microsoft Office Excel statistical func- tions, pivot tables, and pivot graphs are useful tools for data overview as well. After you understand your data, you have to prepare it for data mining. You have to decide what exactly your case is. Sometimes this is a simple task; sometimes it can get quite complex. For example, a bank might decide that a case for analysis is a fam- ily, whereas the transaction system tracks data about individual persons only. After you define your case, you prepare a table or a view that encapsulates everything you know about your case. You can also prepare child tables or views and use them as nested tables in a mining model. For example, you can use an “orders header” production table as the case table, and an “order details” table as a nested table if you want to ana- lyze which products are purchased together in a single order. Usually, you also pre- pare some derived variables. In medicine, for example, the obesity index is much more important for analyses than a person’s bare height and weight. You have to decide what to do with missing values, if there are too many. For example, you can decide to replace them with mean values. You should also check the outliers—rare and far out-of-bounds values—in a column. You can group or discretize a continuous variable in a limited number of bins and thus hide outliers in the first and the last bin. SSAS 2008 data mining algorithms SSAS 2008 supports all of the most popular data mining algorithms. In addition, SSIS includes two text mining transformations. Table 1 summarizes the SSAS algorithms and their usage. Table 1 SSAS 2008 data mining algorithms and usage Algorithm Usage Association Rules The Association Rules algorithm is used for market basket analysis. It defines an itemset as a combination of items in a single transaction; then it scans the data and counts the number of times the itemsets appear together in transac- tions. Market basket analysis is useful to detect cross-selling opportunities. Clustering The Clustering algorithm groups cases from a dataset into clusters containing similar characteristics. You can use the Clustering method to group your cus- tomers for your CRM application to find distinguishable groups of customers. In addition, you can use it for finding anomalies in your data. If a case doesn’t fit well in any cluster, it’s an exception. For example, this might be a fraudulent transaction. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Creating mining models 691 Table 1 SSAS 2008 data mining algorithms and usage (continued) Algorithm Usage Decision Trees Decision Trees is the most popular DM algorithm, used to predict discrete and continuous variables. The algorithm uses the discrete input variables to split the tree into nodes in such a way that each node is more pure in terms of tar- get variable—each split leads to nodes where a single state of a target vari- able is represented better than other states. For continuous predictable variables, you get a piecewise multiple linear regression formula with a sepa- rate formula in each node of a tree. A tree that predicts continuous variables is a Regression Tree. Linear Regression Linear Regression predicts continuous variables, using a single multiple linear regression formula. The input variables must be continuous as well. Linear Regression is a simple case of a Regression Tree, a tree with no splits. Logistic Regression As Linear Regression is a simple Regression Tree, a Logistic Regression is a Neural Network without any hidden layers. Naïve Bayes The Naïve Bayes algorithm calculates probabilities for each possible state of the input attribute for every single state of predictable variable. These probabil- ities are used to predict the target attribute based on the known input attri- butes of new cases. The Naïve Bayes algorithm is quite simple; it builds the models quickly. Therefore, it’s suitable as a starting point in your prediction project. The Naïve Bayes algorithm doesn’t support continuous attributes. Neural Network The Neural Network algorithm is often associated with artificial intelligence. You can use this algorithm for predictions as well. Neural networks search for nonlinear functional dependencies by performing nonlinear transformations on the data in layers, from the input layer through hidden layers to the output layer. Because of the multiple nonlinear transformations, neural networks are harder to interpret compared to Decision Trees. Sequence Clustering Sequence Clustering searches for clusters based on a model, and not on simi- larity of cases as Clustering does. The models are defined on sequences of events by using Markov chains. Typical usage of Sequence Clustering would be an analysis of your company’s website usage, although you can use this algo- rithm on any sequential data. Time Series You can use the Time Series algorithm to forecast continuous variables. Inter- nally, the Time Series uses two different algorithms. For short-term forecast- ing, the Auto-Regression Trees (ART) algorithm is used. For long-term prediction, the Auto-Regressive Integrated Moving Average (ARIMA) algorithm is used. You can mix the blend of algorithms used by using the mining model parameters. Creating mining models After a lengthy introduction, it’s time to start with probably the most exciting part of this chapter—creating mining models. By following the instructions, you can create predictive models. You’re going to use Decision Trees, Naïve Bayes, and Neural Net- work algorithms on the same mining structure. The scenario for these models is based on the AdventureWorksDW2008 demo database. The fictitious Adventure Works com- pany wants to boost bike sales by using a mailing campaign. The new potential Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
692 CHAPTER 54 Introduction to SSAS 2008 data mining customers are in the ProspectiveBuyer table. But the company wants to limit sending leaflets only to those potential customers who are likely to buy bikes. The company wants to use existing data to find out which customers tend to buy bikes. This data is already joined together in the vTargetMail view. Therefore, the data preparation task is already finished, and you can start creating mining models following these steps: 1 In Business Intelligence Development Studio (BIDS), create a new Analysis Ser- vices project and solution. Name the solution and the project TargetMail. 2 In Solution Explorer, right-click on the Data Sources folder and create a new data source. Use the Native OLE DB\SQL Server Native Client 10.0 provider. Connect to your SQL Server using Windows authentication and select the AdventureWorksDW2008 database. Use the Inherit option for the imperson- ation information and keep the default name, Adventure Works DW2008, for the data source. 3 Right-click on the Data Source Views folder, and create a new data source view. In the Data Source View Wizard, on the Select a Data Source page, select the data source you just created. In the Select Tables and View pane, select only the vTargetMail view and ProspectiveBuyer table. Keep the default name, Adven- ture Works DW2008, for the data source view. 4 Right-click the Mining Structures folder and select New Mining Structure. Walk through the wizard using the following options: On the Welcome page of the Data Mining Wizard, click Next. In the Select the Definition Method page, use the existing relational data- base or data warehouse (leave the first option checked). In the Create the Data Mining Structure window, in the Which Data Mining Technique Do You Want to Use? drop-down list under the Create Mining Structure with a Mining Model option, select the Decision Trees algorithm from the drop-down list (the default). Use Adventure Works DW2008 DSV in the Select Data Source View page. In the Specify Table Types page, select vTargetMail as a case table by clicking the Case check box for this table. 5 By clicking on appropriate check boxes in the Specify the Training Data page, define CustomerKey as a key column (selected by default), BikeBuyer as predict- able column, and CommuteDistance, EnglishEducation, EnglishOccupation, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChild- renAtHome, Region, and TotalChildren as input columns. 6 In the Specify Columns’ Content and Data Type page, click the Detect button. The wizard should detect that all columns, except CustomerKey, have discrete content. Note that if you want to use Age and YearlyIncome attributes in the model, you’d have to discretize them if you don’t want a Regression Tree. 7 In the Create Test Set page, you can specify the percentage of the data or num- ber of cases for the testing set—the holdout data. Use the default splitting, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Harvesting the results 693 using 30 percent of data as the test set. You’ll use the test set to evaluate how well different models perform predictions. 8 Type TM as the name of the mining structure and TM_DT as the name of the model in the Completing the Wizard page. 9 Click Finish to leave the wizard and open the Data Mining Designer. Save the project. You’re going to add two models based on the same structure. In Data Mining Designer, select the Mining Models tab. To add a Naïve Bayes model, right-click on TM_DT, and then select the New Mining Model option. Type TM_NB as the name of the model and select the Microsoft Naive Bayes algorithm. Click OK. To add a Neural Network model, right-click the TM_DT model and again select the New Mining Model option. Type TM_NN as the name of the model and select the Micro- soft Neural Network algorithm. Click OK. Save, deploy, and process the complete proj- ect. Don’t exit BIDS. Your complete project should look like the one in figure 2. Harvesting the results You created three models, yet you still don’t know what additional information you got, or how you can use it. In this section, we’ll start with examining mining models in BIDS, in the Data Mining Designer, with the help of built-in Data Mining Viewers. The viewers show you patterns and rules in an intuitive way. After the overview of the mod- els, you have to decide which one you’re going to deploy in production. We’ll use the Lift Chart built-in tool to evaluate the models. Finally, we’re going to simulate the Figure 2 Predictive models project Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.