Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing - Trịnh Tấn Đạt
lượt xem 5
download
Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing, chương này trình bày những nội dung về: why preprocess the data; descriptive data summarization; data cleaning; data integration and transformation; data reduction; discretization and concept hierarchy generation;... Mời các bạn cùng tham khảo chi tiết nội dung bài giảng!
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing - Trịnh Tấn Đạt
- Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ 1
- Outline Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary 2
- Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, … e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records 3
- Why Is Data Dirty? Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning 4
- Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 5
- Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility 6
- Data type Numeric: The most used data type, and the stored content is numeric Characters and strings: strings are arrays of characters Boolean: for binary data with true and false values Time series data: including time-or sequential-related properties Sequential data: data itself has sequential relationship Time series data: each data will be subject to change with time 7
- Data type Spatial data: for data including special related attributes For example, Google Map, Integrated Circuit Design Layout, Wafer Exposure Layout, Global Positioning System (GPS), etc. Text data: for paragraph description, including patent reports, diagnostic reports, etc. Structured data: library bibliographic data, credit card data Semi-structured data: email, extensible markup language (XML) Unstructured data: social media data of messages in Facebook Multimedia data: Including data of pictures, audio, video, etc. in media with mass data volumes as compared to other types of data that need data compression for data storage 8
- “A proxy attribute is a variable that is used to represent or stand in for another variable or attribute that is difficult to measure directly. A proxy attribute is typically used in situations where it is not possible or practical to measure the actual attribute of interest. For example, in a Data scale study of income, the amount of money a person earns per year may be difficult to determine accurately. In such a case, a proxy attribute, such as education level or occupation, may be used instead.” ChatGPT Each variable of data has its corresponding attribute and scale to quantify and measure its level natural quantitative scale qualitative scale When one variable is hard to find the corresponding attribute, proxy attribute can be used instead as a measurement Common scales: nominal scale, categorical scale, ordinal scale, interval scale, ratio scale, and absolute scale 9
- Six common scales nominal scale: only used as codes, where the values has no meaning for mathematical operations categorical scale: according to its characteristics, and each category is marked with a numeric code to indicate the category to which it belongs ordinal scale: to express the ranking and ordering of the data without establishing the degree of variation between them interval scale: also called distance scale, can describes numerical differences between different numbers in a meaningful way ratio scale: different numbers can be compared to each other by ratio absolute scale: the numbers measured have absolute meaning 10
- Data inspection Goal: Inspects the obtained data in different view points to find the errors in advance and then correct or remove some of them after discussion with domain experts Data are categorized into quantitative and qualitative aspects Quantitative data Data inspection: number of samples, number of variables or features, and different data values Sample sizes: too small samples may affect the results, while too much samples may affect statistical significance Variable sizes: too much may cause much time for computation Qualitative data Inspect centralized trends (mean, median, etc.) and variability Inspect data omissions, data noise, etc. in different graphs 11
- Data discovery and visualization Statistical table: a table is made according to specific rules after organized the data Statistical chart: graphical representation of various characteristics of statistical data in different graphic styles Data Type: Frequency: histogram, bar plot, pie chart Distribution: box plot, Q-Q plot Trends: trend chart Relationships: scatter plot Different data categories have different statistical charts Categorical data: Bar chart applicable Continuous data: histogram and pie chart applicable 12
- Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 13
- Forms of Data Preprocessing 14
- Descriptive data summarization 15
- Mining Data Descriptive Characteristics Motivation To better understand the data: central tendency, variation and spread Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc. 16
- Measuring the Central Tendency = 1 n x Mean (algebraic measure) (sample vs. population): x = xi n n i =1 N Weighted arithmetic mean: w x i i x= i =1 n Median: A holistic measure w i =1 i Middle value if odd number of values, or average of the middle two values otherwise Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: mean − mode = 3 (mean − median) 17
- Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data Data Mining: Concepts and Techniques 18
- Four moments of distribution: Mean, Variance, Skewness, and Kurtosis 19
- 20
CÓ THỂ BẠN MUỐN DOWNLOAD
-
Bài giảng Khai phá dữ liệu (Data mining): Chương 7 - ĐH Bách khoa TP.HCM
22 p | 215 | 26
-
Bài giảng Khai phá dữ liệu trong kinh doanh - ĐH Thương Mại
0 p | 494 | 22
-
Bài giảng Khai phá dữ liệu (Data mining): Chương 6 - ĐH Bách khoa TP.HCM
67 p | 268 | 22
-
Bài giảng Khai phá dữ liệu - Chương 1: Tổng quan về khai phá dữ liệu
61 p | 157 | 16
-
Bài giảng Khai phá dữ liệu (Data mining): Chương 0: Giới thiệu môn học
8 p | 127 | 14
-
Bài giảng Khai phá dữ liệu (Data mining): Chương 8 - ĐH Bách khoa TP.HCM
8 p | 119 | 13
-
Bài giảng Khai phá dữ liệu (Data mining): Chương 1 - Lê Tiến
61 p | 93 | 9
-
Bài giảng Khai phá dữ liệu (Data mining): Chương 0 - Lê Tiến
7 p | 110 | 9
-
Bài giảng Khai phá dữ liệu web: Giới thiệu môn học
13 p | 112 | 9
-
Bài giảng Khai phá dữ liệu: Chương 8 - TS. Võ Thị Ngọc Châu
23 p | 80 | 8
-
Bài giảng Khai phá dữ liệu: Chương 1 - TS. Võ Thị Ngọc Châu
63 p | 108 | 8
-
Bài giảng Khai phá dữ liệu: Bài 1 - Văn Thế Thành
7 p | 90 | 5
-
Bài giảng Khai phá dữ liệu - Chương 1: Tổng quan
14 p | 149 | 4
-
Bài giảng Khai phá dữ liệu: Bài 0 - TS. Trần Mạnh Tuấn
10 p | 63 | 4
-
Bài giảng Khai phá dữ liệu: Bài 1 - TS. Trần Mạnh Tuấn
34 p | 69 | 4
-
Bài giảng Khai phá dữ liệu: Bài 2 - TS. Trần Mạnh Tuấn
32 p | 55 | 4
-
Bài giảng Khai phá dữ liệu: Chương 1 - Trường ĐH Phan Thiết
71 p | 41 | 4
-
Bài giảng Khai phá dữ liệu: Chương 4 - Trường ĐH Phan Thiết
70 p | 27 | 2
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn