intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing - Trịnh Tấn Đạt

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:71

10
lượt xem
5
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing, chương này trình bày những nội dung về: why preprocess the data; descriptive data summarization; data cleaning; data integration and transformation; data reduction; discretization and concept hierarchy generation;... Mời các bạn cùng tham khảo chi tiết nội dung bài giảng!

Chủ đề:
Lưu

Nội dung Text: Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing - Trịnh Tấn Đạt

  1. Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ 1
  2. Outline  Why preprocess the data?  Descriptive data summarization  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary 2
  3. Why Data Preprocessing?  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, …  e.g., occupation=“ ”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records 3
  4. Why Is Data Dirty?  Incomplete data may come from  “Not applicable” data value when collected  Different considerations between the time when the data was collected and when it is analyzed.  Human/hardware/software problems  Noisy data (incorrect values) may come from  Faulty data collection instruments  Human or computer error at data entry  Errors in data transmission  Inconsistent data may come from  Different data sources  Functional dependency violation (e.g., modify some linked data)  Duplicate records also need data cleaning 4
  5. Why Is Data Preprocessing Important?  No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 5
  6. Multi-Dimensional Measure of Data Quality  A well-accepted multidimensional view:  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility 6
  7. Data type  Numeric: The most used data type, and the stored content is numeric  Characters and strings: strings are arrays of characters  Boolean: for binary data with true and false values  Time series data: including time-or sequential-related properties  Sequential data: data itself has sequential relationship  Time series data: each data will be subject to change with time 7
  8. Data type  Spatial data: for data including special related attributes  For example, Google Map, Integrated Circuit Design Layout, Wafer Exposure Layout, Global Positioning System (GPS), etc.  Text data: for paragraph description, including patent reports, diagnostic reports, etc.  Structured data: library bibliographic data, credit card data  Semi-structured data: email, extensible markup language (XML)  Unstructured data: social media data of messages in Facebook  Multimedia data: Including data of pictures, audio, video, etc. in media with mass data volumes as compared to other types of data that need data compression for data storage 8
  9. “A proxy attribute is a variable that is used to represent or stand in for another variable or attribute that is difficult to measure directly. A proxy attribute is typically used in situations where it is not possible or practical to measure the actual attribute of interest. For example, in a Data scale study of income, the amount of money a person earns per year may be difficult to determine accurately. In such a case, a proxy attribute, such as education level or occupation, may be used instead.” ChatGPT  Each variable of data has its corresponding attribute and scale to quantify and measure its level  natural quantitative scale  qualitative scale  When one variable is hard to find the corresponding attribute, proxy attribute can be used instead as a measurement  Common scales: nominal scale, categorical scale, ordinal scale, interval scale, ratio scale, and absolute scale 9
  10. Six common scales  nominal scale: only used as codes, where the values has no meaning for mathematical operations  categorical scale: according to its characteristics, and each category is marked with a numeric code to indicate the category to which it belongs  ordinal scale: to express the ranking and ordering of the data without establishing the degree of variation between them  interval scale: also called distance scale, can describes numerical differences between different numbers in a meaningful way  ratio scale: different numbers can be compared to each other by ratio  absolute scale: the numbers measured have absolute meaning 10
  11. Data inspection  Goal: Inspects the obtained data in different view points to find the errors in advance and then correct or remove some of them after discussion with domain experts  Data are categorized into quantitative and qualitative aspects  Quantitative data  Data inspection: number of samples, number of variables or features, and different data values  Sample sizes: too small samples may affect the results, while too much samples may affect statistical significance  Variable sizes: too much may cause much time for computation  Qualitative data  Inspect centralized trends (mean, median, etc.) and variability  Inspect data omissions, data noise, etc. in different graphs 11
  12. Data discovery and visualization  Statistical table: a table is made according to specific rules after organized the data  Statistical chart: graphical representation of various characteristics of statistical data in different graphic styles  Data Type:  Frequency: histogram, bar plot, pie chart  Distribution: box plot, Q-Q plot  Trends: trend chart  Relationships: scatter plot  Different data categories have different statistical charts  Categorical data: Bar chart applicable  Continuous data: histogram and pie chart applicable 12
  13. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data 13
  14. Forms of Data Preprocessing 14
  15. Descriptive data summarization 15
  16. Mining Data Descriptive Characteristics  Motivation  To better understand the data: central tendency, variation and spread  Data dispersion characteristics  median, max, min, quantiles, outliers, variance, etc. 16
  17. Measuring the Central Tendency = 1 n x  Mean (algebraic measure) (sample vs. population): x =  xi n n i =1 N  Weighted arithmetic mean: w x i i x= i =1 n  Median: A holistic measure w i =1 i  Middle value if odd number of values, or average of the middle two values otherwise  Mode  Value that occurs most frequently in the data  Unimodal, bimodal, trimodal  Empirical formula: mean − mode = 3  (mean − median) 17
  18. Symmetric vs. Skewed Data  Median, mean and mode of symmetric, positively and negatively skewed data Data Mining: Concepts and Techniques 18
  19. Four moments of distribution: Mean, Variance, Skewness, and Kurtosis 19
  20. 20
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2