Lecture Administration and visualization: Chapter 3.3 - Data lake

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:45

Thêm vào BST

Báo xấu

16
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Lecture "Administration and visualization: Chapter 3.3 - Data lake" provides students with content about: Traditional business analytics process; Architecture for data lake; Software component;... Please refer to the detailed content of the lecture!

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Lecture Administration and visualization: Chapter 3.3 - Data lake

1
Chapter 3 Data lake 2
Outline • Definition • Architecture for data lake • Software component 3
Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on- write’) 5. Create reports, analyze data Dedicated ETL tools (e.g. SSIS) Relational Queries ETL pipeline Results LOB Applications Defined schema All data not immediately required is discarded or archived 4
Two approaches to information management for analytics: Top-down and bottom-up Top-down How can we make it happen? (deductive) Prescriptive What will analytics happen? Theory Predictive Theory analytics Hypothesis Why did T ION Hypothesis it happen? IZA TIM Pattern What Diagnostic OP analytics Observation happened? Observation Descriptive Confirmation analytics ION INF O RM AT Bottom-up (inductive)
Data warehousing uses a top-down approach Understand Gather Implement data warehouse corporate requirements Reporting and strategy Reporting and analytics analytics design Business development requirements Dimension modeling Physical design ETL design ETL development Technical requirements Data sources Set up infrastructure Install and tune
The data lake uses a bottom-up approach Ingest all data Store all data Do analysis regardless of requirements in native format without using analytic engines like schema definition Hadoop Devices Batch queries Interactive queries Real-time analytics Machine Learning Data warehouse
New big data thinking: All data has value • All data has potential value • Data hoarding • No defined schema—stored in native format • Schema is imposed and transformations are done at query time (schema-on-read). • Apps and users interpret the data as they see fit Iterate Gather data Store indefinitely Analyze See results from all sources 8
Defining the Data Lake • A centralized repository that allows you to store all your structured and unstructured data at any scale • These assets are stored in a near-exact, or even exact, copy of the source format. • The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse) [Gartner IT Glossary] 9
Traditional Approaches Current state of a data warehouse MONITORING AND TELEMETRY ETL DATA WAREHOUSE DATA SOURCES BI AND ANALYTCIS Star schemas, views Emailed, other read- centrally optimized stored Excel structures reports and OLTP ERP CRM LOB dashboards Well manicured, often relational Flat, canned or multi-dimensional Complex, rigid transformations sources access to historical data Required extensive monitoring Known and expected data volume Many reports, multiple versions of and formats Transformed historical into read the truth structures Little to no change 24 to 48h delay
Traditional Approaches Current state of a data warehouse MONITORING AND TELEMETRY ETL DATA WAREHOUSE DATA SOURCES BI AND ANALYTCIS Star schemas, views Emailed, other read- centrally optimized stored Excel structures reports and OLTP ERP CRM LOB dashboards STALE REPORTING INCREASE IN TIME INCREASING DATA VOLUME NON-RELATIONAL DATA Complex, rigid transformations can’t Increase in variety of data sources longer keep pace Reports become invalid or unusable Increase in data volume Monitoring is abandoned Delay in preserved reports increases Increase in types of data Delay in data, inability to transform Users begin to “innovate” to relieve volumes, or react to new sources starvation Pressure on the ingestion engine Repair, adjust and redesign ETL
New approach Data Lake Transformation (ELT not ETL) DATA WAREHOUSE BI AND ANALYTCIS Star schemas, Discover and views consume other read- predictive optimized structures analytics, data sets and other reports DATA SOURCES DATA LAKE DATA REFINERY PROCESS EXTRACT AND LOAD (TRANSFORM ON READ) OLTP ERP CRM LOB Transform relevant data into data sets FUTURE DATA NON-RELATIONAL DATA SOURCES Extract and load, no/minimal transform All data sources are considered Refineries transform data on read Storage of data in near-native format Leverages the power of on-prem Produce curated data sets to technologies and the cloud for Orchestration becomes possible integrate with traditional warehouses storage and capture Streaming data accommodation becomes Users discover published data Native formats, streaming data, big possible sets/services using familiar tools data
Data warehouse vs data lake Characteristics Data Warehouse Data Lake Non-relational and relational from Relational from transactional systems, IoT devices, web sites, mobile apps, Data operational databases, and line of business social media, and corporate applications applications Designed prior to the DW implementation Written at the time of analysis Schema (schema-on-write) (schema-on-read) Fastest query results using higher cost Query results getting faster using Price/Performance storage low-cost storage Any data that may or may not be Data Quality Highly curated data that serves as the curated (ie. raw data) central version of the truth Data scientists, Data developers, and Users Business analysts Business analysts (using curated data) Machine Learning, Predictive Analytics Batch reporting, BI and visualizations analytics, data discovery and profiling 13
The data lake and warehouse Devices Batch queries Dashboards Reports Interactive queries Exploration Real-time analytics Machine Learning Queries Meta-Data, Cooked Joins Data Relational Results ETL pipeline LOB Applications Defined schema
Data Lake + Data Warehouse Better Together 15
Challenges involved in implementing a data lake Data Silos Analytics Data spans sources Open interfaces to data Inefficiency in colocation Variety of analytics tools Performance and Scale Security Storage bottlenecks Compliance challenges IoT sources – small writes Effectively control access Price-performance Corporate policies Data grows independently
Typical data lake architecture on Azure 17
Introducing Cortana Intelligence Suite Information Big Data Stores Machine Learning Intelligence Data Management and Analytics People Sources Machine Learning Cognitive Services Data Factory Data Lake Store SQL Data Data Lake Bot Web Data Catalog Warehouse Analytics Framework Apps HDInsight Mobile Event Hubs (Hadoop and Cortana Spark) Apps Stream Analytics Bots Dashboards & Visualizations Sensors Automated and Power BI Systems devices Data Intelligence Action
Where Big Data is a cornerstone Information Big Data Stores Machine Learning Intelligence Data Management and Analytics People Sources Machine Learning Cognitive Data Factory Data Lake Store Services SQL Data Data Lake Bot Web Data Catalog Warehouse Analytics Framework Apps HDInsight Mobile Event Hubs (Hadoop and Cortana Spark) Apps Stream Analytics Bots Dashboards & Visualizations Sensors Automated and Power BI Systems devices Data Intelligence Action
Bringing Big Data for everybody • Built for the cloud to accelerate the pace of innovation through a state of the art platform Control Ease of use User Adoption Data Lake Analytics Specific Applications in a multi-tenant form factor HDInsight HDP | CDH | MapR Workload optimized, (Azure Marketplace) managed clusters Any Hadoop technology IaaS Hadoop Managed Hadoop Big Data as-a-service Azure Data Lake Analytics Azure Storage Data Lake Store