Data Lake: Quản trị và Trực quan hóa Bài giảng

Chapter 3 Data lake

Outline

• Definition

• Architecture for data lake

• Software component

Traditional business analytics process

1. Start with end-user requirements to identify desired reports

and analysis

2. Define corresponding database schema and queries

Identify the required data sources

4. Create a Extract-Transform-Load (ETL) pipeline to extract

required data (curation) and transform it to target schema (‘schema-on- write’)

5. Create reports, analyze data

Dedicated ETL tools (e.g. SSIS)

Relational

Queries

ETL pipeline

Results

Defined schema

LOB Applications

All data not immediately required is discarded or archived

Two approaches to information management for analytics: Top-down and bottom-up

How can we make it happen?

Top-down (deductive)

Prescriptive analytics

What will happen?

Predictive analytics

Theory

T I O

Why did it happen?

Theory Hypothesis

T I M I Z

Diagnostic analytics

What happened?

Hypothesis Pattern

Descriptive analytics

Observation Observation

T I O

R M A

I N

Bottom-up (inductive)

Confirmation

Data warehousing uses a top-down approach

Implement data warehouse

Gather requirements

Understand corporate strategy

Business requirements

Reporting and analytics design Reporting and analytics development

Dimension modeling Physical design

Technical requirements

ETL design ETL development

Data sources

Set up infrastructure Install and tune

The data lake uses a bottom-up approach

Ingest all data regardless of requirements

Store all data in native format without schema definition

Do analysis using analytic engines like Hadoop

Devices

Batch queries

Interactive queries

Real-time analytics

Machine Learning

Data warehouse

New big data thinking: All data has value • All data has potential value • Data hoarding • No defined schema—stored in native format • Schema is imposed and transformations are done at

query time (schema-on-read).

• Apps and users interpret the data as they see fit

Iterate

Store indefinitely

See results

Analyze

Gather data from all sources

Defining the Data Lake

• A centralized repository that allows you to store all your

structured and unstructured data at any scale

• These assets are stored in a near-exact, or even exact, copy

of the source format.

• The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse) [Gartner IT Glossary]

Traditional Approaches Current state of a data warehouse

MONITORING AND TELEMETRY

DATA WAREHOUSE

ETL

DATA SOURCES

BI AND ANALYTCIS

Star schemas, views other read- optimized structures

OLTP

ERP

CRM

LOB

Emailed, centrally stored Excel reports and dashboards

Complex, rigid transformations

Well manicured, often relational sources

Flat, canned or multi-dimensional access to historical data

Required extensive monitoring

Known and expected data volume and formats

Many reports, multiple versions of the truth

Transformed historical into read structures

Little to no change

24 to 48h delay

Traditional Approaches

Current state of a data warehouse

MONITORING AND TELEMETRY

DATA WAREHOUSE

ETL

DATA SOURCES

BI AND ANALYTCIS

Star schemas, views other read- optimized structures

OLTP

ERP

CRM

LOB

Emailed, centrally stored Excel reports and dashboards

STALE REPORTING

INCREASE IN TIME

INCREASING DATA VOLUME

NON-RELATIONAL DATA

Complex, rigid transformations can’t longer keep pace

Reports become invalid or unusable

Increase in variety of data sources

Monitoring is abandoned

Delay in preserved reports increases

Increase in data volume

Increase in types of data

Users begin to “innovate” to relieve starvation

Delay in data, inability to transform volumes, or react to new sources

Pressure on the ingestion engine

Repair, adjust and redesign ETL

New approach

Data Lake Transformation (ELT not ETL)

DATA WAREHOUSE

BI AND ANALYTCIS

Star schemas, views other read- optimized structures

Discover and consume predictive analytics, data sets and other reports

DATA SOURCES

DATA LAKE

EXTRACT AND LOAD

DATA REFINERY PROCESS (TRANSFORM ON READ)

OLTP

ERP

CRM

LOB

Transform relevant data into data sets

NON-RELATIONAL DATA

FUTURE DATA SOURCES

Extract and load, no/minimal transform

All data sources are considered

Refineries transform data on read

Storage of data in near-native format

Produce curated data sets to integrate with traditional warehouses

Orchestration becomes possible

Leverages the power of on-prem technologies and the cloud for storage and capture

Users discover published data sets/services using familiar tools

Streaming data accommodation becomes possible

Native formats, streaming data, big data

Data warehouse vs data lake

Characteristics

Data Warehouse

Data Lake

Data

Relational from transactional systems, operational databases, and line of business applications

Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications

Schema

Designed prior to the DW implementation (schema-on-write)

Written at the time of analysis (schema-on-read)

Price/Performance

Fastest query results using higher cost storage

Query results getting faster using low-cost storage

Data Quality

Any data that may or may not be curated (ie. raw data)

Highly curated data that serves as the central version of the truth

Users

Business analysts

Data scientists, Data developers, and Business analysts (using curated data)

Analytics

Batch reporting, BI and visualizations

Machine Learning, Predictive analytics, data discovery and profiling

The data lake and warehouse

Batch queries

Devices

Dashboards Reports Exploration

Interactive queries

Real-time analytics

Queries

Machine Learning

Cooked Data

Meta-Data, Joins

Results

Relational

ETL pipeline

Defined schema

LOB Applications

Data Lake + Data Warehouse Better Together

Challenges involved in implementing a data lake

Data Silos

Analytics

Data spans sources

Open interfaces to data

Inefficiency in colocation

Variety of analytics tools

Performance and Scale

Security

Storage bottlenecks

Compliance challenges

IoT sources – small writes

Effectively control access

Price-performance

Corporate policies

Data grows independently

Typical data lake architecture on Azure

Introducing Cortana Intelligence Suite

Big Data Stores

Intelligence

Information Management

Machine Learning and Analytics

People

Data Sources

Machine Learning

Cognitive Services

Data Factory

Data Lake Store

Web

Data Catalog

Bot Framework

SQL Data Warehouse

Data Lake Analytics

Apps

Mobile

Event Hubs

Cortana

HDInsight (Hadoop and Spark)

Apps

Bots

Stream Analytics

Dashboards & Visualizations

Power BI

Automated Systems

Sensors and devices

Data

Action

Intelligence

Where Big Data is a cornerstone

Big Data Stores

Intelligence

Information Management

Machine Learning and Analytics

People

Data Sources

Machine Learning

Data Factory

Data Lake Store

Cognitive Services

Web

Data Catalog

Bot Framework

SQL Data Warehouse

Data Lake Analytics

Apps

Mobile

Event Hubs

Cortana

HDInsight (Hadoop and Spark)

Apps

Bots

Stream Analytics

Dashboards & Visualizations

Power BI

Automated Systems

Sensors and devices

Data

Action

Intelligence

Bringing Big Data for everybody

• Built for the cloud to accelerate the pace of innovation

through a state of the art platform

Control

Ease of use

Data Lake Analytics

Specific Applications in a multi-tenant form factor

n o i t p o d A r e s U

HDInsight

Workload optimized, managed clusters

HDP | CDH | MapR (Azure Marketplace)

Any Hadoop technology

IaaS Hadoop

Big Data as-a-service

Managed Hadoop

Azure Data Lake Analytics

Azure Storage

Data Lake Store

Azure HDInsight

• Fully-managed Hadoop and Spark

for the cloud

• 100% Open Source Hortonworks

data platform

• Clusters up and running in minutes • Managed, monitored and supported

by Microsoft with the industry’s best SLA

• Familiar BI tools for analysis, or open source notebooks for

interactive data science

• 63% lower TCO than deploy your own Hadoop on-premises*

*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

Azure Data Lake Store

• A hyper-scale repository for Big Data analytics

workloads

• Hadoop File System (HDFS) for the cloud • No limits to scale • Store any data in its native format • Enterprise-grade access control,

encryption at rest

• Optimized for analytic workload performance

Azure Data Lake Analytics

• Distributed analytics service built on Apache YARN

• Elastic scale per query lets users focus on business

goals—not configuring hardware

• Includes U-SQL—a language that unifies the benefits

of SQL with the expressive power of C#

• Integrates with Visual Studio to develop, debug, and

tune code faster

• Federated query across Azure data sources

• Enterprise-grade role based access control

Typical data lake architecture on AWS

Kylo data lake platform

Zaloni data lake reference architecture

Data lakehouse

• audit history,

• data versioning,

• distributed computing and storage,

• ACID transactions,

• dynamic schema evolution.

Data lake design practices

Data lake key components

• Security – even though you will not expose the Data

Lakes to a broad audience, it is still very important that you think this aspect through, especially during the initial phase and architecturing. It’s not like relational databases, with an artillery of security mechanisms. Be careful and never underestimate this aspect.

• Orchestration + ELT processes – as data is being

pushed from the Raw Layer, through the Cleansed to the Sandbox and Application layer, you need a tool to orchestrate the flow. Most likely, you will need to apply transformations. Either you choose an orchestration tool capable of doing so, or you need some additional resources to execute them.

• Governance – monitoring and logging (or lineage) operations will become crucial at some point for measuring performance and adjusting the Data Lake.

• Metadata – data about data, so mostly all the

schemas, reload intervals, additional descriptions of the purpose of data, with descriptions on how it is meant to be used.

• Master Data – an essential part of serving ready-to- use data. You need to either find a way to store your MD on the Data Lake, or reference it while executing ELT processes.

• Data Analytics Engine

• allow for concurrent work on the data by multiple users, • have access restriction policy for separate users/groups of

users,

• cost-effective enough maintaining speed and scalability at the

same time.

• BI & AI

Data lake storage

• Storage option

• How frequently data will be accessed and in how many zones it will be replicated. When designing data lake architecture these properties will ensure better security and additional cost savings as infrequently data storage costs less.

Data Lake common areas

• Staging – for data ingestion and movement

• Raw – for storing original copies of data in its raw

format indefinitely

• Curated – for data storage that is heavily transformed,

perhaps into a star schema for analysis

• Sandbox – to enable analysts and scientists the

opportunity to work with, collect, and transform new data without the need for heavy controls

Data lake repository layering

• Raw

• Standardized

• Cleansed

• Application

• Sandbox

Folders Structure

Usage Path – Per Project Data Lake Layer Expecte d Volume Sub Folders (Granular ity)

Raw Files Files without any transformation, stored “as is”, every minute. This is the landing zone of your data lake ~ TBs / day /project-name/raw- files /year/month/day/ hour/minute

Raw Data /year/month/day ~ GBs / day /project-name/raw- data Now all files are in data queryable format: same time zone and currency. Special characters and duplicated were removed. Small files are merged into bigger files, what is a best practice for big data workloads.

/year/month Business Data ~ MBs / day /project- name/business- data

Raw data + business rules. Now you start to have the basic aggregations that will help all It is a good idea do use parallel other analysis. processing on top of distributed file system to accomplish this heavy workload. This layer also can be used for data ingestion into DWs or DMs.

Files Format

Compression Why Data Lake Layer Files Format

The same format of the original data, for fast data ingestion.

Raw Files “as is” Gzip Gzip will deliver good compression rate for most of the file types.

Sequence files are a good option for map reduce programming paradigm as it can be easily splitted across data nodes enabling parallel processing. Other advantage of using sequence files is to merge two or more files into a bigger one , a typical goal of this layer. Raw Data Snappy Sequence Files

Snappy doesn’t give the best ratio, but it has excellent compress/decompress performance.

For interactive queries using Presto, Impala, or an external connection like Polybase, that allows you to query external data from Azure Synapse Analytics.

Snappy Business Data Parquet Files

Snappy compression again will be used, parquet files are columnar, making them compressed by nature. The data volume isn’t a problem anymore.

Organization of data in S3

• Raw Data, is the section where the data from the source, is

stored as is.

• Curated Data, is the section where all the data is cleansed using data quality rules and is available for the analysis.

• Intermediate Datasets store the datasets and tables

required for transformations to run.

• Linked Datasets are the denormalized or summarized

aggregated datasets derived from the original data that are useful for multiple use cases.

• Archived Data, is the section where unnecessary or less

often referred data is moved into. This may be moved into Glacier or such storage system, that is available for less cost.

• Export Area is the place where the datasets are created for moving the data to external area like indexes or data marts.

5 steps to a data lake architecture

Benefits of the Data Lake

• Enables “productionizing”

advanced analytics

• Derives value from unlimited data

types (including raw data)

• Reduces long-term cost of

ownership across entire spectrum of data use

• Cost-effective scalability and

flexibility

Risks of the Data Lake

• Loss of trust

• Loss of relevance and

momentum

• Increased risk

• Pure Hadoop implementation

• Query performance not as good as

relational database

• Complex query support not good due to lack of query optimizer, in- database operators, advanced memory management, concurrency, dynamic workload management and robust indexing

Thank you for your attention!!!