1
Chapter 3 Data lake
2
Outline
• Definition
• Architecture for data lake
• Software component
3
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3.
Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema-on- write’)
5. Create reports, analyze data
Dedicated ETL tools (e.g. SSIS)
Relational
Queries
ETL pipeline
Results
Defined schema
LOB Applications
All data not immediately required is discarded or archived
4
Two approaches to information management for analytics: Top-down and bottom-up
How can we make it happen?
Top-down (deductive)
Prescriptive analytics
What will happen?
Predictive analytics
Theory
N
T I O
Why did it happen?
A
Theory Hypothesis
T I M I Z
P
O
Diagnostic analytics
What happened?
Hypothesis Pattern
Descriptive analytics
Observation Observation
N
T I O
R M A
O
F
I N
Bottom-up (inductive)
Confirmation
Data warehousing uses a top-down approach
Implement data warehouse
Gather requirements
Understand corporate strategy
Business requirements
Reporting and analytics design Reporting and analytics development
Dimension modeling Physical design
Technical requirements
ETL design ETL development
Data sources
Set up infrastructure Install and tune
The data lake uses a bottom-up approach
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysis using analytic engines like Hadoop
Devices
Batch queries
Interactive queries
Real-time analytics
Machine Learning
Data warehouse
New big data thinking: All data has value • All data has potential value • Data hoarding • No defined schema—stored in native format • Schema is imposed and transformations are done at
query time (schema-on-read).
• Apps and users interpret the data as they see fit
Iterate
Store indefinitely
See results
Analyze
Gather data from all sources
8
Defining the Data Lake
• A centralized repository that allows you to store all your
structured and unstructured data at any scale
• These assets are stored in a near-exact, or even exact, copy
of the source format.
• The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse) [Gartner IT Glossary]
9
Traditional Approaches Current state of a data warehouse
MONITORING AND TELEMETRY
DATA WAREHOUSE
ETL
DATA SOURCES
BI AND ANALYTCIS
Star schemas, views other read- optimized structures
OLTP
ERP
CRM
LOB
Emailed, centrally stored Excel reports and dashboards
Complex, rigid transformations
Well manicured, often relational sources
Flat, canned or multi-dimensional access to historical data
Required extensive monitoring
Known and expected data volume and formats
Many reports, multiple versions of the truth
Transformed historical into read structures
Little to no change
24 to 48h delay
Traditional Approaches
Current state of a data warehouse
MONITORING AND TELEMETRY
DATA WAREHOUSE
ETL
DATA SOURCES
BI AND ANALYTCIS
Star schemas, views other read- optimized structures
OLTP
ERP
CRM
LOB
Emailed, centrally stored Excel reports and dashboards
STALE REPORTING
INCREASE IN TIME
INCREASING DATA VOLUME
NON-RELATIONAL DATA
Complex, rigid transformations can’t longer keep pace
Reports become invalid or unusable
Increase in variety of data sources
Monitoring is abandoned
Delay in preserved reports increases
Increase in data volume
Increase in types of data
Users begin to “innovate” to relieve starvation
Delay in data, inability to transform volumes, or react to new sources
Pressure on the ingestion engine
Repair, adjust and redesign ETL
New approach
Data Lake Transformation (ELT not ETL)
DATA WAREHOUSE
BI AND ANALYTCIS
Star schemas, views other read- optimized structures
Discover and consume predictive analytics, data sets and other reports
DATA SOURCES
DATA LAKE
EXTRACT AND LOAD
DATA REFINERY PROCESS (TRANSFORM ON READ)
OLTP
ERP
CRM
LOB
Transform relevant data into data sets
NON-RELATIONAL DATA
FUTURE DATA SOURCES
Extract and load, no/minimal transform
All data sources are considered
Refineries transform data on read
Storage of data in near-native format
Produce curated data sets to integrate with traditional warehouses
Orchestration becomes possible
Leverages the power of on-prem technologies and the cloud for storage and capture
Users discover published data sets/services using familiar tools
Streaming data accommodation becomes possible
Native formats, streaming data, big data
Data warehouse vs data lake
Characteristics
Data Warehouse
Data Lake
Data
Relational from transactional systems, operational databases, and line of business applications
Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema
Designed prior to the DW implementation (schema-on-write)
Written at the time of analysis (schema-on-read)
Price/Performance
Fastest query results using higher cost storage
Query results getting faster using low-cost storage
Data Quality
Any data that may or may not be curated (ie. raw data)
Highly curated data that serves as the central version of the truth
Users
Business analysts
Data scientists, Data developers, and Business analysts (using curated data)
Analytics
Batch reporting, BI and visualizations
Machine Learning, Predictive analytics, data discovery and profiling
13
The data lake and warehouse
Batch queries
Devices
Dashboards Reports Exploration
Interactive queries
Real-time analytics
Queries
Machine Learning
Cooked Data
Meta-Data, Joins
Results
Relational
ETL pipeline
Defined schema
LOB Applications
Data Lake + Data Warehouse Better Together
15
Challenges involved in implementing a data lake
Data Silos
Analytics
Data spans sources
Open interfaces to data
Inefficiency in colocation
Variety of analytics tools
Performance and Scale
Security
Storage bottlenecks
Compliance challenges
IoT sources – small writes
Effectively control access
Price-performance
Corporate policies
Data grows independently
Typical data lake architecture on Azure
17
Introducing Cortana Intelligence Suite
Big Data Stores
Intelligence
Information Management
Machine Learning and Analytics
People
Data Sources
Machine Learning
Cognitive Services
Data Factory
Data Lake Store
Web
Data Catalog
Bot Framework
SQL Data Warehouse
Data Lake Analytics
Apps
Mobile
Event Hubs
Cortana
HDInsight (Hadoop and Spark)
Apps
Bots
Stream Analytics
Dashboards & Visualizations
Power BI
Automated Systems
Sensors and devices
Data
Action
Intelligence
Where Big Data is a cornerstone
Big Data Stores
Intelligence
Information Management
Machine Learning and Analytics
People
Data Sources
Machine Learning
Data Factory
Data Lake Store
Cognitive Services
Web
Data Catalog
Bot Framework
SQL Data Warehouse
Data Lake Analytics
Apps
Mobile
Event Hubs
Cortana
HDInsight (Hadoop and Spark)
Apps
Bots
Stream Analytics
Dashboards & Visualizations
Power BI
Automated Systems
Sensors and devices
Data
Action
Intelligence
Bringing Big Data for everybody
• Built for the cloud to accelerate the pace of innovation
through a state of the art platform
Control
Ease of use
Data Lake Analytics
Specific Applications in a multi-tenant form factor
n o i t p o d A r e s U
HDInsight
Workload optimized, managed clusters
HDP | CDH | MapR (Azure Marketplace)
Any Hadoop technology
IaaS Hadoop
Big Data as-a-service
Managed Hadoop
Azure Data Lake Analytics
Azure Storage
Data Lake Store
Azure HDInsight
• Fully-managed Hadoop and Spark
for the cloud
• 100% Open Source Hortonworks
data platform
• Clusters up and running in minutes • Managed, monitored and supported
by Microsoft with the industry’s best SLA
• Familiar BI tools for analysis, or open source notebooks for
interactive data science
• 63% lower TCO than deploy your own Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure Data Lake Store
• A hyper-scale repository for Big Data analytics
workloads
• Hadoop File System (HDFS) for the cloud • No limits to scale • Store any data in its native format • Enterprise-grade access control,
encryption at rest
• Optimized for analytic workload performance
Azure Data Lake Analytics
• Distributed analytics service built on Apache YARN
• Elastic scale per query lets users focus on business
goals—not configuring hardware
• Includes U-SQL—a language that unifies the benefits
of SQL with the expressive power of C#
• Integrates with Visual Studio to develop, debug, and
tune code faster
• Federated query across Azure data sources
• Enterprise-grade role based access control
Typical data lake architecture on AWS
24
Kylo data lake platform
25
Zaloni data lake reference architecture
26
Data lakehouse
• audit history,
• data versioning,
• distributed computing and storage,
• ACID transactions,
• dynamic schema evolution.
27
Data lake design practices
28
Data lake key components
• Security – even though you will not expose the Data
Lakes to a broad audience, it is still very important that you think this aspect through, especially during the initial phase and architecturing. It’s not like relational databases, with an artillery of security mechanisms. Be careful and never underestimate this aspect.
• Orchestration + ELT processes – as data is being
pushed from the Raw Layer, through the Cleansed to the Sandbox and Application layer, you need a tool to orchestrate the flow. Most likely, you will need to apply transformations. Either you choose an orchestration tool capable of doing so, or you need some additional resources to execute them.
29
• Governance – monitoring and logging (or lineage) operations will become crucial at some point for measuring performance and adjusting the Data Lake.
• Metadata – data about data, so mostly all the
schemas, reload intervals, additional descriptions of the purpose of data, with descriptions on how it is meant to be used.
• Master Data – an essential part of serving ready-to- use data. You need to either find a way to store your MD on the Data Lake, or reference it while executing ELT processes.
30
• Data Analytics Engine
• allow for concurrent work on the data by multiple users, • have access restriction policy for separate users/groups of
users,
• cost-effective enough maintaining speed and scalability at the
same time.
• BI & AI
31
Data lake storage
• Storage option
• How frequently data will be accessed and in how many zones it will be replicated. When designing data lake architecture these properties will ensure better security and additional cost savings as infrequently data storage costs less.
32
Data Lake common areas
• Staging – for data ingestion and movement
• Raw – for storing original copies of data in its raw
format indefinitely
• Curated – for data storage that is heavily transformed,
perhaps into a star schema for analysis
• Sandbox – to enable analysts and scientists the
opportunity to work with, collect, and transform new data without the need for heavy controls
33
Data lake repository layering
• Raw
• Standardized
• Cleansed
• Application
• Sandbox
34
Folders Structure
Usage Path – Per Project Data Lake Layer Expecte d Volume Sub Folders (Granular ity)
Raw Files Files without any transformation, stored “as is”, every minute. This is the landing zone of your data lake ~ TBs / day /project-name/raw- files /year/month/day/ hour/minute
Raw Data /year/month/day ~ GBs / day /project-name/raw- data Now all files are in data queryable format: same time zone and currency. Special characters and duplicated were removed. Small files are merged into bigger files, what is a best practice for big data workloads.
/year/month Business Data ~ MBs / day /project- name/business- data
Raw data + business rules. Now you start to have the basic aggregations that will help all It is a good idea do use parallel other analysis. processing on top of distributed file system to accomplish this heavy workload. This layer also can be used for data ingestion into DWs or DMs.
35
Files Format
Compression Why Data Lake Layer Files Format
The same format of the original data, for fast data ingestion.
Raw Files “as is” Gzip Gzip will deliver good compression rate for most of the file types.
Sequence files are a good option for map reduce programming paradigm as it can be easily splitted across data nodes enabling parallel processing. Other advantage of using sequence files is to merge two or more files into a bigger one , a typical goal of this layer. Raw Data Snappy Sequence Files
Snappy doesn’t give the best ratio, but it has excellent compress/decompress performance.
For interactive queries using Presto, Impala, or an external connection like Polybase, that allows you to query external data from Azure Synapse Analytics.
Snappy Business Data Parquet Files
Snappy compression again will be used, parquet files are columnar, making them compressed by nature. The data volume isn’t a problem anymore.
36
Organization of data in S3
• Raw Data, is the section where the data from the source, is
stored as is.
• Curated Data, is the section where all the data is cleansed using data quality rules and is available for the analysis.
• Intermediate Datasets store the datasets and tables
required for transformations to run.
• Linked Datasets are the denormalized or summarized
aggregated datasets derived from the original data that are useful for multiple use cases.
• Archived Data, is the section where unnecessary or less
often referred data is moved into. This may be moved into Glacier or such storage system, that is available for less cost.
• Export Area is the place where the datasets are created for moving the data to external area like indexes or data marts.
37
38
5 steps to a data lake architecture
39
40
Benefits of the Data Lake
• Enables “productionizing”
advanced analytics
• Derives value from unlimited data
types (including raw data)
• Reduces long-term cost of
ownership across entire spectrum of data use
• Cost-effective scalability and
flexibility
41
Risks of the Data Lake
• Loss of trust
• Loss of relevance and
momentum
• Increased risk
• Pure Hadoop implementation
• Query performance not as good as
relational database
• Complex query support not good due to lack of query optimizer, in- database operators, advanced memory management, concurrency, dynamic workload management and robust indexing
42
Thank you for your attention!!!
43
Delta lake
44
45