Copyright 2023, Faulkner Information Services. All
Rights Reserved.
Docid: 00018055
Publication Date: 2303
Publication Type: TUTORIAL
Preview
When the history of data processing is told, the emphasis is usually on the processing part, from the abacus to the mainframe to massively parallel processors to the quantum computer. However, when data is discussed, the conversation normally turns to physical storage, from paper ledgers to disk and tape to CD and USB drives to the cloud. The history of data, however, its structure and use, is equally fascinating, featuring new types of purpose-built repositories such as data warehouses, data marts, data lakes, and even data lakehouses. This report focuses on data lakes, which allow enterprises to store vast amounts of raw data in its native format without any pre-defined schema or structure.
Report Contents:
Executive Summary
[return to top of this
report]
When the history of data processing is told, the emphasis is usually on
the processing part, from the abacus to the mainframe to massively
parallel processors to the quantum computer. However, when data is
discussed, the conversation normally turns to physical storage, from paper
ledgers to disk and tape to CD and USB drives to the cloud.
The history of data, however, its structure and use, is equally
fascinating, featuring new types of purpose-built repositories such as
data warehouses, data marts, data lakes, and even data lakehouses.
Figure 1. A Conceptual Visualization of a Data Lake
Enterprise data fills the lake (as shown at
the bottom).
Enterprise tools skim the value (shown at the top).
Credit: Amazon Web Services
A data lake, as represented in Figure 1, is a logical and physical
storage medium that enables an enterprise to store vast amounts of raw
data in its native format without any pre-defined schema or structure.
Data lakes are typically employed to store a combination of structured and
unstructured data, the latter originating from modern-day sources such as
social media posts and, increasingly, the Internet of Things (IoT).
As raw data, data lake data is evaluated via advanced techniques
including predictive analytics, machine learning, and natural language
processing. To understand where data lakes fit in the overall data storage
scheme, consider the other common repositories.
Data Warehouse
Data warehouses store data extracted from transactional databases,
line-of-business applications, and operational databases.1
Developed in the 1980s, data warehouses permit companies and other
commercial interests to examine customer data for buying trends.2
Data Mart
Considered a subset of data warehouses, data marts are usually oriented
toward a specific enterprise team or line of business such as finance or
sales, expediting the data search process by reducing the volume of hay in
the enterprise haystack.3
Data Lakehouse
A combination data lake and data warehouse, a data lakehouse, as imagined
by analyst Martin Heller, “unifies the best of data warehouses and data
lakes in one simple platform to handle all … data, analytics, and AI
uses cases. It’s built on an open and reliable data foundation that
efficiently handles all data types and applies one common security and
governance approach across all … data and cloud platforms.”4
Market
[return to top of this
report]
As predicted by Data Bridge Market Research, the global data lake market,
which was valued at $11.71 billion in 2021, is expected to reach $61.07
billion by 2029, registering a robust compound annual growth rate (CAGR)
of 22.93 percent during the 2022 to 2029 forecast period.5
Growth Factors
The data lake market is being driven by several factors, principally, the
popularity of cloud data storage, with major providers like Amazon,
Microsoft, and Google all offering data lake solutions – solutions that
are:
- Easy to establish, maintain, manage, and access;
- Scalable to accommodate growth;
- Secure relative to on-premises data stores;
- Economical with pay as you use cloud pricing.
Data Bridge projects that “Internet of Things or connected devices
technology will truly benefit the market in the long run. Internet of Things
technology allows … data collection through more [means] and improves …
operational efficiency with respect to cost and … data analytics. IOT even
offers [a] greater degree of ease in running analytics.Additionally, the
increasing trend of digitalization … offers numerous growth opportunities
within the market. [A rising] number of data processing activities and
complete digitization of operations will also work in favor of the market.”6
Growth Inhibitors
As with most emerging technologies, data lake growth may be constrained
by
- Budgetary concerns;
- The absence of technical expertise to build and maintain data lake
infrastructure; - A lack of data lake awareness, “especially in under-developed
economies.”7
Major Players
The data lake market space is populated by numerous providers, including:
- Microsoft (US)
- IBM (US)
- Cloudera (US)
- Informatica (US)
- Tata Consultancy Services Limited (India)
- Google (US)
- Oracle (US)
- Amazon Web Services (US)
- SAS Institute (US)
- Teradata (US)
- Atos SE (France)8
Applications
[return to top of this
report]
Data lakes offer several advantages over warehouses and other data
collectives. Principal among these are:
The ability to store all data types,
both structured and unstructured.
The corresponding ability to perform
expanded data analytics. As analyst Shubham Sharma reports,
“According to an Aberdeen survey, organizations that implemented a data
lake outperformed competitors by 9 percent in organic revenue growth.
These companies were able to perform new types of analytics on previously
unusable and siloed data – log files, data from click-streams, social
media and IoT devices – now centrally stored in the data lake.”
The chance to offload previously
siloed data, relieving capacity problems related to small,
single-use data stores, while simultaneously reducing data duplication and
the potential for critical data loss.9
Use Cases
As itemized by Microsoft, a major market presence, data lakes can serve a
variety of enterprise use cases:
Subscription-based streaming companies collect and process insights on
customer behavior, which they may use to improve their recommendation
algorithm.
“Finance – Investment firms
use the most up-to-date market data, which is collected and stored in real
time, to efficiently manage portfolio risks.
“Healthcare – Healthcare
organizations rely on big data to improve the quality of care for
patients.
“Omnichannel retailer –
Retailers use data lakes to capture and consolidate data that’s coming in
from multiple touchpoints, including mobile, social, chat, word-of-mouth,
and in person.
“IoT – Hardware sensors
generate enormous amounts of semi-structured to unstructured data on the
surrounding physical world.
“Digital supply chain – Data
lakes help manufacturers consolidate disparate warehousing data, including
EDI systems, XML, and JSONs.
“Sales – Data scientists and
sales engineers often build predictive models to help determine customer
behavior and reduce overall churn.”10
Of these opportunities, healthcare and retail offer the greatest promise,
at least in the short term:
Healthcare professionals
deposit and withdraw increasingly large volumes of medical data on a daily
basis – data vital to patient care including preventive medicine. The
COVID-19 pandemic accelerated this trend, and industry commitments to
telemedicine and digital transformation will spur continuing data lake
investment.
In Retail and E-commerce,
data lake solutions will “enable [industry stakeholders] to better tackle
customer requirements amidst the evolving digital payments ecosystem.”11
Data Lakes and AI
There exists a natural synergy between data lakes and AI, which should
help advance the adoption of both technologies by forward-thinking
enterprise planners.
Consider, for example, the industrial practice of inspecting and testing
physical materials, components, and systems.As analyst Peter Rosiepen
points out, “Many industries rely heavily on non-destructive testing (NDT)
and inspection data to ensure the safety of their assets and operations.”
With an appetite for both structured and unstructured data at any scale, a
data lake “could be a solution for storing and managing [inspection and
test] data. Having a date lake allows for the centralization of all NDT
data and inspection metadata cost-effectively – the storage of large
amounts of data comes at a fraction of the cost of traditional storage
methods as it eliminates the need for X-ray film, chemicals, paper, or
archive rooms. It also reduces pathways because digital data, unlike
physical data, can be accessed from almost everywhere.
“One of the main benefits of storing and managing NDT data and inspection
metadata in a data lake is that it is a valuable source for artificial
intelligence (AI) projects. The large amount of data stored in a data lake
can be used to train machine learning models, which can then be used to
improve the efficiency of NDT and inspection processes.”12
Observations and Recommendations
[return to top of this
report]
Your First Data Lake
For enterprises with minimal large data management experience and
exposure – in particular, small-to-medium-sized enterprises (SMEs) – data
lake offerings from Amazon, Microsoft, and Google offer a convenient and
relatively inexpensive starting place.
With Amazon Web Services (AWS), for example, data lakes are built upon a
number of prominent service “pillars”:
The popular AWS Simple Storage
Service (S3) forms the data lake storage foundation
AWS Lake Formation constructs
the data lake, with features and functions such as source crawlers, ETL
and data prep, data catalog, security settings, and access control.
AWS Glue facilitates data
movement between the data lake and custom data and analytics services.
AWS Athena provides
analytics.
AWS Redshift provides
security and governance.
AWS EMR provides a big data
platform for running large-scale distributed data processing jobs,
interactive SQL queries, and machine learning (ML) applications.
Data Security Best Practices
No matter how data is stored, retrieved, and processed, a common theme is
data security.
To preserve the integrity and confidentiality of data lake data, analyst
Shubham Sharma suggests that all enterprises follow some basic best
practices. Chief among these are:
Identify data goals
“In order to prevent your data lake from
becoming a data swamp, … identify your organization’s data
goals – the business outcomes – and appoint an internal or external data
curator who could assess new sources/datasets and govern what goes into
the data lake based on those goals.”
Document incoming data
“All incoming data should be documented as it
is ingested into the lake. The documentation usually takes the forms of
technical metadata and business metadata, although new forms of
documentation are also emerging.”
Importantly, no data should arrive by accident
or oversight. Metaphorically, this will prevent the lake from overflowing
or becoming contaminated. It will also help control storage costs.
“Security has to be maintained across all
zones of the data lake, starting from landing to consumption. To ensure
this, connect with your vendors and see what they are doing in these four
areas:
-
- “User authentication
- “User authorization
- “Data-in-motion encryption
- “Data-at-rest encryption.”13
Web Links
[return to top of this
report]
-
- Amazon Web Services:
-
- Google:
-
- Microsoft:
-
- US National Institute of Standards and Technology:
References
1 Martin Heller. “What Is a Data Lake? Massively Scalable
Storage for Big Data Analytics.” InfoWorld | IDG Communications, Inc.
April 29, 2022.
2-3 CFI Team. “Data Warehouse.” CFI Education Inc. December
7, 2022.
4 Martin Heller. “What Is a Data Lake? Massively SScalable
torage for Big Data Analytics.” InfoWorld | IDG Communications, Inc.
April 29, 2022.
5-8 “Global Data Lake Market, By Component (Solutions,
Services), Deployment Mode (On-Premises, Cloud), Organization Size
(Large Enterprises, Small and Medium-Sized Enterprises), Business
Function (Marketing, Sales, Operations, Finance, Human Resources),
Industry Vertical (Banking, Financial Services and Insurance,
Telecommunication and Information Technology, Retail and E-Commerce,
Healthcare and Life Sciences, Manufacturing, Energy and Utilities, Media
And Entertainment, Government, Others) – Industry Trends and Forecast to
2029.” Data Bridge Market Research. August 2022.
9 Shubham Sharma. “What Is a Data Lake? Definition,
Benefits, Architecture and Best Practices.” VentureBeat. March 10, 2022.
10 “What Is Data Lake.” Microsoft. 2023.
11 “Data Lake Market – Global Industry Analysis (2018 –
2020) – Growth Trends and Market Forecast (2021 – 2026).” Research and
Markets. April 2022.
12 Peter Rosiepen. “Data Lake as a Powerful Tool for
Artificial Intelligence Projects.” Inspectioneering, LLC. February 6,
2023.
13 Shubham Sharma. “What Is a Data Lake? Definition,
Benefits, Architecture and Best Practices.” VentureBeat. March 10, 2022.
About the Author
[return to top of this
report]
James G. Barr is a leading business continuity analyst
and business writer with more than 40 years’ IT experience. A member of
“Who’s Who in Finance and Industry,” Mr. Barr has designed, developed, and
deployed business continuity plans for a number of Fortune 500 firms. He
is the author of several books, including How to Succeed in Business
BY Really Trying, a member of Faulkner’s Advisory Panel, and a
senior editor for Faulkner’s Security Management Practices.
Mr. Barr can be reached via e-mail at jgbarr@faulkner.com.
[return to top of this
report]