PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The reader
is available for free
download.
ETL Best Practices
Copyright 2018, Faulkner Information
Services. All Rights Reserved.
Docid: 00021313
Publication Date: 1805
Report Type: TUTORIAL
Preview
ETL, which stands for Extract, Transform, and Load, is a deceptively
simple yet complex concept that defines the essence of data
processing. The three-step process involves: (1) the extraction of
data, called source data, from a file, database, or other data
repository; (2) the transformation of the source data – as guided by
business rules – into a more desired form; and (3) the loading of the
transformed data, called target data, into a target database. Frequently a business intelligence process, ETL is commonly employed to
sift through customer transaction data, order and summarize the data,
and then store the data in a data mart or data warehouse for subsequent
analysis.
Report Contents:
- Executive Summary
- Description
- Best Practices
- Outlook
- Recommendations
- References
- Web Links
- Related Reports
Executive Summary
[return to top of this report]
Related Faulkner Reports |
Data Analysis and Data Mining Tutorial |
Business Intelligence Solutions Market Trends Market |
ETL, which stands for Extract, Transform, and Load,
is a deceptively simple yet complex concept that defines the essence
of data processing.
The three-step process involves:
-
The extraction of data, called source data,
from a file, database, or other data repository. -
The transformation of the source data – as
guided by business rules – into a more desired form. - The loading of the transformed data, called target data, into a target
database.
ETL is commonly employed to sift through customer transaction data,
order and summarize the data, and then store the data in a data mart or
data warehouse, either as a direct source of business intelligence
(depending on the sophistication of the transformation element) or as source data to be subsequently devoured and analyzed by business
intelligence (BI) tools, which is, itself, an ETL activity.
While the mechanics of ETL can be complicated, involving, for
example, the movement of elements A and B in database C to element D in
data warehouse E, the best practices surrounding such activity are
relatively straightforward, and involve:
- Performing ETL operations in manageable chunks (or "iterations").
-
Validating business rules and source data – to avoid the
"garbage-in, garbage-out" phenomenon. -
Leveraging external expertise in the form of vendor products and
services. - Attending to data security and business continuity demands –
owing to the importance of ETL to enterprise productivity and
profitability.
Description
[return to top
of this report]
ETL Tools
Facilitating the ETL process are a variety of tools. Leading ETL vendors include:
- IBM
- Informatica
- Oracle
- SAP
Not surprisingly, these prominent ETL providers are also
dominant in the data management space, and their ETL
solutions typically integrate with – and/or extend – their
companion data management offerings.
According to analyst Garrett Alley,
modern ETL tools "integrate
well with advanced cloud data warehouses and … support the ever-growing number
of data sources and streams.
"Today’s trends continue to point to the cloud, and moving IT and ETL to the
cloud only makes sense. Cloud-based ETL services are the natural next step. They
support the same batch model as their predecessors, but they are taking ETL to
the next stage, often offering support for real-time data, intelligent schema
detection, and more."1
ETL Process
The ETL process is designed to draw from
multiple, disparate sources, from conventional flat files to
enterprise resource planning and customer relationship
management databases, as illustrated in Figure 1.
Figure 1. Depiction of Typical ETL Process
Source: Scalable Systems
ETL enables a greater understanding of enterprise data by
extracting, connecting, and manipulating multi-source data elements in a manner
that generates operational insights that transcend what might be revealed by inspecting any single
data source.
Data Quality
Despite their potential payback, ETL projects, like other
IT initiatives, can suffer planning and implementation
setbacks. Analysts Al Rago and Rob Wozniak contend that "poor data
quality is the biggest contributor to delays in the ETL
portion of [a business intelligence] project."2
Elaborating on this point, they observe that: "Inevitably, all of the data problems that have been lurking in source systems are exposed to the light of day when an ETL system is built. Common culprits are:
- "Business rules not supported by the data;
- "Missing or incomplete data; and
- "Source data structures that do not comply with designs."3
The best antidote for poor data quality is an Extract,
Transform, and Load process that adheres to industry best
practices – one capable of discovering and highlighting
source data and business rule anomalies.
ETL Variants
In addition to the well-established ETL process, IBM has identified three
distinct ETL variants:
- "ELT (Extract Load Transform)
– The ability to extract
data, load it and then transform it inside a database engine. ELT
is also sometimes referred to as ETL Pushdown. - "TELT (Transform Extract Load Transform)
– The ability
to transform and extract data at the source, load it and then transform it
inside a database engine. - "TETLT (Tranform Extract Transform Load Transform)
– The
ability to transform and extract data at the source, transform it again
within the data integration engine, and then load it and transform it inside
a database engine."4
Of these ETL alternatives, analyst Prabodh
Mhalgi believes ELT combined with Hadoop – an open
source distributed processing framework for Big Data applications running in
clustered systems – provides
great potential. "Though the ETL process and traditional ETL tools have been
serving … data warehouse needs, the changing nature of data and its
rapidly growing volume have stressed the need to move to Hadoop. Apart
from the obvious benefits of cost effectiveness and scalability of
Hadoop, ELT on Hadoop provides flexibility in [the] data processing
environment."5
Best Practices
[return to top of this report]
1. Perform ETL Projects in Iterative Cycles
When a new, large-scale information system is implemented, a "pilot"
project is often ordered, providing enterprise officials and end users
with the opportunity to "experience" the system on a small scale prior to
production. The purpose of the pilot is to identify and resolve
any problems before they can adversely affect the enterprise at large.
Similarly, when managing an ETL project, the operation
should be conducted on an iterative basis, with each cycle
introducing new source data and new transformation
techniques. In this way, the ETL process can be
continuously analyzed and refined, and even allow for the
evolution of new business rules as enterprise planners
evaluate successive generations of ETL results.
The iterative method reduces enterprise exposure to
poorly conceived ETL schemes by enabling "mid-course
corrections."
2. Design for High Performance ETL Processing
Performance can suffer when ETL operations are "scaled up." Analyst Marc Castillo offers the following ten tips for designing high
performance ETL processes:
- "For batch processing, only pull the minimum data you need from
your source systems. - "Don’t perform cached lookups in your ETL tool against large
tables. - "Avoid using row-by-row processing in your ETL tool. You should
be using the bulk-loading options of your ETL tool, as it is tuned
for processing large data volumes. - "For sources coming from a database, make sure you are offloading
table joins to the database. Chances are that your database can do
the join much more efficiently that your ETL tool can. - "Never use cursors in any ETL code. Cursors have their
use, but they should not be used in any regularly-scheduled ETL
processes. - "Check your ETL scheduling logic and make sure you are utilizing
as many parallel threads as possible. - "Make sure you have a step at the end of your jobs to rebuild
indexes on your reporting tables. - "Check that the Auto Statistics Update option on your target
database is turned on. If your database does not support it or you
have it turned off for other reasons, make sure your statistics are
updated at the end of each ETL run. - "Limit the number of CTEs and joins you are doing with a single
query. - "If you are using stored procedures as part of your ETL
processing, copy all input parameters into variables internal to the
stored procedure."6
3. Verify Business Rules and Data Integrity
As previously mentioned, poor data quality
is the main impediment to a successful ETL operation. According to
analysts Al Rago and Rob Wozniak, the process of verifying business rules and data integrity
(data profiling)
"must begin during requirements gathering so rules for handling data anomalies can be presented to the users in a readable format. For example, there may be records of retail transactions in the source system without a retail location or store specified. In this case the users may want the record filtered out, defaulted to a generic
‘not-specified’ location, or assigned to a specific location. This style of surveying the quality of the data in the underlying source systems and validating it against requirements and business rules is essential for preventing problems downstream in the BI stack. Active data profiling needs to be an ongoing part of all requirements gathering and design phases."7
4. Automate the Extract, Transform, Load
Process
Analysts Al Rago and Rob Wozniak agree that: "Automating data validation activities is an essential part of every ETL process. Testing an ETL process means confirming that it did what it was designed to do. If a job was constructed to load a month’s worth of transactions from a point-of-sale system, it needs to confirm that a month of data was loaded; if a job was constructed to summarize those transactions, it needs to confirm that all the relevant transactions were included in the summary.
"Building an automated regression test bed for validating ETL involves one of two proven methods. The first way is to create a QA database of expected results and comparing the those expected results to the Target data in the warehouse under development. The second method uses pairs of queries against both source and target to make sure data has been extracted, transformed and loaded successfully. Both methods are equally effective and the scale of the operation and the complexity of the ETL determine which method is right for the situation."8
5. Extract Only New or Changed Data
Source data is often highly volatile. If the ETL process is
conducted over time, only new or changed data should be extracted upon
the second, third, fourth, etc. iterations. This will prevent the
transformation and loading of redundant data, as well as expedite ETL
operations.
6. Utilize ETL Vendors for ETL Development
As befits the complexity of modern ETL projects, enterprise
planners should take advantage of ETL offerings from vendors like IBM,
Informatica, Oracle, and SAP.
Similarly, for large-scale ETL
initiatives, it may be advantageous to engage the services
of a major business consulting firm like Accenture or IBM. In addition to providing project
management expertise, these firms frequently have
relationships with ETL vendors, and have overseen projects
utilizing the vendors’ products.
7. Secure Data Before, During, After ETL
One often overlooked aspect of the ETL process is security. Even
the best ETL design can be subverted by hackers and other miscreants if
the data being extracted, transformed, and loaded is not secure. ETL security involves protecting data while at rest (in source and
target databases), and in transit (while moving from source to target
through the transformation engine).
Since the transformation element of ETL causes changes
in data, it may be difficult to determine if target data has
been transformed in accordance with the ETL process, or
modified by an unauthorized third party.
By conducting regular quality assurance checks – sampling
the target data at random – QA and security officials can
ascertain the likelihood of data compromise, and activate
additional security measures as appropriate.
As to the matter of "ETL security," there is no "magic
bullet." A program of continuous enterprise security
improvement affords the best protection.
8. Protect Source Data Repositories for
Recoverability
In addition to security, business continuity is a critical concern. In
particular, enterprise officials should ensure that source data is recoverable
in the event of a disaster affecting ETL information systems.
While target data stores may contain more "actionable"
information, the ability to create – and, if necessary,
recreate – those target stores relies on the availability of
the original source data.
In addition, the same source data may be subjected over
time to multiple ETL operations, especially as the ETL
process is refined. Thus, ensuring the "continued" existence
of primary source data is a business continuity imperative.
Outlook
[return to top
of this report]
It is the nature of ETL best practices that they will
emerge over time, especially in response to:
- New data types, like "e-discovery" data.
- New data collections, like "Big Data."
- New data states, like "data virtualization."
- New data environments, like cloud or edge data.
Until the information management industry establishes a new,
more comprehensive set of ETL standards and practices, enterprise officials
should consider how their ETL policies and protocols might evolve to encompass
the following ETL issues.
1. Cloud Integration
As more and more applications migrate to "the cloud," the ETL process
will have to expand to extract, transform, and load both on-premise and off-premise data. Logistics and security will be major
concerns.
2. Legacy Data
The ready availability of cloud-based applications – via the
software-as-a-service (SaaS) model – is responsible for accelerating the
rate at which legacy applications are being retired and replaced by
online systems. One major ETL hurdle will be transitioning legacy
data to the new environment.
3. Big Data
The term "Big Data" refers
to the massive amounts of data being generated on a daily basis by businesses
and consumers alike – data which cannot be processed using conventional data
analysis tools owing to its sheer size and, in many cases, its unstructured
nature.
Big Data poses big ETL challanges:
- New data characteristics, including unstructured and
multi-structured data. - New analytical platforms, like Hadoop, an open-source,
distributed programming framework that facilitates Big Data
processing. - Data in motion and data at rest.
- New tools and techniques to analyze Big Data.9
Convinced that Big Data hold the key to improved productivity and
profitability, enterprise planners will be searching for ETL tools capable of
extracting, transforming, and loading Big Data.
4. Real-Time Warehousing
Owing to the need for instant business intelligence, the cycle time for
creating and refreshing data warehouses is nearing zero; in other words,
it is being reduced to real-time. As a consequence, the enterprise ETL
process will have to become faster, more automatic and more reliable (i.e., self-correcting).
5. Data Virtualization
Data virtualization is a data management methodology
that enables the aggregation of data from multiple, disparate sources to create
a single virtual view (often as an application "dashboard’). Data virtualization
(DV) is commonly associated with enterprise business applications including sales force automation
(SFA),
customer relationship management (CRM),
enterprise resource planning (ERP),
and business intelligence (BI). Data virtualization is, Itself, an ETL process, and enterprise officials will
have to incorporate DV methods into their ETL toolkits.10
6. Electronic Discovery
Electronic discovery (or e-discovery) is the process of finding information of a particular type that is stored in electronic files, databases, or
archives. The principal driver behind the development and adoption of e-discovery
technology is the US Supreme Court. On April 12, 2006, the Court approved and forwarded to
Congress a number of revisions to the Federal Rules of Civil Procedure (FRCP)
that address the preservation and discovery of data in electronic media
(e-discovery). The new rules, which went into effect on December 1,
2006, establish a new form of discoverable data called "electronically
stored information" (ESI). Since failure to extract ESI on
demand can result in civil and criminal penalties. enterprise officials
will have to improve their ETL process or risk sanctions.
7. Code Review
As with any application development initiative, the creation of
effective, efficient, and dependable ETL code relies, in large measure,
on conducting comprehensive code reviews, in which a wide range of
coding experts and application stakeholders perform a thorough
evaluation of programming methods and results. Analyst Jim Egan believes
that the Code Review Team should include the following:
- The developer of the code
- Senior developer(s)
- A facilitator
- A junior developer (optional)
- A technical team lead (optional)
- A QA resource (optional)
- A business analyst (optional)
While the first two representatives – the developer of the code
and senior developer(s) – are obvious, the last five require some
elaboration.
The following explanations are provided by Jim Egan’s "6 Best
Practices for Conducting ETL Code Review":
- Facilitator – "There should be someone assigned the
task of managing the flow of the meeting and if necessary function
as a referee. This could be a Senior Developer or even the Team
Lead. Dual roles are OK." - Junior Developer – "I like to have a junior
developer sit in on a code review so they know what to expect when
their code is being reviewed. This could also be an opportunity to
educate the junior developer on how the project wants code to be
written. Studies have shown that this doesn’t always work out well. So don’t use it as a primary reason to put them on the code review
team." - Technical Team Lead – "If the other developers don’t
have enough experience then the Technical Team Lead may need to sit
in and possibly run the meeting." - QA Resource – "You may want QA in this meeting so
they can get an idea of what they’ll be testing." - Business Analyst – "This person will be reviewing
the code to ensure that it meets the requirements."11
Importantly, the code review process should be conducted at
regular intervals during code development. In this way,
potential problems can be identified – and resolved – in an timely
manner, thereby cutting – or preventing the inflation of –
development expenses.
Recommendations
[return to top of this report]
Best practices, whether
ETL or otherwise, are forged through cooperation.
In fashioning a
future ETL strategy, enterprise officials should:
-
Survey their industry
colleagues to determine how other, similar enterprises
are extracting, transforming, and loading vital data. -
Participate in information
management user groups; again, to share ETL methods and
perspectives. -
Engage a major business
consulting firm for information on the latest ETL
techniques, especially industry-specific techniques. -
Stay familiar with new
advances in ETL tools by adding their name to major
vendors’ mailing lists. -
Check common sources of information technology
standards and practices for ETL guidance, like the International
Organization for Standardization (ISO) or the US
National Institute of Standards and Technology (NIST). - Understand that Big Data repositories will likely
contain customers’ personally identifiable information (PII),
resulting in the extraction, transformation, and loading
of extremely sensitive information. ETL processes must
be secure to prevent privacy violations and enterprise
liability.
References
[return to top of this report]
- 1 Garrett Alley. "2018 ETL Tools
Comparison." Alooma, Inc. May 1, 2018. - 2 Al Rago and Rob Wozniak. "ETL Projects Don’t Have To Be
So Hard." Analytics8. 2011:1. - 3 Ibid.
- 4 "What is ETL?" IBM Corporation.
- 5 Prabodh Mhalgi. "Traditional ETL vs ELT on Hadoop." Bitwise. July 4, 2017.
- 6 Marc Castillo. "10 Best Practices for High
Performance ETL Processing." West Monroe Partners. June 25, 2015. - 7 Al Rago and Rob Wozniak. "ETL Projects Don’t Have To Be
So Hard." Analytics8. 2011:2. - 8 Ibid. p.8.
- 9 Mike Ferguson. "The Impact of Big Data on ETL." Intelligent
Business Strategies/Syncsort. July 2012:6. - 10 Webopedia.
- 11 Jim Egan. "6 Best Practices for Conducting ETL Code
Reviews." Mantis Technology Group, Inc. September 18, 2013.
Web Links
[return to top of this report]
- IBM: http://www.ibm.com/
- Informatica: http://www.informatica.com/
- ISO: http://www.iso.org/
- NIST: http://www.nist.gov/
- Oracle: http://www.oracle.com/
- SAP: http://www.sap.com/
About the Author
[return to top of this report]
James G. Barr is a leading business continuity analyst and
business writer with more than 30 years’ IT experience. A member of
"Who’s Who in Finance and Industry," Mr. Barr has designed,
developed, and deployed business continuity plans for a number of Fortune
500 firms. He is the author of several books, including How to
Succeed in Business BY Really Trying, a member of Faulkner’s Advisory
Panel, and a senior editor for Faulkner’s Security Management
Practices. Mr. Barr can be reached via e-mail at jgbarr@faulkner.com.
[return to top of this report]