ETL Best Practices

PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The reader
is available for free

ETL Best Practices

by James G. Barr

Docid: 00021313

Publication Date: 1805

Report Type: TUTORIAL


ETL, which stands for Extract, Transform, and Load, is a deceptively
simple yet complex concept that defines the essence of data
processing. The three-step process involves: (1) the extraction of
data, called source data, from a file, database, or other data
repository; (2) the transformation of the source data – as guided by
business rules – into a more desired form; and (3) the loading of the
transformed data, called target data, into a target database. Frequently a business intelligence process, ETL is commonly employed to
sift through customer transaction data, order and summarize the data,
and then store the data in a data mart or data warehouse for subsequent

Report Contents:

Executive Summary

[return to top of this report]

Faulkner Reports
Analysis and Data Mining Tutorial
Intelligence Solutions Market Trends Market

ETL, which stands for Extract, Transform, and Load,
is a deceptively simple yet complex concept that defines the essence
of data processing.

The three-step process involves:

  1. The extraction of data, called source data,
    from a file, database, or other data repository.
  2. The transformation of the source data – as
    guided by business rules – into a more desired form.
  3. The loading of the transformed data, called target data, into a target

ETL is commonly employed to sift through customer transaction data,
order and summarize the data, and then store the data in a data mart or
data warehouse, either as a direct source of business intelligence
(depending on the sophistication of the transformation element) or as source data to be subsequently devoured and analyzed by business
intelligence (BI) tools, which is, itself, an ETL activity.

While the mechanics of ETL can be complicated, involving, for
example, the movement of elements A and B in database C to element D in
data warehouse E, the best practices surrounding such activity are
relatively straightforward, and involve:

  • Performing ETL operations in manageable chunks (or "iterations").
  • Validating business rules and source data – to avoid the
    "garbage-in, garbage-out" phenomenon.
  • Leveraging external expertise in the form of vendor products and
  • Attending to data security and business continuity demands –
    owing to the importance of ETL to enterprise productivity and


[return to top
of this report]

ETL Tools

Facilitating the ETL process are a variety of tools. Leading ETL vendors include:

  • IBM
  • Informatica
  • Oracle
  • SAP

Not surprisingly, these prominent ETL providers are also
dominant in the data management space, and their ETL
solutions typically integrate with – and/or extend – their
companion data management offerings.

According to analyst Garrett Alley,
modern ETL tools "integrate
well with advanced cloud data warehouses and … support the ever-growing number
of data sources and streams.

"Today’s trends continue to point to the cloud, and moving IT and ETL to the
cloud only makes sense. Cloud-based ETL services are the natural next step. They
support the same batch model as their predecessors, but they are taking ETL to
the next stage, often offering support for real-time data, intelligent schema
detection, and more."1

ETL Process

The ETL process is designed to draw from
multiple, disparate sources, from conventional flat files to
enterprise resource planning and customer relationship
management databases, as illustrated in Figure 1.

Figure 1. Depiction of Typical ETL Process

Figure 1. Depiction of Typical ETL Process

Source: Scalable Systems

ETL enables a greater understanding of enterprise data by
extracting, connecting, and manipulating multi-source data elements in a manner
that generates operational insights that transcend what might be revealed by inspecting any single
data source.

Data Quality

Despite their potential payback, ETL projects, like other
IT initiatives, can suffer planning and implementation
setbacks. Analysts Al Rago and Rob Wozniak contend that "poor data
quality is the biggest contributor to delays in the ETL
portion of [a business intelligence] project."2

Elaborating on this point, they observe that: "Inevitably, all of the data problems that have been lurking in source systems are exposed to the light of day when an ETL system is built. Common culprits are:

  • "Business rules not supported by the data;
  • "Missing or incomplete data; and
  • "Source data structures that do not comply with designs."3

The best antidote for poor data quality is an Extract,
Transform, and Load process that adheres to industry best
practices – one capable of discovering and highlighting
source data and business rule anomalies.

ETL Variants

In addition to the well-established ETL process, IBM has identified three
distinct ETL variants: 

Of these ETL alternatives, analyst Prabodh
Mhalgi believes ELT combined with Hadoop – an open
source distributed processing framework for Big Data applications running in
clustered systems – provides
great potential. "Though the ETL process and traditional ETL tools have been
serving … data warehouse needs, the changing nature of data and its
rapidly growing volume have stressed the need to move to Hadoop. Apart
from the obvious benefits of cost effectiveness and scalability of
Hadoop, ELT on Hadoop provides flexibility in [the] data processing

Best Practices

[return to top of this report]

1. Perform ETL Projects in Iterative Cycles

When a new, large-scale information system is implemented, a "pilot"
project is often ordered, providing enterprise officials and end users
with the opportunity to "experience" the system on a small scale prior to
production. The purpose of the pilot is to identify and resolve
any problems before they can adversely affect the enterprise at large.

Similarly, when managing an ETL project, the operation
should be conducted on an iterative basis, with each cycle
introducing new source data and new transformation
techniques. In this way, the ETL process can be
continuously analyzed and refined, and even allow for the
evolution of new business rules as enterprise planners
evaluate successive generations of ETL results.

The iterative method reduces enterprise exposure to
poorly conceived ETL schemes by enabling "mid-course

2. Design for High Performance ETL Processing

Performance can suffer when ETL operations are "scaled up." Analyst Marc Castillo offers the following ten tips for designing high
performance ETL processes:

  1. "For batch processing, only pull the minimum data you need from
    your source systems.
  2. "Don’t perform cached lookups in your ETL tool against large
  3. "Avoid using row-by-row processing in your ETL tool. You should
    be using the bulk-loading options of your ETL tool, as it is tuned
    for processing large data volumes.
  4. "For sources coming from a database, make sure you are offloading
    table joins to the database. Chances are that your database can do
    the join much more efficiently that your ETL tool can.
  5. "Never use cursors in any ETL code. Cursors have their
    use, but they should not be used in any regularly-scheduled ETL
  6. "Check your ETL scheduling logic and make sure you are utilizing
    as many parallel threads as possible.
  7. "Make sure you have a step at the end of your jobs to rebuild
    indexes on your reporting tables.
  8. "Check that the Auto Statistics Update option on your target
    database is turned on. If your database does not support it or you
    have it turned off for other reasons, make sure your statistics are
    updated at the end of each ETL run.
  9. "Limit the number of CTEs and joins you are doing with a single
  10. "If you are using stored procedures as part of your ETL
    processing, copy all input parameters into variables internal to the
    stored procedure."6

3. Verify Business Rules and Data Integrity

As previously mentioned, poor data quality
is the main impediment to a successful ETL operation. According to
analysts Al Rago and Rob Wozniak, the process of verifying business rules and data integrity
(data profiling)
"must begin during requirements gathering so rules for handling data anomalies can be presented to the users in a readable format. For example, there may be records of retail transactions in the source system without a retail location or store specified. In this case the users may want the record filtered out, defaulted to a generic
‘not-specified’ location, or assigned to a specific location. This style of surveying the quality of the data in the underlying source systems and validating it against requirements and business rules is essential for preventing problems downstream in the BI stack. Active data profiling needs to be an ongoing part of all requirements gathering and design phases."7

4. Automate the Extract, Transform, Load

Analysts Al Rago and Rob Wozniak agree that: "Automating data validation activities is an essential part of every ETL process. Testing an ETL process means confirming that it did what it was designed to do. If a job was constructed to load a month’s worth of transactions from a point-of-sale system, it needs to confirm that a month of data was loaded; if a job was constructed to summarize those transactions, it needs to confirm that all the relevant transactions were included in the summary.

"Building an automated regression test bed for validating ETL involves one of two proven methods. The first way is to create a QA database of expected results and comparing the those expected results to the Target data in the warehouse under development. The second method uses pairs of queries against both source and target to make sure data has been extracted, transformed and loaded successfully. Both methods are equally effective and the scale of the operation and the complexity of the ETL determine which method is right for the situation."8

5. Extract Only New or Changed Data

Source data is often highly volatile. If the ETL process is
conducted over time, only new or changed data should be extracted upon
the second, third, fourth, etc. iterations. This will prevent the
transformation and loading of redundant data, as well as expedite ETL

6. Utilize ETL Vendors for ETL Development

As befits the complexity of modern ETL projects, enterprise
planners should take advantage of ETL offerings from vendors like IBM,
Informatica, Oracle, and SAP.

Similarly, for large-scale ETL
initiatives, it may be advantageous to engage the services
of a major business consulting firm like Accenture or IBM. In addition to providing project
management expertise, these firms frequently have
relationships with ETL vendors, and have overseen projects
utilizing the vendors’ products.

7. Secure Data Before, During, After ETL

One often overlooked aspect of the ETL process is security. Even
the best ETL design can be subverted by hackers and other miscreants if
the data being extracted, transformed, and loaded is not secure. ETL security involves protecting data while at rest (in source and
target databases), and in transit (while moving from source to target
through the transformation engine).

Since the transformation element of ETL causes changes
in data, it may be difficult to determine if target data has
been transformed in accordance with the ETL process, or
modified by an unauthorized third party.

By conducting regular quality assurance checks – sampling
the target data at random – QA and security officials can
ascertain the likelihood of data compromise, and activate
additional security measures as appropriate.

As to the matter of "ETL security," there is no "magic
bullet." A program of continuous enterprise security
improvement affords the best protection.

8. Protect Source Data Repositories for

In addition to security, business continuity is a critical concern. In
particular, enterprise officials should ensure that source data is recoverable
in the event of a disaster affecting ETL information systems.

While target data stores may contain more "actionable"
information, the ability to create – and, if necessary,
recreate – those target stores relies on the availability of
the original source data.

In addition, the same source data may be subjected over
time to multiple ETL operations, especially as the ETL
process is refined. Thus, ensuring the "continued" existence
of primary source data is a business continuity imperative.


[return to top
of this report]

It is the nature of ETL best practices that they will
emerge over time, especially in response to:

  • New data types, like "e-discovery" data.
  • New data collections, like "Big Data."
  • New data states, like "data virtualization."
  • New data environments, like cloud or edge data.

Until the information management industry establishes a new,
more comprehensive set of ETL standards and practices, enterprise officials
should consider how their ETL policies and protocols might evolve to encompass
the following ETL issues.

1. Cloud Integration

As more and more applications migrate to "the cloud," the ETL process
will have to expand to extract, transform, and load both on-premise and off-premise data. Logistics and security will be major

2. Legacy Data

The ready availability of cloud-based applications – via the
software-as-a-service (SaaS) model – is responsible for accelerating the
rate at which legacy applications are being retired and replaced by
online systems. One major ETL hurdle will be transitioning legacy
data to the new environment.

3. Big Data

The term "Big Data" refers
to the massive amounts of data being generated on a daily basis by businesses
and consumers alike – data which cannot be processed using conventional data
analysis tools owing to its sheer size and, in many cases, its unstructured

Big Data poses big ETL challanges:

  • New data characteristics, including unstructured and
    multi-structured data.
  • New analytical platforms, like Hadoop, an open-source,
    distributed programming framework that facilitates Big Data
  • Data in motion and data at rest.
  • New tools and techniques to analyze Big Data.9

Convinced that Big Data hold the key to improved productivity and
profitability, enterprise planners will be searching for ETL tools capable of
extracting, transforming, and loading Big Data.

4. Real-Time Warehousing

Owing to the need for instant business intelligence, the cycle time for
creating and refreshing data warehouses is nearing zero; in other words,
it is being reduced to real-time. As a consequence, the enterprise ETL
process will have to become faster, more automatic and more reliable (i.e., self-correcting).

5. Data Virtualization

Data virtualization is a data management methodology
that enables the aggregation of data from multiple, disparate sources to create
a single virtual view (often as an application "dashboard’). Data virtualization
(DV) is commonly associated with enterprise business applications including sales force automation
customer relationship management (CRM),
enterprise resource planning (ERP),
and business intelligence (BI). Data virtualization is, Itself, an ETL process, and enterprise officials will
have to incorporate DV methods into their ETL toolkits.10

6. Electronic Discovery

Electronic discovery (or e-discovery) is the process of finding information of a particular type that is stored in electronic files, databases, or
archives. The principal driver behind the development and adoption of e-discovery
technology is the US Supreme Court. On April 12, 2006, the Court approved and forwarded to
Congress a number of revisions to the Federal Rules of Civil Procedure (FRCP)
that address the preservation and discovery of data in electronic media
(e-discovery). The new rules, which went into effect on December 1,
2006, establish a new form of discoverable data called "electronically
stored information" (ESI). Since failure to extract ESI on
demand can result in civil and criminal penalties. enterprise officials
will have to improve their ETL process or risk sanctions.

7. Code Review

As with any application development initiative, the creation of
effective, efficient, and dependable ETL code relies, in large measure,
on conducting comprehensive code reviews, in which a wide range of
coding experts and application stakeholders perform a thorough
evaluation of programming methods and results. Analyst Jim Egan believes
that the Code Review Team should include the following:

  • The developer of the code
  • Senior developer(s)
  • A facilitator
  • A junior developer (optional)
  • A technical team lead (optional)
  • A QA resource (optional)
  • A business analyst (optional)

While the first two representatives – the developer of the code
and senior developer(s) – are obvious, the last five require some

The following explanations are provided by Jim Egan’s "6 Best
Practices for Conducting ETL Code Review":

  • Facilitator – "There should be someone assigned the
    task of managing the flow of the meeting and if necessary function
    as a referee. This could be a Senior Developer or even the Team
    Lead. Dual roles are OK."
  • Junior Developer – "I like to have a junior
    developer sit in on a code review so they know what to expect when
    their code is being reviewed. This could also be an opportunity to
    educate the junior developer on how the project wants code to be
    written. Studies have shown that this doesn’t always work out well. So don’t use it as a primary reason to put them on the code review
  • Technical Team Lead – "If the other developers don’t
    have enough experience then the Technical Team Lead may need to sit
    in and possibly run the meeting."
  • QA Resource – "You may want QA in this meeting so
    they can get an idea of what they’ll be testing."
  • Business Analyst – "This person will be reviewing
    the code to ensure that it meets the requirements."11

Importantly, the code review process should be conducted at
regular intervals during code development. In this way,
potential problems can be identified – and resolved – in an timely
manner, thereby cutting – or preventing the inflation of –
development expenses.


[return to top of this report]

Best practices, whether
ETL or otherwise, are forged through cooperation.

In fashioning a
future ETL strategy, enterprise officials should:

  1. Survey their industry
    colleagues to determine how other, similar enterprises
    are extracting, transforming, and loading vital data.

  2. Participate in information
    management user groups; again, to share ETL methods and

  3. Engage a major business
    consulting firm for information on the latest ETL
    techniques, especially industry-specific techniques.

  4. Stay familiar with new
    advances in ETL tools by adding their name to major
    vendors’ mailing lists.

  5. Check common sources of information technology
    standards and practices for ETL guidance, like the International
    Organization for Standardization (ISO) or the US
    National Institute of Standards and Technology (NIST).

  6. Understand that Big Data repositories will likely
    contain customers’ personally identifiable information (PII),
    resulting in the extraction, transformation, and loading
    of extremely sensitive information. ETL processes must
    be secure to prevent privacy violations and enterprise


[return to top of this report]

[return to top of this report]

About the Author

[return to top of this report]

James G. Barr is a leading business continuity analyst and

business writer with more than 30 years’ IT experience. A member of

"Who’s Who in Finance and Industry," Mr. Barr has designed,

developed, and deployed business continuity plans for a number of Fortune

500 firms. He is the author of several books, including How to

Succeed in Business BY Really Trying, a member of Faulkner’s Advisory

Panel, and a senior editor for Faulkner’s Security Management

Practices. Mr. Barr can be reached via e-mail at

[return to top of this report]