Data Analysis and Data Mining












PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free
download
.

Data Analysis and Data Mining

by Faulkner Staff

Docid: 00016442

Publication Date: 2209

Report Type: TUTORIAL

Preview

Some estimates indicate that the amount of new information more than
doubles every two years.1 To deal with such mountains of data,
organizations often store information in a data repository that pulls from
various sources that might include in-house databases, summarized
information from internal systems, and external sources. Properly
designed, implemented, and updated, these repositories, called data
warehouses, allow managers at all levels to extract and examine
information about their company such as its products, operations, and
customers’ buying habits. A data warehouse, combining data analysis and
data mining, can bring together data in a single format supplemented by
metadata through use of a set of input mechanisms known as extraction,
transformation, and loading (ETL) tools. These and other business
intelligence tools enable organizations to quickly make knowledgeable
decisions based on analysis of the data. This report discusses how data
analysis and data mining fit with other technologies, offers a view of the
current state, and provides guidance to organizations in overseeing the
implementation process.

Report Contents:

Executive Summary

[return to top of
this report]

With a central repository to keep the massive amounts of data being
constantly accrued, organizations need tools that can help them extract
the most useful information from that data.

Business Intelligence Solutions Market Trends

A data warehouse, combining data analysis and data mining, can bring
together disparate pieces of data into a single format and supplement it
with metadata created through a set of input mechanisms known as
extraction, transformation, and loading (ETL) tools. These technologies
are frequently used in customer relationship management (CRM) to analyze
patterns and query customer databases. Working together, the tools enable
organizations to quickly make knowledgeable business decisions based on
good information analysis.

Analysis of the data includes simple query and reporting functions,
statistical analysis, more complex multidimensional analysis, and data
mining (also known as knowledge discovery in databases, or KDD). Online
analytical processing (OLAP) is most often associated with
multidimensional analysis, which requires powerful data manipulation and
computational capabilities. However, research firm Gartner states that
predictive analytics, along with “other categories of advanced analytics,”
form the fastest-growing analytics market segment.

Definition

[return to top of
this report]

Gathering corporate information from different sources into a single
structure, typically an organization’s data warehouse, can facilitate
analysis of different business activities and enhance the understanding of
underlying trends. Data warehouses are usually separate from production
systems, as the production data is added to the data warehouse at
intervals that vary according to business needs and system constraints.
Raw production data must be cleaned and qualified; therefore it often
differs from the operational data from which it was extracted. The
cleaning process may actually change field names and data characters in
the record to make the revised version compatible with the warehouse data
rule set. This is the province of ETL tools.

A data warehouse also contains metadata (structure and sources of the raw
data; essentially, data about data), the data model, rules for data
aggregation, replication, distribution and exception handling, and any
other information necessary to map the data warehouse, its inputs, and its
outputs. As the complexity of data analysis grows, so does the amount of
data being stored and analyzed with ever more powerful and faster analysis
tools and hardware platforms required to maintain the data warehouse.

A successful data warehousing strategy requires a powerful, fast, and
easy way to develop useful information from raw data. Data analysis and
data mining tools use quantitative analysis, cluster analysis, pattern
recognition, correlation discovery, and associations to analyze data with
little or no IT intervention. The resulting information is then presented
to the user in an understandable form. These processes are collectively
known as business intelligence (BI). Managers can choose between several
types of analysis tools including queries and reports, managed query
environments, and OLAP and its variants (ROLAP, MOLAP, and HOLAP). These
are supported by data mining, which develops patterns that may be used for
later analysis.

Business Intelligence Components

The ultimate goal of data warehousing is BI production, but analytic
tools represent only part of this process. Three basic components are used
together to prepare a data warehouse for use and to develop information
from it:

  • ETL tools, used to bring data from diverse sources together in a
    single, accessible structure, and load it into the data mart or data
    warehouse.
  • Data mining tools, which use a variety of techniques including neural
    networks and advanced statistics to locate patterns within the data and
    develop hypotheses.
  • Analytic tools, including querying tools and the OLAP variants, to
    analyze data, determine relationships, and test hypotheses about the
    data.

Analytic tools continue to grow within this framework, with the overall
goal of improving BI, improving decision analysis, and, more recently,
promoting linkages with business process management (BPM, also known as
workflow).

Data Mining

Data mining can be defined as the process of extracting data, analyzing
it from many dimensions or perspectives, then producing a summary of the
information in a useful form that identifies relationships within the
data. There are two types of data mining: descriptive, which gives
information about existing data; and predictive, which makes forecasts
based on the data.

Table 1 shows common applications for data mining across industries.

Table 1. Common Applications for Data Mining Across Industries
Application What is Predicted? Business Decision Driven
Profiling and Segmentation

Customer behaviors and needs by segment

How to better target product/service offers

Cross-Sell and Up-Sell

What customers are likely to buy

Which product/service to recommend

Acquisition and Retention

Customer preferences and purchase patterns

How to grow and maintain valuable customers

Campaign Management

The success of customer communications

How to direct the right offer to the right person at the right
time

Profitability and Lifetime Value

Drivers of future value (margin and retention)

Which customers to invest in and how best to appeal to them

Source: SAS

Basic Requirements

A corporate data warehouse or departmental data mart is useless if their
data cannot be put to work. One of the primary goals of all analytic tools
is to develop processes that can be used by ordinary individuals in their
jobs rather than requiring advanced statistical knowledge. At the same
time, the data warehouse and information gained from data mining and data
analysis needs to be compatible across a wide variety of systems. For this
reason, products within this arena are evolving toward ease of use and
interoperability, though these have become major challenges.

For all analytic tools, it is important to keep business goals in mind,
both in selecting and deploying tools and in using them. In putting these
tools to use, it is helpful to look at where they fit into the
decision-making processes. The five steps in decision-making can be
identified as follows:

  • Develop standard reports
  • Identify exceptions; unusual situations and outcomes that indicate
    potential problems or advantages
  • Identify causes of the exceptions
  • Develop models for possible alternatives
  • Track effectiveness

Standard reports are the results of normal database queries that tell how
the business is performing and provide details of key business factors.
When exceptions occur, the details of the situation must be easily
obtainable. This can be done by data mining, or by developing hypotheses
and testing them using analytic tools such as OLAP. The conclusions can
then be tested using “what-if” scenarios with simple tools such as
spreadsheet applications. When a decision is made and action is taken, the
results must then be traced so that the decision-making process can be
improved.

Although sophisticated data analysis may require the help of specialized
data analysts and IT staff, the true value of these tools lies in the fact
that they are coming closer to the user. The “dashboard” is becoming the
leading user interface, with products such as the Informatica PowerCenter,
Oracle’s Hyperion Essbase, SAS Enterprise Miner, and Microsoft SQL Server
Analysis Services designed to provide easily customizable personal
dashboards.

One of the recurring challenges for data analysis managers is to disabuse
executives and senior managers of the notion that data analysis and data
mining are business panaceas. Even when the technology might promise
valuable information, the cost and the time required to implement it might
be prohibitive.

Current View

[return to top of
this
report]

The advanced analytics market is growing. IDC expects that revenue for
big data and business analytics software delivered via the public cloud
will grow 32.3 percent and represent more than 44 percent of the total BDA
software opportunity in 2022,2 while MarketsandMarkets predicts
the global advanced analytics market size to grow from USD 33.8 billion in
2021 to USD 89.8 billion by 2026, at a compound annual growth rate (CAGR)
of 21.6 percent.3

The current market comprises two distinct type of data analysis vendors:
traditional BI, where analysis is but one part of the product, and a genre
known as advanced analytics platforms. While enterprise vendors such as
SAS, SAP, IBM, Microsoft, and Oracle have products in both categories,
vendors of standalone products include Qlik and Tibco in the BI arena, and
RapidMiner, KNIME, Alteryx, and Revolution Analytics (acquired in 2015 by
Microsoft) in advanced analytics platforms.

The analytic sector of BI can be further broken down into two general
areas: query and analysis and data mining. It is important to bear in mind
the distinction, although these areas are often confused. Data analysis
looks at existing data and applies statistical methods and visualization
to test hypotheses about the data and discover exceptions. Data mining
seeks trends within the data, which may be used for later analysis. It is,
therefore, capable of providing new insights into the data, which are
independent of preconceptions.

According to advanced analytic platform vendor Alteryx, the past several
years have seen analytics move from IT teams and coders to data analysts
and business experts. This move has changed the requirements of users,
resulting in increased functionality in products devoted to data analysis
and mining.

Data Analysis

Data analysis is concerned with a variety of different tools and methods
that have been developed to query existing data, discover exceptions, and
verify hypotheses. These include:

Queries and Reports. A query is simply a question
put to a database management system, which then generates a subset of data
in response. Queries can be basic (e.g., show me Q3 sales in Western
Europe) or extremely complex, encompassing information from a number of
data sources, or even a number of databases stored within dissimilar
programs (e.g., a product catalog stored in an Oracle database, and the
product sales stored under Salesforce). A well-written query can exact a
precise piece of information; a sloppy one may produce huge quantities of
worthless or even misleading data.

Queries are often written in structured query language (SQL), a
product-independent command set developed to allow cross-platform access
to relational databases. Queries may be saved and reused to generate
reports, such as monthly sales summaries, through automatic processes, or
simply to assist users in finding what they need. Some products build
dictionaries of queries that allow users to bypass knowledge of both
database structure and SQL by presenting a drag-and-drop query-building
interface. Query results may be aggregated, sorted, or summarized in many
ways. For example, SAP’s BusinessObjects unit offers a number of built-in
business formulas for queries.

The presentation of the data retrieved by the query is the task of the
report. Presentations may encompass tabular or spreadsheet-formatted
information, graphics, cross tabulations, or any combination of these
forms. A rudimentary reporting of products might simply show the results
in a comprehensible fashion; more elegant output is usually advanced
enough to be suitable for inclusion in a glossy annual report. Some
products can run queries on a scheduled basis and configure those queries
to distribute the resulting reports to designated users through email.
Reporting products routinely produce HTML output and are often accessible
through a user’s Web browser.

Managed Query Environments. The term managed query
environment has been adopted by the industry to describe a query and
reporting package that allows IT control over users’ access to data and
application facilities in accordance with each user’s level of expertise
and business needs. For example, in some organizations, IT may build a set
of queries and report structures and require that employees use only the
IT-created structures; in other organizations, and perhaps within other
areas of the same organization, employees are permitted to define their
own queries and create custom reports.

A managed report environment (MRE) is a type of managed query
environment. It is a report design, generation, and processing environment
that permits the centralized control of reporting. To users, an MRE
provides an intelligent report viewer that may contain hyperlinks between
relevant parts of a document or allow embedded OLE objects such as Excel
spreadsheets within the report. MREs have familiar desktop interfaces; for
example, SAP’s BusinessObjects tabbed interface allows employees to handle
multiple reports in the same way they would handle multiple spreadsheets
in an Excel workbook.

Some MREs can handle the scheduling and distribution of reports, as well
as their processing. For example, SAP’s Crystal Reports can develop
reports about previously created reports.

Online Analytical Processing (OLAP). The most
popular technology in data analysis is OLAP. OLAP servers organize data
into multidimensional hierarchies, called cubes, for high-speed data
analysis. Data mining algorithms scan databases to uncover relationships
or patterns. OLAP and data mining are complementary, with OLAP providing
top-down data analysis and data mining offering bottom-up discovery. A
simple example is provided in Figure 1.

Figure 1. A Simple OLAP Setup

Figure 1. A Simple OLAP Setup

Source: Wikimedia
Commons

OLAP tools allow users to drill down through multiple dimensions to
isolate specific data items. For example, a hypercube (the
multidimensional data structure) may contain sales information categorized
by product, region, salesperson, retail outlet, and time period, in both
units and dollars. Using an OLAP tool, a user need only click on a
dimension to see a breakdown of dollar sales by region; an analysis of
units by product, salesperson, and region; or to examine a particular
salesperson’s performance over time.

Information can be presented in tabular or graphical format and
manipulated extensively. Since the information is derived from summarized
data, it is not as flexible as information obtained from an ad hoc query;
most tools offer a way to drill down to the underlying raw data. For
example, IBM Cognos Business Insight provides users with the ability to
query the database for the records in question.

Although each OLAP product handles data structures and manipulation in
its own way, an OLAP API standardizes many important functions and allows
IT to offer the appropriate tool to each of its user groups. The MD-API
specifies how an OLAP server and client connect, and it defines metadata,
data fetch functions, and methods for handling status messages. It also
standardizes filter, sort, and cube functions; compliant clients are able
to communicate with any vendor’s compliant server.

OLAP Variants

OLAP is divided into multidimensional OLAP (MOLAP), relational OLAP
(ROLAP), and hybrid OLAP (HOLAP).

ROLAP can be applied both as a powerful DSS product, as well as to
aggregate and pre-stage multi-dimensional data for MOLAP environments.
ROLAP products optimize data for multi-dimensional analysis using standard
relational structures. The advantage of the MOLAP paradigm is that it can
natively incorporate algebraic expressions to handle complex, matrix-based
analysis. ROLAP, on the other hand, excels at manipulating large data sets
and data acquisition, but is limited to SQL-based functions. Since all
organizations will require both complex analysis and analysis of large
data sets, it could be necessary to develop an architecture and set of
user guidelines that will enable implementation of both ROLAP and MOLAP
where each is appropriate.

HOLAP is the newest step in the ongoing evolution of OLAP. HOLAP combines
the benefits of both ROLAP and MOLAP by storing only the most often used
data in multidimensional cube format and processing the rest of the
relational data in the standard on-the-fly method. This provides good
performance in browsing aggregate data, but slower performance in
“drilling down” to further detail.

Data Mining

Databases are growing in size to a stage where traditional techniques for
analysis and visualization of the data are breaking down. Data mining and
KDD are concerned with extracting models and patterns of interest from
large databases. Data mining can be regarded as a collection of methods
for drawing inferences from data. The aims of data mining and some of its
methods overlap with those of classical statistics. It should be kept in
mind that both data mining and statistics are not business solutions; they
are just technologies. Additionally, there are still some philosophical
and methodological differences between them.

This field is growing rapidly, due in large part to the increasing
awareness of the potential competitive business advantage of using such
information. Important knowledge has been extracted from massive
scientific data, as well. What is useful information depends on the
application. Each record in a data warehouse full of data is useful for
daily operations, as in online transaction business and traditional
database queries. Data mining is concerned with extracting more global
information that is generally the property of the data as a whole. Thus,
the diverse goals of data mining algorithms include: clustering the data
items into groups of similar items, finding an explanatory or predictive
model for a target attribute in terms of other attributes, and finding
frequent patterns and sub-patterns, as well as finding trends, deviations,
and interesting correlations between the attributes.

A problem is first defined, then data source and analytic tool selection
are undertaken to decide the best way to approach the data. This involves
a wide variety of choices. The model development and deployment process is
diagrammed by SmartDrill, a data mining services provider, in Figure2.

Figure 2. Typical Data Mining Process for Predictive
Modeling

Figure 2. Typical Data MiningProcess for Predictive Modeling

Source: SmartDrill

Decision trees and decision rules are frequently the basis for data
mining. They utilize symbolic and interpretable representations when
developing methods for classification and regression. These methods have
been developed in the fields of pattern recognition, statistics, and
machine learning. Symbolic solutions can provide a high degree of insight
into the decision boundaries that exist in the data and the logic
underlying them. This aspect makes these predictive mining techniques
particularly attractive in commercial and industrial data mining
applications.

Applying machine-learning methods to inductively construct models of the
data at hand has also proven successful. Neural networks have been
successfully applied in a wide range of supervised and unsupervised
learning applications. Neural-network methods are not commonly used for
data mining tasks because they are the most likely to produce
incomprehensible results and to require long training times. Some
neural-network learning algorithms exist, however, that are able to
produce good models without excessive training times.

In recent years, significant interest has developed in adapting numerical
and analytic techniques from statistical physics to provide algorithms and
estimates for good approximate solutions to hard optimization problems.
Cluster analysis is an important technique in exploratory data analysis,
because there is no prior knowledge of the distribution of the observed
data. Partitional clustering methods, which divide the data according to
natural classes present in it, have been used in a large variety of
scientific disciplines and engineering applications. The goal is to find a
partition of a given data set into several compact groups. Each group
indicates the presence of a distinct category in the measurements.

In all data mining applications, results are considerably subject to
interpretation, since it is a search for trends and correlation rather
than an examination of hypotheses based on known real-world information.
The possibility for spurious results is large, and there are many cases
where the information developed will be of little real value for business
purposes. Nonetheless, when pay dirt is struck, the results can be
extremely useful.

Interest in data mining is growing, and it has recently been spotlighted
by attempts to root out terrorist profiles from data stored in government
computers. In a more mundane, but lucrative application, SAS uses data
mining and analytics to glean insight about influencers on various topics
from postings on social networks such as Twitter, Facebook, and user
forums, and to detect fraud in banking and insurance.

Trends

Regulations

The big news in regulations is the European Union (EU) General Data
Protection Regulation (GDPR). GDPR’s intent is to protect all EU citizens
from privacy and data breaches, although it will complicate how
organizations approach data and, obviously, affect global data use, since
it applies to all data that touches the EU, regardless of where in the
world the data originates or has been captured and analyzed. One silver
lining is that GDPR will help organizations improve their data management
capabilities. By enforcing how data is handled across an enterprise, it
strengthens the ability of organizations to compete, assisting in their
success.

Data Mining and CRM

CRM is a technology that relies heavily on data mining. Comprising sales,
marketing, and service, CRM applications use data mining techniques to
support their functionality. Combining the two technology segments is
sometimes referred to as “customer data mining.” Proponents claim that
positive results of customer data mining include improvements in
prospecting and market segmentation; increases in customer loyalty, as
well as in cross-selling and up-selling; a reduction in risk management
need; and the optimization of media spending on advertising.

Big Data

Data has grown – and continues to grow – exponentially, and so the catch
phrase used to describe a massive volume of data is, not
surprisingly, big data. Some analysts claim that the term is
becoming passe, and big data is simply data. However, the term is also
used to describe the tools and processes required to handle and store this
data, generally so large that traditional processes cannot manage it.

Some analysts opine that the increase in Big Data has resulted in a surge
of cloud services, such as Amazon Web Services. The future of Big Data may
see more categories, including things like actionable data and fast data.
Its penetration into the business environment will also result in smart BI
having the capability to automate decision making.

Algorithms

Some analysts opine that algorithms will become the Next Big Thing in
data analysis. Microsoft defines an algorithm as “a set of heuristics and
calculations that creates a model from data. To create a model, the
algorithm first analyzes the data you provide, looking for specific types
of patterns or trends… [then] uses the results of this analysis over
many iterations to find the optimal parameters for creating the mining
model. These parameters are then applied across the entire data set to
extract actionable patterns and detailed statistics.”

Hadoop and Spark

Apache Hadoop is an open source software framework. It allows for
distributed processing of big data on large clusters of commodity
hardware, and it is scalable from a single server to thousands. It
accomplishes both massive data storage and faster processing.

Recently, however, one of its former components has become a popular big
data platform in its own right: Apache Spark. Its originator, Matei
Zaharia, claims that Spark’s data processing speed is much faster than
that of Hadoop and that Spark is “the largest big data open-source
project.” Whether or not Hadoop’s popularity decline is due to Spark or
other platforms remains to be seen.

Streaming Analytics

Streaming analytics can be defined as analytic platforms able to generate
insights from data in real time. Basically, just as traditional analytics
tools have allowed extracting business value from data at rest, these
platforms allow that from data in motion. In addition to platforms
available from enterprise vendors such as Software AG, IBM, and SAP,
standalone vendors Tibco, Informatica, and Vitria offer streaming analytic
platforms. Additionally, many hosted streaming analytic offerings are
available from vendors including, surprisingly, Amazon (its Kinesis was
released in November 2013) and Google (whose Cloud DataFlow was released
in August 2015).

Data Lakes

Data lakes are growing in popularity as a big data storage tool. A data
lake can be defined as a storage repository that holds a vast amount of
raw data in its native format, whether structured, semi-structured, or
unstructured. This is different from a data warehouse, where the data is
structured and processed before storing. With a data lake, the data
structure does not need to be defined until the data is used.

Data lakes have proven successful in their use as storage for enormous
data quantities; however, some analysts claim that it is more difficult to
gain actionable insights from the data. Rather than letting the data rest
in data lakes, these analysts claim that the analysis should take place in
real time, before the data becomes “stale.”

Artificial Intelligence (AI)

No longer the stuff of sci-fi movies, AI is finding renewed popularity.
Such technologies as facial recognition, medical diagnoses, and even
self-driving automobiles are all based on AI, and most analysts expect
this genre to expand rapidly in the next several years. Conversational AI,
aka chatbots, have grown in popularity, with many enterprises developing
chatbots to improve their customer service.

Recommendations

[return to top of
this report]

The 12 Rules

In 1993, E.F. Codd, S.B. Codd, and C.T. Salley presented a paper entitled
“Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT
Mandate” that offered 12 rules for evaluating analytical processing tools.
These rules are essentially a list of “must haves” in data analysis,
focusing on usability, and they continue to be relevant in evaluating
analytic tools:

  • Multidimensional Conceptual View
  • Transparency
  • Accessibility
  • Consistent Reporting Performance
  • Client/Server Architecture
  • Generic Dimensionality
  • Dynamic Sparse Matrix Handling
  • Multi-User Support
  • Unrestricted Cross-Dimensional Operations
  • Intuitive Data Manipulation
  • Flexible Reporting
  • Unlimited Dimensions and Aggregation Levels

Since data analysis is such a key method for developing knowledge from
the huge amounts of business data collected and stored each day,
enterprises need to select their data analysis tools with care. This will
help ensure that the tools’ strengths match the needs of their business.
Organizations must be aware of how the tools are to be used and their
intended audience. It is also important to consider the Internet, as well
as the needs of mobile users and power users, and to assess the skills and
knowledge of the users and the amount of training that will be needed to
get the most productivity from the tools.

Visual tools are very helpful in representing complex relationships in
formats that are easier to understand than columns of numbers spread
across a screen. Key areas of discovery found with visual tools can then
be highlighted for more detailed analysis to extract useful information.
Visual tools also offer a more natural way for people to analyze
information than does mental interpretation of a spreadsheet.

Organizations should also closely consider the tool interface presented
to users, because an overly complex or cluttered interface will lead to
higher training costs, increased user frustration, and errors. Vendors are
trying to make their tools as friendly as possible, but decision-makers
should also consider user customization issues, because a push-button
interface may not provide the flexibility their business needs. When
considering their data analysis and storage processes, companies need to
determine which approach is best. The choices include a multi-dimensional
approach, a relational analysis one, or a hybrid of the two. 

While data analysis tools are becoming simpler, more sophisticated
techniques will require specialized staff. Data mining, in particular, can
require added expertise because results can be difficult to interpret and
may need to be verified using other methods.

Data analysis and data mining are part of BI, and require a strong data
storage strategy in order to function. This means that attention needs to
be paid to the more mundane aspects of ETL, as well as to advanced
analytic capacity. The final result can only be as good as the data that
feeds the system.

References

1 “The Amount of Data in the World Doubles Every Two Years.”
Medium.com. October 7, 2020.

2 “IDC Forecasts Revenues for Big Data and Business Analytics
Solutions Will Reach $189.1 Billion This Year with Double-Digit Annual
Growth Through 2022.” IDC.com. April 4, 2020.

3 “Advanced Analytics Market by Component (Solutions and
Services), Business Function (Sales & Marketing, Operations), Type
(Big Data Analytics, Risk Analytics), Deployment Mode (On-premises and
Cloud), Vertical and Region – Global Forecast to 2026.” MarketsandMarkets.
February 2022.

[return to top of this
report]

[return to top of
this report]