Dark Data

PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free

Dark Data

by Geoff Keston

Docid: 00021086

Publication Date: 2007

Report Type: TUTORIAL


The massive volume of data created through e-mails, texts, multimedia,
social media, and other unstructured sources is a significant and growing
problem for enterprises. While this “dark data” might contain useful
insights about customers, for example, it cannot be easily found,
analyzed, or secured by typical tools and processes. New approaches for
managing it are being developed – ones that not only enhance perspectives
on customer opinions and industry trends but also address the threats
presented by unsecured dark data and minimize them to reduce security

Report Contents:

Executive Summary

[return to top of this

For enterprises, the data created by employees sending text messages,
chatting through customer service apps, and posting to
corporate social media sites presents two problems.

Big Data Analytics Tutorial
Big Data Technology Tutorial
Artificial Intelligence Tutorial

First, the data cannot easily provide business analysis. While the data
might contain useful insight about customer perceptions, for example, it
can’t be easily searched, making that insight hard to uncover. Second, the
data can’t be effectively secured and may reside outside corporate
defenses. Because it isn’t categorized in any way, applying regulatory or
privacy polices to it is difficult.

These problems are already large and widespread, and as enterprises use a
wider range of communications tools, the risks are growing worse. There
are products on the market to help analyze and secure dark data, but no
single tool solves all problems and significant human knowledge is needed
to use them.

With these challenges in mind, enterprises can most effectively manage
dark data as a business process problem. Big Data analysis and artificial
intelligence are part of the answer, but they need to be complemented by
policies that control how employees create data and procedures such as
internal audits that look for information that software tools can’t find.


[return to top of this

Corporate information that cannot be easily searched and managed is called
“dark data.” A short list of its sources includes:

  • Text messages
  • Chat transcripts
  • Event videos
  • Social media posts
  • Pictures
  • Audio files
  • Voicemails

Dark data is also called “unstructured” to contrast it with the structured
contents of a database. In a database, information is categorized, for
example, as being a name or address. This structure makes data easy to find.
Administrators can use a database search tool to identify all people over 40
who live in St. Louis or all financial transactions in 2019 over $10,000.
Further, databases are created using specialized software with built-in
security, and the software is installed on hardware that, because of its
importance, is secured with physical door locks and other mechanisms.
Unstructured data, on the other hand, isn’t categorized. Without studying
each piece of data, organizations can’t know key facts about it, such as
what customer it involves or what regulations pertain to its handling.
Often, even the existence of the data may be hard to determine.

There are additional problems caused by this lack of visibility:

  • Ownership of information may be unclear
  • Data is hard to classify and analyze
  • Data may reside outside of normal enterprise security mechanisms
  • Backups may not be performed

Current View

[return to top of this

Starting several years ago, as enterprises began using unstructured data
platforms like social media more extensively, the fundamental nature of
corporate knowledge management changed. “This is a different world we’re
living in,” said digital strategy specialist Aashish Chandra in 2013.1
Since then, the problem has persisted. In a 2019 study of
organizations in the United States, Europe, and Asia, 60-percent of
respondents said that at least half of their data could be considered

And the importance of coping with dark data has grown, in particular
because of the European data handling law called the General Data
Protection Regulation, or GDPR. This regulation puts greater data
management demands on organizations, including those outside of Europe
that interact with European companies.

Commercial interest in tools for managing dark data is emerging gradually,
as can be seen, for example, in Apple’s acquisition of artificial
intelligence software developer Lattice Data, whose products aim to convert
unstructured data into structured data.4 And dark data’s use in
particular fields is also emerging slowly. For example, in law, the presence
of this new source of evidence (or concealed evidence) has yet to be fully
understood. It can “complicate judicial decision-making” writes Daniel J.
Grimm in the American University Law Review.5 And
organizations can put themselves at legal or regulatory-violation risk
simply by accident, Grimm explains, because of how they handle data.

The problem of dark data is increasing at the same time that artificial
intelligence technology is improving, providing more tools for discovering
and analyzing unstructured information. Whereas conventional data analysis
tools are effective for structured data, such as is entered into defined
fields, AI can process less predictable content, such as human natural
language. But AI cannot process this unstructured data automatically.
Instead, it must “learn” or “be trained” to make sense of this content.6
And this learning process works better when larger amounts of data are
available. Not only is AI helping organizations to analyze dark data, but
conversely, dark data is helping AI technology develop by providing it
with unstructured data from which to learn.7 “Give AI more
information to analyze and it can produce deeper, more accurate insights,”
says a report from Splunk, which sells AI-based technology.8
“Today’s dark data could one day be an accelerant for even greater AI
performance. Thus, the advent of AI and the value of dark data go

But many organizations report not having a good understanding of AI, so they
face a learning curve in taking advantage of these new technologies.9


[return to top of this

Dark data will continue to proliferate and become more diverse.
Developments in technology – including the cheap price of storage and the
ease with which data is created – will drive this expansion. As a result,
the emerging market for products to manage dark data is likely to grow.
One recent forecast predicted a 21.7-percent compound annual growth rate
between 2020 and 2025.10

The problem of unstructured data has been largely created by the development
of new technologies, and the solution to better managing the data can also
be found, in part, in technology. There are many new tools to process data.
“Big data requires a lot of computational power,” says IBM.11
“The advent of cloud computing is making it possible to handle large data
sets using several networked devices to create vast computing
grids….Unlike traditional parallel processing, this ‘device mesh’ can
distribute and analyze information across several end points.” Another
technology development is that analytical software has become much more
powerful and feature-rich. IBM specifically mentions such AI technologies as
cognitive computing, natural language processing, and machine learning.

While there are tools, analyzing unstructured data remains hard and often
requires expertise. Enterprises must shoulder much of the burden of
choosing the tools that suit their specific needs and environment, and
problems such as false positives remain common. The future of dark data
will be shaped largely by how fully these analytical tools mature. In
particular, the following issues will be critical:

  • Can tools provide relevant business information?
  • What types of data can they analyze?
  • How much specialized knowledge is required to use them?

Over time, the technical limitations of data analytics tools may be
overcome, at least in part. But it is likely that in most enterprises some
data will remain dark, or at least dim. Enterprises will continue playing
catch up as new forms of dark data are created on new platforms, and the
effort needed to expose this information will be worthwhile only for some

The very real probability is that many enterprises will fall far short of
making effective use of dark data. One analysis by Gartner predicted that
over 80-percent of enterprises will, by 2021, “fail to develop a
consolidated data security policy across silos, leading to potential
noncompliance, security breaches and financial liabilities.”12 And
another study found that the problem affects both corporations and
government agencies.13


[return to top of this

Choose the Dark Data to Expose

Not all dark data is valuable to a business, and not all of it is a security
vulnerability. Instead of trying to bring it all into the light – a
time-consuming, expensive, and probably impossible goal – it is typically
better to focus on:

  • the most valuable data for business analysis,
  • information that presents a security risk.

Determining what data fits into these categories is difficult, however.
Performing spot checks of various types and sources of data can help an
enterprise identify where value might be found and where security holes
lurk. Over time, enterprises can develop formal metrics to categorize such
data and measure the cost-benefit ratio of exposing it versus leaving it
dark. In the process, organizations can also create new policies about
data handling to fix recurring problems, such as private information being
improperly shared. Once spot checks identify where and how dark data is
being created, an enterprise can consider which tools to use.

Another recommendation comes from technology writer Will Kelly, who
recommends that organizations “[s]tart small with a dark data pilot
project.”14 A fast data approach — or the application of
analytics to smaller data sets in the organization’s application pipeline
— enables IT teams to extract real-time, actionable information. For
example, seek out the best dark data sources that can help analyze a
specific organizational bottleneck. Another option is to tackle one
unanalyzed data source at a time.”

“Document all lessons learned,” Kelly advises. “Test and use key
performance indicators to assess results….As you learn more, gradually
scale up dark data analytics efforts.”

Identify the Tools to Use

Dark data comes in many forms, from text messages to video, so managing it
all with one tool or one process is unlikely. Instead, enterprises may need
to choose multiple tools and develop new processes. The tools that are best
for a particular enterprise will depend on many factors, including its own
business needs and the other technology it uses.

A good place to start is to consider the structured data in use, as
there are often potential connections between it and unstructured data.
“In a majority of cases, unstructured data is ultimately related back to
the company’s structured data records,” writes Mary E. Shacklett,
president of research company Transworld Data.15 “As an
example, every x-ray or MRI image for a patient is related back to the
patient’s record in the hospital’s record system. The patient record in
the record system is enriched with unstructured data that is linked to it,
and the doctor gets a more complete picture of the patient.”

By considering the ways it uses structured data, an organization can
choose tools that focus on the most relevant types of unstructured content
rather than looking blindly for information that might happen to be

Limit and Manage the Creation of Unstructured Data

A major reason dark data creates so many problems is its sheer volume.
The problems of dark data can be reduced – but not eliminated – by
controlling how data is created. Most organizations will find the
technologies that create text messages and Web chats to be too beneficial
to stop using. Yet while dark data therefore can’t be eliminated, its
growth can be slowed and its diversity can be limited. To this end,
some key policies a business might enforce include:

  • what unstructured data sources can be used,
  • what types of information can be shared on unstructured media,
  • what obligations employees have for reporting potential data problems.

Another way to minimize the creation of unstructured information is
to avoid rushing into projects that create large volumes of data
without having a plan for handling it. This can happen, for example, if a
company launches a Big Data project, explains Randy Kerns of Evaluator
Group, which consults with companies about information management.16
He says that the firm has had “a significant number of … clients that
have [the] introduction of some big data projects that came from outside.
IT wasn’t part of it — that just got dumped on them — and so they end up
with some huge spike they have to deal with.”

Develop Policies for Data that Remains Dark

Some data will inevitably remain dark — because it cannot be uncovered
efficiently, affordably, or at all — so enterprises must consider how to
cope with that problem. “Perhaps the biggest challenge when working with
dark data is simply getting access to it, as it’s often stored in siloed
repositories close to where the data is being collected,” says Dan Cech of
data management platform company Grafana Labs.17 “Additionally,
it may be stored in systems and formats that are difficult to query or
have limited analytics capabilities.”

One key step in working with dark data is to routinely research new
analytics tools, which may be able to expose data that previously couldn’t
be found. “[A] regular inventory requires understanding where dark data
resides, how it’s stored, how it’s protected and what kinds of access
controls help maintain its security,” says IT expert Ed Tittel.18
“Most organizations of any size conduct periodic security audits, evaluating
risks, exposures, incident response and policy. Dark data needs to be folded
into this process and visited sufficiently often to manage risks of exposure
as well as potential loss or harm.”

Tittel also recommends that organizations encrypt all dark data, even if
they haven’t evaluated its contents. But encryption is just one potential
policy that an enterprise might apply to dark data. Many other processes
are needed too. Dark data will always be a problem, and as new
communication platforms emerge, the problem will take new, unforeseen
forms. Frequent audits can help enterprises to keep pace with these
changes and identify the need for new tools and policies.


[return to top of this

[return to top of this

About the Author

[return to top of this

Geoff Keston is the author of more than 250 articles
that help organizations find opportunities in business trends and
technology. He also works directly with clients to develop communications
strategies that improve processes and customer relationships. Mr. Keston
has worked as a project manager for a major technology consulting and
services company and is a Microsoft Certified Systems Engineer and a
Certified Novell Administrator.

[return to top of this