Data Curation










PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free
download
.

Data Curation

by James G. Barr

Docid: 00018012

Publication Date: 2201

Publication Type: TUTORIAL

Preview

Data curation is the process of collecting, wrangling, and preserving
enterprise data. The goal is to generate and maintain data sets that
are FAIR, which stands for data that are Findable, Accessible,
Inter-operable, and, perhaps most importantly, Reusable, thereby
optimizing data value. Essentially a large-volume data management
strategy, data curation is a response to the massive amounts of data,
so-called big data, that began accumulating in data warehouses in the
1990s with the advent of the Internet and e-commerce.

Report Contents:

Executive Summary

[return to top of this
report]

The verb “curate,” derived from the noun “curator,” is the process of
“carefully choosing just the right assortment of objects.”1 Indeed, when most people hear the word curate, they immediately think of
museums and art galleries where one or more professionals –
curators – select among a wide variety of paintings, sculptures, and other
objects to exhibit in a collection often arranged around a theme like French
impressionist paintings or early Roman engravings.

 

Related
Faulkner Reports
Information Lifecycle
Management Strategy Tutorial
Information Archiving Best
Practices Tutorial
Digitizing Enterprise
Content Tutorial

Today, the term curate (or curation) is applied to both physical and
digital assets; in the latter case, through a function known as “data
curation,” defined by analyst Hazal Simsek as “the process of collecting,
wrangling, and preserving data.”2

The FAIR Principals

For the modern enterprise, the main objective of data curation is to generate
and maintain data sets containing FAIR data.

Figure 1. The FAIR Principals

Figure 1. The FAIR Principals

Source: Wikimedia Commons

As illustrated in Figure 1, FAIR is an acronym that stands for the four
characteristics of curated data – data that are
Findable, Accessible, Inter-operable, and Reusable.


  • Findable

    involves the use of unique identifiers and metadata
    (data describing data) that permits data to be located quickly and
    efficiently.

  • Accessible

    describes data that are open, free, and readily
    available for research and discovery efforts.

  • Inter-operable

    implies data that are compatible with a broad
    range of applications and workflows.

  • Reusable

    means data that can be used and reused for multiple
    purposes.

In simplest terms, data curation is the process by which an enterprise
gets the most out its data.

Optimization Through Curation

As Skylar Hawthorne, a leading data curator at the University of
Michigan’s Inter-University Consortium for Political and Social Research
(ICPSR) observes: “Many … researchers follow a linear workflow; they
collect data, run statistical analyses, then publish their results. Curators, however, must consider what happens to data, how others can use
it, and what they might use it for. The initial insights gleaned from data
can be groundbreaking, but no single research team can maximize the full
potential of their data. This is why secondary data analysis, in
which researchers use existing data to discover new results, is so
valuable.”

Consider, for example, that “since the release of the US Transgender
Survey in 2015, an additional 54 data-related publications have built upon
it. These include:

  • “The first report on the lives of transgender people in rural America
    (Movement Advancement Project 2019),
  • “State policies and healthcare use among transgender people
    (Goldenberg et al. 2020), and
  • “Even an insight into the association of transphobic discrimination
    and alcohol misuse among transgender adults (Kcomt et al. 2020).

“If the researchers for the US Transgender Survey did not release their
data, then other researchers would not have been able to use their data to
discover novel insights.

“That’s where
data curators come in. [They] serve as mediators, translators,
editors, publishers, librarians, and – of course – curators.”3

Metadata vs. Data

The art and science of data curation is “highly focused on maintaining
and managing metadata,”4 or data describing data. The US
Department of Defense defines metadata as “data describing stored data:
that is, data describing the structure, data elements, interrelationships,
and other characteristics of electronic records.”5

In defining metadata, analyst Keith D. Foote offers a useful analogy to a
technology past. “The pre-digital card catalogs used in libraries a
few decades ago provide a good example of metadata. Generally
speaking, metadata supplies the how, when, what, where, and why of
data. Metadata [makes] the data easier to find and track.”6

Curation vs. Governance

The terms “data curation” and “data governance” are often used
interchangeably. While data curation serves as an essential element
in data governance (or data management), it is “chiefly
concerned with optimizing metadata” to facilitate data discovery and
preservation. Data governance is concerned with managing data
throughout its lifecycle.7

Education and Experience

Data curators are skilled IT
professionals. In many cases, applicants for the position require a
master’s degree in Library or Information Science, plus at least two
years’ experience in digital science, digital repository management, or
other related discipline. As with cybersecurity or other
rapidly-evolving technical specialty, data curators will normally devote
considerable time to improving their knowledge of data structures, data
management, and emerging data curation techniques.

Data Curator Duties

[return to top of this
report]

“Companies analyze only 12 percent of their data on average.”8

“By discovering and organizing data sets, data curators make the
knowledge within them accessible [to] all other professionals within the
[enterprise]. And presenting the right kind of knowledge to the
right kind of people has nearly unlimited possibilities of advancing
business goals.”9

While the practice of data curation may differ from enterprise to
enterprise, most data curators share similar duties.

Their generic responsibilities include:


  • Preserving data

    – “Collecting, storing, and managing data to
    ensure that it doesn’t get lost.”

  • Discovering data

    – “Gathering data from different databases,
    cataloging, categorizing, and otherwise preparing it for further usage
    and analysis.”

  • Cleaning data

    – “[Removing] errors and inconsistencies.”

  • Integrating data

    – “[Combining] data from differently formatted
    databases.”

  • Sharing data

    – “Making data available for further use by
    interested parties.”10

Among their more technical functions are:


  • Contextualizing

    ” – Adding metadata to a “data set” (or
    assemblage of related data elements). Metadata, including sources
    and attributions, reveals how and why the subject data was generated,
    like a data genealogy.

  • Citing the Data

    ” – Enabling third-party users to properly
    attribute the data, acknowledging its source and ownership.

  • De-Identification

    ” – Removing, masking, or otherwise
    concealing personally identifiable information (PII) as required by
    applicable security and privacy standards and statutes.

  • Validating and Adding Metadata

    ” – Adding machine-readable
    information about the nature of a data set to facilitate electronic
    search and retrieval operations.

  • [Validating the] Data

    ” – Invoking the services of a subject
    matter expert, someone with the same credentials and experience as the
    data creator, to review the contents of a data set.11

In addition, curators:

Prepare ML/AI Data

Today, many enterprise operations are performed in whole or in part by
computers using machine learning (ML) and artificial intelligence (AI)
applications. These applications learn by absorbing enormous volumes
of training data – data prepared by curators to ensure the information is
reliable, unbiased, and machine-readable. Curation is critical to
helping ML/AI systems make the right decisions, and take the right
actions.12

Evaluate Data Feeds

Even as enterprises are often overwhelmed by data – either internal,
self-generated data or data delivered via external sources – new data
feeds that are deemed valuable and reliable are always welcome, especially
as these new feeds might replace compromised or questionable
sources. The ability to evaluate the enterprise’s data landscape –
and make critical recommendations relative to its contents and contours –
is an indispensable part of the data curator’s job.

Finally, a senior data curator may be invited to:

Assess DM Procedures

As a member of the enterprise Data Management team responsible for data
quality, a data curator is well positioned, by virtue of her knowledge and
experience, to assess the quality of all DM policies, protocols, and
procedures, including those related to vital support services, such as:

  • Data backup and recovery
  • Data discovery (for legal purposes)
  • Data disposition

The Data Management manager should leverage this expertise by
incorporating data curators into the regular DM operations review process.

Content Curation Basics

[return to top of this
report]

Closely related to the concept of data curation is content
curation. While data curation generally concerns “business” data

created
by


the enterprise – sales data, financial data, research and
development data, etc., content curation normally involves Internet data

collected
by


the enterprise – news, advertising, social media posts,
website data, etc.

Figure 2. The Content Curation Challenge

Figure 2. The Content Curation Challenge

Source: Wikimedia Commons

Content curation, aimed at gathering relevant and useful information from
the Web,13 is complicated by the almost unimaginable volume of
data – text, audio, and video – generated each day. For example, as
of June 2021:

  • “The internet [held] 44 zettabytes of data … and [was] growing at a
    rate of around 1.7Mb per second per person on earth.
  • “Twitter [produced] 474,000 tweets per minute.
  • “YouTube [had] 400 hours of video content uploaded every minute.
  • “There [were] 67,305,600 Instagram posts per day.
  • “There [were] 3.5 billion Google searches every minute.”14

From an enterprise perspective, the continuing proliferation of e-mail
messages, the extreme volatility of website data, and the frenzied
exchanges of social media users have created a challenge for content
curators charged with separating the proverbial Web wheat from the chaff.

Analyst Kazuki Nakayashiki summarizes the dilemma as follows. “In
the constant scramble to stay relevant, most platforms emphasize the new,
rather than the good, creating a feed architecture that is obsessed with
the present. What is the point in going back and reading an older
issue of anything when you’ll have two new issues in your inbox in the new
few minutes? This creates a

content ephemerality issue


, where we are bombarded with a consistent stream of new, but only slightly
different content as creators and platforms struggle to capture the interest of
their users through constant novelty.

Helping combat this issue, “
The best content creators absorb huge
amounts of information for us and render the best of it down into
genuinely interesting and entertaining highlights that communicate both
the original content and their take on it.

15

From the standpoint of a content curator, who is seeking to absorb and
process the best information the Internet has to offer, Nakayashiki’s
analysis suggests at least a partial curation strategy:

Step 1.
Limit your enterprise data gathering to a small number
of meaningful content categories.

Step 2.
For each category, identify two or three trusted
sources.

These can be news sites, websites, blogs, etc.,
depending on the content contributor. Trustworthiness can be
determined by multiple factors, including:

  • The contributor’s reputation;
  • The appearance of source citations (indicating an academic approach to
    research and reporting);
  • The general quality of the content provided, the questions asked, the
    arguments advanced, and the conclusions rendered; and
  • The frequency with which the contributor is quoted or otherwise
    referenced by her contemporaries.

Step 3.
“Scrape” each trusted source on a weekly basis looking
for relevant, curated content.

Step 4.
Each quarter, archive any content with a low “refresh”
incidence, as the content is likely static or no longer relevant.

Data Curation Tools

[return to top of this
report]

“A study headed by the University of Texas
has shown that if fortune 1000 companies were to raise the usability
of their data by 10%, it would mean a $2.01 billion increase in total
revenue per year.”16

To help optimize the data curation process, a number of firms offer AI-
and ML-infused data curation platforms. Prominent providers include:

  • Alation
  • Stitch Data (Talend)
  • DQLabs
  • Alteryx17

Other options include solutions from Amazon, Microsoft, and NIST.

AWS Lake Formation

Using Amazon Web Services (AWS) Lake Formation, a client can create a
curated “data lake” by defining where the subject data resides and what
data access and security policies apply. Lake Formation then:

  • Collects and catalogs the data
  • Moves the data into a new Amazon S3 data lake
  • Cleans and classifies the data using machine learning algorithms
  • Secures access to any sensitive data

Azure Data Lake Storage

With Microsoft Azure Data Lake Storage, data are transformed in stages
and deposited in one of three data lakes, as illustrated in Figure
3. The data are then available for consumption by analytics, data
science, and visualization teams.

Figure 3. Azure Data Lake Storage

Figure 3. Azure Data Lake Storage

Source: Microsoft

NIST Configurable Data Curation System

The Configurable Data Curation System, also known as the CDCS or Curator,
provides a means for capturing, sharing, and transforming unstructured
data into a structured format based on the Extensible Markup Language
(XML). The CDCS can be viewed as a “loading dock” for scientific
data. It serves as a means to enable the collection and
dissemination of structured scientific data. It can be applied to
any area and is agnostic to the type of data. "Curated” data are
amenable to transformation to other formats such as those used by existing
computational tools. The data are organized using user-selected
community-developed templates encoded in XML Schema used to create data
documents that are saved in a non-relational (NoSQL) document database.”18

[return to top of this
report]

Alation: http://www.alation.com/
Alteryx: http://www.alteryx.com/
Amazon Web Services (AWS): http://aws.amazon.com/
DQLabs: http://www.dqlabs.ai/
Microsoft: http://www.microsoft.com/
Stitch Data (Talend): http://www.talend.com/
US National Institute of Standards and Technology (NIST): http://www.nist.gov/

References

1 Merriam-Webster.

2 Hazal Simsek. “All You Need to Know about Data Curation.” AIMultiple. November 3, 2021.

3 Skylar Hawthorne. “An Insider’s Take on Data Curation: Context, Quality, and Efficiency.” Journal of eScience Librarianship
(Vol. 10, Issue 3). August 11, 2021:e1200.

4 Keith D. Foote. “So You Want to Be a Data Curator?” Dataversity Digital LLC. August 19, 2021.

5 Jesse Wilkins. “What is Metadata and Why Is It Important?” AIIM. March 9, 2021.

6 Keith D. Foote. “So You Want to Be a Data Curator?” Dataversity Digital LLC. August 19, 2021.

7 Elizabeth Mixson. “Five Things to Know about Data Curation.” AI, Data & Analytics Network. April 14, 2021.

8 Hazal Simsek. “All You Need to Know about Data Curation.” AIMultiple. November 3, 2021.

9 “Understanding Data Curation: Benefits, Goals, and Best Practices.” Coresignal. September 16, 2021.

10 Ibid.

11 Hazal Simsek. “All You Need to Know about Data Curation.” AIMultiple. November 3, 2021.

12 Ibid.

13 Keith D. Foote. “So You Want to Be a Data Curator?” Dataversity Digital LLC. August 19, 2021.

14 Kazuki Nakayashiki. “The Future Is Creation via Curation.” Medium. June 6, 2021.

15 Ibid.

16 “Understanding Data Curation: Benefits, Goals, and Best Practices.” Coresignal. September 16, 2021.

17 Elizabeth Mixson. “Five Things to Know about Data Curation.” AI, Data & Analytics Network. April 14, 2021.

18 US National Institute of Standards and Technology. September 17, 2018.

About the Author

[return to top of this
report]

James G. Barr is a leading business continuity analyst
and business writer with more than 40 years’ IT experience. A member of
“Who’s Who in Finance and Industry,” Mr. Barr has designed, developed, and
deployed business continuity plans for a number of Fortune 500 firms. He
is the author of several books, including How to Succeed in Business
BY Really Trying
, a member of Faulkner’s Advisory Panel, and a
senior editor for Faulkner’s Security Management Practices.
Mr. Barr can be reached via e-mail at jgbarr@faulkner.com.

[return to top of this
report]