PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader
is available for free
download.
Automating Taxonomy Management
Copyright 2017, Faulkner Information Services. All
Rights Reserved.
Docid: 00011544
Publication Date: 1709
Report Type: TUTORIAL
Preview
By organizing business information for more
efficient retrieval and analysis, a taxonomy can help improve all aspects
of an
enterprise, from forecasting and decision making to sales and customer
service.
Taxonomy management software can reduce the investment of time and
resources
required to implement and maintain an effective taxonomy. Strictly
speaking, a
taxonomy encodes only hierarchical relationships within the data; some
enterprises may benefit from the additional knowledge organization
capabilities
provided by a thesaurus or ontology.
Report Contents:
Executive Summary
[return to top of this
report]
Despite the business benefits of
better access to information, many companies do not devote sufficient
resources
to catalog the data they accumulate in a logical, consistent manner.
Related Faulkner Reports |
Using Controlled Vocabularies Tutorial |
Taxonomy Basics Tutorial |
A taxonomy that structures information in a
meaningful way can help both customers and employees to find and make use
of
needed information, but implementing and maintaining this organization
manually
can be time-intensive and prone to errors. By automating much of this
process,
taxonomy management software can improve consistency, and also
significantly
reduce the time and labor involved in creating, applying, and maintaining
a
taxonomy. Other resources include predefined, customizable taxonomies, as
well
as consultants who can help with constructing and maintaining a taxonomy.
Creating and deploying an effective taxonomy
requires a dual focus on both business requirements and technical
execution.
The team managing the project should therefore include experts in both
areas.
This team should evaluate existing resources and business requirements to
determine the scope of the project, investigate available technology, and
then
implement the taxonomy.
Several methods can be used to automate taxonomy
creation and content classification, including rules-based,
linguistic/semantic, and statistical approaches. The current trend in
taxonomy
software is to combine multiple machine methods, as well as manual human
intervention, in order to maximize accuracy and relevancy. As more data is
collected and stored by more organizations, efficient taxonomical methods
are increasingly necessary for organizations to access and utilize data in
strategic business activities.
There is also a trend in knowledge
organization systems toward increasing power and complexity, with the
ability
to represent the complex, specific relationships of an ontology, rather
than
just the limited hierarchical relationships of a taxonomy. Topic maps are
a
standard, computer-understandable language for expressing the relations in
an
ontology. The topic map is a relatively recent tool that is still being
refined
and expanded.
Implementing and maintaining a taxonomy is a
labor-intensive, long-term commitment, even when taxonomy management
software
is used. Nevertheless, research shows that taxonomies are effective and
valued
components of information strategy for businesses that have them.
Description
[return to top of this
report]
A taxonomy structures information into
categories and subcategories. For cross-classification, it is often useful
to
overlay multiple independent taxonomies, or facets, to provide different
views
into the same data. For example, a publisher may organize its database of
books
by genre, by year created, by sales representative, by editor, and by
medium
(e.g. paperback, hardcover, e-book, and audio book). Facets also allow
information to be labeled and organized differently to suit the needs of
different groups, such as customers, sales staff, support staff, and
scientists. Figure 1 shows a taxonomy that organizes a set of individuals
according to multiple facets (governors, actors, and presidents).
Figure 1. Taxonomy with Multiple Facets
By improving the ability of both customers
and employees to find and make use of needed information, a taxonomy has
the
potential to significantly improve all aspects of a business, from
forecasting
and decision making to sales and customer service. However, many
businesses do
not take the time to catalog the data they accumulate in a consistent,
timely,
comprehensive manner. And even when they do make the effort, it can be
difficult to ensure that all the data is included and categorized
consistently
across business units, in a way that its intended users will find useful.
By automating many of the tasks involved,
taxonomy management software can improve consistency and significantly
reduce
the time and labor involved in creating, applying, and maintaining a
taxonomy.
Below are several steps that can help a business choose and implement a
system
that is appropriate for its needs.
Assemble an
Interdisciplinary Team
The process of creating and deploying a
taxonomy requires a dual focus on both business requirements and technical
execution. The taxonomy project team should therefore be
interdisciplinary, with
access to expertise in both areas, in order to best evaluate existing
resources
and business requirements and determine the scope of the project.
Business experts on the team can include
business analysts, subject matter experts, content managers, and content
users
from various departments and business units, including sales, support,
administration, and research. The business experts will address content
coverage, terminology, and labeling, as well as alignment with corporate
culture, goals, and budget.
The technical execution side can include experts in web architecture, site
design, and search engineering, along with information specialists such as
information architects, librarians, knowledge engineers, semantic specialists,
taxonomy and metadata designers, and web developers.
The technical experts will have hands-on
responsibility for implementing the taxonomy, while the business experts
will
have an advisory role. These different roles can be formally reflected by
designating the business experts as a taxonomy interest group, separate
from
the taxonomy development team.
Designation | Roles |
---|---|
Taxonomy Development Team |
Information management (information architects, |
Taxonomy Interest Group |
Business analysts, subject matter experts, content managers, |
Establish a Scope for the Project
Building a taxonomy is potentially
open-ended. It would be easy for such a project to expand indefinitely as
more
and more data sources and links between them are added. To avoid expensive
and
inefficient project creep, it is imperative that a scope for the taxonomy
project be established early and adhered to. In determining this scope,
the
taxonomy development team and taxonomy interest group should work together
to
consider the purpose of the taxonomy, the needs and abilities of its
intended
users, the type and volume of content to be categorized, and any existing
resources that might be leveraged.
- Purpose: A taxonomy may be created to support one or
several purposes both internal and external to the company, including
sales, research, customer support, and business planning. Determining a
clear and well defined purpose from the start will not only serve as a
boundary on the project over time, but also aid in decisions about
taxonomy structure and document classification. - Users’ Needs and Abilities: The information needs and
behaviors of various classes of intended users must also be considered,
including their different levels of technical and business expertise. In
some cases, everyone’s needs can be addressed by a single, comprehensive
enterprise taxonomy (perhaps with multiple facets); in other situations
it may be preferable to maintain a separate departmental taxonomy (e.g.,
for a research group). For most businesses, it is a good idea to choose
a taxonomy product with enough flexibility to generate and link multiple
taxonomies, in order to best accommodate future growth and other
changes. If the choice is made to set up a large taxonomy, it is
important to manage the navigation within it so as not to overwhelm
users with too many levels and long navigation paths. Research shows
that users prefer not to navigate more than three or four layers down
within a taxonomy. - Content: Accumulated business information may reside
in various formats and numerous locations throughout the enterprise. A
necessary step in determining the scope of the taxonomy project is to
locate and characterize all of the available content in order to
determine which of it should be indexed and included in the taxonomy.
This step will require the coordination and cooperation of all
departments within the business. It is important to note the formats of
the content to be included, and to make sure that the software chosen
can manage those formats. - Existing Resources: It is likely that there is
existing expertise within a business that could be leveraged to
jump-start the taxonomy project. There may also be databases or
classification systems that the company already maintains, such as an
existing records schedule or corporate library. Often the vendor of the
taxonomy management software makes available a selection of predefined
taxonomies that can be synchronized to create a foundation for a single
enterprise-wide taxonomy. Published third-party taxonomies can also be
found for many business, medical, scientific, engineering, and public
policy topic areas. Most taxonomy management software allows users to
import, convert, and modify existing taxonomies.
Create, Implement, and Populate the Taxonomy
Internal and External Structure. Taxonomy
design encompasses two separate aspects: the underlying (unseen) structure
and
the presentation of the taxonomy to various user groups. Having an
underlying
organization that is independent of the presentation to the user enables
multiple views of the same data to suit the needs of different types of
internal and external users. In addition, it is likely that the most
logical or functional underlying organization will not be the best
organization
for the user to view. For example, content may be organized internally by
creator or by level of security clearance.
When content is mapped from the underlying
structure to the external presentation, it is important to make sure that
every
page or node in the presentation contains adequate content. Designers
should
also verify that all information in the underlying structure is mapped
somewhere for presentation (even if it is not accessible to all users).
Graphic Presentation. Taxonomy information can be
presented graphically in
a variety of ways, including a series of tabs on a web site or portal, a
site
diagram, nested file folders, nested tree structures, alphabetical
listings of
topics and sub-topics, and more creative approaches, such as heat maps,
tinker
toy diagrams, and voice recognition interfaces. Looped tree structures
allow a
high level of networking: any topic can be referenced by multiple other
topics.
Facets can be presented as a series of linked but independent hierarchies,
each
based on a different top level category and cross-linked to the others, so
there are multiple paths to the same content. Finally, information from
two
separate taxonomies can be presented together in a matrix.
Team
Responsibilities. The taxonomy
interest group is responsible for establishing standards and guidelines
relating to content inclusion, clustering, and labeling, for both the
underlying structure and the presentation. The taxonomy development team
is
responsible for the mechanics of implementing both the underlying
structure and
the presentation. Once the taxonomy has been created, each body of text
must
then be analyzed and assigned to a place in the taxonomy by attaching a
metadata tag to it. This is referred to as populating the taxonomy.
Taxonomy
creation and population can each be manual, automated, or a combination of
the
two.
Taxonomy software can function as a
standalone system or as a module of a complete information storage and
retrieval system. Most standalone systems can integrate with or send
output to
content management, portal, and other enterprise management systems.
Testing and Maintenance
Once created and populated, a taxonomy
should be tested to identify errors, ambiguities, and inconsistencies, and
to
refine its organization. Testing can check whether users are able to find
desired information, whether search results are relevant, and how many
clicks
or how long it takes to complete a specific task. The taxonomy should also
be
evaluated for alignment with users’ needs, and for ease of categorizing
new
content.
Designing and deploying a taxonomy is not a
one-time effort but rather an ongoing process that requires a long-term
investment to keep up with changes in business context, content, and
users,
including classifying new content and, at times, reclassifying or deleting
existing content. When designing a taxonomy and choosing a software
package, it
would be very difficult to anticipate the level and type of developments,
changes, and growth that may occur in the company’s future. The taxonomy
and
software should therefore be flexible and adaptable for future changes.
A business should also develop criteria and
policies for how the taxonomy will be extended when necessary, to make it
easier for taxonomy managers to keep it up to date and consistent over
time. These
policies should address governance, clarifying who will have the authority
to
revise it and how frequently, as well as policies for dispute settlement.
Current
View
[return to top of this
report]
Several different approaches can be used to
automate taxonomy creation and content classification, including
rules-based
methods, linguistic/semantic analysis, and statistical algorithms built on
Bayesian probability, support vector machines, and neural networks.
Taxonomy
software also generally provides an interface for manual adjustments.
Figure 2 below shows
automatic content classification with optional manual adjustments using
Top Quadrant.
Figure 2. Automatic Classification of Content
Rule-Based Classification
In rule-based classification, experts create
a rule for each category of the taxonomy to specify when a document will
be
included in that category. These rules often contain precise and complex
operations and decision trees. Rules can refer to a document’s file type
or
metadata (such as author, date, or keyword) as well as its content. An
advantage of rule-based classification is the ability to very accurately
determine which documents will be classified in each category. The
drawbacks of
the rule-based approach are the labor cost of having experts write and
maintain
the rules, as well as the potential for human error in overlooking or
mischaracterizing a concept or category. Misclassifications can also arise
when
a potential ambiguity or grey area was not anticipated by the experts who
created the rules. Also, this method automates only document
classification; it
does not automate the design or construction of the taxonomy structure
itself.
Statistical Analysis
Several systems rely on statistical analysis
to automate both taxonomy creation and content classification. Statistical
approaches using Bayesian probability, neural networks, or support vector
machines analyze word frequency, word placement, word grouping, and the
distance between words in a document. Packages that use these statistical
approaches require some hands-on preliminary training of the software.
Taxonomy
creation is trained by providing the system with a basic taxonomy, defined
by a
human expert. Content classification is trained by presenting certain
manually
selected documents to the software as examples of what should be
classified
under a given topic. By analyzing the sample documents, the system both
refines
the taxonomy and establishes the rules of classification to be used for
new
documents. An advantage of this approach is that it can work on text in
any
language because it uses pure pattern matching, without semantic analysis.
Some
drawbacks of this approach are the requirement that an expert manually
select
training documents for every category, and the fact that the resulting
taxonomy
is only as good as the training documents that were selected.
Semantic and Linguistic Clustering
Some software automatically generates and
populates a taxonomy by using natural language processing and statistical
clustering to analyze the topics and subtopics found in the set of
documents,
without human analysis. Clustering is a process of grouping documents or
text into
subsets of similar documents or text by identifying elements they have in
common. Semantic clustering identifies word meanings, parts of speech,
idioms,
verb chains, and noun phrases, and it uses stemming to reduce a word to
its
root, e.g. reducing ‘brought’ to ‘bring’, to ensure that all words based
on the
same root are clustered together. Linguistic software also analyzes
syntax,
identifying subjects, verbs, objects, and other grammatical roles using
both
rule-based and probabilistic grammar. This approach requires less manual
involvement by experts, and it does not require pre-training. But unlike
the
pure pattern matching of Bayesian probability, the semantic and linguistic
clustering approach is typically language-dependent.
Manual Creation and Classification
In addition to whatever automatic processing
a system provides, most taxonomy vendors supply a tool to customize,
rename, or
manually create nodes of a taxonomy, as well as to manually classify
documents.
Although it will at times be necessary to intervene manually, it is
generally
neither cost-effective nor wise to rely predominantly on manual
processing. The
main risk of manual classification is inconsistency. It is unlikely that
the
staff members classifying documents would have the same level of
understanding
and make the same decisions in assigning categories as would the experts
who
designed the taxonomy. And different people may categorize the same
concepts in
different ways, which would seriously decrease the utility of the
taxonomy. At
least if an automatic system misclassifies a concept, the relevant
documents
are likely to all be misclassified in the same place, making it easier for
the
problem to be discovered and corrected.
Outlook
[return to top of this
report]
Each methodology for taxonomy creation and
document classification has advantages and disadvantages. The current
trend in
taxonomy software is to combine multiple machine methods, as well as
manual
human intervention, in order to maximize accuracy and relevancy in the
creation
and maintenance of taxonomies. When choosing taxonomy software, a business
should understand how each package’s pattern of advantages and
disadvantages will affect performance in its own corporate and data
environment.
There is also a trend in taxonomy management
software toward increasing power and complexity, with the ability to
represent
the complex, specific relationships of an ontology rather than only the
strictly hierarchical relationships of a taxonomy. An ontology enables an
unlimited number of user-defined relationships between concepts. For
example, a
store can be ‘located in’ a city, while a gasket is a ‘component of’ a
washing
machine. Figure 2 shows an ontology. An identity relationship lets terms
in
different ontologies refer to the same subject, so that the ontologies can
use
each other’s information. In this way, an ontology supports knowledge
reuse and
scalable knowledge construction. It becomes a valuable resource that
details
the knowledge in a company or a subject area. Topic maps are a standard,
computer-understandable language for expressing the relations in an
ontology.
The topic map is a relatively recent tool that is still being refined and
expanded by organizations such as OASIS (a vendor consortium promoting
open
standards) and ISO (International Standards Organization).
Figure 3. Ontology
Recommendations
[return to top of this
report]
Research shows that taxonomies are effective
and valued components of information strategy for businesses that have
them. In
one survey, a browsable taxonomy was seen as a preferred or critical part
of
the information architecture by 88 percent of survey respondents;
automated
taxonomy construction tools were seen as critical or preferred by 89
percent of
respondents; and automated document classification was seen as critical or
preferred by 76 percent of respondents. Another study of e-commerce sites
showed that users find desired information only 34 percent of the time
with a
simple search, but 54 percent of the time using a taxonomy.
Nevertheless,
implementing and maintaining a taxonomy is a labor-intensive, long-term
commitment, and a business should not underestimate the time and effort
involved. It is usually wise to invest in software to help with taxonomy
creation, importing and modifying existing taxonomies, and document
classification. Software vendors that provide taxonomy, thesaurus, or
ontology
management systems include Data Harmony, MultiTes, SAS, Smartlogic
(SchemaLogic), Synaptica, and Wordmap. Other vendors may be found via
TaxoBank,
which maintains a list of software for building and editing thesauri. It
can
also be helpful to enlist the guidance of a consultant such as Wand or
Taxonomy
Strategies, who provide services including workshops, consulting, and
training
to help organizations with project definition, construction, and
maintenance of
taxonomies. Sources for predefined, customizable taxonomies include Wand,
which
licenses taxonomies for more than 150 common enterprise applications, and
Taxonomy
Warehouse, which provides a comprehensive listing of available taxonomies
across a wide selection of areas.
It is important to
choose a system that that has the flexibility to grow with the business
and
that will be compatible over time with the trend toward richer
representations
such as ontologies and topic maps. A good way to begin is to implement a
limited taxonomy within one group or department first. Once the benefits
of
this limited taxonomy become apparent, it is easier to get buy-in across
the
company, and from management. The initial taxonomy can also then be used
as a
model for the subsequent larger taxonomy.
Web Links
[return to top of this
report]
- The American Society for Indexing: http://www.asindexing.org/
- Boxes and Arrows: http://www.boxesandarrows.com/
- Controlled Vocabulary.com: http://www.controlledvocabulary.com/
- Data Harmony: http://www.dataharmony.com/
- ISO (International Standards Organization): http://www.isotopicmaps.org/
- PoolParty: http://www.poolparty.biz/
- Mondeca: http://www.mondeca.com/
- MultiTes: http://www.multites.com/
- Oasis: http://www.oasis-open.org/
- SAS: http://www.sas.com/
- Smartlogic: http://www.smartlogic.com/
- Synaptica: http://www.synaptica.com/
- TaxoBank: http://www.taxobank.org/content/thesauri-and-vocabulary-control-thesaurus-software/
- Taxonomy Warehouse: http://www.taxonomywarehouse.com/
- Taxonomy Strategies: http://www.taxonomystrategies.com/
- Top Quadrant: http://www.topquadrant.com/
- TopicMaps.Org: http://www.topicmaps.org/
- Wand: http://www.wandinc.com/
- Wordmap: http://www.wordmap.com/
- World Wide Web Consortium (W3C): http://www.w3.org/
About
the Author
[return to top of this
report]
Betsy
Walli is a licensed marriage and
family therapist and an independent writer and editor with experience in
academic, technical, and marketing topics. Dr. Walli holds a masters
degree in
counseling from California State University, Fullerton, and a Ph.D. in
linguistics from the Massachusetts Institute of Technology.
[return to top of this
report]