Computational Linguistics

PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free

Computational Linguistics

by James G. Barr

Docid: 00021197

Publication Date: 2202

Report Type: TUTORIAL


Imagine a world in which the language barrier does not exist. In which
computer programs could translate one national language into another and
accomplish that feat in near real time. In such a world, a person who
spoke only English could communicate with someone who spoke only French or
German or Russian. That potential is the driving force behind
“computational linguistics,” an area of study that lies at the
intersections of computer science, artificial intelligence, mathematical
logic, linguistics, cognitive science, and psychology. The ultimate goal
of computational linguistics is the development of cognitive machines that
humans can freely talk to in their respective natural languages.

Report Contents:

Executive Summary

[return to top of this

There are several thousand languages in the world. The exact number is
not known because the difference between a language and a dialect is often
difficult to discern. Most people in the world are monolingual, meaning
they only have command of their native language. Communication among
speakers of different languages is thus difficult and the fruit of
communication, in terms of person-to-person collaboration that yields
great discoveries or inventions, is hard to attain.

But imagine a world in which the language barrier did not exist, in which
computer programs could translate one national language into another, and
accomplish that feat in near real time. In such a world, a person who only
spoke English could communicate with someone who only spoke French or
German or Russian.

That potential is the driving force behind “computational linguistics,”
an area of study that lies “at the intersections of computer science,
artificial intelligence, mathematical logic, linguistics, cognitive
science, and psychology.”1 The ultimate goal of computational
linguistics is the development of cognitive machines that humans can
freely talk to in their respective natural languages.

Science and Engineering

Computational linguistics (CL) is often referred to as “natural language
processing” (NLP), although CL is generally
preferred within academic and scientific circles.

NLP is actually a subset of CL. Where CL is concerned with the science
surrounding computing and language, NLP, according to Oracle, “is the
engineering discipline concerned with building computational artifacts
that understand, generate, or manipulate human language.”2

Business of Linguistics

According to the Department of Linguistics at Fresno State in California,
computational linguistics is “the most commercially viable branch of
linguistics,” with hundreds of companies from start-ups to tech titans
like Microsoft and IBM pursuing its profit-making potential.

Among the CL tasks that computers “carry out (or try to carry out)” every
day are:

  • “Searching large databases (or the entire internet) for documents
    containing the answer to a query
  • “Listening to voice commands and acting on them correctly
  • “Providing a speaking voice for text (reading out loud)
  • “Translating documents from one language to another
  • “Searching databases of current text to gather information about what
    words mean”3

Today, the most prominent application of computational linguistics is
speech recognition, used primarily for dictation and telephone-based order
entry. As evidence of its continuing evolution, CL was one of the enabling
technologies that allowed IBM’s Watson to defeat a pair of former
champions on the game show Jeopardy!.

CL applications are being used by attorneys to expedite e-discovery
orders, filtering thousands of documents in an effort to extract
information relevant to a particular court proceeding. In some cases,
these programs are taking the jobs of first-year associates and other
junior-level personnel – persons who would normally assume this legal
“grunt” work.

Turning from law to medicine, CL technology is playing a prominent role
in facilitating the production of electronic medical records (EMRs), thus
reducing the cost of healthcare administration.


[return to top of this

Italian Roberto Busa is considered the pioneer of computational
linguistics, according to an article published by IBM. “In 1946, [Busa]
proposed a revolutionary idea to [Big Blue]: using computers to study
texts, in particular the collected works of St. Thomas Aquinas.

“In 1949, during a trip to New York, [Busa] had the chance to present his
idea to Thomas Watson, Sr., founder of the IBM Corporation, who decided to
support his project. In 1980, after thirty years’ work, the [56-volume]
‘Index Thomisticus’ [was produced], an imposing work which gathers the
entire production of St. Thomas Aquinas in a format readable and
manageable by computer using the methodology developed by Father Busa.”4

About the time Father Busa was approaching IBM, political leaders, trying
to interpret the impact of Russia (later the Soviet Union) emerging as a
world power, concluded that the prospects for peace could be enhanced if
Russian could be readily translated into English, and vice versa.

What they learned – a lesson that would be repeated over the course of
the next several decades – was that converting one natural language into
another is really hard – and demands formidable technology.

Four Essentials to Success

According to analyst Sanjay Srivastava, there are “four essentials to
success with computational linguistics:

  1. Scale – The technology should be [readily
    expandable], whether … working with hundreds of documents or
  2. Speed – [The technology] has to be able to review a
    high volume of documents and extract relevant information fast.
  3. Accuracy – The extracted data has to be accurate,
    especially if [the data] will be used to make business-critical
    decisions. It’s not enough to be at 60-70 percent – there needs to be
    98-99 percent accuracy.
  4. Traceability – Companies subject to audits have to
    be able to track and trace how they arrived at their ultimate decision.
    In commercial lending, for instance, a small footnote could make a
    dramatic impact on the risk score. Banks need to be able to drill down
    and pinpoint key data sources.”5

The Obstacles to Computational Linguistics

Just like dancing is more than steps and patterns, language is more than
vocabulary and grammar. And just like dancing, language, especially spoken
language, is a performing art. For computational linguists, this
performance aspect is what renders machine translation (MT) difficult.

Consider, for example, that speakers may have different accents, and may
also speak differently depending upon emotion. Similarly, speakers are
known to use various forms of non-verbal communication, such as nods,
frowns, and sighs, to convey certain attitudes. In addition, the same word
may have different meanings depending upon the speaker’s intention. (“No?”
is different from “No…,” which is different from “No!”).6

The use of occupational jargon and slang is a problem, as well as certain
idiomatic expressions that may have no corresponding meaning in another
language – idioms that reflect a particular cultural context.

Finally, as analyst Laurie Gerber reminds us, “High-quality output is
rarely achieved when working from linguistically under-specified languages
[Chinese and English, for example] to highly specified … languages such
as Arabic. Going from a grammatically rich language to a grammatically
less-specified language is easy – information can be lost without harming
the translation. But going to grammatically richer languages requires the
translation system to infer or perhaps manufacture information that is not
present in the source language.”7

As a result of these impediments, many computational linguists content
themselves with systems that work with written, rather than spoken, text.
Even interpreting written text, however, has its challenges, such as:

  • Determining the veracity, or factuality, of events (are you sure that
    this really happened?).8
  • Generating numerical approximations (for phrases like “more than a

As long as computational linguistics remains an art rather than a
science, trust will be an ever-present issue. For the most important
applications of machine translation – e-discovery, for example – CL
results will need to be verified by humans, perhaps by validating random
samples of translated text.


[return to top of this

Computational Linguistics

As itemized by Heidelberg University, computational linguistics serves a
wide range of existing and potential applications:

  • Support in translating texts from one language to another or
    completely automatic translation (machine translation).
  • Automatic management of large databases containing
    information in language form and the retrieval of information from such
    databases, e.g. via automatic production of summaries and abstracts
    (summarization), or the location of specific information in a large
    number of academic publications (e-science).
  • Location of information in heterogeneous data sources
    (Internet, large structured databases, corporate portals, etc.).
  • Automatic question-answering on the basis of large databases
    or information in language form on the World Wide Web.
  • Language-learning and correction programs for
    foreign-language learners (vocabulary trainers and other practice
    programs) and spelling and grammar correction programs for native
    speakers in text editors.
  • Linguistic interaction with computers or artificial intelligence
    (AI) systems
    in the field of robotics, virtual worlds or
    computer-aided medical care.”10

Interpreting Violence Threats – As evidence of the potential power
and societal value of computational linguistics, and its diverse
applications, Isabelle Van Der Vegt suggests, in her doctoral thesis, that
CL might be employed to help understand threats of “grievance-fueled
targeted violence.” Although CL is probably “less suited” to the
prediction of violent behavior, it might contribute to explaining such
behavior and, therefore, add to the body of knowledge required to prevent
or reduce violence – a persistent public policy priority and human rights

Bible Content Sourcing – In another fascinating example of CL
utility, researchers have developed an algorithm that could help determine
the different sources that contributed to the individual books of the
Bible. Professor Nachum Dershowitz of Tel Aviv University’s Blavatnik
School of Computer Science worked in collaboration with his son, Bible
scholar Idan Dershowitz of Hebrew University, Professor Moshe Koppel, and
Ph.D. student Navot Akiva of Bar-Ilan University to create a computer
algorithm that recognizes linguistic cues, such as word preference, to
divide texts into probable author groupings. By focusing exclusively on
writing style instead of subject or genre, they claim to have surmounted
several methodological hurdles that hamper conventional Bible scholarship.

To test the validity of their method, the researchers randomly mixed
passages from the two Hebrew books of Jeremiah and Ezekiel, and asked the
computer to separate them. By searching for and categorizing chapters by
synonym preference and then looking at usage of common words, the computer
program was able to separate the passages with 99 percent accuracy.12

Natural Language Processing

As the engineering arm of computational linguistics, natural language
processing helps streamline and expedite business processes by:

  • Automating routine tasks, through chatbots and other digital
  • Improving searches, “disambiguating” word definitions based
    on context (carrier, for example, means something different in
    biomedical than in industrial contexts).
  • Enabling search engine optimization, improving an
    enterprise’s rank – and, thus, visibility – in online searches.
  • Analyzing and organizing large document collections,
    including corporate reports and scientific documents.
  • Advancing social media analytics, evaluating customer reviews
    and social media comments to make better sense of huge volumes of
  • Providing market insights, analyzing the language of
    customers to determine their needs and how to communicate with them.
  • Moderating user or customer content, [maintaining] quality
    and civility by analyzing not only the words, but also the tone and
    intent of comments.13

NLP Market – According to a June 2021 market research report
prepared by Fortune Business Insights, the global NLP market is
expected to grow from $20.98 billion in 2021 to $127.26 billion in
2028, realizing a remarkable compound annual growth rate (CAGR) of
29.4 percent during the forecast period.

The principal driving factors are:

  • The enterprise need to improve customer experience (CX)
    capabilities, by better analyzing customer inquiries and comments;
  • The increasing adoption of smart assistants, such as Amazon Alexa
    or Apple’s Siri, which demand robust NLP skills.


[return to top of
this report]

On the frontier of computational linguistics are capabilities like
“natural language generation” and “natural language understanding.”

Natural Language Generation

Natural language generation (NLG), also known as automated narrative
generation (ANG), transforms Big Data into narrative reports by
recognizing and extracting key insights contained within the data and
translating these findings into plain English for ready consumption.
There are now articles generated by respected news outlets like Forbes
and the Associated Press that are actually “penned” by computers. Some
analysts believe that as much as 90 percent of news could be
algorithmically generated by the mid-2020s, much of it without human

While we’ve long known that mechanical processes like automotive
manufacturing can be reduced to repeatable processes – which robots can
assimilate and perform with greater efficiency, fewer defects, and lower
costs than humans – it turns out that certain intellectual activities
such as translating a company’s quarterly earnings report into an
article for investors is also “mechanical” – and can be accomplished by
computer. In essence, knowledge workers who may have felt
relatively immune to the encroachment of artificial intelligence may
have to seriously reconsider their future job prospects.

This new reality was glimpsed a few years ago when application
developers started to produce programs that could sort through massive
amounts of evidentiary material accumulated through e-discovery orders
and identify items of interest to litigators. Instead of engaging
an army of associates and paralegals to analyze the documents, law
partners could delegate the work to computers, which operate faster and
avoid the type of fatigue-based errors and omissions that might
characterize reviews by humans.

Importantly, NLG outputs can be generated in multiple forms, each
tailored to a specific audience.

Natural Language Understanding

One of the newly emerging elements of computational linguistics – and
one of the most exciting – is natural language understanding (NLU), also
called “natural language interpretation” (NLI). Through NLU analysis,
“computers manage to interpret … language and define a user’s intent.”
Unlike simple speech recognition, NLU “focuses on the determination of
intent, sentiment, and context.”

According to analyst Bogdan Koretski, “[with] the evolution of NLU,
we’ll see the ubiquitous growth of its applications. Just imagine how it
will be: systems that understand the meaning, provide instant answers,
and even direct the conversation. Directory services that assist people
in finding the desired information, resolving their issues, and more.
Such products will allow companies to dramatically improve customer
service and automate manual operations.

“Due to the deep and correct language interpretation, machine-to-human
communication will reach a completely new level. Business processes like
data collection and analysis, data vetting, and facts checking will be
automated and errors excluded.”14

Outreach Programs

Achieving breakthroughs in computational linguistics will require a
steady flow of students ready to major in this difficult field. Some
colleges and universities, like Cornell, Carnegie Mellon, and MIT, are
participating in outreach programs designed to entice middle and high
school students to sample CL.

One such program is the North American Computational Linguistics Open
Competition (NACLO). The NACLO is a computational linguistics
competition designed for high-school students.

NACLO was started in 2006 in order to promote Computational Linguistics
and Linguistics in general in North America. Its founders include Lori
Levin (Carnegie Mellon University, general chair), Dragomir Radev
(University of Michigan, program chair), Tom Payne (University of
Oregon), James Pustejovsky (Brandeis University, sponsorship chair), and
Tanya Korelsky (NSF).


[return to top of this

Computational linguistics shares a curious history with nuclear
fusion. Both technologies:

  • Were conceived around the same time.
  • Offered enormous promise (in the case of nuclear fusion an almost
    limitless supply of clean, inexpensive electrical energy).
  • Disappointed their proponents by never reaching their potential.

Throughout its history, computational linguistics – just like nuclear
fusion – has always been 20 years away.

Although the metaphor may be a bit cliché, the US should make a real
“man on the moon”-style commitment to CL research and development, not
just military applications like those being furthered by DARPA, but
civilian applications – eliminating today’s man-machine interface, the
computer programming language, in favor of something more fundamental:
the spoken word.

As the Department of Linguistics at Fresno State in California
observes, “Computational linguistics is both highly theoretical and
highly practical. To succeed, those of us involved in the field must
make key advances in understanding how language really works, and we
must work tirelessly on computing techniques which maximize the utility
of what little we do understand about language.”15

As a means of furthering interest in computational linguistics analyst
Tanmoy Ray observes that “[traditionally], linguistics graduates have
always found jobs within academia, writing, or [the] translation fields.
But, with … technological advancements and increasing digital
transformation, linguistics graduates are finding their skills in
high demand in the technology sector, particularly for … areas related
to artificial intelligence (AI) and machine learning (ML).”16

[return to top of this


[return to top of this

About the Author

[return to top of
this report]

James G. Barr is a leading business continuity analyst
and business writer with more than 30 years’ IT experience. A member of
“Who’s Who in Finance and Industry,” Mr. Barr has designed, developed,
and deployed business continuity plans for a number of Fortune 500
firms. He is the author of several books, including How to Succeed
in Business BY Really Trying
, a member of Faulkner’s Advisory
Panel, and a senior editor for Faulkner’s Security Management
. Mr. Barr can be reached via e-mail at

[return to top of
this report]