PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The reader
is available for free
download.
ARCHIVED REPORT:
Internationalized Domain Names
Copyright 2013, Faulkner Information Services. All Rights Reserved.
Docid: 00011523
Publication Date: 1312
Report Type: TUTORIAL
Preview
Since the Internet became a reality, Internet domain names have been composed from the set of characters used to
represent English and other Latin alphabet-based languages. About 400 million
people worldwide claim English as their primary language. The global population,
however, is more than 7 billion. Clearly, globalizing the audience to which the Internet is fully accessible
in a primary language is a business opportunity unparalleled in human history.
Internationalized Domain Names will help take advantage of this opportunity, as well as making the Web understandable to millions more people – in their
native language.
Report Contents:
Executive Summary
[return to top of this report]
Internationalized Domain Names (IDNs) contain characters other than those based on ASCII character sets.
Examples include right-to-left scripts like Arabic and Hebrew, and non-alphabetic scripts typical of many Asian languages. Various mechanisms have been in place for some time to allow domain name stems to be displayed in non ASCII characters (the
www.something
portion of a domain name ). Internet Corporation for Assigned Numbers and Names’ (ICANN’s) current push to adopt fully internationalized domain names will make it possible to render them entirely in local characters, including the terminal top level domain name (e.g., .com,
.org ). The objective of global deployment of IDNs is to finally and completely internationalize the Internet experience, allowing people full access to information and services in their own
languages.
Resolution of IDNs requires a bit of magic on the part of the DNS name resolver infrastructure. Existing ASCII names must still work properly, and new, IDN-style
names must translate predictably, consistently, and in a law abiding
way.
It must be understood, however, that the mere ability to locate domains in linguistically unrestricted fashion does not finish the job of internationalizing Internet content and services. In fact, it implies a whole host
of issues for serving and handling
Web content in an equally unrestricted way. The World Wide Web Consortium has been actively engaged in getting ready for the impacts of IDNs.
The W3C Internationalization Activity, dubbed i18n, an industry abbreviation for
the word internationalization
, is tasked with ensuring that the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.
Description
[return to top of this report]
Why Internationalize Domain Names
Computers handle text by means of logical characters. ASCII, the oldest universally accepted logical character set, encodes characters in 8 bits, which limits ASCII to a total of 128 characters. As it is based solely on the English alphabet, this number is sufficient to encode upper and lower case letters, numbers, punctuation and a set of non-printing characters including those for tab, space and newline. Non internationalized
Internet domain names are rendered using the ASCII character set. Unfortunately, neither ASCII nor its variants provide for encoding languages which don’t use the Latin alphabet. For this reason, in international use, ASCII has largely been superseded by a standard which expands and subsumes it, Unicode. Unicode uses 16 bits to encode characters and can represent on the order of 100,000 logical characters.
This is sufficient to encode most of the world’s writing systems, which are
mapped into Unicode as language-specific character sets.
Internationalized Domain Names resolve characters using Unicode character sets and algorithms. Examples of writing systems so encoded include right-to-left scripts like Arabic and Hebrew, and non-alphabetic scripts typical of many Asian languages. Various ad hoc mechanisms have been in place for some time to allow domain name stems to be locally displayed in non ASCII characters (the
www.something
portion of a domain name). ICANN’s current push to adopt
fully internationalized domain names will make it possible to render them
entirely in local characters. The objective of global deployment of IDNs is
to finally and completely internationalize the Internet experience, allowing
people access to information and services in their own languages.
Resolution of IDNs requires a bit of magic on the part of the DNS name resolver infrastructure: Existing ASCII names must still work properly, and new, IDN-style names must translate predictably, consistently and in a
law abiding
way. This is accomplished by the Internationalizing Domain Names in Applications (IDNA) infrastructure, which relies on an ASCII to Unicode Translation facility called (no kidding) PunyCode. On a conceptual level, PunyCode is an algorithmic tool that uniquely translates an IDN string to an ASCII string that the Internet’s Root Services Name Resolver infrastructure can successfully associate with an IP address.
The Challenges of Deploying Internationalized Domain Names
Internationalized Domain Name Services present a nuanced set of problems for security and semantic resolution, and addressing these issues has been a significant part of the workload in creating an international standard by which all the world’s languages may be used in domain names. To understand how security loopholes arise, it is necessary to take a minor detour through the history of typography. The characters we write by hand are called glyphs. The nature of human anatomy and the evolution of writing tools have jointly produced fairly consistent kinds of glyph-making across the world’s languages. By way of example, ancient runic inscriptions that are clearly linguistically independent in origin and widely dispersed geographically typically contain a preponderance of upright characters that look a good deal like K, H, I and L. Most western languages have glyphs which combine circular and upright shapes. Similarly,
Middle Eastern written languages contain characteristic combinations of shapes, as do Southeast Asian and Asian languages. When glyphs are rendered as computer fonts, representations are occasionally identical in visual appearance, though distinct in their Unicode logical character encoding. This creates the potential for internationalized domain name homographic attack.
IDN homograph attack is a form of spoofing
where a malicious Website convinces a user it is legitimate by displaying a domain name that appears to be correct, but in fact includes spurious Unicode characters identical in appearance to the correct ones. To preempt potential attack of this sort, domain registries keep records of what character sets are allowed for use in domain name resolution. Semantic difficulties arise when there is more than one character combination that may be used to render a name. So far, this type of problem occurs most frequently where existing domain names are rendered using Latin alphabet character combinations to work around the lack of availability of diacritical marks. Semantic inconsistencies mean that the owner of a domain name might be required to register domains using several different spellings in order to capture all of the potential traffic to their site. Interoperability is another potential difficulty for IDN implementation: Old style ASCII domain names may be cut and pasted between browser address bars and documents. Most browsers have been updated or have add-on support for rendering Unicode domain names. For example, Microsoft Internet Explorer offers users the opportunity to change address bar settings on their browser if they visit a site with an internationalized domain name.
Current View
[return to top of this report]
ICANN’s example.test Web Pages
ICANN engaged Swedish firm Autonomica AB to develop, conduct and report results of closed lab tests of internationalized top-level domains in a simulation of the public root environment. Autonomica’s March
7, 2007 report deemed results of the laboratory tests successful.
Autonomica AB has, under a contract with ICANN, investigated whether the
addition of top level domains containing encoded internationalized characters
(so called IDNs) to the public root zone for testing purposes has any impact on
the iterative mode resolvers used to look up the information. No impact at all
could be detected. All involved systems behaved exactly as expected.
Following the successful test, ICANN launched live IDN implementation tests.
In October 2007, users were provided access to wiki pages with the domain name
example.test in 11 test languages – Arabic, Persian, Chinese (simplified and traditional), Russian, Hindi, Greek, Korean, Yiddish, Japanese and Tamil. Six additional demonstration pages have been added since the original test launch to support Amharic, traditional Hebrew, Khmer, Thai and Urdu. The wikis allow users to establish subpages with their own names in one of the 17 test languages. Initial implementation languages were chosen based on feedback from communities that have shown the most interest in moving IDNs from concept to reality.
IDN Example Page Link |
Script |
Language |
|
---|---|---|---|
Arabic |
Arabic |
||
Simplified Chinese |
Chinese |
||
Traditional Chinese |
Chinese |
||
Greek |
Greek |
||
Devanagari |
Hindi |
||
Kanji, Hiragana, Katakana |
Japanese |
||
Hangul |
Korean |
||
Perso-Arabic |
Persian |
||
Cyrillic |
Russian |
||
Tamil |
Tamil |
||
Hebrew |
Yiddish |
||
|
Ge’ez |
Amharic |
|
Bengali |
Bengali |
||
Hebrew |
Hebrew |
||
Khmer |
Khmer |
||
Thai |
Thai |
||
Persian |
Urdu |
The IDN ccTLD Fast Track Process
ICANN
allowed countries and international territories to start applying for country
code top-level domains (IDN ccTLDs) in 2009. These domain names are typically
two-letter domains that are assigned based on the ISO 3166-1 standard for coding
countries and international territories.
The
process by which countries can apply for these domains is called the IDN ccTLD
Fast Track. It is designed to allow ICANN to quickly implement new domain names
while still complying with IDNA protocols, security standards, and current best
practices for ccTLD implementation.
Countries who want to request their own
international domain name must participate in a three-step process:
-
Preparation.
The country must identify the IDN it wants to request, how it will be
administered, and who will be responsible for managing it. -
String
Evaluation. The country submits a request to ICANN. In addition to the
topics covered above, they must submit supporting documentation. -
String
Delegation. Assuming the application meets all the string evaluation
criteria, the domain name request then goes through the standard ICANN IANA
process that is used for ASCII-based ccTLDs. Requests are submitted to IANA
for root zone management.
Status of IDN ccTLDs
In May 2010, Egypt, Saudi Arabia, and the United Arab Emirates became the first
countries to receive IDN ccTLDs through the Fast Track process. Their domain
names were the first to use the Arabic alphabet and use the traditional
left-to-right spelling. The implementation of these domains was lauded by ICANN
CEO Rod Beckstrom as a historical moment in the history of the Internet, since
it made Internet content accessible to millions of people in their native
language.
The countries/territories listed in
Table 2 have completed Step 2 (String Evaluation) and are free to enter Step 3
(String Delegation).
Country/Territory |
Language |
Status |
---|---|---|
Algeria |
Arabic |
Delegated |
Bangladesh |
Bangla |
Pending Delegation |
China |
Chinese |
Delegated |
Egypt |
Arabic |
Delegated |
Georgia |
Georgian |
Pending Delegation |
Hong Kong |
Chinese |
Delegated |
India |
Hindi, |
Delegated |
Iran, Islamic Republic of |
Persian |
Delegated |
Jordan |
Arabic |
Delegated |
Kazakhstan |
Kazakh |
Delegated |
Korea, Republic of |
Korean |
Delegated |
Malaysia |
Malay |
Delegated |
Mongolian |
Mongolian |
Delegated |
Morocco |
Arabic |
Delegated |
Oman |
Arabic |
Delegated |
Pakistan |
Urdu |
Pending Delegation |
Palestinian Territory, Occupied |
Arabic |
Delegated |
Qatar |
Arabic |
Delegated |
Russian Federation |
Russian |
Delegated |
Saudi Arabia |
Arabic |
Delegated |
Serbia |
Serbian |
Delegated |
Singapore |
Chinese |
Delegated |
Sri Lanka |
Sinhalese |
Delegated |
Sudan |
Arabic |
Pending Delegation |
Syrian Arab Republic |
Arabic |
Delegated |
Taiwan |
Chinese |
Delegated |
Thailand |
Thai |
Delegated |
Tunisia |
Arabic |
Delegated |
Ukraine |
Ukrainian |
Delegated |
United Arab Emirates |
Arabic |
Delegated |
Yemen |
Arabic |
Pending Delegation |
Source:
ICANN
Outlook
[return to top of this report]
IDN gTLDs
ICANN is in the process of
allocating a new set of top-level domains known as generic TLDs (gTLDs). More than a thousand applications have already been filed for these new domains, including more than a hundred IDN gTLDs.1
IDN Variants
While IDNs can serve as powerful
tools for broadening the Internets capacity and accessibility, IDN variants
may pose problems. A variant is said to exist when a single conceptual
For example, a
character can be identified with two or more different Unicode code points with
graphic representations that may be visually similar.
string in traditional Chinese commonly has an equivalent in simplified Chinese.
To support IDN variants in the root zone, the ICANN community undertook several projects to study and make recommendations on their viability, sustainability, and delegation.
In April 2013, the
ICANN Board adopted a resolution directing staff to implement the Procedure to Develop and Maintain the Label Generation Rules for the Root Zone in Respect of IDNA
Labels.2
IDNs Will Be Essential
The regions representing the highest expected growth in Internet usage over the next decade will directly benefit from IDNs.3
Recommendations
[return to top of this report]
About 400 million people worldwide claim English as their primary language. In 2011,
actuarial estimates put global population at 7 billion. Clearly, no matter how you do the rounding, globalizing the audience to which the
Internet is accessible in a primary language is a business opportunity unparalleled in human history. Given this, one can safely make these two assertions: It is going to happen, and it will be challenging.
Get Up To Speed On Unicode
The complexity of globalizing Web presence has to do with factors which transcend technology and incorporate linguistics, politics and culture. It is for this reason that understanding Unicode as a technology suite is a fundamental literacy and core competency for
Web application architects and managers. Important Unicode concepts include:
-
Organization of the code space.
-
Allocation of Unicode characters.
-
Character encoding formats, and when to use each.
-
Code points.
-
Use of byte order marks.
-
Formatting and special characters.
-
Unicode Standard and Unicode Algorithms.
Perhaps unsurprisingly, over half of the characters in the Unicode standards are ideographic, and meant to represent current and past usage of Chinese, Japanese, Korean and Vietnamese. This subset is known as Unihan, and is of primary interest to anyone globalizing business processes or content which target greater Asia.
International Components for Unicode (ICU)
Solid, proven open source tools for handling Unicode processing exist and are of great value in internationalizing text handling facilities. The most popular of these, International Components for Unicode (ICU), is a mature toolset, is available in both Java and C++ Libraries, and released under an open source license. ICU is broadly portable, widely used and known to produce consistent results across platforms. A link to the project site is provided below.
References
1-3 Enabling a Multilingual Internet
. ICANN. 2013.
Web Links
[return to top of this report]
- Internet Corporation for Assigned Numbers and Names: http://www.icann.org/
- International Components for Unicode: http://www.icu-project.org/
- World Wide Web Consortium: http://www.w3.org/
About the Author
[return to top of this report]
James G. Barr is a leading business continuity analyst and
business writer with more than 30 years’ IT experience. A member
of Whos Who in Finance and Industry,
Mr. Barr has
designed, developed, and deployed business continuity plans for a
number of Fortune 500 firms. He is the author of several books,
including How to Succeed in Business BY Really Trying, a member
of Faulkner’s Advisory Panel, and a senior editor for Faulkner’s
Security Management Practices. Mr. Barr can be reached via
e-mail at jgbarr@faulkner.com.
[return to top of this report]