Categories
Web Design and Development

Language Codes for the Web

Have you ever noticed how companies like IBM and Microsoft will have a language code that varies in letter length if it has a dash and if there are number or not? They are using a language tagging method called RFC-5646.

You can find this RFC-5646 code is used in several places such as the URL paths and also in the lang meta field of the page in HTML and many other places. It is a standard on the web that is needed for accessibility and SEO.

Why RFC 5646 language tagging is used and important to the web

You may have heard of the ISO 639 (International Organization for Standardization, Codes for the representation of names of languages) and think well there you go, we already have a standardized way to display language. However, if you stop and think about it spelling of words in English are different if you are from America or Britain. There is often vocabulary differences depending on where you are from sometimes even within the same country like Pop vs Soda in America. Then think about European Portuguese and Brazilian Portuguese, which also has spelling and vocabulary difference plus differences in pronouns and verbs. European French and Canadian French have differences prepositions and when specifiers are to be used and even if there should be space before punctuation.

RFC 5646, titled “Tags for Identifying Languages,” is a standard published by the Internet Engineering Task Force (IETF) that defines the structure and syntax for language tags used in various internet protocols and formats. IETF Language tags are crucial for identifying and communicating language preferences from users, identifying the language of content for accessibility, screen readers and Search Engine Optimization (SEO). In this article, we will start with providing a simplified explanation of RFC 5646, explaining its purpose, structure, examples, how and where it is used on the web and where you can find some good resources to look up the language code you are looking for.

Purpose of RFC 5646

The primary objective of RFC 5646 is to establish a standardized system for identifying languages in a consistent and globally understood manner.
RFC 5646 defines a structured language tag format discussed in detail below, allowing for a flexible and comprehensive representation of language-related information.

Language tags play a vital role in facilitating language negotiation between users, applications, and systems. They are utilized in a wide range of applications, including web browsers, email clients, document processing, and more.

By adhering to this standardized language tagging system, internet users can express their language preferences, enabling applications to present content in a way that aligns with the user’s linguistic preferences. This is essential for providing a personalized user experience, which is a fundamental aspect of modern digital communication.

Structure of Language Tags

A language tag defined in RFC 5646 consists of one or more subtags, separated by hyphens. These subtags are used to represent various language-related information, including primary language, region, script, and additional attributes. The structure of a language tag is organized and defined by the following subtags:

  • Language: This is the essential part of the language tag and is the only subtag that is required. It consists of a two or three-letter code for the primary language (ISO 639?1, ISO 639?2, ISO 639?3, ISO 639?4, and ISO 639?5).
  • Script: Four-letter code representing the script (ISO 15924).
  • Region: Two-letter code for the region (ISO 3166-1, ISO 3166?2, ISO 3166?3 and UN M.49).
  • Variant: Specific variant or dialect of the language.
  • Extensions: Additional language-related information. (Normally starts with the letter ‘x’)
  • Private-use: For custom language or locale identifiers.

Choose Appropriate Language Tags:

For the web in most cases your language code is going to be made up of one or two subtags. Always at least the primary language and then either script or region.

What makes these so useful is that these subtags are defined by documented codes that are shared across the world.

  • ISO 639 is currently composed of five different parts and is what you would use to determine the first subtag value Langauge.
  • ISO 15924 is what you would use to determine the value for the Script subtag.
  • ISO 3166 is currently composed of three different parts and is what you would use to determine the value for subtag Region.
  • UN M49 is what you would use to determine the value of the region subtag if what you were looking for was not found in the ISO 3166 documents.
    • Standard Country or Area Codes for Statistical Use (Series M, No. 49) is a standard for area codes used by the United Nations

When it comes to variant, extensions and private-use these seem to not have official codes. In fact, these subtags mainly seem to be in place to support older language tags.

For instance, you may have seen or heard of es-XL for Spanish Latian America like in this document from 2004. This is using the language subtag and the extension subtag however these days you should use the language subtag and then the region subtag using the UN M.49 document and this results in es-419 for Spanish that is used in Latin America.

Examples of Language tags

The official RFC 5646 provides some examples in its appendix of language tags. Below are some of those along with some that I found useful.

Simple language subtag:

  de (German)

  es (Spanish)

  ja (Japanese)

Language subtag plus Script subtag:

  zh-Hant (Chinese written using the Traditional Chinese script)

  zh-Hans (Chinese written using the Simplified Chinese script)

Language-Region:

  de-DE (German for Germany)

  en-US (English as used in the United States)

  fr-CA (French as used in Canada also known as French Canadian)

  es-419 (Spanish as used in Latin America)

Easy tools to look up the code you need can be found at
https://www.andiamo.co.uk/resources/iso-language-codes/ 
http://www.lingoes.net/en/translator/langcode.htm
http://webapps-online.com/online-tools/languages-and-locales

NOTE: Brock will be adding his own easily to look up language tag soon.

Ways that Language Tags are Utilized

HTML lang Attribute:

  • Use the lang attribute within the <html> tag to specify the primary language of the HTML document. This is required by WCAG and particularly useful for accessibility as some assistive technologies, such as screen readers, can change their dialect and pronunciation depending on the language being specified.
<html lang="fr">
  • Use the lang attribute on HTML elements wrapping content within the document to denote the language of specific content.
  <p lang="en">This is a paragraph in English.</p>
  <p lang="es">Este es un párrafo en español.</p>
  <p>Ceci est un paragraphe en anglais.</p>

Note: In the above example since we listed the page as <html lang="fr"> we do not need to list the third paragraph as being in French because the whole page has been indicated as being for French readers.

Hreflang Attribute:

  • Use hreflang attribute on link meta tags in the <head> to indicate the URL to alternate language versions of the page for search engines
  • Use hreflang attribute on an <a> tag to indicate the language the linked document is written in. Helopful for both SEO and Accessibilty.
<head>
  <link rel="alternate" href="http://example.com" hreflang="am-et" />
</head>
<body>
  <a href="https://www.w3schools.com" hreflang="en">W3Schools</a>
</body>

HTML Email Communications:

  • Use HTML lang Attribute in the HTML Markup on the <html> tag or on markup around content to indicate the language in which the text is written. Just remember you cannot add the lang attribute to <td> tags.
  • Content-Language is an email header property that can be added if your email deployment system allows that can indicate what language audience the email is intended for.
To: test@example.com
From: test@example.com
Subject: 13 HTML
Message-ID: <8259dd8e-2293-8765-e720-61dfcd10a6f3@example.com>
Date: Sat, 30 Dec 2017 19:12:38 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101
 Thunderbird/59.0a1
MIME-Version: 1.0
Content-Type: text/html; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-GB

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" lang="en" xml:lang="en">
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Émile Chartier said <span lang="fr>"Aimer, c'est trouver sa richesse hors de soi."</span> Which means "To love is to find one's wealth outside of oneself."</p>
  </body>
</html>

CSS Attribute Selectors:

While CSS Logical Properties should be used in your CSS to handle many language-based and international layout needs, there may be times where this is not enough. If you are already using the lang attribute in your HTML you can then target that same attribute with CSS and potentially style based on language. For example, you could make all the Spanish paragraphs a red color.

p[lang="es"] { color: red; } 
// will only work if the language attribute for Spanish is directly on the paragraph tag
p:lang(es) { color: red; } 
// will work on all paragraph tags directly or nested in language attribute for Spanish   

Depending on how your site HTML and CSS is structured you might be able to effect the sizes of all headlines for a certain language. Like in this example, because German is often longer than English you may want to have all those headlines you wrote in English use smaller text when translated to German.

h2 { font-size: 2.5rem }
h2:lang(de) { font-size: 2rem; }

That being said, it is always ideal to have your web design be ready to take a mixture of lengths instead of trying to resize fonts. If you do need to resize fonts, try to do it based on a high level and not per page nor per element.

Server-Side HTTP Headers:

  • Content-Language is a property that can be added to the HTTP headers your server sends to users when they load your site.

The Content-Language representation header is used to describe the language(s) intended for the audience, so users can differentiate it according to their own preferred language.

For example, if “Content-Language: de-DE” is set, it says that the document is intended for German language speakers (however, it doesn’t indicate the document is written in German. For example, it might be written in English as part of a language course for German speakers. If you want to indicate which language the document is written in, use the lang attribute instead).

If no Content-Language is specified, the default is that the content is intended for all language audiences. Multiple language tags are also possible, as well as applying the Content-Language header to various media types and not only to textual documents.

Client-Side HTTP Headers:

Accept-Language Header:

  • The Accept-Language HTTP header is sent from a users browser when they access your site. You site can be read this header with server-side code to determine the user’s preferred language.

URL Structure for Multilingual Content:

  • Structure URLs to include language subdirectories based on RFC 5646 language tags (e.g., example.com/en/page, example.com/fr/page) for clear navigation and SEO.
example.com/en/page
example.com/fr/page
example.com/fr-ca/page

Make it even clearer for SEO and your users by translating the URL along with having the language tag.

example.com/en/offers
example.com/fr/bonnes-affaires
example.com/fr-ca/offres

JavaScript

if(/^en\b/.test(navigator.language)){
 // Run code meant for any variant of English
}

You could use this to personalize the users experience, like by Localizing Date and Time Formats.

const usersPerferredLanguage = window.navigator.language;
const date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0));

if(/^en-US\b/.test(usersPerferredLanguage)){

  // US English uses month-day-year order
  console.log(new Intl.DateTimeFormat("en-US").format(date));
  // "12/19/2012"

} else if(/^en-GB\b/.test(usersPerferredLanguage)){

  // British English uses day-month-year order
  console.log(new Intl.DateTimeFormat("en-GB").format(date));
  // "19/12/2012"

} else if(/^ko-KR\b/.test(usersPerferredLanguage)){
  // Korean uses year-month-day order
  console.log(new Intl.DateTimeFormat("ko-KR").format(date));
  // "2012. 12. 19."
}

Dates and especially times can become very complex especially when dealing with JavaScript. This older orcale document goes into some of the differences, but I might write an article myself on the matter.

Remember you should never make assumptions that users cannot override. So, while you may initially detect the language via the browser you should cache this in a variable that can be changed by the user.

Proper Language tagging is a step towards a more inclusive and interconnected digital world

I hope this article has proven informative and aids you in grasping the importance of IETF RFC 5646 language tagging. May it guide you in creating a more engaging and accessible online environment for all.

Leave a Reply

Your email address will not be published. Required fields are marked *