Alan Dean

CTO, Developer, Agile Practitioner

Photograph of Alan Dean

Saturday, January 17, 2009

HTTP i18N Patterns

I have been dealing with the problems of internationalization (i18N) in order to make applications world-ready for over a decade now. In fact, many years ago, my first community presentation was “Internationalisation and Globalisation in Visual Basic 6”.

There has recently been a discussion on the [rest-discuss] list titled “Conneg for language selection: Just Say No”. I wrote a lengthy response on the thread and thought that there was value converting that into a blog entry.

Don’t confuse translation with localization

Translation is the conversion of source text in one language to target text in another language.

Localization (l10N), whilst typically involving translation, is the process of cultural adaptation.

Imagine that you are book publisher responsible for publishing a work of Leo Tolstoy. In Russia, you would not translate the text (the original being written in Russian). But you would still localize by, for example, putting a price on the cover in roubles. You might then order a translation to French and the imprint would be localized by having a price in euros. Depending on your marketing, the localization may also involve changing the jacket image and will certainly require the book title to be translated as well. However, it seems that there is a section of the market for Tolstoy in France that reads in Russian. In which case, you may decide to have an additional imprint that has no translation but still requires localisation of both jacket image and prices.

Resource translation creates a new resource, translation of decoration creates alternate representations

In REST, it important to distinguish between different resources and different representations. When using HTTP, you can content negotiate (conneg) between alternate representations, but you should not do so between different resources. Instead, you should provide the user with links to follow to view the different resources.

Content negotiation by language or Request-IP address?

Localization conneg can be either language-driven from the Accept-Language HTTP header and/or country-driven from the Request-IP address. Both are valid localizations and both can be used in concert. For example, the price of goods and services are country-specific localizations rather then language-specific (both in terms of the currency used and the tax regime to be applied).

Health warning: URI structure is orthogonal to REST

Before I go into more detail with my examples, I want to make it clear that what URIs look like is, prima facie, unimportant to REST. This is sometimes referred to as URI opacity or opaqueness. Nonetheless, there are good patterns we can employ for URI structure and human-readable URIs are considered to be ‘a good thing’.

Example: Tolstoy on the Web

Carrying on with the Tolstoy example, here is how I would personally approach the structure and content negotiation of his book “Anna Karenina

1. Expose the ‘concept of’ a book called “Anna Karenina” by Leo Tolstoy

http://www.example.com/book/leo-tolstoy/anna-karenina

In this example, I have used the English translations of both names but any language would be acceptable. It is worth noting, however, that regardless of how much we may dislike the fact, American English is the de facto default language of the Web. I would argue that you need a good reason to depart from American English for ‘canonical’ URIs. This is not to say that you cannot have localized URIs as well. In the example below, you can see “book” being translated into “livre” on the URI.

Content negotiate based upon Accept (for media types) and Accept-Language to a list of the various translations of Anna Karenina using the following URI structure:

{canonical-baseURI}.{language-token}.{media-type-token}

A table of example content negotiation URIs from the canonical URI.
URI Language Media Type
http://www.example.com/book/leo-tolstoy/anna-karenina No representation
http://www.example.com/book/leo-tolstoy/anna-karenina.en.html Generic English HTML
http://www.example.com/livre/leo-tolstoy/anna-karenina.fr.html Generic French HTML
http://www.example.com/livre/leo-tolstoy/anna-karenina.fr-fr.html French, as spoken in France HTML
http://www.example.com/livre/leo-tolstoy/anna-karenina.fr-ca.html French, as spoken in Canada HTML
http://www.example.com/book/leo-tolstoy/anna-karenina.en.atom Generic English Atom
http://www.example.com/livre/leo-tolstoy/anna-karenina.fr.atom Generic French Atom
http://www.example.com/livre/leo-tolstoy/anna-karenina.fr-fr.atom French, as spoken in France Atom
http://www.example.com/livre/leo-tolstoy/anna-karenina.fr-ca.atom French, as spoken in Canada Atom

Each of these pages should contain a list of links of the available translations and the countries in which they are available for purchase.

2. Expose the country-specific links for product purchase

There are a number of different ways of hosting country specific URIs:

  • Country Code TLDs (ccTLD).
    For example, http://www.example.co.uk or http://www.example.fr
  • Country-specific subdomains.
    For example, http://uk.example.com or http://fr.example.com
  • Country-specific paths.
    For example, http://www.example.com/uk/ or http://www.example.com/fr/

For my example purposes, I will simply use ccTLDs. If we look at the way in which we have set up our content negotiation for the canonical example above, we see that we can use the same pattern. In many cases, language negotiation isn’t provided for ccTLD domains unless there are multiple official languages. So for http://www.example.fr there might only be localization to French but for http://www.example.ca there would likely be localization to both English and French:

A table of example content negotiation URIs from the ccTLD URI.
URI Language Media Type
http://www.example.ca/book/leo-tolstoy/anna-karenina No representation
http://www.example.ca/book/leo-tolstoy/anna-karenina.en-ca.html English, as spoken in Canada HTML
http://www.example.ca/livre/leo-tolstoy/anna-karenina.fr-ca.html French, as spoken in Canada HTML
http://www.example.ca/book/leo-tolstoy/anna-karenina.en-ca.atom English, as spoken in Canada Atom
http://www.example.ca/livre/leo-tolstoy/anna-karenina.fr-ca.atom French, as spoken in Canada Atom

In some cases, such as Google, the canonical URI step is skipped by sniffing the Request-IP address and then redirecting straight to the ccTLD. Google has the good manners to also provide a link back to the .com domain when it does this.

7 comments:

Subbu Allamaraju said...

Nice writeup.

dret said...

nice post about the difficulties how to do proper I18N on the web. it's not easy, and often the specific task is important to decide what's "best". when it comes to patterns of how to identify variants of a resource, http://dret.net/lectures/web-fall08/i18n+l10n#(20) is my attempt to list all of them, i came up with at least 9 variations and there are probably more.

Anonymous said...

Very insightful. Thanks!

Anonymous said...

Did you find a way for Google to index every flavors of the content. SEO people often argue that you can't do conneg, because Google will not index french, english, etc. but one of them.

Alan Dean said...

As far as any search engine is concerned, each leaf negotiated URI is a separate resource to be indexed. The problem is actually in the opposite direction, whereby you need to use the canonical meta tag if you want to concentrate link juice. You should also provide as much metadata as possible (via headers such as Content-Language and <meta> tags in html) in order that the search engine bot can understand how each resource differs from the others.

When SEO folk talk about not being able to do conneg, they mean when you negotiate the response on the same URI, rather than redirecting to language / content-type specific URIs. The same problem applies to intermediaries such as proxies.

Anonymous said...

yes. I didn't explain very well.

Usability - having one canonical URI is good and conneg behind. When Ms Yoko share canonical URI (that she is reading in Japanese) with Mr Tommi (Finland) in a mail. When Tommi opens it, he sees Finnish.

SEO - having different URIs for each content so when people search terms, they willl find the right content.

There is Transparent Content Negotiation in HTTP, but it doesn't seem to have traction.

Olivier Thereaux had written an article about The missing feature.

Alan Dean said...

I see your point Karl, but most of the issues can be resolved by a good uri-space and usability design.

Surfacing a UI to the user in order that they can switch to a different translation, for example, is considered good practice. BBC News does a decent job of this. That way, if Ms Yoko shares a URI with Mr Tommi then he can select a Finnish translation if one is available. This is preferable to the server delivering different representations to different users in what would appear to be an arbitrary manner, beyond user discretion.

If you want the user to be able to share "conceptual" URIs then you can do so in the same manner that permalinks are customarily surfaced. When a recipient uses such a URI, then they will be negotiated according to their request headers in the normal manner.

The new canonical meta tag (and the use of canonical URIs in sitemaps.xml) is really for bots to be able to filter out noise (with the possible side effect of concentrating link-juice).