Imagine flipping through the pages of a book in a language you cannot read. You can look at the pictures, but understand little of the text.

Millions of people have a similar experience connecting to and trying to navigate the Web.

The leading software, apps, operating systems and voice technologies are often developed only for English and other languages with commercial advantages.

People who are passionate about seeing their languages thrive online can join communities to translate, localize, write, type and tweet with purpose, but it’s difficult for an underrepresented language to gain social relevance when majority language versions exist.

For less widely spoken or written languages this leads to what some researchers call “digital extinction,” even in affluent countries where most people are online.

Fewer than 400,000 people speak Icelandic, and so Icelanders casually switch to English to give voice commands to their devices. Some see the lack of technology in their native language as a contributing factor to the dominance of English and the decline of Icelandic.

Multilingual technologies and translations aren’t enough. What is needed for a healthy Internet is locally relevant content that genuinely represents the languages and experiences of people who connect. Where locally relevant content is missing, anywhere in the world, it can be a barrier to Internet uptake and a frustrating experience for news and information seekers.

At this point, you are probably wondering what the scale and reach of your own language is on the Internet… but it is surprisingly difficult to assess. First, how would you assess what amount of content is appropriate for any given population? Or track whether language diversity is improving worldwide?

One common method is to compare the estimated number of Internet users who are speakers of different languages, to the estimated percentage of websites in those languages. This count leads to sensational figures. For example, more than 50% of the Web is in English, while only 25% of Internet users speak English. We said this ourselves in the first version of the Internet Health Report. But these numbers are likely to be flawed.

One person who has questioned the accuracy of such popular language metrics for the Web is Daniel Pimienta of FUNREDES, a mostly inactive Internet research-action group in the Dominican Republic. In 2009, Pimienta co-authored a paper for UNESCO describing how biases that overstate the dominance of English were normalized through repetition.

For example, many researchers of the Web – including W3Techs that generates the percentage of language content online above – rely on Alexa Internet rankings of the world’s most popular websites. This is a tiny percentage of the billions of Web pages that exist. The advantage is that the list is guaranteed to be free of spam, parked domains and other irrelevant (for humans) pages of the Web.

But exactly how Alexa (an online marketing tool owned by Amazon) gathers data is secret. Some websites install an Alexa code to help track visits to their site. In addition, Alexa say they monitor “tens of millions” of Web users via “more than 25,000 browser extensions” but they divulge no information on what number of these users are based in China, for instance.

There are many ongoing efforts to more accurately measure the presence of different languages online.

Pimienta suggests an alternative method of language measurement that draws on dozens of different available indicators to calculate the relative “power” of a language, including Wikipedia pages, software downloads and social media users. His estimate for English on the Web is closer to 30%.

The Inclusive Internet Index assesses the degree of local language content in 86 different countries by surveying citizens about whether there are domestic news publications, e-government services, health, finance and entertainment websites.

A process within UNESCO to develop new Internet Universality Indicators is also likely to include some measure of whether local language content is relevant (see draft).

This may be the age of ‘big data,’ but collecting accurate and relevant information about language is still challenging, even in a hyper-connected world.

It’s important to acknowledge the flaws of current approaches, but it’s even more vital that we not give up. We need to know how the world’s languages are faring online, to help better assess whether the Internet is fulfilling its promise.

For diverse, accessible and healthy communities – online and offline – it’s critical that we continue the work to understand and support a multilingual Web.

