Bringing lost languages back to life with AI

An algorithm may be able to decipher the ancient languages that have left linguists stumped.

October 25, 2020

Humans started writing more than 5,000 years ago, and the ancient texts that have survived can give us a peek into the lives of our long-dead ancestors — if we can decipher them.

Languages evolve, so by the time an ancient piece of writing makes its way into the hands of modern linguists, there might not be a single person on Earth who knows how to read it.

However, because languages evolve, linguists can look for clues connecting lost languages to living ones, and then work backwards to decipher the writings.

Still, there are a least a dozen written languages, so far discovered, that linguists simply haven’t been able to crack.

Often, this is because the nearest living language is still unknown, or because the language is not broken into words or lacking punctuation (called “unsegmented” or “undersegmented”), which makes it harder to decode.

Now, researchers at MIT have developed an algorithm to help linguists decipher these lost languages — potentially yielding new insights into humanity’s past.

Recovering Lost Languages

The MIT team first trained their algorithm to understand some of the basic principles of language evolution — a “p” sound is more likely to evolve into a similar-sounding “b” than into a “k”, for example.

When they then evaluated the algorithm using two already-deciphered ancient languages — Ugaritic and Gothic, an unsegmented language — they found that it was able to correctly identify the languages linguists believe are most closely related to them.

The algorithm could help identify the closest living relatives of lost languages.

Next, the MIT researchers tested the algorithm using an undeciphered, undersegmentaged language called Iberian.

Linguists haven’t been able to determine Iberian’s closest known language. Some believe it’s Basque, but most disagree — they suspect that it doesn’t have a still-living relative language.

The AI supported the latter group, determining that, while Iberian is more like Basque than several other candidates, it’s not enough like it to be considered related.

Deciphering Ancient Languages

In its current state, MIT’s algorithm could be a useful tool for linguists, helping them identify the closest living relatives of lost languages. But what if some lost languages — like Iberian — don’t have descendants?

The MIT researchers hope to help with that, too.

Their goal is to train the algorithm to determine the meaning of such ancient documents — even if it can’t outright translate them — if fed just a few thousands words of the lost language.

They might do this by teaching the AI to identify references to people or places within a text. Linguists could then investigate the document within the context of those historical markers.

“These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate,” lead researcher Regina Barzilay told MIT News, “but the key research question is whether the task is feasible without any training data in the ancient language.”

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].