Job Location : all cities,MA, USA
Lost languages are more than a mere academic curiosity; without them, we miss an entire body of knowledge about the people who spoke them. Unfortunately, most of them have minimal records, making decipherment difficult with machine-translation algorithms like Google Translate. Some lack well-researched related languages, and often they do not have traditional dividers like white space and punctuation. (For example, imagine trying to decipher a foreign language written like this.)
Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) recently developed a system capable of automatically deciphering lost languages without prior knowledge of their relation to other languages. The system can also determine relationships between languages and has been used to support the idea that Iberian is not related to Basque.
The goal is for the system to decipher languages that have eluded linguists for decades, using only a few thousand words.
Led by MIT Professor Regina Barzilay, the system incorporates principles from historical linguistics, such as the predictable ways languages evolve. For example, languages rarely add or delete entire sounds, but certain sound substitutions are common—like a p changing to a b, while changing to a k is less likely due to pronunciation differences.
The system uses these linguistic constraints to handle the vast possibilities of language transformation and the limited data available. It embeds sounds into a multidimensional space, where pronunciation differences are reflected in the distance between vectors, capturing relevant patterns of language change. This allows the model to segment words in ancient texts and map them to related language counterparts.
This project builds on previous work where Barzilay and Luo deciphered Ugaritic and Linear B languages, with the latter taking decades for humans to decode. Unlike that work, where the related languages were known, the new system infers relationships between languages, addressing a major challenge in decipherment.
When tested on known languages, the algorithm accurately identifies language families. Applied to Iberian, considering Basque and other language families, it found that Iberian was closer to Basque than other languages but still too different to be considered related.
Future work aims to go beyond linking texts to known languages, focusing on identifying the semantic meaning of words without prior knowledge—a process called cognate-based decipherment. This involves recognizing entities like people or locations within texts, which can then be investigated historically.
As Barzilay explains, For instance, we may identify all references to people or locations in the document, which can then be further investigated in light of historical evidence. These entity recognition methods are highly accurate in modern text processing, but the key question is whether they are feasible without training data in the ancient language.
The project received support from the Intelligence Advanced Research Projects Activity (IARPA).
For inquiries, product placements, sponsorships, and collaborations, contact us at [email protected]. We look forward to hearing from you!#J-18808-Ljbffr