PanLex: Using PostgreSQL to implement a massive word-translation graph

Date: 2017-09-08
Time: 17:30 - 18:20
Room: Cyril Magnin
Level: Intermediate

PanLex is a non-profit project whose mission is to support small-language communities around the world, overcoming language barriers to human rights, information, and opportunities. Enabling this mission is a translation database connecting every word in every language. The PanLex database, developed over the past 10 years, contains over 2,500 dictionaries, 5,700 languages, 25 million words, and 1.3 billion direct pairwise word translations. Direct translations are those explicitly attested in some dictionary (for example, Wiktionary). However, the true power of the database comes from indirect translations, that is, translations inferred by linking two direct translations. For example, to translate the Irish Gaelic word "madra" (dog) into Zulu, the database query generates the correct translation "inja" via hundreds of intermediate languages such as English, Estonian, Welsh, Indonesian, Greek, Turkish, Bengali, Hebrew, Korean, Mandarin, Basque, and Armenian.

This presentation will describe how PanLex uses PostgreSQL to implement its massive word-translation graph. I will summarize the PanLex database schema; show how graph searches for direct and indirect translations are implemented in SQL; describe algorithms for determining translation quality (especially important for indirect translations), and their implementation; and identify performance challenges and some ways to address them. Database topics will include joins, aggregates, stored functions, and denormalization.


David Kamholz