Many NLP models rely on training in one high-resource language, and they cannot be directly used to make predictions  for other languages at inference time. Most of the languages of the world are under-resourced and rely on Machine Translation (MT) to English to make use of Language Models. However, having an MT system in every direction is costly and  not the best solution for every NLP task. We propose the use of meta-learning to solve this issue. Our algorithm, TreeMAML, extends a meta-learning model, MAML [1], by exploiting hierarchical languages relationships.

MAML adapts the model to each task with a few gradient steps. In our method, TreeMAML, this adaptation follows the hierarchical tree structure:
In each step down the tree, gradients are pooled across language clusters: Algorithm 1.

Algorithm 2 is a non-binary modification of the OTD clustering [2],  that generates the language tree without previous knowledge of the structure, allowing us to use implicit relationships between the languages.

In our Experiments we adapt a high-resource language model, Multi-BERT [3], to a Few-Shot NLI task with these steps:
We use the XNLI data set [4]. XNLI corpus is a crowd-sourced collection of pairs for the MultiNLI corpus with 10 different genres in 15 languages. The pairs are annotated with textual entailment. Each combination of a language and a genre is consider a task.

We perform Few-Shot meta-learning using three shots for each task during meta-training.
We applied TreeMAML to fine tune the 12 layer of Multi-Bert. We perform two experiments:
Experiment 1 – FixedTreeMAML:  Assume that the  language tree structure is known, and correspond to the one in Fig 1.,  and applying Algorithm 1.
Experiment 2 – Learned TreeMAML: More general case where the relation among languages is not known. Algorithm 2 is  used in each inner step of Algorithm 1 to cluster the gradients and learn the hierarchy between languages.

We compare our methods with the  baseline (Multi-Bert) and with the newest state of the art results (XMAML, [5]).  Our method shows a ~3% improvement in accuracy, see the figure below.

[1] C. Finn, P. Abbeel, and S. Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv:1703.03400.
[2] A. Menon, A. Rajagopalan, B. Sumengen, G. Citovsky, Q. Cao, and S. Kumar. 2019. Online Hierarchical Clustering Approximations. arXiv:1909.09667
[3] J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
[4] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. arXiv:1809.05053.
[5] F. Nooralahzadeh, Bekoulis, Bjerva, Augenstein: Zero Shot