Motivations
Large Language Models (LLMs) are a type of artificial intelligence model designed to understand and generate human language. Recent developments in LLMs have showcased remarkable capabilities in understanding and generating human language, with applications spanning translation, summarization, coding assistance, question answering, and more.
Many existing LLMs, however, are trained upon massive internet-based datasets, which often has disproportionately large influences from western, industrialized, rich, educated, and democratic (WIRED) societies, as people from non-WIRED societies are less likely to be literate, to use the Internet, and to have their output easily accessed. Such an imbalance in the training data can lead to model outputs that display strong bias in terms of cultural values, political beliefs and social attitudes.
LLMs trained on predominantly WIRED-centric content risk neglecting the linguistic and cultural diversity inherent in non-WIRED populations. These biases become evident not only in examples like cultural references or local idioms, but also in more critical domains such as decision-making, social attitudes and moral reasoning, and which can vary significantly across global communities. By overlooking these variations, mainstream LLMs may inadvertently perpetuate inaccurate assumptions or exclude large segments of the global population.
Our work in SEA-LION, now part of Singapore’s National Multi-Modal Large Language Model project, aims to address these disparities by creating LLMs that cater to under-represented population groups and low resource languages in the SEA region.
SEA-LION is trained on more content produced in Southeast Asian languages like Thai, Vietnamese and Bahasa Indonesia to ensure better representation in data and alignment compared to Western or Chinese models. SEA-LION models understand nuances in SEA languages and demonstrate greater awareness of cultural context specific to the region.
This lowers the bar for governments, industries, and academia that seek LLM solutions that fit local languages and reflect local cultural norms, since WIRED-centric models can pose langauge barriers and misalign with local sensibilities in the SEA region.
Last updated