SEA-LION v2
Last updated
Last updated
SEA-LION version 2, released in July 2024, has been continued-pretrained on top of the that is 8 billion parameters in size, with context length of 8192 tokens.
Using continued-pretraining let us leverage the powerful capabilities of the Llama3 base model and build a stronger model with far fewer resources than pre-training from scratch. Compared to the 980B tokens used in for SEA-LION v1, approximately 48B tokens across 5 SEA languages (English, Indonesia, Tamil, Thai and Vietnamese) was used for the continued pre-training of SEA-LION v2.
At a glance:
Model type: Decoder
Tokenizer: Default tokenizer used in Llama 3 8B Instruct
Training Data Size: 48B tokens of SEA data
Context Length: 8192
Available Formats:
Base (Llama-SEA-LION-v2-8B)
Instruct (Llama-SEA-LION-v2-8B-IT)
GGUF (Llama-SEA-LION-v2-8B-IT-GGUF)
Supported Languages:
English
Indonesian
Thai
Vietnamese
Tamil
License:
AWS EC2 p5d.24xlarge
8 instances
Nvidia H100 80GB GPU
64
Training Duration
2 days
Configuration
Precision
bfloat16
Optimizer
decoupled_adamw
Scheduler
weight_stable_decay
Learning Rate
1.0e-5
Global Batch Size
512
Micro Batch Size
2
For tokenisation, the model employs the default tokenizer used in Llama 3 8B Instruct.
The Llama-SEA-LION-v2-8B base model was continued pre-trained on 48B tokens of the following data:
Dolma RefinedWeb - English
7.650
1
7.650
15.90
Dolma C4 - English
1.160
1
1.16
9.21
Dolma Reddit - English
1.339
1
1.339
2.42
Dolma Semantic Scholar
0.959
1
0.959
2.79
Dolma arXiv
0.469
1
0.469
1.99
Dolma StarCoder
4.422
1
4.422
0.98
SEA-LION Pile - Indonesian
3.4
2
6.8
14.17
Wiki* - Indonesian
0.3
4
1.2
2.50
SEA-LION Pile - Tamil
5.6
1
5.6
11.67
Wiki* + News - Tamil
0.6
4
2.4
5.00
SEA-LION Pile - Thai
2.28
1
2.28
4.75
WangChanBERTa - Thai
5
1
5
10.42
Wiki* - Thai
0.18
4
0.72
1.50
SEA-LION Pile - Vietnamese
6.76
1
6.76
14.08
Wiki* - Vietnamese
0.31
4
1.24
2.58
Note:
All token counts are counted using Llama3 tokenizer
wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
We evaluated Llama-SEA-LION-v2-8B base model on general language capabilities.
The evaluation was done five-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
For more details on Llama-SEA-LION-v2-8B benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
Llama-SEA-LION-v2-8B-IT is a multilingual instruction-following model which has been fine-tuned with around 100,000 English instruction-completion pairs alongside a smaller pool of around 50,000 instruction-completion pairs from other ASEAN languages, such as Indonesian, Thai and Vietnamese.
These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
The Llama-SEA-LION-v2-8B-IT model was fine-tuned using 8x A100-40GB using parameter efficient fine tuning in the form of LoRA.
Llama-SEA-LION-v2-8B-IT was trained on a wide range of instructions that were manually and stringently verified by our team. A large portion of the effort was dedicated to ensuring that each instruction-completion pair that the model sees is of high quality and any errors were corrected and rewritten by native speakers or else dropped from our mix.
In addition, special care was taken to ensure that the datasets used had commercially permissive licenses through verification with the original data source.
We evaluated Llama-SEA-LION-v2-8B-IT on both general language capabilities and instruction-following capabilities.
Note: BHASA is implemented following a strict answer format, and only spaces and punctuations are cleaned. For tasks where options are provided, the answer should only include one of the pre-defined options, nothing else. If the model continues to generate more tokens (e.g. to explain its answer), it will be considered to be a wrong response. For the F1 score metric (as used in Sentiment Analysis and Toxicity Detection), all answers that do not fall under the pre-defined labels will be treated as a separate label (to mark it as a wrong answer) and included in the calculations so that the model is penalized for not generating one of the pre-defined labels.
The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
As these two datasets were originally in English, the linguists and native speakers in the team worked together to filter, localize and translate the datasets into the respective target languages to ensure that the examples remained reasonable, meaningful and natural.
IFEval
IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
MT-Bench
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use gpt-4-1106-preview
as the judge model and compare against gpt-3.5-turbo-0125
as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
The following quantized GGUF formats of our Llama-SEA-LION-v2-8B-IT model are available:
Llama-SEA-LION-v2-8B-IT-Q2_K
Llama-SEA-LION-v2-8B-IT-Q3_K_M
Llama-SEA-LION-v2-8B-IT-Q4_0
Llama-SEA-LION-v2-8B-IT-Q4_K_M
Llama-SEA-LION-v2-8B-IT-Q5_0
Llama-SEA-LION-v2-8B-IT-Q5_K_M
Llama-SEA-LION-v2-8B-IT-Q6_K
Llama-SEA-LION-v2-8B-IT-Q8_0
SEA-LION v2 models are available for download via the following channels:
Llama-SEA-LION-v2-8B
Llama-SEA-LION-v2-8B-IT
Llama-SEA-LION-v2-8B-IT-GGUF
Llama-SEA-LION-v2-8B-IT can be run using the 🤗 Transformers library
It is important for users to be aware that our models exhibits certain limitations that warrant consideration:
The model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning.
The model has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.
Llama-SEA-LION-v2-8B was trained using on the following hardware:
Tamil news is sourced with permission from
For the evaluation of general language capabilities in SEA languages, we employed the across a variety of tasks. These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
For the evaluation of general language capabilities, we employed the across a variety of tasks. These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
Since Llama-SEA-LION-v2-8B-IT is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, and .
Please refer to our section for more details on how to access them.
,