SEA-LION v2

Introduction

SEA-LION version 2, released in July 2024, has been continued-pretrained on top of the Llama 3 8B Instruct model that is 8 billion parameters in size, with context length of 8192 tokens.

Using continued-pretraining let us leverage the powerful capabilities of the Llama3 base model and build a stronger model with far fewer resources than pre-training from scratch. Compared to the 980B tokens used in for SEA-LION v1, approximately 48B tokens across 5 SEA languages (English, Indonesia, Tamil, Thai and Vietnamese) was used for the continued pre-training of SEA-LION v2.

At a glance:

Model type: Decoder
Tokenizer: Default tokenizer used in Llama 3 8B Instruct
Training Data Size: 48B tokens of SEA data
Context Length: 8192
Available Formats:
- Base (Llama-SEA-LION-v2-8B)
- Instruct (Llama-SEA-LION-v2-8B-IT)
- GGUF (Llama-SEA-LION-v2-8B-IT-GGUF)
Supported Languages:
1. English
2. Indonesian
3. Thai
4. Vietnamese
5. Tamil
License: Llama3 Community License

Llama-SEA-LION-v2-8B

Training Infrastructure

Llama-SEA-LION-v2-8B was trained using MosaicML Composer on the following hardware:

Training Details

Llama-SEA-LION-v2-8B

AWS EC2 p5d.24xlarge

8 instances

Nvidia H100 80GB GPU

Training Duration

2 days

Configuration

HyperParameter

Llama-SEA-LION-v2-8B

Precision

bfloat16

Optimizer

decoupled_adamw

Scheduler

weight_stable_decay

Learning Rate

1.0e-5

Global Batch Size

512

Micro Batch Size

Tokenizer

For tokenisation, the model employs the default tokenizer used in Llama 3 8B Instruct.

Training Data

The Llama-SEA-LION-v2-8B base model was continued pre-trained on 48B tokens of the following data:

Data Source

Unique Tokens (B)

Multiplier

Total Tokens (B)

Percentage (%)

Dolma RefinedWeb - English

7.650

15.90

Dolma C4 - English

1.160

1.16

9.21

Dolma Reddit - English

1.339

2.42

Dolma Semantic Scholar

0.959

2.79

Dolma arXiv

0.469

1.99

Dolma StarCoder

4.422

0.98

SEA-LION Pile - Indonesian

3.4

6.8

14.17

Wiki* - Indonesian

0.3

1.2

2.50

SEA-LION Pile - Tamil

5.6

11.67

Wiki* + News - Tamil

0.6

2.4

5.00

SEA-LION Pile - Thai

2.28

4.75

WangChanBERTa - Thai

10.42

Wiki* - Thai

0.18

0.72

1.50

SEA-LION Pile - Vietnamese

6.76

14.08

Wiki* - Vietnamese

0.31

1.24

2.58

Note:

All token counts are counted using Llama3 tokenizer
wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
Tamil news is sourced with permission from Seithi

Benchmark Performance

We evaluated Llama-SEA-LION-v2-8B base model on general language capabilities.

General Language Capabilities

For the evaluation of general language capabilities in SEA languages, we employed the BHASA evaluation benchmark across a variety of tasks. These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).

The evaluation was done five-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.

For more details on Llama-SEA-LION-v2-8B benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/

Llama-SEA-LION-v2-8B-IT

Llama-SEA-LION-v2-8B-IT is a multilingual instruction-following model which has been fine-tuned with around 100,000 English instruction-completion pairs alongside a smaller pool of around 50,000 instruction-completion pairs from other ASEAN languages, such as Indonesian, Thai and Vietnamese.

These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.

Fine-Tuning Methodology

The Llama-SEA-LION-v2-8B-IT model was fine-tuned using 8x A100-40GB using parameter efficient fine tuning in the form of LoRA.

Fine-Tuning Data

Llama-SEA-LION-v2-8B-IT was trained on a wide range of instructions that were manually and stringently verified by our team. A large portion of the effort was dedicated to ensuring that each instruction-completion pair that the model sees is of high quality and any errors were corrected and rewritten by native speakers or else dropped from our mix.

In addition, special care was taken to ensure that the datasets used had commercially permissive licenses through verification with the original data source.

Benchmark Performance

We evaluated Llama-SEA-LION-v2-8B-IT on both general language capabilities and instruction-following capabilities.

General Language Capabilities

For the evaluation of general language capabilities, we employed the BHASA evaluation benchmark across a variety of tasks. These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).

Note: BHASA is implemented following a strict answer format, and only spaces and punctuations are cleaned. For tasks where options are provided, the answer should only include one of the pre-defined options, nothing else. If the model continues to generate more tokens (e.g. to explain its answer), it will be considered to be a wrong response. For the F1 score metric (as used in Sentiment Analysis and Toxicity Detection), all answers that do not fall under the pre-defined labels will be treated as a separate label (to mark it as a wrong answer) and included in the calculations so that the model is penalized for not generating one of the pre-defined labels.

The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.

Instruction-following Capabilities

Since Llama-SEA-LION-v2-8B-IT is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, IFEval and MT-Bench.

As these two datasets were originally in English, the linguists and native speakers in the team worked together to filter, localize and translate the datasets into the respective target languages to ensure that the examples remained reasonable, meaningful and natural.

IFEval

IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).

MT-Bench

MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use gpt-4-1106-preview as the judge model and compare against gpt-3.5-turbo-0125 as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.

Llama-SEA-LION-v2-8B-IT-GGUF

The following quantized GGUF formats of our Llama-SEA-LION-v2-8B-IT model are available:

Llama-SEA-LION-v2-8B-IT-Q2_K
Llama-SEA-LION-v2-8B-IT-Q3_K_M
Llama-SEA-LION-v2-8B-IT-Q4_0
Llama-SEA-LION-v2-8B-IT-Q4_K_M
Llama-SEA-LION-v2-8B-IT-Q5_0
Llama-SEA-LION-v2-8B-IT-Q5_K_M
Llama-SEA-LION-v2-8B-IT-Q6_K
Llama-SEA-LION-v2-8B-IT-Q8_0

Please refer to our Download the Model(s) section for more details on how to access them.

Download the Model(s)

SEA-LION v2 models are available for download via the following channels:

HuggingFace SEA-LION v2 Collection

Model

Download

Llama-SEA-LION-v2-8B

HuggingFace

Llama-SEA-LION-v2-8B-IT

HuggingFace

Llama-SEA-LION-v2-8B-IT-GGUF

HuggingFace, Ollama

Usage

Llama-SEA-LION-v2-8B-IT can be run using the 🤗 Transformers library

# Please use transformers==4.43.2

import transformers
import torch

model_id = "aisingapore/Llama-SEA-LION-v2-8B-IT"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
messages = [
    {"role": "user", "content": "Apa sentimen dari kalimat berikut ini?\nKalimat: Buku ini sangat membosankan.\nJawaban: "},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Disclaimer

It is important for users to be aware that our models exhibits certain limitations that warrant consideration:

The model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning.
The model has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.

References

Thai Pre-Training Data Reference

@misc{lowphansirikul2021wangchanberta,
    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
    year={2021},
    eprint={2101.09635},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

PreviousLlama-SEA-LION-v3-70B NextSEA-LION v1

Last updated 3 months ago