SEA-LION v1
Introduction
SEA-LION version 1, released in December 2023, was our first collection of Large Language Models (LLMs) that were specifically pretrained and instruct-tuned for the Southeast Asia (SEA) region, making a significant leap forward in the field of Natural Language Processing in understanding the SEA regional context.
SEA-LION v1 comes in two model sizes – one with 3 billion parameters (SEA-LION-v1-3B) and another with 7 billion parameters (SEA-LION-v1-7B). Both variants are built on the robust MPT architecture and utilise a vocabulary size of 256K, with context length of 2048 tokens.
Our SEA-LION-v1-7B model was then further instruct-tuned to produce SEA-LION-v1-7B-IT.
At a glance:
Model type: Decoder
Tokenizer: Custom SEABPETokenizer
Available Formats:
3B Base (SEA-LION-v1-3B)
7B Base (SEA-LION-v1-7B)
7B Instruct (SEA-LION-v1-7B-IT)
7B GGUF (SEA-LION-v1-7B-IT-GGUF)
Languages:
English
Chinese
Indonesian
Malay
Thai
Vietnamese
Filipino
Tamil
Burmese
Khmer
Lao
License: MIT
SEA-LION-v1-3B / SEA-LION-v1-7B
Model Architecture
SEA-LION-v1-3B and SEA-LION-v1-7B are both decoder models built on the robust MPT architecture:
Layers
32
32
d_model
2560
4096
head_dim
20
32
Vocabulary
256000
256000
Sequence Length
2048
2048
Training Infrastructure
AWS EC2 p4d.24xlarge
30 instances
32 instances
Nvidia A100 40GB GPU
240
256
Training Duration
14 days
22 days
Configuration:
Precision
bfloat16
bfloat16
Optimizer
decoupled_adamw
decoupled_adamw
Scheduler
cosine_with_warmup
cosine_with_warmup
Learning Rate
1.6e-4
6.0e-5
Global Batch Size
1200
2048
Micro Batch Size
5
4
Tokenizer
For tokenization, both SEA-LION-v1-3B and SEA-LION-v1-7B employed our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
SEABPETokenizer was trained by sampling 20M lines from the model training data, using the SentencePiece framework. The tokenizer type is Byte-Pair Encoding (BPE).
Training Data
SEA-LION-v1-3B and SEA-LION-v1-7B were trained on 980B tokens of text data from 11 languages spoken across SEA:
English
Chinese
Indonesian
Malay
Thai
Vietnamese
Filipino
Tamil
Burmese
Khmer
Lao
These 980B tokens comprised of the following data mix:
RefinedWeb - English
571.3B
1
571.3B
58.20%
mC4 - Chinese
91.2B
1
91.2B
9.29%
mC4 - Indonesian
3.68B
4
14.7B
1.50%
mC4 - Malay
0.72B
4
2.9B
0.29%
mC4 - Filipino
1.32B
4
5.3B
0.54%
mC4 - Burmese
1.2B
4
4.9B
0.49%
mC4 - Vietnamese
63.4B
1
63.4B
6.46%
mC4 - Thai
5.8B
2
11.6B
1.18%
WangChanBERTa - Thai
5B
2
10B
1.02%
mC4 - Lao
0.27B
4
1.1B
0.12%
mC4 - Khmer
0.97B
4
3.9B
0.40%
mC4 - Tamil
2.55B
4
10.2B
1.04%
the Stack - Python
20.9B
2
41.8B
4.26%
the Stack - Javascript
55.6B
1
55.6B
5.66%
the Stack - Shell
1.2B5
2
2.5B
0.26%
the Stack - SQL
6.4B
2
12.8B
1.31%
the Stack - Markdown
26.6B
1
26.6B
2.71%
RedPajama - StackExchange
21.2B
1
21.2B
2.16%
RedPajama - ArXiv
30.6B
1
30.6B
3.12%
Performance
SEA-LION-v1-3B and SEA-LION-v1-7B base models had an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
SEA-LION-v1-3B
36.26
64.59
24.07
36.46
40.35
SEA-LION-v1-7B
39.93
68.51
26.87
35.09
42.60
SEA-LION-v1-7B-IT
SEA-LION-v1-7B-IT is a multilingual model which has been fine-tuned with thousands of English and Indonesian instruction-completion pairs alongside a smaller pool of instruction-completion pairs from other ASEAN languages, using our pre-trained SEA-LION-v1-7B as base.
These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
Fine-Tuning Methodology
The SEA-LION-v1-7B-IT was fine-tuned using 8x A100-40GB using parameter efficient fine tuning in the form of LoRA.
Fine-Tuning Data
SEA-LION-v1-7B-IT was trained on a wide range of instructions that were manually and stringently verified by our team. A large portion of the effort was dedicated to ensuring that each instruction-completion pair that the model sees is of a high quality and any errors were corrected and rewritten by native speakers or else dropped from our mix.
In addition, special care was taken to ensure that the datasets used had commercially permissive licenses through verification with the original data source.
Benchmarks
BHASA stands out amongst other evaluations for SEA languages for its holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted.
The evaluation was done zero-shot with Indonesian prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the BHASA paper.
For Natural Language Understanding (NLU) tasks, we tested the model on Sentiment Analysis (Sentiment) using the NusaX dataset, Question Answering (QA) using the TyDiQA dataset, and Toxicity Detection (Toxicity) using the Indonesian Multi-Label Hate Speech Detection dataset. The metrics used are F1 scores for all three tasks.
For Natural Language Generation (NLG) tasks, we tested the model on Machine Translation from English to Indonesian (Eng>Indo) and from Indonesian to English (Indo>Eng) using the FLORES-200 dataset, and Abstractive Summarization (Summary) using the XLSum dataset. The metrics used for Machine Translation and Abstractive Summarization are ChrF++ and ROUGE-L respectively.
For Natural Language Reasoning (NLR) tasks, we tested the model on Natural Language Inference (NLI) using the IndoNLI lay dataset and on Causal Reasoning (Causal) using the XCOPA dataset. The metrics are based on accuracy for both tasks.
Performance
SEA-LION v1 models achieved better or competitive performances on tasks in regional languages at the time of release:
SEA-LION-7B-Instruct-Research
24.86
76.13
24.45
52.50
46.82
15.44
33.20
23.80
SEA-LION-7B-Instruct
68.41
91.45
17.98
57.48
58.04
17.54
53.10
60.80
SeaLLM 7B v1
30.96
56.29
22.60
62.23
41.55
14.03
26.50
56.60
SeaLLM 7B v2
44.40
80.13
55.24
64.01
63.28
17.31
43.60
82.00
Sailor-7B
65.43
59.48
20.48
64.27
60.68
8.69
15.10
38.40
Llama 2 7B Chat
11.12
52.32
0.00
44.09
57.58
9.24
0.00
0.00
Mistral 7B Instruct v0.1
38.85
74.38
20.83
30.60
51.43
15.63
28.60
50.80
GPT-4
73.60
74.14
63.96
69.38
67.53
18.71
83.20
96.00
SEA-LION-7B-Instruct-GGUF
Support for SEA-LION-v1-IT-GGUF was merged into llama.cpp
as of 4th Apr 2024.
Prompt Template:
Recommended llama.cpp
command:
To convert & quantize your own SEA-LION model:
The following quantized GGUF formats of our SEA-LION-v1-7B-IT model are available:
sea-lion-7b-instruct-Q2_K
sea-lion-7b-instruct-Q3_K_M
sea-lion-7b-instruct-Q4_0
sea-lion-7b-instruct-Q4_K_M
sea-lion-7b-instruct-Q5_0
sea-lion-7b-instruct-Q5_K_M
sea-lion-7b-instruct-Q6_K
sea-lion-7b-instruct-Q8_0
Download the Model(s)
SEA-LION v1 models are available for download via the following channels:
SEA-LION-v1-3B
SEA-LION-v1-7B
SEA-LION-v1-7B-IT
SEA-LION-v1-7B-IT-GGUF
Usage
SEA-LION-v1-7B-IT can be run using the 🤗 Transformers library
Prompting Guide
Disclaimer
It is important for users to be aware that our models exhibits certain limitations that warrant consideration:
The model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning.
The model has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.
It should be noted that the model has not been optimized for multi-turn dialogue interactions, which may result in reduced effectiveness in extended conversations.
References
Thai Pre-Training Data Reference
Last updated