Mar 10, 2025 11:51:00

AMD releases proprietary visual language model 'Instella-VL-1B,' trained on AMD GPUs to achieve competitive performance

Semiconductor giant AMD has announced its first visual language model (

VLM ) , Instella-VL-1B . Instella-VL-1B is part of the Instella family of language models announced by AMD in March 2025, and is a VLM trained on AMD Instinct MI300X , AMD's GPU for generative AI.

Instella-VL-1B: First AMD Vision Language Model — ROCm Blogs
https://rocm.blogs.amd.com/artificial-intelligence/Instella-BL-1B-VLM/README.html

Instella-VL-1B is a multi-modal model with 1.5 billion parameters that combines a vision encoder with 300 million parameters and a language model with 1.2 billion parameters.

To build Instella-VL-1B, AMD combined datasets such as LLaVA , Cambrian , and Pixmo , and created new data mixtures in both the pre-training and SFT (supervised fine-tuning) stages. Specifically, they enhanced the model's document understanding capabilities by employing richer document-related datasets such as M-Paper , DocStruct4M , and DocDownstream .

With the new pre-training dataset (7M examples) and SFT dataset (6M examples), Instella-VL-1B significantly outperforms similarly sized open source models (such as LLaVa-OneVision and MiniCPM-V2 ) on both general visual language tasks and OCR -related benchmarks. It also outperforms the open weight model InternVL2-1B on general benchmarks and achieves comparable performance on OCR-related benchmarks.

Here's a comparison of its performance in various benchmarks with competing AI models:

Model name	Visual Encoder	Text Encoder	GQA	SQA	POPE	MM-Bench	SEED-Bench	MMMU	RealWorldQA	MMStar	OCRBench	Text VQA	AI2D	ChartQA	DocVQA	InfoVQA
DeepSeek-VL-1.3B	SigLIP	DeepSeek-LLM-1B	--	64.52	85.80	64.34	65.94	28.67	50.20	38.30	41.40	57.54	51.13	47.40	35.70	20.52
InternVL2-1B	InternViT	Qwen2-0.5B	55.06	89.54	87.40	61.70	65.90	32.40	51.90	46.18	74.40	69.60	62.40	71.52	80.94	46.30
InternVL2.5-1B	InternViT	Qwen2-0.5B-instruct	56.66	93.90	89.95	68.40	71.30	35.60	58.30	47.93	74.20	72.96	67.58	75.76	82.76	53.62
TinyLLaVA-2.4B	SigLIP	Gemma	61.58	64.30	85.66	58.16	63.30	32.11	52.42	37.17	28.90	47.05	49.58	12.96	25.82	21.35
TinyLLaVA-1.5B	SigLIP	TinyLlama	60.28	59.69	84.77	51.28	60.04	29.89	46.67	31.87	34.40	49.54	43.10	15.24	30.38	24.46
LLaVA-OneVision-1B	SigLIP	Qwen2-0.5B	57.95	59.25	87.17	44.60	65.43	30.90	51.63	37.38	43.00	49.54	57.35	61.24	71.22	41.18
MiniCPM-V-2	SigLIP	MiniCPM-2.4B	--	76.10	86.56	70.44	66.90	38.55	55.03	40.93	60.00	74.23	64.40	59.80	69.54	38.24
Instella-VL-1B	CLIP	AMD OLMO 1B SFT	61.52	83.74	86.73	69.17	68.47	29.30	58.82	43.21	67.90	71.23	66.65	72.52	80.30	46.40

Instella-VL-1B is an adaptation and optimization of the LLaVA code base for AMD hardware and model architectures, and is trained exclusively using publicly available datasets. It was trained using AMD's generative AI GPU, the AMD MI300X, and AMD described Instella-VL-1B as 'a testament to AMD's commitment to advancing open source AI technology in multimodal AI.'

In keeping with its open source commitment, AMD is sharing not only the weights of the Instella-VL-1B model, but also detailed training configurations, datasets and code.

GitHub - AMD-AIG-AIMA/InstellaVL
https://github.com/AMD-AIG-AIMA/InstellaVL

Related Posts:

Mar 10, 2025 11:51:00 in Software, Posted by logu_ii