AMD releases proprietary visual language model 'Instella-VL-1B,' trained on AMD GPUs to achieve competitive performance



Semiconductor giant AMD has announced its first visual language model (

VLM ) , Instella-VL-1B . Instella-VL-1B is part of the Instella family of language models announced by AMD in March 2025, and is a VLM trained on AMD Instinct MI300X , AMD's GPU for generative AI.

Instella-VL-1B: First AMD Vision Language Model — ROCm Blogs
https://rocm.blogs.amd.com/artificial-intelligence/Instella-BL-1B-VLM/README.html



Instella-VL-1B is a multi-modal model with 1.5 billion parameters that combines a vision encoder with 300 million parameters and a language model with 1.2 billion parameters.



To build Instella-VL-1B, AMD combined datasets such as LLaVA , Cambrian , and Pixmo , and created new data mixtures in both the pre-training and SFT (supervised fine-tuning) stages. Specifically, they enhanced the model's document understanding capabilities by employing richer document-related datasets such as M-Paper , DocStruct4M , and DocDownstream .

With the new pre-training dataset (7M examples) and SFT dataset (6M examples), Instella-VL-1B significantly outperforms similarly sized open source models (such as LLaVa-OneVision and MiniCPM-V2 ) on both general visual language tasks and OCR -related benchmarks. It also outperforms the open weight model InternVL2-1B on general benchmarks and achieves comparable performance on OCR-related benchmarks.

Here's a comparison of its performance in various benchmarks with competing AI models:

Model name Visual Encoder Text Encoder

GQA

SQA

POPE

MM-Bench

SEED-Bench

MMMU

RealWorldQA

MMStar

OCRBench

Text VQA

AI2D

ChartQA

DocVQA

InfoVQA

DeepSeek-VL-1.3B

SigLIP

DeepSeek-LLM-1B

--

64.52

85.80

64.34

65.94

28.67

50.20

38.30

41.40

57.54

51.13

47.40

35.70

20.52

InternVL2-1B

InternViT

Qwen2-0.5B

55.06

89.54

87.40

61.70

65.90

32.40

51.90

46.18

74.40

69.60

62.40

71.52

80.94

46.30

InternVL2.5-1B

InternViT

Qwen2-0.5B-instruct

56.66

93.90

89.95

68.40

71.30

35.60

58.30

47.93

74.20

72.96

67.58

75.76

82.76

53.62

TinyLLaVA-2.4B

SigLIP

Gemma

61.58

64.30

85.66

58.16

63.30

32.11

52.42

37.17

28.90

47.05

49.58

12.96

25.82

21.35

TinyLLaVA-1.5B

SigLIP

TinyLlama

60.28

59.69

84.77

51.28

60.04

29.89

46.67

31.87

34.40

49.54

43.10

15.24

30.38

24.46

LLaVA-OneVision-1B

SigLIP

Qwen2-0.5B

57.95

59.25

87.17

44.60

65.43

30.90

51.63

37.38

43.00

49.54

57.35

61.24

71.22

41.18

MiniCPM-V-2

SigLIP

MiniCPM-2.4B

--

76.10

86.56

70.44

66.90

38.55

55.03

40.93

60.00

74.23

64.40

59.80

69.54

38.24

Instella-VL-1B

CLIP AMD OLMO 1B SFT

61.52

83.74

86.73

69.17

68.47

29.30

58.82

43.21

67.90

71.23

66.65

72.52

80.30

46.40



Instella-VL-1B is an adaptation and optimization of the LLaVA code base for AMD hardware and model architectures, and is trained exclusively using publicly available datasets. It was trained using AMD's generative AI GPU, the AMD MI300X, and AMD described Instella-VL-1B as 'a testament to AMD's commitment to advancing open source AI technology in multimodal AI.'

In keeping with its open source commitment, AMD is sharing not only the weights of the Instella-VL-1B model, but also detailed training configurations, datasets and code.

GitHub - AMD-AIG-AIMA/InstellaVL
https://github.com/AMD-AIG-AIMA/InstellaVL



in Software, Posted by logu_ii