A data library that summarizes more than 10,000 types of large-scale language models (LLM) and visualizes the number of downloads and similarities in an easy-to-understand manner will be released



From the second half of 2022, countless large-scale language models (LLM) and AI services such as ``ChatGPT'' and ``Bard'' have appeared, and users around the world have begun to actively use generative AI. Many of these large-scale language models have been deposited in

Hugging Face , a repository of machine learning models and datasets, but researchers at Stanford University have released a new visualization of Hugging Face data.

[2307.09793] On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models
https://doi.org/10.48550/arXiv.2307.09793

Constellation
https://constellation.sites.stanford.edu/

Access the 'Constellation' above and click 'Access Constellation'.



Next, specify the LLM you want to display. The number above is the minimum number of downloads. Change it when you want to display only those downloaded from Hugging Face exceeding the specified number. This allows you to narrow down to just popular LLMs. The number below is called the number of clusters, which simply specifies how many groups the LLM is divided into. LLMs are grouped by likeness.



This time, check the checkbox to display the word cloud and click 'Run Clustering'. After a while, some graphs will be displayed.



The first thing displayed is a tree diagram that organizes all LLMs filtered by this specification. It is very difficult to see, but you can enlarge it and display it by making full use of the zoom.



You can check what the LLM `` Vicuna-13B '', which has an accuracy comparable to ChatGPT and Google's Bard, was derived from, and what the similar language model is.



This is a graph when each LLM is divided into several communities using

the Louvain method . Closely connected LLMs are considered a community and are surrounded by a thin circle. Hovering over each node (LLM) displays the name of the LLM, the ranking of the number of downloads, the number of downloads, the number of 'likes' made in Hugging Face, and the number of parameters.



What you see next is a compiled list of the top 20 LLMs with the most ties to each LLM. Open source and commercially available '

Falcon ' model occupies the top three.



Then you'll see a list of LLMs sorted by community size. The largest is

falcon-7b-instruct . The next largest is gpt-neo-125m , one of the models of ' GPT-Neo ', which aims at an open source language model with performance close to ' GPT-3 '.



For each cluster, a word cloud is also displayed that shows which model group stands out.



The last thing you see is a graph showing the number of downloads versus the number of likes. It was OpenAI's open source LLM '

GPT-2 ' that boasted an unusual number of downloads of 13.6 million.



LLMs out there are often given similar names such as 'GPT' or 'model'. The researchers who created the data library this time list in detail what words appear in the name of LLM and how often they appear in

the journal .



in Review,   Software,   Web Service, Posted by log1p_kr