Test Article 1 | ImpactViz Journal

A Review of Transformer Models

Aug 02 2023

Introduction

Large language models (LLMs) have emerged as a transformative force in the field of natural language processing (NLP). These models, powered by deep learning techniques, have demonstrated remarkable capabilities in understanding and generating human-like text. The development of LLMs gained momentum with the advent of GPT (Generative Pre-trained Transformer) introduced by OpenAI, with its pioneering model GPT-3 being one of the most well-known and influential language models to date. With billions of parameters and extensive pre-training on vast amounts of text data from the internet, ChatGPT, based on the GPT-3 architecture, has become a leading example of LLMs. It showcases extraordinary language comprehension, fluency, and contextual understanding, enabling it to excel across a wide range of NLP tasks. As these models continue to evolve and grow, they hold tremendous potential to revolutionize numerous applications in human-computer interaction, content generation, and information retrieval [@dale2021gpt],[@radford2019language].

The development of LLMs which have revolutionized NLP are directly facilitated by the introduction of the Transformer architecture in "Attention Is All You Need" by [@vaswani2017attention] on which they are based. Before Transformers, recurrent neural networks (RNNs) and variants like LSTMs dominated sequential data processing tasks, but struggled with long-range dependencies and parallelization. The Transformer's attention mechanism enables it to efficiently capture long-range dependencies, making it highly effective for NLP tasks. Notably, LLMs have been developed both by major research institutions and commercial organizations. On the one hand, models like GPT-3, with over 175 billion parameters, developed by OpenAI, are representative of the most well-known commercial LLMs, demonstrating human-like language understanding and generation capabilities. On the other hand, the research community has contributed several open-source large language models, such as T5 (Text-to-Text Transformer) by Google [@t5], and LLaMA (Large Language Model Meta AI) by Meta AI [@llama-1]. These models have also made significant impacts and advancements in various NLP applications, ranging from sentiment analysis to machine translation. The pre-training of large language models on vast amounts of internet text data is a common practice, followed by fine-tuning on specific tasks with labeled data to adapt their knowledge for targeted applications. Consequently, both paywalled and open-source large language models have played crucial roles in advancing the state of NLP, and they continue to pave the way for exciting developments in human-machine interaction and natural language understanding.

This article reviews various LLM model series and concludes with a comprehensive comparison of all models released to date.

Google's T5

Google's T5 was one of the earliest models that implemented the text-to-text as a sequence-to-sequence generation objective where a where the model is fed some text for context or conditioning on a range of diverse NLP tasks and is then asked to produce some output text. To specify which task the model should perform, they added a task-specific (text) prefix to the original input sequence before feeding it to the model. Consequently, it was found that a huge variety of NLP tasks can be cast in this format, including translation, summarization, and even classification and regression tasks. As an example, to ask T5 to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.” For text classification tasks, the model simply predicts a single word corresponding to the target label. For example, on the MNLI benchmark [@mnli] the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. With our preprocessing, the input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.” with the corresponding target word “entailment”. Their subsequent work FLAN advocated for a zero-shot task learning approach via instruction tuning. The idea is that by using supervision to teach an LM to perform tasks described via instructions, the LM will learn to follow instructions and do so even for unseen tasks. Distinguishing between T5 and FLAN prompts, T5 prompts were mostly just a tag for the dataset, which would not work in the zero-shot setting. In contrast, the prompts that are used for FLAN are similar to what would be used to ask a human to perform the task. As such just as the dessert that follows the end of meals, the chosen "FLAN" strategy could be applied over any pretrained language models. Specifically Google applied FLAN to their pretrained LAMDA dialog model and their pretrained T5.

A Catalog of Google's T5 series LLMs

Google's T5 is a versatile text-to-text model that can perform a wide range of language tasks, while their later introduced FLAN (Few-shot Language Adaptation Network) strategy was designed to enhance pretrained LLMs zero-shot performance on unseen task by further instruction finetuning produced optimized models of their base models as FLAN-T5, FLAN-PaLM, and FLAN-LAMDA. This comparison showcases this development based on the central characteristics of the models.

has model	T5	LAMDA	FLAN	PaLM	Flan-T5	Flan-PaLM
model family	T5	LaMDA-PT	LaMDA-PT	PaLM	T5	PaLM
date created	2019-10-01	2022-01-01	2022-02-08	2022-04-01	2022-11-01	2022-11-01
pretraining task	Span Corruption	Causal language modeling		Causal language modeling
pretraining architecture	Encoder/Decoder	Decoder	Decoder	Decoder	Encoder/Decoder	Decoder
fine-tuning task	finetuning on downstream tasks one at a time	based on multi-turn crowdsourced dialog datasets, LaMDA-PT is finetuned in a mix of generative tasks that generate response given contexts, and discriminative tasks that evaluate quality and safety of a response in context	Instruction Tuning		Instruction Tuning	Instruction Tuning
number of parameters	60M, 220M, 770M, 3B, and 11B	137B	137B	8B, 62B, and 540B	80M (Flan-T5-Small), 250M (Flan-T5-Base), 780M (FLan-T5-Large), 3B (Flan-T5-XL), and 11B (Flan-T5-XXL).	8B, 62B, 540B
optimizer	AdaFactor		AdaFactor	AdaFactor
tokenization	sentencepiece	sentencepiece		sentencepiece
training corpus	Colossal Clean Crawled Corpus	1.56T words from public dialog data and other public web documents. Overall, it consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T\nwords	FLAN is instruction tuned on 25 tasks spanning 62 datasets.<SEP>LaMDA-PT is is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary	780B tokens from multilingual social media conversations (50%), multilingual filtered webpages (27%), books in English (13%), code from Github (5%), multilingual Wikipedia (4%), and news in English (1%). Code includes 24 programming languages.	Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT	Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT
extension	Same as original Transformer with some additions such as relative positional embeddings like Transformer XL	LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies	Zero-shot task learning. The output space for a given task is either one of several classes (classification) or free text (generation).	PaLM uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data	instruction finetuning\nwith a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on\nchain-of-thought data	Flan-PaLM is generated by "Flan Finetuning" the PaLM\nmodels: (1) scaling the number of tasks to 1,836, (2) scaling the model\nsize, and (3) finetuning on chain-of-thought data.
hardware used	TPUv3	TPUv3	TPUv3	TPUv4
hardware information	we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods. TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines.	LaMDA was pretrained on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch	instruction tuning takes around 60 hours on a TPUv3 with 128 cores	PaLM 540B is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism. Each Pod has 3072 TPU v4 chips attached to 768 hosts.
organization	Google	Google	Google	Google	Google	Google
application	Diverse set of downstream tasks including machine translation,\nquestion answering, abstractive summarization, and text classification	General language modeling, such as translation, summarization,\nquestion and answers	language understanding and generation tasks such as inference, sentiment analysis, paraphrase, closed-book QA, reading comprehension, coreference, summarization, translation, commonsense reasoning, and struct-to-text	PaLM is designed as a general purpose language model with\napplicability to hundreds of different language tasks	The primary use is to underestand how to improve large\nlanguage models with the right kind of instruction fine-tuning. The focus is\nresearch on zero-shot and in-context few-shot learning NLP tasks, such as\nreasoning, and question answering; advancing fairness and safety research,\nand understanding limitations of current large language models	Same as Flan-T5. The goal is to show Flan finetuning can\neven improve on the largest Google LMs (+9.4% improvement average\nacross tasks), with improvements to chain of thought, self consistency,\nmultilingual tasks, arithmetic reasoning
has source code	https://github.com/google-research/text-to-text-transfer-transformer<SEP>https://huggingface.co/docs/transformers/model_doc/t5		https://github.com/google-research/FLAN	https://github.com/lucidrains/PaLM-pytorch	https://github.com/google-research/t5x<SEP>https://huggingface.co/docs/transformers/model_doc/flan-t5
blog post	https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html	https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html<SEP>https://blog.google/technology/ai/lamda/	http://rylanschaeffer.github.io/blog_posts/2022-01-20-google-brain-flan.html<SEP>https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html	https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/<SEP>https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html	https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html
license	Apache 2.0	closed source	Apache 2.0	closed source	Apache 2.0	closed source

EleutherAI's GPT-J, GPT-NeoX, and Pythia series

With the GPT-NeoX-20B model, EleutherAI sought to offer a strong performing open-source alternative mainly to GPT-3, which in its empirical evaluations proved an effective model--it was fairly close to GPT-3 performance. At the time of its release i.e. April 2022, it was the largest open-source model available. The earlier released GPT-J-6B model had a similar objective and outperformed GPT-3 Babbage and underperformed GPT-3 175B model by 10 points in the lowest results, still proving a strong open-source contender.

The Pythia series from EleutherAI mainly focused on ground research on LLMs to answer a central research question: "How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale?" It is well established that there are regular and predictable patterns in the behavior of trained language models as they scale [@scale1] [@scale2] [@ghorbani2021scaling] [@pu2021scaling] [@mikami2021scaling] [@hernandez2021scaling]. But prior work connecting these “Scaling Laws” to the learning dynamics of language models is minimal. One of the driving reasons for this gap in research is a lack of access to appropriate model suites to test theories. This is addressed by EleutherAI in their release of the Pythia-suite of 8 main models at parameter sizes of 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B, which are respectively released in two variants of the underlying Pile training dataset, thus resulting in 16 total models. Furthermore, public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders is provided for further study.

A Catalog of EleutherAI's Large Language Models

Starting in early 2022, EleutherAI released open-sourced high-performing models built after GPT-2/GPT-3 with an intent to open-source strong LLMs. With its Pythia series, its initiative to open-source LLMs continues, which now extends into releasing the largest collection of 16 total open-source models at different parameter sizes to study the effects of scale on LLMs. This comparison showcases EleutherAI's models released so far in terms of their salient properties.

has model	GPT-Neo	GPT-J	GPT-NeoX-20B	Pythia
model family	GPT	GPT	GPT	Pythia
date created	2021-03-01	2021-05-01	2022-04-01	2023-03-13
innovation	GPT-Neo by EleutherAI is an open-source alternative to proprietary models like GPT-3. Trained on the diverse "The Pile" dataset, its significance lies in democratizing access to large language models, fostering community-driven development, and promoting ethical considerations in AI research.	GPT-J-6B by EleutherAI stands out for its open-source accessibility, training on the diverse "Pile" dataset, and its development by an independent research organization. The model emphasizes democratization in AI, showcasing that state-of-the-art advancements aren't limited to large corporations. EleutherAI's transparent approach and emphasis on open research further distinguish GPT-J in the LLM landscape.	GPT-NeoX-20B is a 20 billion parameter autoregressive language model with notable architectural differences from GPT-3, including the use of rotary positional embeddings. It excels in few-shot reasoning and is distinctively open-sourced, making it a significant contribution to the public research community. The model was trained on the Pile dataset.	The paper introduces Pythia, a suite of 16 Large Language Models (LLMs) trained on consistent public data, spanning from 70M to 12B parameters. Pythia provides public access to 154 checkpoints for each model, facilitating research into LLM training dynamics, memorization, and bias reduction. This unique setup offers novel insights into LLM behavior and evolution.
pretraining architecture	Decoder	Decoder	Decoder	Decoder
pretraining task	Causal language modeling	Causal language modeling	Causal language modeling	Causal language modeling
training corpus	Pile - 840 GB open source text dataset that combines 22 pre existing datasets	Pile corpus, a large-scale curated dataset created by EleutherAI	Pile — 840 GB open source text dataset that combines 22 preexisting datasets	Two version of each of the eight models are released: one trained on the original Pile corpus and another trained on the deduplicated Pile corpus.<SEP>Pile
optimizer			ZeRO<SEP>AdamW	ZeRO<SEP>Adam
tokenization		same set of Byte Pair Encoding (BPE) Tokenizer as GPT-2/GPT-3	train a new BPE tokenizer based on the Pile	BPE tokenizer that is trained specifically on the Pile same as used for GPT-NeoX-20B
number of parameters	125M, 350M, 1.3B, and 2.7B	6B	20B	70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B
maximum number of parameters (in million)	2700	6000	20000	12000
hardware used		TPU v3-256 pod	AMD EPYC 7532 CPU<SEP>A100-SXM4-40GB GPU	A100-40GB GPU
hardware information			trained GPT-NeoX-20B on twelve Supermicro\nAS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs	training the 70M, 160M, and 410M models required 32 A100 GPUs with 40GB RAM, 1.0B, 1.4B, and 2.8B models used 64 A100-40GB GPUs. 6.9B model used 128 GPUs. And 12B model used 256 GPUs.
extension	Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens	GPT-J 6B is a Transformer model trained using Mesh Transformer\nJAX and same tokenizer as GPT2/3	Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse	Trained with the library GPT-NeoX
organization	EleutherAI	EleutherAI	EleutherAI	EleutherAI
application	Text generation, but adaptable to many other NLP tasks when fine tuned.	generating text from a prompt, but is advised to be finetuned for more effective performance	range of language-understanding, mathematics and knowledge-based tasks	Research on language model’s behavior, functionality, and\nlimitations
has source code	https://github.com/EleutherAI/gpt-neo<SEP>https://huggingface.co/docs/transformers/model_doc/gpt_neo	https://huggingface.co/EleutherAI/gpt-j-6b<SEP>https://github.com/kingoflolz/mesh-transformer-jax	https://github.com/EleutherAI/gpt-neox<SEP>other gpt-neo models<SEP>https://huggingface.co/EleutherAI/gpt-neox-20b	pythia-deduped<SEP>pythia<SEP>https://github.com/EleutherAI/pythia
blog post	https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api<SEP>https://www.section.io/engineering-education/leveraging-gptneo-to-generate-ai-based-blog-content/	https://en.wikipedia.org/wiki/GPT-J	https://blog.eleuther.ai/announcing-20b/	https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling
license	Open, MIT license	Apache 2.0	Apache 2.0	Apache 2.0

MosaicML's Mosaic Pretrained Transformer (MPT)

MosaicML based on San Francisco, California, is a leading generative AI platform that empowers enterprises to build their own AI. As of July 19, 2023, Databricks, the Data and AI company, announced it has completed its acquisition of MosaicML. MosaicML is widely known for its state-of-the-art MPT large language models (LLMs).

Its first model released in early May 2023 was MPT-7B a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. The base model is accompanied with three different finetuned models for practical application purposes: 1) an instruction-tuned model called MPT-7B-Instruct (licensed for commercial use), 2) a non-commercially licensed MPT-7B-Chat as a conversational version of MPT-7B. MPT-7B-Chat has been finetuned using ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct, ensuring that it is well-equipped for a wide array of conversational tasks and applications. While MPT-7B-Instruct focuses on delivering a more natural and intuitive interface for instruction-following, MPT-7B-Chat aims to provide seamless, engaging multi-turn interactions for users. And finally, 3) they also announced MPT-7B-StoryWriter-65k+—“a model designed to read and write stories with super long context lengths”—with a previously unheard of 65,000 token context length. MPT-7B matches the quality of LLaMA-1-7B and outperforms other open source 7B - 20B models such as Pythia on standard academic tasks. The FALCON, Llama-2, or OpenLAMA models were not part of MPT-7B evaluations.

Subsequently in June, MosaicML expanded the MPT series with the bigger brother of the 7B model as MPT-30B -- a new, completely open-source model licensed for commercial use. This model is significantly more powerful than 7B and outperforms GPT-3 on many benchmarks. This model has been released in 2 fine-tuned variants too: MPT-30B-Instruct and MPT-30B-Chat licensed for non-commercial use only. All models in this series demonstrate strong abilities to generate code. MPT-30B models outperform TII's Falcon-40B.

Both MPT-7B and 30B are trained on a large amount of data (1T tokens like LLaMA vs. 300B for Pythia, and 800B for StableLM).

With its models, MosiacML aims to build on and furthermore set a strong precedent to the development of open-source LLMs pioneered in afore-mentioned efforts such such as the LLaMA series from Meta, the Pythia series from EleutherAI, the Falcon model from Technology Innovation Institute (TII) and the OpenLLaMA model from Berkeley AI Research.

A Catalog of MosaicML's MPT Large Language Models

Thus far the MPT model series from MosaicML, with full-form Mosaic Pretrained Transformers, i.e. MPT-7B and MPT-30B, have proven to be very competitive with MPT-7B surpassing GPT-3 in many academic tasks, and MPT-30B surpassing perfomances by Llama-1, Falcon-40B, and all prior models from EleutherAI. This comparison showcases MosaicML's models released so far in terms of their salient properties. Increasingly with stronger models released for research, the open-sourced development of LLMs offers a promising future trajectory.

has model	MPT-7B	MPT-30B
model family	MosaicPretrainedTransformer (MPT) models	MosaicPretrainedTransformer (MPT) models
date created	2023-05-05	2023-06-22
pretraining architecture	Decoder	Decoder
pretraining task	Language Modeling	Language Modeling
training corpus	Semantic Scholar ORC<SEP>The Stack Dedup<SEP>RedPajama<SEP>C4<SEP>mC4	RedPajama - StackExchange<SEP>RedPajama - arXiv<SEP>RedPajama - Books<SEP>Semantic Scholar ORC<SEP>The Stack - Markdown<SEP>RedPajama - Wikipedia<SEP>The Stack - Selected Languages<SEP>RedPajama - CommonCrawl<SEP>c4 - English - SemDedup<SEP>mC4 3.1.0 - English
optimizer	LION	LION
tokenization	EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus	EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus
number of parameters	7B	30B
hardware used	A100-40GB GPU	A100-40GB GPU<SEP>H100-80GB GPU
hardware information	A100-80GB GPU was used in model finetuning<SEP> took ~9.5 days to train on 440xA100-40GB GPUs, and cost ~$200k	The model was trained in three stages using the MosaicML Platform: (i) First it was trained on 440 A100-40GBs with a batch size of 1760. (ii) Then, on 216 A100-40GBs with a batch size of 1728. (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens. The model was trained with sharded data parallelism using FSDP and used the LION optimizer.
extension	MPT-7B-StoryWriter-65k+<SEP>MPT-7B-Chat<SEP>MPT-7B-Instruct	MPT-30B 8k Context Window<SEP>MPT-30B-Chat<SEP>MPT-30B-Instruct
fine-tuning task	Conversations for the Chat model<SEP>Short instructions for the Instruct model<SEP>Excerpts of fiction books for StoryWriter	Finetuning on longer context data using sample subsets of datasets only samples with at least 4096 tokens of instances in base model dataset for the 8K context window model<SEP>A large collection of chat datasets for the Chat model<SEP>Instructions following for the Instruct model
organization	MosaicML	MosaicML
application	Multiple tasks similar to LLaMA such as reasoning or code generation	Multiple text generation tasks. This model demonstrates good coding ability better than its predecessor as well as specifically finetuned code writers.
has source code	https://huggingface.co/mosaicml/mpt-7b	https://huggingface.co/mosaicml/mpt-30b
blog post	https://www.youtube.com/watch?v=KSlWkrByc0o&t=9s<SEP>https://www.mosaicml.com/blog/mpt-7b	https://www.mosaicml.com/blog/mpt-30b
license	Apache 2.0	Apache 2.0

DeepMind's compute-optimal Chinchilla and other models

DeepMind In a seminal paper called "Training Compute-Optimal Large Language Models" published in 2022, carried out a detailed study of the performance of language models of various sizes and quantities of training data. The goal was to find the optimal number of parameters and volume of training data for a given compute budget. This paper is often referred to as the Chinchilla paper. Some of the findings of the paper: 1) it hints that many of the 100B parameter LLMs like GPT-3 may actually be over parameterized, meaning they have more parameters than they need to achieve a good understanding of language and under trained so that they would benefit from seeing more training data. The authors hypothesized that smaller models may be able to achieve the same performance as much larger ones if they are trained on larger datasets.

One important takeaway from the Chinchilla paper is that the optimal training dataset size for a given model is about 20 times larger than the number of parameters in the model. Chinchilla was determined to be compute optimal. For a 70 billion parameter model, the ideal training dataset contains 1.4 trillion tokens or 20 times the number of parameters. Per this, 175B parameter models like GPT-3, OPT, and BLOOM were trained on datasets that are smaller than the Chinchilla optimal size. These models may actually be under trained. In contrast, LLaMA 65B was trained on a dataset size of 1.4 trillion tokens, which is close to the Chinchilla recommended number. Another important result from the paper is that the compute optimal Chinchilla model outperforms non compute optimal models such as GPT-3 on a large range of downstream evaluation tasks. With the results of the Chinchilla paper in hand teams have recently started to develop smaller models that achieved similar, if not better results than larger models that were trained in a non-optimal way. Moving forward, you can probably expect to see a deviation from the bigger is always better trends of the last few years as more teams or developers like you start to optimize their model design.

This comparison characteristically describes Chinchilla as well as other models from DeepMind on certain salient properties of LLMs.

A Catalog of DeepMind's LLMs including their seminal Chinchilla model

Coming from researchers at DeepMind, in their 2022 paper "Training Compute-Optimal Large Language Models," often called the "Chinchilla paper," it explored the ideal size and training data volume for language models within a set compute budget. The study suggested that many 100B parameter models, like GPT-3, might be over-parameterized and under-trained. They proposed that smaller models could match the performance of larger ones if trained on more extensive datasets. The Chinchilla paper highlights that the best training dataset size for a model is roughly 20 times its parameter count. For instance, a 70B parameter model needs 1.4 trillion tokens for optimal training. Models like GPT-3, OPT, and BLOOM, with 175B parameters, were trained on suboptimal dataset sizes, potentially under-training them. In comparison, LLaMA 65B was trained near the Chinchilla's recommended size. Notably, the compute-optimal Chinchilla model surpassed models like GPT-3 in various tasks. Given these findings, there's a shift towards developing efficient smaller models, challenging the previous "bigger is always better" trend. This comparison characteristically describes Chinchilla as well as other models from DeepMind on certain salient properties of LLMs.

has model	AlphaFold	Gopher	Chinchilla	GopherCite	Flamingo	Gato	Sparrow
model family	SE(3)-Transformer	GPT	GPT	Gopher	Chinchilla	“Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)	GPT
date created	2019-09-01	2021-12-01	2022-03-01	2022-03-01	2022-04-01	2022-05-01	2022-09-01
innovation	The main innovation of "Highly accurate protein structure prediction with AlphaFold" is the creation of AlphaFold, a deep learning model that accurately predicts protein structures. This extends the capabilities of Large Language Models by showcasing their potential to solve complex scientific challenges beyond language-related tasks, marking a significant advancement in the intersection of machine learning and biology.	The paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" emphasizes the benefits and limitations of scaling LLMs. It highlights significant performance advances with data quality and scale but notes uneven gains across tasks, especially in mathematical reasoning. The study also delves into the impact of scale on toxicity and bias in model outputs.	The paper "Training Compute-Optimal Large Language Models" introduces a methodology to optimally balance model size and training tokens under a fixed compute budget. The findings emphasize equal scaling of parameters and training tokens, highlighting the importance of high-quality dataset scaling. The research also underscores ethical concerns with training on vast datasets sourced from the web.	GopherCite, in the context of large language models, is designed to support its answers with verified quotes from sources. It employs reinforcement learning with unique training techniques and can decline to answer questions if uncertain about the quality of its response. This approach differentiates it from other models by emphasizing evidence-backed answers and selective response generation.	Flamingo is a Visual Language Model designed to bridge powerful pretrained vision-only and language-only models, allowing it to handle sequences of interleaved visual and textual data. It's trained on large-scale multimodal web corpora with interleaved text and images, enabling rapid adaptation to new tasks using few-shot learning. This approach allows Flamingo to outperform models fine-tuned on significantly more task-specific data.	Gato is a multi-modal, multi-task agent inspired by Large Language Models, capable of handling diverse tasks like playing games, captioning images, and robotics using a single neural network. It employs a unique tokenization approach to process varied data types, from text to images. This innovation allows Gato to generalize across a vast range of tasks, setting a new standard in the realm of LLMs.	Sparrow, developed by DeepMind, introduces innovations in the context of LLMs by utilizing targeted human judgments for alignment. It employs a unique approach of using a single external knowledge fragment for evidence and focuses on breaking down goals into detailed rules, enhancing the model's helpfulness and accuracy in responses. The model also integrates techniques like self-play, search, and fine-grained rules to shape its behavior.
pretraining architecture	Encoder	Decoder	Decoder	Decoder	Decoder	Decoder	Decoder
pretraining task	Protein folding prediction of BERT using parameter\nsharing, which is much more efficient given the same number of parameters	Causal language modeling	Causal language modeling	Causal language modeling	Log likelihood of text given some visual input	CLM (where tokens are either text or agent actions)	Causal language modeling
training corpus	170,000 proteins from a public repository of protein sequences and structures	Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.	1.4 trillion training tokens. Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.	Same as Gopher plus specific dataset generated in the RLHP process	MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average)	1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot)	Same as Chinchilla + interactive data gathering with human annotators during the RLHF process
optimizer		Adam optimizer	AdamW	AdaFactor	AdamW	Adam optimizer	AdaFactor
tokenization		sentencepiece	a slightly modified SentencePiece (Kudo and Richardson, 2018)\ntokenizer that does not apply NFKC normalisation	sentencepiece		sentencepiece
number of parameters	b12M, Large = 18M, XLarge = 60M	44M, 117M, 417M, 1.4B, 7.1B, and 280B	70B	280B	80B (largest)	79M, 364M, and 1.18B	70B
maximum number of parameters (in million)	60	280000	70000	280000	80000	1180	70000
hardware used		TPUv3	TPUv4<SEP>TPUv3	TPUv3	TPUv4		TPUv3
hardware information		We built our training and evaluation codebase with JAX and Haiku. In particular, we use JAX’s pmap transformation to efficiently express both data and\nmodel parallelism. We trained and evaluated all models on TPUv3 chips.\nThe half-precision parameters and single-precision Adam state for Gopher occupy 2.5 TiB, which far exceeds the 16 GiB of memory available on each TPUv3 core. To address these memory concerns,\nwe use optimiser state partitioning, model parallelism, and rematerialisation to partition the model state and reduce\nthe activations so that they fit in TPU memory.	All models in this analysis have been trained on TPUv3/TPUv4 with JAX and Haiku.	shard the networks across 128 TPU v3\nmachines.	Our model and associated infrastructure were implemented using JAX and Haiku. All training and evaluation was performed on TPUv4 instances. The\nlargest model containing 80 billion parameters is trained on 1536 chips for 15 days and sharded across 16 devices. Megatron type sharding is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers\nwere unsharded. ZeRO stage 1 is used to shard the optimizer state.		shard the models across 64\nTPU v3 machines
extension	The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post).	Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute	Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance	GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported	It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet	The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable.	Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite
organization	Deepmind	Deepmind	Deepmind	Deepmind	Deepmind	Deepmind	Deepmind
application	Protein folding	Mostly Language Modeling and NLU, but also extensible like GPT	Same as Gopher/GPT3	Dialog systems, Q&A, general language generation tasks	Text to image	Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm.	Dialog agents and general language generation applications like Q&A
has source code	https://github.com/deepmind/alphafold				https://github.com/lucidrains/flamingo-pytorch	https://github.com/OrigamiDream/gato
blog post	https://www.deepmind.com/publications/highly-accurate-protein-structure-prediction-with-alphafold<SEP>https://fabianfuchsml.github.io/alphafold2/	https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval	https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510<SEP>https://medium.com/mlearning-ai/language-models-need-proper-training-c71484727f00	https://www.deepmind.com/blog/gophercite-teaching-language-models-to-support-answers-with-verified-quotes	https://medium.com/geekculture/3-overlooked-things-deepminds-flamingo-a-large-model-for-computer-vision-84cd9d2f738c<SEP>https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model	https://www.deepmind.com/blog/a-generalist-agent<SEP>https://www.deepmind.com/publications/a-generalist-agent	https://www.deepmind.com/blog/building-safer-dialogue-agents\n<SEP>https://medium.com/to-cut-a-long-paper-short/sparrow-improving-alignment-of-dialogue-agents-via-targeted-human-judgments-e0876402d800
license	Apache 2.0	closed source	closed source	closed source	closed source	closed source	closed source

MetaAI's LLaMA

In the spirit of building LLMs, Meta AI first introduced LLaMA-1 which was released under a non-commercial license. In subsequent work, Meta AI introduced open-sourced Llama 2 and Llama 2-Chat. Crucially, both these models support community development, marking a significant stride towards fostering transparency and promoting the development of more responsible, replicable LLMs. Llama 2 models are trained on 2 trillion tokens and have double the context length of Llama 1. Llama-2-chat models have additionally been trained on over 1 million new human annotations. Furthermore, compared to Llama 1, in Llama 2 they performed more robust data cleaning, updated the data mixes, trained on 40% more total tokens, doubled the context length, as well as leveraged grouped-query attention (GQA) for inference scalability improving. Evaluation comparisons of open-source models, including Llama 1, Llama 2 base models, MPT (MosaicML), and Falcon on standard academic benchmarks for Commonsense Reasoning, World Knowledge, Reading Comprehension, Math, MMLU, BBH, and AGI Eval indicates that Llama 2 outperforms other models including Llama 1. The following comparison showcases Llama 1 versus 2 on their essential characteristics.

A Catalog of Meta's LLaMA Models

has model	LLaMA	Llama 2
model family	LLaMa	LLaMa
date created	2023-02-27	2023-07-18
pretraining architecture	Decoder	Decoder
pretraining task	Language Modeling	Self-supervized Learning
training corpus	Stack Exchange<SEP>ArXiv<SEP>Gutenberg and Books3<SEP>Wikipedia<SEP>Github<SEP>C4<SEP>CommonCrawl<SEP>approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%)	No specific information available other than "A new mix of publicly available online data"
optimizer	AdamW	AdamW
tokenization	the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.	same as Llama-1 which uses the BPE algorithm
number of parameters	6.7B, 13.0B, 32.5B, and 65.2B	7B, 13B, 34B, 70B
hardware used	A100-80GB GPU	A100-80GB GPU
hardware information	When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.	models were pretrained on Meta’s Research Super Cluster (RSC) as well as internal production clusters, both of which use NVIDIA A100s.
fine-tuning task	instructions	both instruction tuning and reinforcement learning from human feedback for Llama-2-Chat
extension	LLaMA uses a Transformer architecture, and with extensions:\nPre-normalization, SwiGLU activations, RoPE embeddings, reduced\nmemory usage and runtime through efficient implementation of the causal\nmulti-head attention, checkpointing to reduce the amount of activations\nthat are recomputed during the backward pass, model and sequence parallelism\nto reduce memory usage of the model, and uses 1.4T BPE tokens\nafter tokenization.	Llama-2-Chat
organization	Meta AI	Meta AI
application	Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning,\nCode generation, Reading comprehension, and multitask understanding.	Research on LLMs demonstrating multitask abilities such as reading comprehension, commonsense reasoning, math and coding abilities
has source code	https://huggingface.co/docs/transformers/main/model_doc/llama<SEP>https://github.com/facebookresearch/llama	https://github.com/facebookresearch/llama
blog post	https://ai.facebook.com/blog/large-language-model-llama-meta-ai/	https://ai.meta.com/blog/llama-2/<SEP>https://ai.meta.com/resources/models-and-libraries/llama/
license	Limited, Non-commercial bespoke license	Llama 2 Community License Agreement
carbon emitted (tco2eq)	14 for LLaMA-7B, 23 for LLaMA-13B, 90 for LLaMA-33B, and 173 for LLaMA-65B	31.22 for Llama-2-7B, 62.44 for Llama-2-13B, 153.90 for Llama-2-34B, and 291.42 for Llama-2-70B

Llama-1, Alpaca, and Vicuna

With major industry players introducing open-source models, the academic research world was not meant to be far left behind. Before going into further details, a general difference between academic and industry players is that, more often than not, academics have access to far lesser compute facilities compared to their industry counterparts.

In the spirit of fostering academic research on LLMs, a group of researchers at Stanford university released the first finetuned LLM of its kind called Alpaca. The identify two important parts to creating an instruction following model in the spirit of OpenAI's ChatGPT. They are: 1) a strong baseline pretrained model to finetune. They select Llama-1-7B as their deemed most effective at the time. 2) A highly optimized and effective finetuning dataset with instruction following demonstrations. For this, they generated 52K instruction-following demonstrations by building upon the self-instruct method generated from OpenAI’s text-davinci-003. Thus they created Alpaca as a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAI’s text-davinci-003.

Inspired from Llama-1 and Alpaca, a team of students from UC Berkeley in collaboration with UC San Diego and CMU released the Vicuna chat model at a small price of 300 dollars. Still based on the Llama models in several parameter sizes, Vicuna was finetuned on user conversations from the ShareGPT platform. The team collected 70K conversations from ShareGPT.com, a website where users can share their ChatGPT conversations. The team also made several improvements to the training recipe, including memory optimizations to enable Vicuna’s understanding of long context. They also adjusted the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot’s output. Thus, after fine-tuning Vicuna with 70K user-shared ChatGPT conversations, it was discovered that Vicuna was capable of generating more detailed and well-structured answers compared to Alpaca, with the quality on par with ChatGPT.

In a nutshell, the Meta's Llama models are proving very effective in influencing academic research on optimizing LLMs via finetuning.

A Catalog of LLM's of Camelid Species: Llama-1, Alpaca, and Vicuna

This comparison showcases academically developed Alpaca from Stanford University and Vicuna from Berkeley in collaboration with other universities. Both models are finetuned variants based on Llama-1 developed by Meta AI.

has model	LLaMA	Alpaca	Vicuna
model family	LLaMa	LLaMa	LLaMa
date created	2023-02-27	2023-03-01	2023-03-30
pretraining architecture	Decoder	Decoder	Decoder
number of parameters	6.7B, 13.0B, 32.5B, and 65.2B	7B, 13B, 33B, 65B	7B, 13B, 33B
hardware used	A100-80GB GPU	A100-80GB GPU	A100-80GB GPU
hardware information	When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.	fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.	the training was done with PyTorch FSDP on 8 A100 GPUs in one day
extension	LLaMA uses a Transformer architecture, and with extensions:\nPre-normalization, SwiGLU activations, RoPE embeddings, reduced\nmemory usage and runtime through efficient implementation of the causal\nmulti-head attention, checkpointing to reduce the amount of activations\nthat are recomputed during the backward pass, model and sequence parallelism\nto reduce memory usage of the model, and uses 1.4T BPE tokens\nafter tokenization.	Alpaca is fine-tuned from a 7B LLaMA model	stable-vicuna-13b-delta<SEP>Vicuna v1.3 is fine-tuned from LLaMA with supervised instruction fine-tuning.
fine-tuning task	instructions	52K instruction-following data generated from OpenAI’s text-davinci-003 using self-instruct mechanism, from 175 human-written instruction-output pairs.	70K user-shared conversations gathered from ShareGPT.com with public APIs
organization	Meta AI	Stanford University	Large Model Systems Organization
application	Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning,\nCode generation, Reading comprehension, and multitask understanding.	Evaluated on a variety of text generation and classification\ntasks.	chatbot
has source code	https://huggingface.co/docs/transformers/main/model_doc/llama<SEP>https://github.com/facebookresearch/llama	https://github.com/tatsu-lab/stanford_alpaca	https://chat.lmsys.org/<SEP>https://huggingface.co/lmsys/vicuna-33b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-13b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-7b-v1.3<SEP>https://github.com/lm-sys/FastChat
blog post	https://ai.facebook.com/blog/large-language-model-llama-meta-ai/	https://medium.com/version-1/stanford-alpaca-a-small-yet-mighty-language-model-for-instruction-following-tasks-af9e92e87d9a	https://lmsys.org/blog/2023-03-30-vicuna/
license	Limited, Non-commercial bespoke license	Limited, non-commercial license CC-BY-NC-SA-4.0	Non-commercial license
optimizer	AdamW
pretraining task	Language Modeling
tokenization	the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.
training corpus	Stack Exchange<SEP>ArXiv<SEP>Gutenberg and Books3<SEP>Wikipedia<SEP>Github<SEP>C4<SEP>CommonCrawl<SEP>approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%)

Berkeley AI Research's OpenLLaMA

The gap between commercial and non-commercial LLMs is rapidly narrowing. This is owing to open-sourced LLM development research efforts from Google with the T5 series, Meta with the LLaMA series, and now OpenLLaMA developed by researchers Xinyang Geng and Hao Liu from Berkeley AI Research.

With Google and Meta's models discussed above, this section presents the OpenLLaMA series engineered as a fully open-source model based on LLaMA-1. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.

A Catalog of BerkeleyAI's OpenLLaMA series

This comparison showcases the OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA-1. OpenLLaMA is released in two versions, with the 2nd version better than the first, as well as exhibiting comparable performance to the original LLaMA-1 and GPT-J across a majority of tasks, and even outperforming them in some tasks.

has model	OpenLLaMA v1	OpenLLaMA v2
model family	LLaMa	LLaMa
date created	2023-06-16	2023-07-16
pretraining task	Causal language modeling	Causal language modeling
pretraining architecture	Decoder	Decoder
number of parameters	3B, 7B, 13B	3B, 7B
optimizer	AdamW	AdamW
training corpus	We train our models on the RedPajama dataset released by Together, which is a reproduction of the LLaMA training dataset containing over 1.2 trillion tokens.<SEP>RedPajama	The v2 models are trained on a mixture of the Falcon refined-web dataset, the StarCoder dataset and the wikipedia, arxiv, book and stackexchange part of the RedPajama dataset.<SEP>Starcoder<SEP>Falcon refined-web
extension	public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA.	public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA.
hardware used	cloud TPU-v4	cloud TPU-v4
organization	Berkeley AI Research	Berkeley AI Research
application	same as LLaMA	same as LLaMA
blog post	https://github.com/openlm-research/open_llama	https://github.com/openlm-research/open_llama
has source code	https://huggingface.co/openlm-research/open_llama_13b<SEP>https://huggingface.co/openlm-research/open_llama_7b<SEP>https://huggingface.co/openlm-research/open_llama_3b	https://huggingface.co/openlm-research/open_llama_7b_v2<SEP>https://huggingface.co/openlm-research/open_llama_3b_v2
license	Apache 2.0	Apache 2.0

OpenAI's GPT

OpenAI's GPT (Generative Pre-trained Transformer) models have undergone remarkable advancements, evolving from GPT-1's foundational contextual text generation to the groundbreaking capabilities of GPT-4. Beginning with GPT-1, these models harnessed the power of the transformer architecture and large-scale pre-training on diverse text sources to generate coherent and contextually relevant human-like text. GPT-2 pushed boundaries by demonstrating more coherent long-form text generation, albeit raising concerns about misuse. GPT-3 further amplified this prowess, showcasing its capacity for a wide array of tasks and its uncanny ability to understand and produce human-like language. GPT-4 represents a culmination of these developments, featuring handling multimodal context, even more refined context comprehension, nuanced responses, and also enhanced handling harm, safety, and helpfulness criteria for LLMs.

A Catalog of OpenAI's GPT series

The comparison showcases the GPT series of Large Language Models from GPT-1 to the latest, at the time of creation of this comparison, GPT-4.

has model	GPT-1	GPT-2	GPT-3	InstructGPT	ChatGPT	GPT-4
model family	GPT	GPT	GPT	GPT	GPT	GPT
date created	2018-06-01	2019-02-01	2020-05-01	2022-01-01	2022-11-30	2023-03-14
pretraining task	Causal language modeling	Causal language modeling	Causal language modeling	Causal language modeling	Causal language modeling	Causal language modeling
pretraining architecture	Decoder	Decoder	Decoder	Decoder	Decoder	Decoder
tokenization	byte pair encoding	byte pair encoding
number of parameters	117M	124M, 355M, 774M, 1.5B	175B	Same as GPT3	175B	170T
maximum number of parameters (in million)	117	1500	175000		175000	170000000
training corpus	BookCorpus<SEP>Supervised Finetuning on several task-specific datasets for Natural Language Inference, Question Answering, Sentence similarity, and Classification.	https://github.com/openai/gpt-2-output-dataset<SEP>8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points.	~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B)	Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts	Human written prompt and interaction dataset collected through the OpenAI API
fine-tuning task	Supervized discriminative finetuning			Reinforcement Learning from Human Feedback	Step 3. RLHF using Proximal Policy Optimization<SEP>Step 2. Collect comparison data and train a reward model<SEP>Step 1. Supervized fine-tuning	Reinforcement Learning from Human Feedback<SEP>Rule-Based Reward Model
extension		GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.<SEP>Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024)	Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer	GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning		a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs
innovation		It can generate upto 768 words (equivalent to 1 1/2 page)<SEP>demonstrate that language models begin to learn tasks such as question answering, machine translation, reading comprehension, and summarization without any explicit supervision when trained on a task-agnostic, diverse dataset of millions of web-scraped webpages. The work proposes a central research question: do WebText LMs transfer well across domains and datasets?	It can generate upto 1,536 words (equivalent to 3 pages)	Better alignment of LLMs with human expectations using reinforcement learning through human feedback	It can generate upto 3000 words (equivalent to 6 pages)<SEP>Supports input context length of 2048 tokens	It can generate upto 24000 words (equivalent to 48 pages)<SEP>Supports input context length between 8192 and 32,768 tokens depending on the model version
organization	OpenAI	OpenAI	OpenAI	OpenAI	OpenAI	OpenAI
application	Text generation, but adaptable to many other NLP tasks when fine tuned.	Text generation, but adaptable to many other NLP tasks when fine tuned.	Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation	Knowledge-intensive dialog or language tasks	provide human-like conversational interactions and assist users in answering questions, generating text, providing recommendations, and engaging in natural language conversations.	Creating highly realistic and contextually accurate human-like text generation
blog post	https://medium.com/@hyponymous/paper-summary-improving-language-understanding-by-generative-pre-training-7b77babd7086<SEP>https://www.reddit.com/r/MachineLearning/comments/n36htr/p_gpt1_annotated_paper_paper_summary/	https://openai.com/research/better-language-models<SEP>https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface	https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122<SEP>https://openai.com/blog/gpt-3-apps	https://sh-tsang.medium.com/review-instructgpt-training-language-models-to-follow-instructions-with-human-feedback-7fce4bf9059a<SEP>https://openai.com/research/instruction-following	https://openai.com/blog/chatgpt	https://openai.com/research/gpt-4
has source code	https://github.com/openai/finetune-transformer-lm<SEP>https://huggingface.co/docs/transformers/model_doc/openai-gpt	https://huggingface.co/docs/transformers/model_doc/gpt2	https://platform.openai.com/docs/models/gpt-3-5<SEP>https://github.com/openai/gpt-3	https://github.com/openai/following-instructions-human-feedback
license	closed source	closed source	closed source	Closed source, accessible through API	Closed source, accessible through API	Closed source, accessible through API

Microsoft's Wizard LLM series

WizardLM is a LLM based on LLaMA trained using a new method, called Evol-Instruct, on complex instruction data. By using AI to "evolve" instructions, WizardLM outperforms similar LLaMA-based LLMs trained on simpler instruction data. On the 6th of July, 2023, WizardLM V1.1 was released with significantly improved performance. Apart from a generalist LLM, WizardLM is also released as coding (WizardCoder) and math solver (WizardMath) variants able to handle more complex scenarios given their finetuning complex instructions setups than prior released specialized models.

A Catalog of the Wizard LLM series

WizardLM is a LLM based on LLaMA trained using a new method, called Evol-Instruct, on complex instruction data. By using AI to "evolve" instructions, WizardLM outperforms similar LLaMA-based LLMs trained on simpler instruction data. On the 6th of July, 2023, WizardLM V1.1 was released with significantly improved performance. This comparison juxtaposes all Wizard LLM variants based on their salient properties

has model	WizardLM	WizardCoder	WizardMath
model family	LLaMa	StarCoder	LLaMa
date created	2023-04-24	2023-06-14	2023-04-24
innovation	The paper introduces the Evol-Instruct methodology, allowing Large Language Models (LLMs) to autonomously generate diverse instructional data. This innovation shifts away from traditional human-generated instructions, leading to more robust models. The resulting model, WizardLM, demonstrated superior performance, highlighting the effectiveness of this approach.	The paper introduces WizardCoder, a model that empowers Code Large Language Models with complex instruction fine-tuning using the Evol-Instruct method tailored for code. This innovation allows WizardCoder to surpass other open-source Code LLMs and even outperform major closed LLMs on key benchmarks. The model fills a gap in the field by emphasizing instruction fine-tuning specifically for the code domain.	The main innovation of WizardMath in the context of Large Language Models (LLMs) is its enhanced mathematical reasoning capabilities. Through a case study, the model demonstrated its ability to solve complex problems using step-by-step reasoning, showcasing its advancement over other LLMs in mathematical tasks.
pretraining architecture	Decoder	Decoder	Decoder
fine-tuning task	supervized open-domain complex instruction finetuning	supervized complex code-based instruction finetuning	Reinforcement Learning from Evol-Instruct Feedback
training corpus	250K instructions generated from OpenAI ChatGPT API auto-generated based on the Evol-Instruct method	initialized with the 20K instruction-following dataset called\nCode Alpaca, the Evol-Instruct technique is iteratively applied on this dataset consisting of 20,000 samples to produce evolved data. After each round of data evolution, the evolved data is merged from\nall previous rounds with the original dataset to finetune StarCoder	To enhance the model’s ability to adhere to the neural and diverse instructions, 1.5k open-domain conversations from WizardLM’s training data are sampled and merged with above math corpus as the final supervized finetuning training data<SEP>few-shot re-generate 15k answers for GSM8k\nand MATH with an Alpha version of WizardLM 70B model to produce solutions in a\nstep-by-step format, then find out those with a correct answer, and use this data to finetune\nbase Llama model<SEP>evolve the original math (GSM8k + MATH) instructions by 8 turns,\nincreasing the data size from 15k to 96k
number of parameters	7B	1B, 3B, 7B, 13B, 15B, and 34B	7B, 13B, and 70B
maximum number of parameters (in million)	7000	34000	70000
extension	Evol-Instruct	Code Evol-Instruct	it enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2
organization	Microsoft	Microsoft	Microsoft
application	assistant in foundational NLP tasks such as reasoning or multi-domain and multi-genre QA, code generation, etc.	code understanding, code generation, instruction fine-tuning for code tasks, tested on benchmarks (HumanEval, HumanEval+, MBPP, DS-1000), generating accurate code responses with clear explanations.	surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B, Llama-1 65B, Falcon-40B, MPT-30B8, Baichuan-13B Chat and ChatGLM2 12B on both GSM8k and MATH.
blog post	https://medium.com/@preangelleo/wizardlm-enhancing-large-language-models-with-ai-evolved-instructions-7fd4425afe80		https://ollama.ai/blog/wizardmath-examples
has source code	https://github.com/nlpxucan/WizardLM	https://github.com/nlpxucan/WizardLM	https://github.com/nlpxucan/WizardLM
license	Non-commercial license	Non-commercial license	Non-commercial license
hardware information	train our model on\n8 V100 GPUs with Deepspeed Zero-3 for 70 hours on 3 epochs.
hardware used	Nvidia V100 GPU
optimizer	Adam optimizer

A Comprehensive Catalog of Large Language Models (LLMs)

This last section brings together 87 different transformer-based language models in a comprehensive comparison of their characteristic properties.

A Catalog of Transformer Models

In the past few years we have seen the meteoric appearance of dozens of models of the Transformer family. This ORKG comparison contrasts models from different families including GPT, T5, LLaMA etc. on some salient aspects around their innovation and difference. This comparison is inspired after the paper "Transformer models: an introduction and catalog" by Xavier Amatriain, while seeking to be an extendable list of both properties and new contributions in terms of transformer models as they continue to be introduced.

has model

GPT-1

BERT

Transformer XL

GPT-2

XLNet

ERNIE

RoBERTa

ALBERT

CTRL

AlphaFold

BART

DialoGPT

DistilBERT

XLM-RoBERTa

Pegasus

mBART

ELECTRA

Megatron

GPT-3

DeBERTa

Big Bird

ViT

DALL-E

Switch

CLIP

GPT-Neo

Swin Transformer

GPT-J

Decision Transformers

Trajectory Transformers

HTLM

Jurassic-1

Megatron-Turing NLG

Anthropic Assistant

GLaM

GLIDE

Gopher

StableDiffusion

CM3

LAMDA

InstructGPT

FLAN

Chinchilla

DQ-BART

GopherCite

SeeKer

GLM

DALL-E 2

Flamingo

PaLM

GPT-NeoX-20B

Gato

OPT

UL2

Global Context ViT

Imagen

Minerva

Godel

BLOOM

BlenderBot 3

AlexaTM 20B

Sparrow

Flan-T5

Flan-PaLM

Galactica

ChatGPT

InstructOR

LLaMA

Alpaca

Pythia

GPT-4

Vicuna

WizardLM

WizardMath

MPT-7B

StarCoderBase<SEP>StarCoder

Falcon

Orca

WizardCoder

OpenLLaMA v1

MPT-30B

OpenLLaMA v2

Llama 2

JAIS

model family

GPT

BERT

GPT

Transformer XL

BERT

SE(3)-Transformer

BERT for encoder, GPT for Decoder

GPT

BERT

RoBERTa

Transformer

BART

BERT

T5<SEP>BERT<SEP>GPT

GPT

BERT

GPT

Also using Resnet, ViT, and vanilla transformer for text<SEP>CLIP

GPT

ViT

GPT

GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)

BART

GPT

Transformer

Diffusion models

GPT

Diffusion

HTLM

LaMDA-PT

GPT

LaMDA-PT

GPT

BART

Gopher

GPT (but can extend any family)

GLM (General Language Model)

GLIDE<SEP>CLIP

Chinchilla

PaLM

GPT

“Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)

GPT

Transformer

ViT

Diffusion models<SEP>CLIP<SEP>T5

PaLM

T5<SEP>GPT

GPT

transformer

GPT

PaLM

transformer

GPT

BERT

LLaMa

Pythia

GPT

LLaMa

MosaicPretrainedTransformer (MPT) models

SantaCoder

transformer

LLaMa

StarCoder

LLaMa

MosaicPretrainedTransformer (MPT) models

LLaMa

GPT

date created

2018-06-01

2018-10-01

2019-01-01

2019-02-01

2019-05-01

2019-07-01

2019-09-01

2019-10-01

2019-12-01

2020-01-01

2020-03-01

2020-05-01

2020-06-01

2020-07-01

2020-10-01

2021-01-01

2021-02-01

2021-03-01

2021-05-01

2021-06-01

2021-07-01

2021-09-01

2021-10-01

2021-12-01

2022-01-01

2022-02-08

2022-03-01

2022-04-01

2022-05-01

2022-06-01

2022-07-01

2022-08-01

2022-09-01

2022-11-01

2022-11-30

2022-12-01

2023-02-27

2023-03-01

2023-03-13

2023-03-14

2023-03-30

2023-04-24

2023-05-05

2023-05-09

2023-05-25

2023-06-05

2023-06-14

2023-06-16

2023-06-22

2023-07-16

2023-07-18

2023-08-30

organization

OpenAI

Google

Google<SEP>CMU

OpenAI

Google<SEP>CMU

Pengcheng Lab<SEP>Baidu

Google<SEP>University of Washington

Google

Salesforce

Deepmind

Facebook

Microsoft

Huggingface

Google

Facebook

Google<SEP>Imperial College London

Facebook

Google<SEP>Stanford

NVidia

OpenAI

Microsoft

Google

OpenAI

Google

OpenAI

EleutherAI

Microsoft

EleutherAI

Facebook<SEP>Google<SEP>UC Berkeley

UC Berkeley

Facebook

AI21

NVidia

Anthropic

Google

OpenAI

Deepmind

EleutherAI<SEP>Stability.ai<SEP>LMU Munich

Facebook

Google

OpenAI

Google

Deepmind

Amazon

Deepmind

Facebook

Tsinghua University

BigScience

OpenAI

Deepmind

Google

EleutherAI

Deepmind

Facebook

Google

NVidia

Google

Microsoft

Huggingface<SEP>Big Science

Facebook

Amazon

Deepmind

Google

Meta

OpenAI

Microsoft

Meta AI<SEP>University of Washington<SEP>University of Hong Kong

Meta AI

Stanford University

EleutherAI

OpenAI

Large Model Systems Organization

Microsoft

MosaicML

BigCode Project

Technology Innovation Institute

Microsoft

Berkeley AI Research

MosaicML

Berkeley AI Research

Meta AI

Cerebras<SEP>Mohamed bin Zayed University of Artificial Intelligence<SEP>Inception

innovation

The paper introduces a framework for natural language understanding by first using generative pre-training on a diverse corpus and then fine-tuning for specific tasks. This approach improved state-of-the-art results on 9 out of 12 datasets, highlighting the potential of unsupervised learning combined with discriminative tasks.

BERT's primary innovation in Language Model Learning is the "masked language model" (MLM) approach, inspired by the Cloze task. This method masks random tokens in a sentence and trains the model to predict them, enabling bidirectional context understanding.

Transformer-XL introduces a segment-level recurrence mechanism and a novel positional encoding scheme to overcome the fixed-length context limitations of traditional Transformers. This allows it to capture dependencies 80% longer than RNNs and 450% longer than vanilla Transformers, addressing context fragmentation and improving efficiency in language modeling.

It can generate upto 768 words (equivalent to 1 1/2 page)<SEP>demonstrate that language models begin to learn tasks such as question answering, machine translation, reading comprehension, and summarization without any explicit supervision when trained on a task-agnostic, diverse dataset of millions of web-scraped webpages. The work proposes a central research question: do WebText LMs transfer well across domains and datasets?

XLNet introduces a generalized autoregressive pretraining method that captures bidirectional context by considering all possible permutations of the factorization order. This approach overcomes BERT's limitations related to data corruption and token independence. Additionally, XLNet integrates techniques from Transformer-XL and offers architectural improvements for permutation-based modeling.

ERNIE innovatively incorporates knowledge from knowledge graphs (KGs) into language representation models. It fuses lexical, syntactic, and knowledge information, enabling enhanced performance on knowledge-driven tasks. This approach sets ERNIE apart from traditional models like BERT, which primarily rely on textual context.

The work introduced RoBERTa, an improved version of BERT, by optimizing design choices: extending training duration, removing the next sentence prediction objective, and dynamically altering the masking pattern. These modifications led RoBERTa to achieve state-of-the-art results on benchmarks like GLUE, RACE, and SQuAD, emphasizing the importance of refining pretraining strategies in Large Language Models.

The main innovation of the work is ALBERT, a language model that improves on existing large models like BERT by employing parameter reduction techniques, such as factorized embeddings and cross-layer parameter sharing. This allows ALBERT to achieve better performance and efficiency on natural language understanding tasks by training larger models with fewer parameters.

The main innovation of the work in the context of LLMs appears to involve advancements in model architecture, training techniques, and multitask learning to enhance the performance, efficiency, and ethical considerations of language models.

The main innovation of "Highly accurate protein structure prediction with AlphaFold" is the creation of AlphaFold, a deep learning model that accurately predicts protein structures. This extends the capabilities of Large Language Models by showcasing their potential to solve complex scientific challenges beyond language-related tasks, marking a significant advancement in the intersection of machine learning and biology.

The main innovation of the work is the introduction of BART, a pre-training approach that combines bidirectional and auto-regressive modeling. BART excels in generating coherent text, making it particularly effective for tasks like summarization, leveraging denoising tasks for multi-task learning, and achieving state-of-the-art results in text generation, especially abstractive summarization.

The main innovation of this work is DialoGPT, a large language model designed for open-domain conversations. DialoGPT is trained on a Reddit dataset, allowing it to generate coherent responses in multi-turn dialogues and adapt to different conversational domains through fine-tuning. It addresses context handling, offensive content mitigation, and employs human evaluation to assess its performance and variants.

The main innovation of this work is the creation of DistilBERT, a smaller language model derived from BERT using knowledge distillation during pre-training. It retains 97% of BERT's performance while being 40% smaller and 60% faster, making it efficient for on-device tasks and resource-constrained environments, thus addressing challenges related to large-scale language models' deployment.

The main innovation of Google's T5 language model is its "text-to-text" framework, where various tasks are formulated as converting input text to output text. This unified approach allows T5 to achieve state-of-the-art performance on diverse tasks without task-specific modifications, simplifying training and deployment. This innovation enhances efficiency and effectiveness in real-world applications of large language models.

The main innovation of the work is the development of the XLM-R model, trained on data from 100 languages, achieving superior performance on cross-lingual tasks. The research highlights the challenges of scaling multilingual models and suggests that increasing model capacity can address some limitations. This approach is especially beneficial for low-resource languages.

PEGASUS introduces a novel pre-training objective called Gap Sentence Generation (GSG), where "important" sentences are removed from a document and the model is trained to regenerate them. The method of selecting these principal sentences is based on their relevance to the entire document, measured using the ROUGE1-F1 score. This unique approach tailors the model for abstractive summarization, achieving state-of-the-art performance on various tasks.

mBART introduces a multilingual denoising pre-training method for neural machine translation using a sequence-to-sequence auto-encoder model. Unlike other models like XLM and MASS, mBART pre-trains both the encoder and decoder, making it more adaptable for translation tasks. The model's versatility is further showcased by its ability to handle various levels of multilinguality, from monolingual to 25 languages.

ELECTRA introduces a novel pre-training task called "replaced token detection" where, instead of masking tokens like in BERT, input tokens are replaced with plausible alternatives from a generator network. The model then acts as a discriminator, predicting if each token was replaced or original. This approach is more computationally efficient than traditional Masked Language Modeling (MLM) and offers superior performance, especially for smaller models.

Megatron-LM introduces an efficient intra-layer model parallelism approach for training large transformer models. Implemented seamlessly in PyTorch without custom modifications, it allows for the training of models with billions of parameters, achieving state-of-the-art results on datasets like WikiText103 and LAMBADA. This innovation pushes the boundaries of Large Language Models, offering a scalable solution for the research community.

GPT-3's primary innovation in the context of Large Language Models is its exceptional few-shot learning capabilities, allowing it to make accurate predictions using just a natural language prompt and a few task demonstrations. The model also introduced prompt-based and in-context learning methodologies. However, its vast size (175B parameters) poses challenges for real-world applications.<SEP>It can generate upto 1,536 words (equivalent to 3 pages)

The DeBERTa model introduces a disentangled attention mechanism, representing each word with separate vectors for content and position. It also incorporates an Enhanced Mask Decoder for better prediction of masked tokens during pre-training. These innovations lead to improved efficiency and performance over models like BERT and RoBERTa.

BigBird introduces a sparse attention mechanism, allowing it to efficiently handle sequences up to 8 times longer than traditional models like BERT. It combines global, sliding window, and random attention patterns to capture both local and long-range dependencies. This innovation enables superior performance on various NLP tasks without sacrificing efficiency.

The Vision Transformer (ViT) applies Transformers, typically used in NLP, directly to image patches without image-specific biases. It excels when pre-trained on larger datasets, outperforming traditional convolutional models like ResNets. This approach challenges the dominance of convolutional architectures in computer vision, mirroring the Transformer's rise in NLP.

The paper introduces a model with remarkable generalization, capable of creatively interpreting and combining unusual textual concepts into images. It also demonstrates combinatorial generalization and zero-shot image-to-image translation controlled by natural language, showcasing advancements in LLMs for text-to-image synthesis.

The Switch Transformer introduces a sparsely-activated model approach, enhancing the Mixture of Experts (MoE) models by simplifying their routing algorithm and reducing computational costs. It enables training large models with lower precision formats like bfloat16 and achieves up to 7x faster pre-training speeds. This innovation pushes LLM boundaries, scaling up to trillion parameter models with significant efficiency gains.

CLIP, in the context of Large Language Models, introduces a novel approach by leveraging natural language supervision with a dataset of 400 million (image, text) pairs. It excels in zero-shot learning, allowing it to classify images using textual descriptions without prior training on specific categories. This integration of vision and language offers a flexible, scalable solution with potential for diverse applications.

GPT-Neo by EleutherAI is an open-source alternative to proprietary models like GPT-3. Trained on the diverse "The Pile" dataset, its significance lies in democratizing access to large language models, fostering community-driven development, and promoting ethical considerations in AI research.

The Swin Transformer introduces a hierarchical representation for visual elements of varying scales and achieves linear computational complexity by computing self-attention within non-overlapping, shifted windows. This design efficiently adapts the Transformer architecture for high-resolution images, making it a robust backbone for computer vision tasks. The shifted window approach enhances modeling power while ensuring lower latency.

GPT-J-6B by EleutherAI stands out for its open-source accessibility, training on the diverse "Pile" dataset, and its development by an independent research organization. The model emphasizes democratization in AI, showcasing that state-of-the-art advancements aren't limited to large corporations. EleutherAI's transparent approach and emphasis on open research further distinguish GPT-J in the LLM landscape.

The Decision Transformer reframes Reinforcement Learning (RL) as a sequence modeling task, leveraging the Transformer architecture used in Large Language Models like GPT. It autoregressively models trajectories by conditioning on desired returns, past states, and actions, enabling it to generate optimal future actions. This approach offers a simplified and scalable solution to RL, especially in sparse reward settings.

The paper presents a novel approach to reinforcement learning (RL) by treating it as a sequence modeling problem. Using the Transformer architecture, traditionally employed in natural language processing, they model trajectories in RL tasks. This perspective simplifies design decisions and proves versatile across various RL challenges, including long-horizon tasks.

The HTLM model's primary innovation is its ability to directly model hyper-text (HTML) from web crawls, enabling structured prompting and auto-prompting. This approach leverages the inherent structure of HTML for tasks like zero-shot summarization and offers improved performance and data efficiency compared to traditional Large Language Models.

The Jurassic-1 model's primary innovation in the context of Large Language Models is its enhanced tokenization efficiency, achieved through a larger 256K vocabulary SentencePiece tokenizer. This tokenizer captures a mix of word pieces, whole words, and multi-word expressions, allowing Jurassic-1 to represent text with 28% fewer tokens than GPT-3, leading to faster processing and broader domain prediction capabilities.

The Megatron-Turing NLG 530B (MT-NLG) is the largest monolithic language model trained to date with 530 billion parameters. It was developed using advanced 3D parallelism techniques, a collaboration between NVIDIA Megatron-LM and Microsoft DeepSpeed. The model showcases superior in-context learning capabilities and sets new benchmarks in zero-shot and few-shot learning.

The paper introduces techniques to improve the alignment of Large Language Models (LLMs) with human values, specifically HHH--helpful, honest, and harmless. It delves into methods like context distillation and preference modeling over imitation learning. The goal is to ensure LLMs provide responses that are both relevant and in line with human expectations.

GLaM introduces a sparsely activated mixture-of-experts architecture, allowing it to scale to 1.2 trillion parameters while consuming only 1/3 of GPT-3's training energy. Despite its size, it achieves superior performance on 29 NLP tasks and is more energy-efficient than dense models like GPT-3.

The paper introduces two guidance techniques for text-guided image synthesis: CLIP guidance and classifier-free guidance. Of the two, classifier-free guidance produces higher-quality, photorealistic images that align closely with textual descriptions, outperforming previous models like DALL-E in evaluations.

The paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" emphasizes the benefits and limitations of scaling LLMs. It highlights significant performance advances with data quality and scale but notes uneven gains across tasks, especially in mathematical reasoning. The study also delves into the impact of scale on toxicity and bias in model outputs.

The CM3 model introduces a causally masked approach for generative modeling, blending the strengths of causal and masked language models. Trained on large-scale multi-modal documents, it can generate rich structured outputs and offers impressive zero-shot capabilities across text and image tasks. This innovation positions CM3 as a powerful advancement in the realm of Large Language Models.

LaMDA is a specialized dialog model that emphasizes safety and factual grounding. The model's innovation lies in its fine-tuning with annotated data and its ability to consult external knowledge sources. This approach aims to produce more accurate and safer dialog responses compared to traditional LLMs.

Better alignment of LLMs with human expectations using reinforcement learning through human feedback

The primary innovation of FLAN in the context of Large Language Models is instruction tuning, where models are finetuned on datasets described via natural language instructions. This method significantly enhances zero-shot learning abilities, with FLAN outperforming the 175B GPT-3 on numerous tasks. The approach emphasizes human-like prompts over traditional model-specific prompts used in models like GPT-3 and T5.

The paper "Training Compute-Optimal Large Language Models" introduces a methodology to optimally balance model size and training tokens under a fixed compute budget. The findings emphasize equal scaling of parameters and training tokens, highlighting the importance of high-quality dataset scaling. The research also underscores ethical concerns with training on vast datasets sourced from the web.

The paper introduces DQ-BART, which innovatively combines model distillation and quantization to compress sequence-to-sequence models. It uniquely initializes student models by copying specific layers from the teacher and distills both the encoder and decoder. This approach achieves significant model compression with minimal performance loss.

GopherCite, in the context of large language models, is designed to support its answers with verified quotes from sources. It employs reinforcement learning with unique training techniques and can decline to answer questions if uncertain about the quality of its response. This approach differentiates it from other models by emphasizing evidence-backed answers and selective response generation.

The main innovation of the model, SeeKeR, is its modular approach that combines internet search, knowledge generation, and response generation for more factual and up-to-date outputs. This method addresses the limitations of traditional large language models by ensuring accurate and current information in responses. SeeKeR outperforms existing models in open-domain knowledge-grounded conversations.

GLM's main innovation lies in its use of gap sentences for pretraining, along with a blank infilling objective during fine-tuning. This approach enhances the model's ability to generate coherent and contextually relevant text. By combining autoregressive decoding and text infilling, GLM achieves competitive performance on a variety of NLU tasks.

The paper introduces a model named T0, based on the T5 architecture, which is trained using a unique method of mapping natural language tasks into prompted forms. This approach, combined with training on multiple prompts and datasets, enables T0 to achieve robust generalization to unseen tasks, often surpassing larger models like GPT-3 in performance. The innovation lies in the effective use of prompts and multitask training to enhance zero-shot learning capabilities.

Flamingo is a Visual Language Model designed to bridge powerful pretrained vision-only and language-only models, allowing it to handle sequences of interleaved visual and textual data. It's trained on large-scale multimodal web corpora with interleaved text and images, enabling rapid adaptation to new tasks using few-shot learning. This approach allows Flamingo to outperform models fine-tuned on significantly more task-specific data.

To demonstrate the first large-scale use of Pathways -- a new\nML system which enables training a single model across thousands or tens of thousands of accelerator chips in a highly efficient manner. With Pathways, they trained a 540B parameter language model on 6144 TPU v4 chips at efficiency levels that could not be reached before for models of this scale. E.g., GPT-3 (175B), Gopher (280B), Megatron-Turing-NLG (530B).

GPT-NeoX-20B is a 20 billion parameter autoregressive language model with notable architectural differences from GPT-3, including the use of rotary positional embeddings. It excels in few-shot reasoning and is distinctively open-sourced, making it a significant contribution to the public research community. The model was trained on the Pile dataset.

Gato is a multi-modal, multi-task agent inspired by Large Language Models, capable of handling diverse tasks like playing games, captioning images, and robotics using a single neural network. It employs a unique tokenization approach to process varied data types, from text to images. This innovation allows Gato to generalize across a vast range of tasks, setting a new standard in the realm of LLMs.

The authors introduced OPT, a collection of auto-regressive language models ranging from 125M to 175B parameters, replicating GPT-3's performance. They applied the latest best practices in data curation and training efficiency and emphasized the importance of community collaboration for responsible LLM guidelines and ethical considerations.

The paper introduces the UL2 model, a unified framework for pre-training in NLP, featuring a novel Mixture-of-Denoisers (MoD) objective. This objective smoothly integrates various pre-training paradigms, such as span corruption and prefix language modeling. Additionally, UL2 introduces dynamic "mode switching" between different denoisers and showcases superior performance across diverse NLP tasks.

The Global Context Vision Transformer (GC ViT) introduces a hierarchical structure optimized for compute and parameter efficiency. It employs global query tokens for capturing contextual information and a novel downsampling module with Fused MB-Conv blocks to enhance inter-channel dependencies. These innovations lead to state-of-the-art performance across various computer vision tasks.

The "Imagen" model innovatively merges transformer language models with high-fidelity diffusion techniques to produce photorealistic images from text descriptions. This demonstrates that embeddings from text-only pretrained large language models are highly effective for text-to-image synthesis.

The main innovation of the "Godel" model lies in its novel approach to integrating external knowledge into dialogue generation using a knowledge selection mechanism. This enables the model to produce contextually relevant and accurate responses by incorporating relevant information from external sources, enhancing the capabilities of LLMs in generating grounded and informed dialogues.

BLOOM is an open-access Large Language Model (LLM) collaboratively designed by hundreds of researchers. Trained on the ROOTS corpus with 59 languages, it achieves competitive performance, especially after multitask prompted finetuning. The model and code are publicly released under the Responsible AI License.

BlenderBot 3 enhances its predecessor by grounding conversations with internet-based knowledge retrieval. It emphasizes fine-tuning with diverse datasets and incorporates advanced safety mechanisms, while also focusing on continual learning from public deployments.

The paper introduces AlexaTM 20B, the largest multilingual seq2seq model, trained on denoising and Causal Language Modeling tasks. This model excels at in-context learning with long contexts and showcases efficiency by outperforming much larger models like GPT3 175B in certain benchmarks. Additionally, it emphasizes a new paradigm for one-shot machine translation, especially for low-resource languages.

Sparrow, developed by DeepMind, introduces innovations in the context of LLMs by utilizing targeted human judgments for alignment. It employs a unique approach of using a single external knowledge fragment for evidence and focuses on breaking down goals into detailed rules, enhancing the model's helpfulness and accuracy in responses. The model also integrates techniques like self-play, search, and fine-grained rules to shape its behavior.

this paper explores instruction finetuning\nwith a particular focus on (1) scaling the number of tasks (1.8K fine-tuning tasks), (2) scaling the model size, and (3) finetuning on\nchain-of-thought data. This approach is compatible with various model sizes and architectures, with Flan-T5 models notably outperforming baseline T5 models.

The paper introduced an extended instruction fine-tuning for the Flan-PaLM model, scaling it to a 540B-parameter size and 1.8K fine-tuning tasks. They incorporated chain-of-thought (CoT) data, which enhanced performance across evaluations. This approach is compatible with various model sizes and architectures

Galactica is designed to revolutionize the way we access scientific knowledge, moving beyond the traditional store-and-retrieve approach. It efficiently absorbs technical knowledge, including complex LaTeX equations and chemical reactions, and demonstrates superior context-associative capabilities over traditional search engines. This model's architecture and approach highlight its potential to serve as a new, more efficient interface for scientific inquiry.

trained using Reinforcement Learning from Human Feedback (RLHF) to obtain better model alignment<SEP>It can generate upto 3000 words (equivalent to 6 pages)<SEP>Supports input context length of 2048 tokens

The paper introduces E5, a text embedding model trained contrastively using a curated dataset named CCPairs. A unique consistency-based filtering refines this dataset, ensuring high-quality training data. E5's efficiency allows it to match or outperform much larger models in various tasks.

InstructOR is built on the GTR model architecture, initialized from T5 models, and fine-tuned on information search datasets. It uses a single encoder to encode input text and task instructions. The training objective distinguishes between good and bad candidate outputs based on their cosine similarity in embeddings.

The LLaMA models prioritize efficiency, with LLaMA-13B outperforming larger models like GPT-3 using fewer parameters. They exclusively train on publicly available data, emphasizing both inference efficiency and democratizing access to powerful language models.

Alpaca 7B is fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

The paper introduces Pythia, a suite of 16 Large Language Models (LLMs) trained on consistent public data, spanning from 70M to 12B parameters. Pythia provides public access to 154 checkpoints for each model, facilitating research into LLM training dynamics, memorization, and bias reduction. This unique setup offers novel insights into LLM behavior and evolution.

It can generate upto 24000 words (equivalent to 48 pages)<SEP>Supports input context length between 8192 and 32,768 tokens depending on the model version

Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases.

The paper introduces the Evol-Instruct methodology, allowing Large Language Models (LLMs) to autonomously generate diverse instructional data. This innovation shifts away from traditional human-generated instructions, leading to more robust models. The resulting model, WizardLM, demonstrated superior performance, highlighting the effectiveness of this approach.

The main innovation of WizardMath in the context of Large Language Models (LLMs) is its enhanced mathematical reasoning capabilities. Through a case study, the model demonstrated its ability to solve complex problems using step-by-step reasoning, showcasing its advancement over other LLMs in mathematical tasks.

MosaicML tries to provide a commercially-usable, open-source model that matches (and - in many ways - surpasses) LLaMA-7B. It is trained on a large amount of data (1T tokens like LLaMA).

StarCoder introduces novel features like an 8K context length, Fill-in-the-Middle (FIM) infilling, and Multi-Query-Attention (MQA) for fast inference. It outperforms other open LLMs for code and matches the OpenAI code-cushman-001 model. Additionally, it emphasizes safe open model release with improved PII redaction and an integrated attribution tool.

Falcon-7B and Falcon-40B have been trained on 1.5 trillion and 1 trillion tokens respectively, in line with modern models optimising for inference. The key ingredient for the high quality of the Falcon models is their training data, predominantly based (>80%) on RefinedWeb — a novel massive web dataset based on CommonCrawl. Another interesting feature of the Falcon models is their use of multiquery attention.

The main innovation of the Orca model is Explanation Tuning, which leverages detailed explanations rather than just input-response pairs for training. This approach enhances the model's reasoning capabilities, using system instructions from GPT-4 and ChatGPT as a teaching assistant for generating explanations.

The paper introduces WizardCoder, a model that empowers Code Large Language Models with complex instruction fine-tuning using the Evol-Instruct method tailored for code. This innovation allows WizardCoder to surpass other open-source Code LLMs and even outperform major closed LLMs on key benchmarks. The model fills a gap in the field by emphasizing instruction fine-tuning specifically for the code domain.

The only difference between OpenLLaMA setting and the original LLaMA is the dataset used: OpenLLaMA employs open datasets rather than the one utilized by the original LLaMA.

The paper introduces Llama 2-Chat, a fine-tuned Large Language Model optimized for dialogue use cases, ranging from 7 billion to 70 billion parameters. It outperforms most open-source chat models and emphasizes safety through specific data annotation and tuning. <SEP>The primary architectural differences from Llama 1 include increased context length\nand grouped-query attention (GQA).

The paper introduces Jais and Jais-chat, state-of-the-art Arabic-centric LLMs. These models are trained bilingually with Arabic and English, addressing the underrepresentation of Arabic in the LLM space. They combine specialized Arabic text processing with advanced model architectures, offering superior performance in Arabic while remaining competitive in English.

pretraining architecture

Decoder

Encoder

Decoder

Encoder

Decoder

Encoder

Encoder/Decoder

Decoder

Encoder

Encoder/Decoder

Encoder

Encoder/Decoder

Encoder

Encoder or Decorder, depending on the base model

Decoder

Encoder

Decoder

Encoder/Decoder

Encoder

Decoder

Encoder

Decoder

Encoder/Decoder

Decoder

Encoder

Decoder

Encoder/Decoder

Decoder

Encoder/Decoder

Decoder

Encoder/decoder or decoder only, depending on the base model it’s extending

Encoder/Decoder

Decoder

Encoder/Decoder

Encoder

T5 (or CLIP or BERT) for frozen text encoder + U-net architecture for cascaded diffusion models for text to image

Decoder

Encoder/Decoder

Decoder

Encoder/Decoder

Decoder

Encoder

Encoder/Decoder

Decoder

pretraining task

Causal language modeling

Masked Language Modeling

Causal language modeling

Masked Language Modeling

Next Sentence Prediction<SEP>Masked Language Modeling

Causal language modeling

Protein folding prediction of BERT using parameter\nsharing, which is much more efficient given the same number of parameters

denoising autoencoder

Causal language modeling

Masked Language Modeling<SEP>Next Sentence Prediction

Span Corruption

Masked Language Modeling

DAE (more concretely GSG) and MLM

denoising autoencoder

replaced token detection

Same as base model

Causal language modeling

Masked Language Modeling

image classification

Caption prediction

denoising autoencoder

predict which of the N × N possible (image, text) pairings across a batch actually occurred

Causal language modeling

Same as ViT

Causal language modeling

Next action prediction

predict most likely sequence

denoising autoencoder

Causal language modeling

Caption prediction

Causal language modeling

Caption prediction

Causal language modeling

denoising autoencoder

Causal language modeling

LM training, Dialogue training

Auto regressive blank infilling

Masked Language Modeling

Caption prediction

Log likelihood of text given some visual input

Causal language modeling

CLM (where tokens are either text or agent actions)

Causal language modeling

Mixture-of-Denoisers, which combines diverse pretraining\nparadigms together

Image Classification

image/text pair prediction

Causal language modeling

Optimizes denoising (80%) and Prefix LM (20%)

Causal language modeling

Span Corruption

Causal language modeling

Contrastive pretraining

Language Modeling

Causal language modeling

Language Modeling

Causal language modeling

Language Modeling

Causal language modeling

Self-supervized Learning

Causal language modeling

fine-tuning task

Supervized discriminative finetuning

Next Sentence Prediction

finetuning on downstream tasks one at a time

based on multi-turn crowdsourced dialog datasets, LaMDA-PT is finetuned in a mix of generative tasks that generate response given contexts, and discriminative tasks that evaluate quality and safety of a response in context

Reinforcement Learning from Human Feedback

Instruction Tuning

dialogue-based fine-tuning

Natural language prompts

Supervised fine-tuning (SFT)

Instruction Tuning

Step 3. RLHF using Proximal Policy Optimization<SEP>Step 2. Collect comparison data and train a reward model<SEP>Step 1. Supervized fine-tuning

Supervised fine-tuning (SFT)

Wide variety of instruction based text-to-text tasks

instructions

supervized open-domain instruction finetuning

Reinforcement Learning from Human Feedback<SEP>Rule-Based Reward Model

supervized open-domain instruction finetuning

supervized open-domain complex instruction finetuning

Reinforcement Learning from Evol-Instruct Feedback

Conversations for the Chat model<SEP>Short instructions for the Instruct model<SEP>Excerpts of fiction books for StoryWriter

Python-specific variant

Explanation tuning

supervized complex code-based instruction finetuning

Finetuning on longer context data using sample subsets of datasets only samples with at least 4096 tokens of instances in base model dataset for the 8K context window model<SEP>A large collection of chat datasets for the Chat model<SEP>Instructions following for the Instruct model

both instruction tuning and reinforcement learning from human feedback for Llama-2-Chat

Instruction Tuning

training corpus

BookCorpus<SEP>Supervised Finetuning on several task-specific datasets for Natural Language Inference, Question Answering, Sentence similarity, and Classification.

Toronto Book Corpus and Wikipedia (3.3B Tokens)

Different training datasets depending on experiments, but baseline is Wikitext-103

https://github.com/openai/gpt-2-output-dataset<SEP>8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points.

Same as BERT + Giga5 (16GB text),\r\nand and aggressively filtered ClueWeb 2012-B (19GB), Common Crawl (110 GB)

English Wikipedia + Wikidata for entitites (note that they\ninitialize model to original BERT parameter values

Same as BERT + CC News + OpenWebText + Stories (~33B Tokens)

Same as BERT

140 GB of text including: Wikipedia (En, De, Es, Fr), Project Gutenberg, 45 subreddits, OpenWebText2, Amazon Reviews, Europarl and UN data from WMT, question-answer pairs from ELI5, and the MRQA shared task3, which includes the Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA , and Natural Questions

170,000 proteins from a public repository of protein sequences and structures

Same as RoBERTa (160Gb of news, books, stories, and web text)

140M Reddit conversations

Same as BERT

Colossal Clean Crawled Corpus

Cleaned Common Crawl in 100 languages

C4 (750GB) + HugeNews (3.8 TB)

CC25 Corpus includes 25 monolingual corpuses in different languages. Largest corpuses are English (300 GB) and Russian (280GB)

Same as BERT except for Large with is same as XLNet

Original paper uses an aggregate dataset consisting of Wikipedia), CC-Stories), RealNews, and OpenWebtext

~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B)

For DeBERTa pre-training, we use Wikipedia (English Wikipedia dump; 12GB), BookCorpus (Zhu et al., 2015) 9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB) and STORIES (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after\ndata deduplication (Shoeybi et al., 2019) is about 78GB.

Books, CC-News, Stories and Wikipedia

From standard Imagenet to JFT-300M (large inhouse dataset)

250 million text-images pairs from the internet

Colossal Clean Crawled Corpus

WIT (WebImageText) - 400 million text,image pairs

Pile - 840 GB open source text dataset that combines 22 pre existing datasets

Imagenet and Imagenet-22k

Pile corpus, a large-scale curated dataset created by EleutherAI

Different corpus for different experiments

D4RL dataset and other RL datasets depending on the task at hand

23TB of simplified HTML extracted from CommonCrawl

300B tokens (same as GPT-3)

The Pile (800GB dataset) + 2 Common Crawl snapshots

400B tokens from filtered Common Crawl and Books. They also create several Dialogue Preference datasets for the RLHF training.

1.6T tokens including web pages filtered by Wikipedia and books for quality

Same as DALL-E

Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.

LAION-5B, a publicly available dataset derived from Common Crawl

CC-News, English Wikipedia

1.56T words from public dialog data and other public web documents. Overall, it consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T\nwords

Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts

FLAN is instruction tuned on 25 tasks spanning 62 datasets.<SEP>LaMDA-PT is is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary

1.4 trillion training tokens. Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.

CNN/DM, XSUM, ELI5, WMT16 En-Ro (~1M tokens)

Same as Gopher plus specific dataset generated in the RLHP process

Wizard of the Internet/Wikipedia, PersonaChat, Blended Skill\nTalk, Empatheic Dialogues, Multi-Session Chat, MS MARCO, Natural\nquestions, SQuAD, TriviaQA

Pile, GLM-130B Chinese corpora, P3, DeepStruct finetuning\ndataset

T0 (Multiple-choice QA, Extractive QA, Closed-Book QA,\nStructure-To-Text, Sentiment, Summarization, Topic Classification, Paraphrase\nIdentification. T0p (same as T0, with additional datasets from\nGPT-3’s evaluation suite). T0pp (same as T0p, with additional datasets\nfrom SuperGLUE, excluding NLI sets)

Combination of the DALL-E and CLIP datasets

MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average)

780B tokens from multilingual social media conversations (50%), multilingual filtered webpages (27%), books in English (13%), code from Github (5%), multilingual Wikipedia (4%), and news in English (1%). Code includes 24 programming languages.

Pile — 840 GB open source text dataset that combines 22 preexisting datasets

1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot)

180B tokens = RoBERTa + the Pile + PushShift.io Reddit

1 trillion tokens on C4

Imagenet-1K and other task dependent dataasets

a combination of internal datasets, with ? 460M image-text pairs, and the publicly available Laion dataset, with ? 400M image-text pairs

Same as PaLM + 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats

147M dialog sessions for a total of 6B tokens from Reddit comment chains\nfor DialoGPT. And grounded dialog corpora like DSTC7 Task 2 corpus, MS MARCO, UnifiedQA, and Schema-Guided Dialog.

https://openreview.net/forum?id=UoEw6KigkUn<SEP>366B tokens (1.5 TB of text data) multilingual dataset (46 natural languages and 13 programming languages)

180B tokens = RoBERTa + the Pile + PushShift.io Reddit

Wikipedia and mC4 datasets in 12 languages.

Same as Chinchilla + interactive data gathering with human annotators during the RLHF process

Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT

Trained on 106 billion tokens of open-access scientific text and\ndata. This includes papers, textbooks, scientific websites, encyclopedias,\nreference material, knowledge bases, and more

Human written prompt and interaction dataset collected through the OpenAI API

Finetune on a combination of 3 datasets: NLI 6 (Natural Language Inference), MS-MARCO passage ranking\ndataset, and NQ (Natural Questions) dataset<SEP>CCPairs dataset by combining various semistructured\ndata sources such as CommunityQA, Common Crawl and Scientific papers, and perform\naggressive filtering with a consistency-based filter

Finetuned on MEDI

Stack Exchange<SEP>ArXiv<SEP>Gutenberg and Books3<SEP>Wikipedia<SEP>Github<SEP>C4<SEP>CommonCrawl<SEP>approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%)

52K instruction-following data generated from OpenAI’s text-davinci-003 using self-instruct mechanism, from 175 human-written instruction-output pairs was leveraged to finetune the model

Two version of each of the eight models are released: one trained on the original Pile corpus and another trained on the deduplicated Pile corpus.<SEP>Pile

Vicuna is finetuned with 70K user-shared ChatGPT conversations user-shared conversations collected from ShareGPT.com

250K instructions generated from OpenAI ChatGPT API auto-generated based on the Evol-Instruct method

To enhance the model’s ability to adhere to the neural and diverse instructions, 1.5k open-domain conversations from WizardLM’s training data are sampled and merged with above math corpus as the final supervized finetuning training data<SEP>few-shot re-generate 15k answers for GSM8k\nand MATH with an Alpha version of WizardLM 70B model to produce solutions in a\nstep-by-step format, then find out those with a correct answer, and use this data to finetune\nbase Llama model<SEP>evolve the original math (GSM8k + MATH) instructions by 8 turns,\nincreasing the data size from 15k to 96k

Semantic Scholar ORC<SEP>The Stack Dedup<SEP>RedPajama<SEP>C4<SEP>mC4

StarCoderBase was trained using the dataset "The Stack v1.2" (Kocetkov et al., 2022), which exclusively contains data from permissively licensed GitHub repositories. The training set was further refined by selecting 86 languages out of the 358 programming languages present in The Stack, based on criteria like data volume, popularity, and active support. The assignment of data to programming languages was done based on file extensions.

Falcon RefinedWeb

the Flan 2022 Collection with its extensive public assortment of tasks and instructions<SEP>⟨query, response⟩ pairs augmented with detailed responses from\nGPT-4 that explain the reasoning process of the teacher as it generates the response

initialized with the 20K instruction-following dataset called\nCode Alpaca, the Evol-Instruct technique is iteratively applied on this dataset consisting of 20,000 samples to produce evolved data. After each round of data evolution, the evolved data is merged from\nall previous rounds with the original dataset to finetune StarCoder

We train our models on the RedPajama dataset released by Together, which is a reproduction of the LLaMA training dataset containing over 1.2 trillion tokens.<SEP>RedPajama

RedPajama - StackExchange<SEP>RedPajama - arXiv<SEP>RedPajama - Books<SEP>Semantic Scholar ORC<SEP>The Stack - Markdown<SEP>RedPajama - Wikipedia<SEP>The Stack - Selected Languages<SEP>RedPajama - CommonCrawl<SEP>c4 - English - SemDedup<SEP>mC4 3.1.0 - English

The v2 models are trained on a mixture of the Falcon refined-web dataset, the StarCoder dataset and the wikipedia, arxiv, book and stackexchange part of the RedPajama dataset.<SEP>Starcoder<SEP>Falcon refined-web

No specific information available other than "A new mix of publicly available online data"

An English dataset comprising 232B tokens from Pile-CC, Books3, ArXiv, PubMed Central, OpenWebText2, Wikipedia (en), FreeLaw, PubMed Abstracts, DeepMind Math, Project Gutenberg, BookCorpus2, EuroParl, PhilPapers, Youtube Subtitles, NIH Grant Abstracts, Enron Emails. And 46B tokens from its Github subset.<SEP>3B tokens from English Wikipedia and 15B tokens from the Books3 corpus translated to Arabic.<SEP>An Arabic dataset comprising 55B tokens from Abu El-Khair, Aranews, ArabicText 2022, ARabic subset of C4, Arabic Wikipedia, ArabicNews 202, Maktabah, UN Meeting transcripts, and other sources.

optimizer

Adam optimizer

Adam weight decay optimizer

Adam optimizer

LAMB optimizer

Adagrad optimizer

AdaFactor

Adam optimizer

AdamW

Adam optimizer

AdaFactor

Adam optimizer

AdaFactor

AdamW

AdaFactor

Adam optimizer

AdamW

AdaFactor

ZeRO<SEP>AdamW

Adam optimizer

AdamW

AdaFactor

AdamW

AdaFactor

Adam optimizer

AdaFactor

AdamW

ZeRO<SEP>Adam

Adam optimizer

LION

Adam optimizer

AdamW

LION

AdamW

tokenization

byte pair encoding

WordPiece

byte pair encoding

sentencepiece

fastBPE

sentencepiece

byte pair encoding

BPE-ecnode

same set of Byte Pair Encoding (BPE) Tokenizer as GPT-2/GPT-3

self-trained sentencepiece

sentencepiece

a slightly modified SentencePiece (Kudo and Richardson, 2018)\ntokenizer that does not apply NFKC normalisation

sentencepiece

uncased wordpiece

sentencepiece

train a new BPE tokenizer based on the Pile

sentencepiece

GPT-2 byte level BPE tokenizer

sentencepiece

Byte level BPE tokenizer

byte pair encoding

the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

BPE tokenizer that is trained specifically on the Pile same as used for GPT-NeoX-20B

EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus

use the Hugging Face Tokenizers library (MOI et al., 2022) to train a byte-level\nByte-Pair-Encoding with a vocabulary size of 49,152 tokens—including the sentinel tokens

LLaMA Byte Pair Encoding (BPE)

EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus

sentencepiece

Jais tokenizer

number of parameters

117M

Base = 110M, Large = 340M

151M

124M, 355M, 774M, 1.5B

Base=117M, Large=360M

Ernie-ViLG 2.0 = 10B, Ernie 3.0 Titan = 260B

125M Base, and 356M Large

Base = 12M, Large = 18M, XLarge = 60M

1.63B

b12M, Large = 18M, XLarge = 60M

Base = 140M, Large = 400M. In general, roughly 10%\nlarger than BART for equivalent architectures

117M, 345M and 762M

66M

60M, 220M, 770M, 3B, and 11B

Base = 270M Large = 550M

Base = 223M Large = 568M

Same as BART

Small = 14M, Base = 110M, Large = 335M

8.3B (GPT-like), 3.9B (BERT-like)

125M, 350M, 774M, 1.3B, 2.7B, 6.7B, 13B, 175B

100M (Base), 350M (Large), 700M (X-large), and 1.5B (XX-large)

Depends on the overall architecture

86M(Base) to 632M (Huge)

12B

125M, 350M, 1.3B, and 2.7B

Swin-Tiny (29M), Swin-Small (50M), Swin-Base (88M), and Swin-Large (197M)

Same as GPT

Smaller architecture than GPT

400M

178B (Jumbo), 17B (Grande), 7.5B (Large)

530B

13M, 42M, 197M, 810M, 2.7B, 13B, and 52B

1.2T across 64 experts, but only 96B get activated for inference

3.5B diffusion model (2.3B for visual encoding, 1.2B for textual) + 1.5B for model for upsampling

44M, 117M, 417M, 1.4B, 7.1B, and 280B

890M (although there are different, smaller, variants)

125M (small), 800M (small), 2.7B (medium) and 13B (large)

137B

Same as GPT3

137B

70B

Up to 30x reduction in parameters compared to standard BART

280B

SeeKeR Dialogue: 400M, 3B; SeeKeR LM: 365M, 762M,\n1.5B, R2C2 BlenderBot: 400M, 3B

Base = 110M, Large = 335M, and also 2B, 10B, 130B

T0-3B: 3 billion, T0, T0p, T0pp: 11 billion

3.5B

80B (largest)

8B, 62B, and 540B

20B

79M, 364M, and 1.18B

125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, and 175B

20B

90M

540B

220M (base), 770M (large), and 175B (XL)

560m, 1.1B, 1.7B, 3B, 7.1B, and 176B

3B, 30B and 175B

20B

70B

80M (Flan-T5-Small), 250M (Flan-T5-Base), 780M (FLan-T5-Large), 3B (Flan-T5-XL), and 11B (Flan-T5-XXL).

8B, 62B, 540B

mini: 125M, base: 1.3B, standard: 6.7B, large: 30B,\nhuge: 120B

175B

300M

330M

6.7B, 13.0B, 32.5B, and 65.2B

7B, 13B, 33B, 65B

70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B

170T

7B, 13B, 33B

7B, 13B, and 70B

15.5B

7B, 40B

13B

1B, 3B, 7B, 13B, 15B, and 34B

3B, 7B, 13B

30B

3B, 7B

7B, 13B, 34B, 70B

1.3B, 6.7B, and 13B

maximum number of parameters (in million)

117

340

151

1500

360

260000

356

1630

400

762

11000

550

568

335

8300

175000

1500

632

12000

1000000

2700

197

6000

400

178000

530000

52000

1200000

3500

280000

890

13000

137000

70000

280000

130000

11000

3500

80000

540000

20000

1180

175000

20000

2000

540000

175000

176000

175000

20000

70000

11000

540000

120000

175000

300

330

65200

65000

12000

170000000

33000

7000

70000

7000

15500

40000

13000

34000

13000

30000

7000

70000

13000

hardware used

TPUv3

Cloud TPUv3

Nvidia V100 GPU

TPUv3

NVIDIA V100 (32GB) GPUs

TPUv3<SEP>Nvidia V100 GPU

Tesla V100 (32GB) GPU

Nvidia V100 GPU

Cloud TPUv3

NVIDIA V100 (16GB) GPU

TPUv3

Nvidia V100 GPU

TPU v3-256 pod

graphics processing unit

A100-80GB GPU

cloud TPU-v4

TPUv3

NVIDIA A100 GPU<SEP>Nvidia V100 GPU

TPUv3

TPUv4<SEP>TPUv3

A100 GPU

TPUv3

Nvidia V100 GPU

TPUv3

TPUv4

AMD EPYC 7532 CPU<SEP>A100-SXM4-40GB GPU

A100-80GB GPU

TPUv4

NVIDIA A40 GPU<SEP>NVIDIA A100 GPU

TPUv4

Nvidia V100 GPU

A100-80GB GPU

NVIDIA V100 (32GB) GPUs<SEP>A100-40GB GPU

NVIDIA A100 GPU

TPUv3

TPUv4

A100-80GB GPU

Nvidia V100 GPU

A100-40GB GPU

A100-80GB GPU

A100-40GB GPU

A100-80GB GPU

Nvidia V100 GPU

A100-40GB GPU

A100-80GB GPU

A100-40GB GPU

A100-80GB GPU

cloud TPU-v4

A100-40GB GPU<SEP>H100-80GB GPU

cloud TPU-v4

A100-80GB GPU

Condor Galaxy 1 AI supercomputer

hardware information

state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster

train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days

trained on 16 Nvidia V100 machines with NVLink.

DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours.

we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods. TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines.

train the XLM-R model for\n1.5 Million updates on five-hundred 32GB Nvidia\nV100 GPUs with a batch size of 8192.

The full model (including 25\nlanguages) is trained on 256 Nvidia V100 GPUs (32GB) for 500K steps. The total batch size is around 128K tokens per GPU, matching the BART configuration.

ELECTRA-SMALL trained for 4days on 1 V100 GPU. ELECTRA-Base trained for 4d on 16 TPUv3s.

Their experiments use up to 32 DGX-2H servers (a total of 512 Tesla V100 SXM3 32GB GPUs). Our infrastructure is optimized for multi-node deep learning applications, with 300 GB/sec bandwidth between GPUs inside a server via NVSwitch and 100 GB/sec of interconnect bandwidth between servers using 8 InfiniBand adapters per server.

All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft.

The base model of DeBERTa was trained on 4 DGX-2 machines equipped with 64 V100 GPUs. The training took 10 days to complete 1M training steps with a batch size of 2048. For the larger version of DeBERTa with 1.5 billion parameters (DeBERTa1.5B), the model was trained on a pre-training dataset of 160G. The training was conducted on a DGX-2 machine with 16 V100 GPUs.

the ViT-L/16 model\npre-trained on the public ImageNet-21k dataset could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.

All models are trained with the same amount of computation (32 cores) and on the same hardware (TPUv3).

The largest ResNet model,\nRN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs.

We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192.

Model training is done with mixed precision using 16-bit bfloat on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes. Each cluster node has 8 NVIDIA 80-GB A100 GPUs, connected to each other by NVLink and NVSwitch. Each node has eight NVIDIA Mellanox 200Gbps HDR Infiniband HCAs for application communication, with an additional two HCAs per node for dedicated storage. The nodes are connected in a three-level (leaf, spine, core) fat-tree topology with 850 switches. The cluster uses an all-NVME shared parallel filesystem for high-performance data access and storage.

the GLaM (64B/64E)\ntraining after 600B tokens consumes 456 MWh, about 1/3 of the energy cost of 1287 MWh used by GPT-3. Moreover, to reach similar (and slightly exceeded) scores as GPT-3, we train using 1,024 TPU-v4 chips for 574 hours (with 280B tokens). This consumes 213 MWh or 1/6 of the GPT-3\nenergy cost.

We built our training and evaluation codebase with JAX and Haiku. In particular, we use JAX’s pmap transformation to efficiently express both data and\nmodel parallelism. We trained and evaluated all models on TPUv3 chips.\nThe half-precision parameters and single-precision Adam state for Gopher occupy 2.5 TiB, which far exceeds the 16 GiB of memory available on each TPUv3 core. To address these memory concerns,\nwe use optimiser state partitioning, model parallelism, and rematerialisation to partition the model state and reduce\nthe activations so that they fit in TPU memory.

HTLM-Medium was trained on 240 V100 GPU for 28 days, while HTLM-Large was trained on 384 A100 GPU for 24 days.

LaMDA was pretrained on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch

instruction tuning takes around 60 hours on a TPUv3 with 128 cores

All models in this analysis have been trained on TPUv3/TPUv4 with JAX and Haiku.

shard the networks across 128 TPU v3\nmachines.

The SeeKeR language models were fine-tuned on all of the search, knowledge, and response tasks\nsimultaneously, with training occurring on 32 V100 GPUs for around 17, 21, and 31 hours for the XL,\nLarge, and Medium models, respectively.<SEP>The SeeKeR 2.7B R2C2 model was fine-tuned on all of the search, knowledge, and dialogue response\ntasks simultaneously, with training occurring on 64 V100 GPUs for around 20 hours<SEP>SeeKeR 2.7B R2C2 was pre-trained on 128 V100 GPUs for approximately 25 days

The models\nare trained on 64 V100 GPUs for 200K steps with\nbatch size of 1024 and maximum sequence length\nof 512, which takes about 2.5 days for GLMLarge.

training runs corresponded to about 270 total hours of training on a v3-512 Cloud TPU device. Further, T5 was trained in Google’s Taiwan datacenter, whereas we\ntrained in the europe-west4-a Cloud region. The gCO2eq/kWh published by Google for these\ndatacenters are 540 and 410 respectively

Our model and associated infrastructure were implemented using JAX and Haiku. All training and evaluation was performed on TPUv4 instances. The\nlargest model containing 80 billion parameters is trained on 1536 chips for 15 days and sharded across 16 devices. Megatron type sharding is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers\nwere unsharded. ZeRO stage 1 is used to shard the optimizer state.

PaLM 540B is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism. Each Pod has 3072 TPU v4 chips attached to 768 hosts.

trained GPT-NeoX-20B on twelve Supermicro\nAS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs

training OPT-175B\non 992 80GB A100 GPUs, reaching 147 TFLOP/s\nutilization per GPU. From this implementation, and\nfrom using the latest generation of NVIDIA hardware,\nwe are able to develop OPT-175B using only\n1/7th the carbon footprint of GPT-3.

We use a batch size of 1024 and 512 TPUv4 chips\nfor pretraining this model. UL20B is trained with Jax and T5X infrastructure. We release and open source\nT5X-based model checkpoints of this 20B model

Object detection\nand instance segmentation models as well as semantic segmentation models were trained using one computational node with\n8 NVIDIA A40 GPUs using a total batch size of 16, hence a batch size of 2 per GPU.<SEP>For image classification, GC ViT models were trained using four computational nodes with 32 NVIDIA A100 GPUs

use 256 TPU-v4 chips for our base 64 x 64 model, and 128 TPU-v4 chips for both super-resolution models

GODEL_B and GODEL_L were trained on 16 Nvidia\nV100 machines, and GODEL_XL was trained with\n128 Nvidia V100 GPUs.

Training was conducted on 48 nodes, each\nhaving 8 NVIDIA A100 80GB GPUs (a total of 384 GPUs); due to possible hardware\nfailures during training, we also maintained a reserve of 4 spare nodes. The nodes were\nequipped with 2x AMD EPYC 7543 32-Core CPUs and 512 GB of RAM, while the storage\nwas handled by mix of full flash and hard disk drives using a SpectrumScale (GPFS) parallel\nfile system shared between all nodes and users of the supercomputer. 4 NVLink GPU-to-\nGPU interconnects per node enabled intra-node communications while 4 Omni-Path 100\nGbps links per node, arranged in an enhanced hypercube 8D global topology, were used for\ninter-node communications.

The 30B and 175B parameter BlenderBot 3 models were each trained for one epoch of the training data\non 64 (30B) or 128 (175B) x 40gb A100 GPU<SEP>The 3B parameter BlenderBot 3 model was trained on 64 x 32gb V100 GPUs for 27k updates with a batch\nsize of 64

trained AlexaTM 20B for 120 days on 128 A100 GPUs for the total of 500k updates with the accumulated\nbatch size of 2 million tokens (total of 1 trillion token updates)

shard the models across 64\nTPU v3 machines

use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4\nTPU chips for 37 hours)

We use the metaseq library for training the models, built by the NextSys team at Meta AI. For training the largest 120B model, we use 128 NVIDIA A100 80GB nodes. For inference Galactica 120B\nrequires a single A100 node.

trained on an Azure AI supercomputing infrastructure

takes {16; 32; 64} V100 GPUs and {1; 1; 2} days for the {small, base, large} models. To improve training efficiency and reduce GPU memory usage, we adopt mixed precision training and gradient checkpointing.

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.

training the 70M, 160M, and 410M models required 32 A100 GPUs with 40GB RAM, 1.0B, 1.4B, and 2.8B models used 64 A100-40GB GPUs. 6.9B model used 128 GPUs. And 12B model used 256 GPUs.

The training is done with 8x A100 GPUs. The longest single training run takes around 2 days. They utilized\nSkyPilot managed spot instances for saving training costs and FlashAttention for memory optimizations.

train our model on\n8 V100 GPUs with Deepspeed Zero-3 for 70 hours on 3 epochs.

A100-80GB GPU was used in model finetuning<SEP> took ~9.5 days to train on 440xA100-40GB GPUs, and cost ~$200k

trained our model on a GPU cluster with 512 A100 80 GB GPUs distributed across 64 nodes. We partitioned the model with a 3D-parallel layout that shards the model with both tensor and pipeline parallelism rank 4, requiring 16 GPUs (two nodes) for one replica.

Falcon-40B-Instruct was trained on AWS SageMaker, on 64 A100 40GB GPUs in P4d instances. <SEP>Falcon-7B-Instruct was trained on AWS SageMaker, on 32 A100 40GB GPUs in P4d instances. <SEP>Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.<SEP>Falcon-7B was trained on 384 A100 40GB GPUs, using a 2D parallelism strategy (PP=2, DP=192) combined with ZeRO.

trained Orca on 20 NVIDIA A100 GPUs with 80GB memory. It took 160\nhours to train Orca on FLAN-5M (ChatGPT augmentations) for 4 epochs, and 40 hours to\ncontinue training on FLAN-1M (GPT-4 augmentations) for the same number of epochs.

The models are trained on cloud TPU-v4s using EasyLM, a JAX based training pipeline we developed for training and fine-tuning large language models. We employ a combination of normal data parallelism and fully sharded data parallelism (also know as ZeRO stage 3) to balance the training throughput and memory usage.

The model was trained in three stages using the MosaicML Platform: (i) First it was trained on 440 A100-40GBs with a batch size of 1760. (ii) Then, on 216 A100-40GBs with a batch size of 1728. (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens. The model was trained with sharded data parallelism using FSDP and used the LION optimizer.

models were pretrained on Meta’s Research Super Cluster (RSC) as well as internal production clusters, both of which use NVIDIA A100s.

All training, hyper-parameter tuning, and instruction-tuning experiments were executed on the Condor Galaxy\n1 (CG-1) AI supercomputer from Cerebras, built in partnership with G42. The final training and fine-tuning runs for Jais were performed on 16 CS-2 systems within CG-1. CG-1 is a Cerebras Wafer-Scale Cluster composed of Cerebras CS-2 systems, MemoryX, SwarmX, management, and input worker nodes.

extension

Relative positioned embeddings enable longer-context attention when compared to vanilla Transformer model

GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.<SEP>Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024)

This model basically adapts Transformer XL architecture to permutation-based LM

Uses BERT for Encoder architecture, but stacks and aggregates\ntwo of them for text and entities. This architecture could be\nunderstood as BERT for text + knowledge graphs

Extension of BERT with optimized training procedure and more data

Compressed version of BERT using parameter sharing, which is much more efficient given the same number of parameters

model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior

The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post).

It can be seen as a generalization of BERT and GPT in that it combines ideas from both in the encoder and decoder

GPT-2 architecture trained on dialog data

Compressed version of BERT using distillation, which is much more efficient given the same number of parameters

Same as original Transformer with some additions such as relative positional embeddings like Transformer XL

An extension of RoBERTa that introduces small parameter tuning insights in the context of multilingual applications

PEGASUS introduces a novel pre-training objective called Gap Sentence Generation (GSG), extending vanilla Transformers, where "important" sentences are removed from a document and the model is trained to regenerate them. The method of selecting these principal sentences is based on their relevance to the entire document, measured using the ROUGE1-F1 score. This unique approach tailors the model for abstractive summarization, achieving state-of-the-art performance on various tasks.

Applied new training techniques including Replaced Token\nDetection

Megatron is a family of models that extend previously known architectures (namely GPT-2 and BERT originally, but also T5 more recently) by introducing model parallelism primitives. In the case of BERT, the authors also replace the next sentence prediction head with sentence order prediction and use whole word n-gram masking.

Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer

Separate positional embedding vector independent from the\ncontent embedding using disentangled attention matrices for contents and\nrelative positions

Big Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequences

Extension of BERT architecture to train on patches of images

A differential variational auto-encoder is used to learn the visual codebook. The transformer is a variation of GPT-3

Goal to increase parameter count while keeping FLOP operations constant by using efficient routing of MoE (Mixture of Experts)

Combines Resnet and ViT for the visual encoding with Transformer for the Textual encoder

Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens

Extends ViT by replacing the standard multi-head self attention (MSA) module by a module based on shifted windows (Swin) allowing ViT-like architectures to generalize to higher resolution images

GPT-J 6B is a Transformer model trained using Mesh Transformer\nJAX and same tokenizer as GPT2/3

Decision transformers use a GPT architecture and extend it by encoding trajectories in a way that they can be learned by an auto-regressive task

Similarly to the Decision transformers, the main extension introduced by Trajectory Transformers is a way to encode a trajectory (state, actions, rewards)

As opposed to BART, they don’t do sentence shuffling

Very similar to GPT-3, but far more parameters and improved training efficiency mostly because of the improved tokenizer. Also, different ratio of depth to breadth

Uses parallelization similar to Megatron to train a LM double the size of GPT-3

These models do not introduce novelties at the architecture/pretraining level and they are based on GPT-3 but rather focuses on how to improve alignment through fine-tuning and prompting. Note that the Anthropic Assistant includes several models optimized for different tasks. Latest versions of this work focus on the benefits of RLHF.

GLaM introduces a Mixture of 64 Experts to increase parameter count and generalization properties in a somewhat standard decoder-only. Transformer architecture. Only two experts get activated at a time per token, which makes the model also more efficient in training and inference.

GLIDE can be seen as an extension of the ADM (Ablated Diffusion Model) by the same authors. However, ADM is not per se a transformer architecture although it does resemble one in some of the configurations the authors use. Given that ADM is by the same authors and was quickly followed up by GLIDE, I think it is fair to consider GLIDE as the first of its kind.

Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute

Stable diffusion is basically the Latent Diffusion model developed by LMU Munich researchers + some learnings on conditional diffusion from DALL-e and Imagen

This is somewhat similar to HTML in its use of structured\ntraining data. However, it is a different architecture and uses causal\nmasking, which makes the model predict, at the end of the sequence,\nan entire missing span of text. It also includes image input via Vector\nQuantized Variational Autoencoding (VQ-VAE) tokens.

LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies

GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning

Zero-shot task learning. The output space for a given task is either one of several classes (classification) or free text (generation).

Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance

Adds quantization and distillation to a BART model to improve performance and model size

GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported

SeeKer is an extension that can be applied to any Transformer architecture by introducing “search”, “knowledge”, and “response” modules that are introduced during pretraining

GLM has a bidirectional encoder and a unidirectional decoder in a unified model

T0 stands for "T5 for Zero Shot", obtained by fine-tuning\nthe T5 model on multitask mixture covering many different NLP tasks.\nCompared with T0, T0p and T0pp were fine-tuned with more datasets.\nT0pp is recommended as it leads (on average) to the best performances on\na variety of NLP tasks.

Combines CLIP encoder and Diffusion decoder similar to GLIDE

It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet

PaLM uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data

Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse

The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable.

Basically same architecture as GPT-3 but with some training improvements introduced in Megatron-LM

UL2-20B (Unifying Language Learning) can be interpreted as\na model that is quite similar to T5 but trained with a different objective\nand slightly different scaling knobs.

hierarchical ViT architecture consisting of local and global self-attention modules

Imagen adds a few extensions to the U-net diffusion architecture (pooled embedding vector, cross attention over text embeddings, and Layer Normalizations)

Extends PaLM by fine-tuning on the mathematical dataset

In contrast\nwith earlier models such as DialoGPT,\nGODEL leverages a new phase of grounded\npre-training designed to better support adapting\nGODEL to a wide range of downstream\ndialog tasks that require information external\nto the current conversation (e.g., a database\nor document) to produce good responses.

Main difference to GPT-3 is that it uses full attention instead of sparse attention

BlenderBot 3 is based on a pre-trained OPT. It adds features needed for a dialog agent such as long-term memory or the ability to search the internet. It is also fine-tuned for some specific tasks given human feedback on them.

Derived from BART and layernorms located exactly at the beginning of each layer. Encoder initialized with internal 10B pre-trained\nencoder.

Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite

instruction finetuning\nwith a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on\nchain-of-thought data

Flan-PaLM is generated by "Flan Finetuning" the PaLM\nmodels: (1) scaling the number of tasks to 1,836, (2) scaling the model\nsize, and (3) finetuning on chain-of-thought data.

Transformer based architecture in a decoder-only setup with\na few modifications. Data extensions include special tokens for working\nmemory, citations, genetic data, and a few other biology related tasks.

Fine-tunes BERT-based models to create text string embeddings\noptimized for semantic relatedness

Fine-tunes T5 explicitly to optimize encoder to produce a\ngeneral purpose text string embedding useful for many NLU tasks.

LLaMA uses a Transformer architecture, and with extensions:\nPre-normalization, SwiGLU activations, RoPE embeddings, reduced\nmemory usage and runtime through efficient implementation of the causal\nmulti-head attention, checkpointing to reduce the amount of activations\nthat are recomputed during the backward pass, model and sequence parallelism\nto reduce memory usage of the model, and uses 1.4T BPE tokens\nafter tokenization.

Alpaca is fine-tuned from a 7B LLaMA model

Trained with the library GPT-NeoX

a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs

stable-vicuna-13b-delta<SEP>Vicuna v1.3 is fine-tuned from LLaMA with supervised instruction fine-tuning.

Evol-Instruct

it enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2

MPT-7B-StoryWriter-65k+<SEP>MPT-7B-Chat<SEP>MPT-7B-Instruct

Falcon-Instruct

finetuning on complex explanation traces obtained from GPT-4

Code Evol-Instruct

public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA.

MPT-30B 8k Context Window<SEP>MPT-30B-Chat<SEP>MPT-30B-Instruct

public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA.

Llama-2-Chat

Jais-chat

application

Text generation, but adaptable to many other NLP tasks when fine tuned.

General Language Understanding and Question Answering. Many other language applications followed

General language tasks

Text generation, but adaptable to many other NLP tasks when fine tuned.

General language tasks

Knowledge intensive related tasks that might benefit from\nknowledge graphs or entities such as entity recognition

Same as BERT

Controllable text generation

Protein folding

Mostly text generation but also some text understanding tasks

Text generation in dialog settings

Same as BERT

Diverse set of downstream tasks including machine translation,\nquestion answering, abstractive summarization, and text classification

Translation and other cross-lingual language tasks

abstractive text summarization

Translation

Same as BERT

Same as base model

Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation

Same as BERT

Particularly well suited for longer sequences, not only in text but also e.g. in genomics

image classification

Text to image

General language tasks (e.g. question answering)

Image/Object classification

Text generation, but adaptable to many other NLP tasks when fine tuned.

Image (object detection, image classification..)

generating text from a prompt, but is advised to be finetuned for more effective performance

General RL (reinforcement learning tasks)

General purpose language model that allows structured HTML prompting

Similar to GPT-3

Language generation and others (similar to GPT-3)

Different models with different applications from general dialog to code assistant.

General language modeling - tested across 29 NLP tasks

Text to image

Mostly Language Modeling and NLU, but also extensible like GPT

Text to image

Multimodal language model with the ability to do structured\nprompting, zero-shot captioning, image generation, and entity linking (via\ntarget text prediction of hyperlinks)

General language modeling, such as translation, summarization,\nquestion and answers

Knowledge-intensive dialog or language tasks

language understanding and generation tasks such as inference, sentiment analysis, paraphrase, closed-book QA, reading comprehension, coreference, summarization, translation, commonsense reasoning, and struct-to-text

Same as Gopher/GPT3

Text generation and understanding

Dialog systems, Q&A, general language generation tasks

Same as base models

a General Language Model pretrained with an autoregressive\nblank-filling objective and can be finetuned on various natural language\nunderstanding and generation tasks.

Perform zero-shot inference tasks by specifying the query\nin natural language, and the models will generate a prediction.

Text to image

PaLM is designed as a general purpose language model with\napplicability to hundreds of different language tasks

range of language-understanding, mathematics and knowledge-based tasks

Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm.

Same as GPT-3

A unified framework for pre-training models that are universally\neffective across datasets and setups.

image generation

Text to image

Mathematical reasoning

open-domain goal-directed dialog tasks such as knowledge-grounded response generation, task-oriented dialog, and conversational QA

Same as GPT-3

same as GPT-3

Summarization, multi-lingual machine translation and NLU tasks

Dialog agents and general language generation applications like Q&A

The primary use is to underestand how to improve large\nlanguage models with the right kind of instruction fine-tuning. The focus is\nresearch on zero-shot and in-context few-shot learning NLP tasks, such as\nreasoning, and question answering; advancing fairness and safety research,\nand understanding limitations of current large language models

Same as Flan-T5. The goal is to show Flan finetuning can\neven improve on the largest Google LMs (+9.4% improvement average\nacross tasks), with improvements to chain of thought, self consistency,\nmultilingual tasks, arithmetic reasoning

The models are designed to perform scientific tasks, including\nbut not limited to citation prediction, scientific QA, mathematical\nreasoning, summarization, document generation, molecular property prediction\nand entity extraction.

provide human-like conversational interactions and assist users in answering questions, generating text, providing recommendations, and engaging in natural language conversations.

Text embeddings for semantic relatedness tasks such as text\nclustering or search retrieval

Any NLU task requiring a single text string embedding. As\nof April 2023 InstructOR is the top-ranked system on the Massive Text\nEmbedding Benchmark (MTEB).

Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning,\nCode generation, Reading comprehension, and multitask understanding.

Evaluated on a variety of text generation and classification\ntasks.

Research on language model’s behavior, functionality, and\nlimitations

Creating highly realistic and contextually accurate human-like text generation

chatbot

assistant in foundational NLP tasks such as reasoning or multi-domain and multi-genre QA, code generation, etc.

surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B, Llama-1 65B, Falcon-40B, MPT-30B8, Baichuan-13B Chat and ChatGLM2 12B on both GSM8k and MATH.

Multiple tasks similar to LLaMA such as reasoning or code generation

potential in generating code across multiple programming languages, and acting as a virtual technical assistant without the need for instruction-tuning or RLHF

Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)

various NLP tasks including Bio Olympiad, Forming Inequalities, Counterfactual Question Answering, Compound Interest Problems, Spatial Reasoning, Commonsense Question Answering, Hallucination, Quadratic Equation Solving, Meeting Transcript Processing

code understanding, code generation, instruction fine-tuning for code tasks, tested on benchmarks (HumanEval, HumanEval+, MBPP, DS-1000), generating accurate code responses with clear explanations.

same as LLaMA

Multiple text generation tasks. This model demonstrates good coding ability better than its predecessor as well as specifically finetuned code writers.

same as LLaMA

Research on LLMs demonstrating multitask abilities such as reading comprehension, commonsense reasoning, math and coding abilities

various NLP tasks encompassing world knowledge and commonsense reasoning

has source code

https://github.com/openai/finetune-transformer-lm<SEP>https://huggingface.co/docs/transformers/model_doc/openai-gpt

https://huggingface.co/docs/transformers/model_doc/bert

https://github.com/chiayewken/transformer_xl<SEP>https://huggingface.co/docs/transformers/model_doc/transfo-xl

https://huggingface.co/docs/transformers/model_doc/gpt2

https://huggingface.co/docs/transformers/model_doc/xlnet

https://github.com/thunlp/ERNIE

https://github.com/facebookresearch/fairseq/tree/main/examples/roberta<SEP>https://huggingface.co/docs/transformers/model_doc/roberta

https://github.com/google-research/albert<SEP>https://huggingface.co/docs/transformers/model_doc/albert

https://github.com/salesforce/ctrl<SEP>https://huggingface.co/docs/transformers/model_doc/ctrl

https://github.com/deepmind/alphafold

https://huggingface.co/docs/transformers/model_doc/bart

https://github.com/microsoft/DialoGPT<SEP>https://huggingface.co/docs/transformers/model_doc/dialogpt

https://huggingface.co/docs/transformers/model_doc/distilbert

https://github.com/google-research/text-to-text-transfer-transformer<SEP>https://huggingface.co/docs/transformers/model_doc/t5

https://huggingface.co/docs/transformers/model_doc/xlm-roberta

https://github.com/google-research/pegasus<SEP>https://huggingface.co/docs/transformers/model_doc/pegasus

https://github.com/facebookresearch/fairseq/tree/main/examples/mbart<SEP>https://huggingface.co/docs/transformers/model_doc/mbart

https://github.com/google-research/electra<SEP>https://huggingface.co/docs/transformers/model_doc/electra

https://github.com/NVIDIA/Megatron-LM

https://platform.openai.com/docs/models/gpt-3-5<SEP>https://github.com/openai/gpt-3

https://github.com/microsoft/DeBERTa<SEP>https://huggingface.co/microsoft/deberta-v2-xxlarge<SEP>https://huggingface.co/microsoft/deberta-v2-xlarge<SEP>https://huggingface.co/microsoft/deberta-xlarge<SEP>https://huggingface.co/microsoft/deberta-large

https://github.com/google-research/bigbird<SEP>https://huggingface.co/docs/transformers/model_doc/big_bird

https://github.com/google-research/vision_transformer<SEP>https://huggingface.co/docs/transformers/model_doc/vit

https://github.com/openai/DALL-E<SEP>https://github.com/borisdayma/dalle-mini

https://github.com/google-research/t5x<SEP>https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py

https://github.com/openai/CLIP<SEP>https://huggingface.co/docs/transformers/model_doc/clip

https://github.com/EleutherAI/gpt-neo<SEP>https://huggingface.co/docs/transformers/model_doc/gpt_neo

https://github.com/microsoft/Swin-Transformer

https://huggingface.co/EleutherAI/gpt-j-6b<SEP>https://github.com/kingoflolz/mesh-transformer-jax

https://github.com/kzl/decision-transformer<SEP>https://huggingface.co/docs/transformers/main/en/model_doc/decision_transformer

https://trajectory-transformer.github.io/<SEP>https://github.com/JannerM/trajectory-transformer

https://github.com/ai21labs/lm-evaluation

https://github.com/openai/glide-text2im

https://github.com/CompVis/latent-diffusion<SEP>https://huggingface.co/CompVis/stable-diffusion<SEP>https://huggingface.co/spaces/stabilityai/stable-diffusion<SEP>https://github.com/Stability-AI/stablediffusion

https://github.com/conceptofmind/LaMDA-rlhf-pytorch

https://github.com/openai/following-instructions-human-feedback

https://github.com/google-research/FLAN

https://github.com/amazon-science/dq-bart

https://parl.ai/projects/seeker/

https://github.com/THUDM/GLM-130B

https://huggingface.co/bigscience/T0

https://github.com/lucidrains/flamingo-pytorch

https://github.com/lucidrains/PaLM-pytorch

https://github.com/EleutherAI/gpt-neox<SEP>other gpt-neo models<SEP>https://huggingface.co/EleutherAI/gpt-neox-20b

https://github.com/OrigamiDream/gato

https://github.com/facebookresearch/metaseq<SEP>https://huggingface.co/facebook/opt-350m

https://github.com/google-research/google-research/tree/master/ul2

https://github.com/NVlabs/GCVit

https://huggingface.co/microsoft/GODEL-v1_1-large-seq2seq?text=Hey+my+name+is+Mariama%21+How+are+you%3F<SEP>https://huggingface.co/microsoft/GODEL-v1_1-base-seq2seq?text=Hey+my+name+is+Julien%21+How+are+you%3F<SEP>https://github.com/microsoft/GODEL

https://huggingface.co/docs/transformers/model_doc/bloom

https://parl.ai/projects/bb3/<SEP>https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/model_card.md<SEP>https://github.com/facebookresearch/ParlAI/blob/main/projects/bb3/agents/README.md

https://github.com/amazon-science/alexa-teacher-models

https://github.com/google-research/t5x<SEP>https://huggingface.co/docs/transformers/model_doc/flan-t5

https://huggingface.co/intfloat/e5-large-v2<SEP>https://huggingface.co/intfloat/e5-base-v2<SEP>https://huggingface.co/intfloat/e5-small-v2<SEP>https://github.com/microsoft/unilm/tree/master/e5

https://huggingface.co/hkunlp/instructor-xl

https://huggingface.co/docs/transformers/main/model_doc/llama<SEP>https://github.com/facebookresearch/llama

https://github.com/tatsu-lab/stanford_alpaca

pythia-deduped<SEP>pythia<SEP>https://github.com/EleutherAI/pythia

https://chat.lmsys.org/<SEP>https://huggingface.co/lmsys/vicuna-33b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-13b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-7b-v1.3<SEP>https://github.com/lm-sys/FastChat

https://github.com/nlpxucan/WizardLM

https://huggingface.co/mosaicml/mpt-7b

https://github.com/bigcode-project/starcoder

https://huggingface.co/tiiuae/falcon-7b<SEP>https://huggingface.co/tiiuae/falcon-40b

https://github.com/nlpxucan/WizardLM

https://github.com/openlm-research/open_llama<SEP>https://huggingface.co/openlm-research/open_llama_13b<SEP>https://huggingface.co/openlm-research/open_llama_7b<SEP>https://huggingface.co/openlm-research/open_llama_3b

https://huggingface.co/mosaicml/mpt-30b

https://huggingface.co/openlm-research/open_llama_7b_v2<SEP>https://huggingface.co/openlm-research/open_llama_3b_v2

https://github.com/facebookresearch/llama

https://huggingface.co/inception-mbzuai/jais-13b

blog post

https://medium.com/@hyponymous/paper-summary-improving-language-understanding-by-generative-pre-training-7b77babd7086<SEP>https://www.reddit.com/r/MachineLearning/comments/n36htr/p_gpt1_annotated_paper_paper_summary/

https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb<SEP>https://www.philschmid.de/bert-text-classification-in-a-different-language

https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html

https://openai.com/research/better-language-models<SEP>https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface

https://towardsdatascience.com/xlnet-explained-in-simple-terms-255b9fb2c97c

http://research.baidu.com/Blog/index-view?id=160

https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/

https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html

https://blog.salesforceairesearch.com/introducing-a-conditional-transformer-language-model-for-controllable-generation/

https://www.deepmind.com/publications/highly-accurate-protein-structure-prediction-with-alphafold<SEP>https://fabianfuchsml.github.io/alphafold2/

https://huggingface.co/microsoft/DialoGPT-medium?text=Hey+my+name+is+Mariama%21+How+are+you%3F

https://medium.com/huggingface/distilbert-8cf3380435b5

https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

https://ai.meta.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/

https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html

https://medium.com/syncedreview/facebook-ai-mbart-the-tower-of-babels-silicon-solution-610dfb494f98

https://sh-tsang.medium.com/brief-review-electra-pre-training-text-encoders-as-discriminators-rather-than-generators-9568050d3a86

https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/<SEP>https://huggingface.co/blog/megatron-training

https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122<SEP>https://openai.com/blog/gpt-3-apps

https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/

https://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html<SEP>https://huggingface.co/blog/big-bird

https://www.v7labs.com/blog/vision-transformer-guide

https://openai.com/blog/dall-e/<SEP>https://ml.berkeley.edu/blog/posts/dalle2/

https://www.alexanderthamm.com/en/blog/switch-transformer-upscaling-to-over-a-billion-parameters/

https://openai.com/research/clip<SEP>https://medium.com/axinc-ai/clip-learning-transferable-visual-models-from-natural-language-supervision-4508b3f0ea46

https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api<SEP>https://www.section.io/engineering-education/leveraging-gptneo-to-generate-ai-based-blog-content/

https://www.section.io/engineering-education/an-overview-of-swin-transformer/

https://en.wikipedia.org/wiki/GPT-J

https://sites.google.com/berkeley.edu/decision-transformer

https://bair.berkeley.edu/blog/2021/11/19/trajectory-transformer/

https://www.ai21.com/blog/ai21-studio-use-cases<SEP>https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1

https://developer.nvidia.com/megatron-turing-natural-language-generation<SEP>https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html

https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval

https://stability.ai/blog/stable-diffusion-public-release

https://lilianweng.github.io/posts/2022-06-09-vlm/

https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html<SEP>https://blog.google/technology/ai/lamda/

https://sh-tsang.medium.com/review-instructgpt-training-language-models-to-follow-instructions-with-human-feedback-7fce4bf9059a<SEP>https://openai.com/research/instruction-following

http://rylanschaeffer.github.io/blog_posts/2022-01-20-google-brain-flan.html<SEP>https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html

https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510<SEP>https://medium.com/mlearning-ai/language-models-need-proper-training-c71484727f00

https://www.amazon.science/publications/dq-bart-efficient-sequence-to-sequence-model-via-joint-distillation-and-quantization

https://www.deepmind.com/blog/gophercite-teaching-language-models-to-support-answers-with-verified-quotes

http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/

https://openai.com/product/dall-e-2<SEP>https://labs.openai.com/

https://medium.com/geekculture/3-overlooked-things-deepminds-flamingo-a-large-model-for-computer-vision-84cd9d2f738c<SEP>https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model

https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/<SEP>https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html

https://blog.eleuther.ai/announcing-20b/

https://www.deepmind.com/blog/a-generalist-agent<SEP>https://www.deepmind.com/publications/a-generalist-agent

https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/

https://blog.research.google/2022/10/ul2-20b-open-source-unified-language.html

https://towardsdatascience.com/global-context-vision-transformers-nvidias-new-sota-image-model-2923bdaf438e

https://imagen.research.google/

https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html

https://www.microsoft.com/en-us/research/blog/godel-combining-goal-oriented-dialog-with-real-world-conversations/

https://huggingface.co/blog/bloom-megatron-deepspeed<SEP>https://huggingface.co/blog/bloom-inference-pytorch-scripts<SEP>https://huggingface.co/blog/bloom-inference-optimization

https://ai.facebook.com/blog/blenderbot-3-a-175b-parameter-publicly-available-chatbot-that-improves-its-skills-and-safety-over-time/

https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning

https://www.deepmind.com/blog/building-safer-dialogue-agents\n<SEP>https://medium.com/to-cut-a-long-paper-short/sparrow-improving-alignment-of-dialogue-agents-via-targeted-human-judgments-e0876402d800

https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html

https://galactica.org/

https://openai.com/blog/chatgpt

https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/

https://pub.towardsai.net/paper-review-instructor-one-embedder-any-task-6a846b0d3ba

https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

https://medium.com/version-1/stanford-alpaca-a-small-yet-mighty-language-model-for-instruction-following-tasks-af9e92e87d9a

https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling

https://openai.com/research/gpt-4

https://lmsys.org/blog/2023-03-30-vicuna/

https://medium.com/@preangelleo/wizardlm-enhancing-large-language-models-with-ai-evolved-instructions-7fd4425afe80

https://ollama.ai/blog/wizardmath-examples

https://www.youtube.com/watch?v=KSlWkrByc0o&t=9s<SEP>https://www.mosaicml.com/blog/mpt-7b

https://huggingface.co/blog/starcoder\n

https://huggingface.co/blog/falcon

https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/

https://github.com/openlm-research/open_llama

https://www.mosaicml.com/blog/mpt-30b

https://github.com/openlm-research/open_llama

https://ai.meta.com/blog/llama-2/<SEP>https://ai.meta.com/resources/models-and-libraries/llama/

https://inceptioniai.org/jais/

license

closed source

Open, Apache 2.0

N/A

closed source

Open, MIT license

closed source

N/A

Open, Apache 2.0

Open, BSD-3-Clause license

Apache 2.0

Open, Apache 2.0

Open, MIT license

Open, Apache 2.0

Apache 2.0

N/A

Open, MIT license

Open, Apache 2.0

Limited, Non-commercial usage

closed source

Open, MIT license

Open, Apache 2.0

N/A

Open, Apache 2.0

Open, MIT license

Apache 2.0

Open, MIT license

N/A

Closed source, accessible through API

Limited, Non-commercial usage

N/A

closed source

Open, MIT license

closed source

open, CreativeML Open RAIL++-M License

N/A

closed source

Closed source, accessible through API

Apache 2.0

closed source

Open, Apache 2.0

closed source

the code is open sourced

Open, MIT license

Open, Apache 2.0

Closed source, accessible through API

closed source

Apache 2.0

closed source

MIT license

Open, Apache 2.0

Limited, non-commercial license CC-BY-NC-SA-4.0

closed source

MIT License

Open, but need to follow restrictions in Attachment A, BigScience\nRAIL License v1.0

Limited, non-commercial, research only

Limited, non-commercial

closed source

Apache 2.0

closed source

Limited, non-commerical CC BY-NC 4.0 license

Closed source, accessible through API

Open, MIT license

Open, Apache 2.0

Limited, Non-commercial bespoke license

Limited, non-commercial license CC-BY-NC-SA-4.0

Apache 2.0

Closed source, accessible through API

Non-commercial license

Apache 2.0

OpenRAIL-M license

Apache 2.0

Non-commercial license

Apache 2.0

Llama 2 Community License Agreement

Apache 2.0

research problem

Large Language Models (LLMs)<SEP>transformer model

transformer model<SEP>Large Language Models (LLMs)

Large Language Models (LLMs)<SEP>transformer model

transformer model<SEP>Large Language Models (LLMs)

Large Language Models (LLMs)<SEP>transformer model

transformer model<SEP>Large Language Models (LLMs)

Large Language Models (LLMs)<SEP>transformer model

transformer model<SEP>Large Language Models (LLMs)

Large Language Models (LLMs)<SEP>transformer model

Conclusion

Let's collaborate on keeping this review up-to-date either here directly on the ORKG or via the following Github repository https://github.com/jd-coderepos/llms.

Introduction

Google's T5

EleutherAI's GPT-J, GPT-NeoX, and Pythia series

MosaicML's Mosaic Pretrained Transformer (MPT)

DeepMind's compute-optimal Chinchilla and other models

MetaAI's LLaMA

Llama-1, Alpaca, and Vicuna

Berkeley AI Research's OpenLLaMA

OpenAI's GPT

Microsoft's Wizard LLM series

A Comprehensive Catalog of Large Language Models (LLMs)

Conclusion