A Review of Transformer Models
Aug 02 2023
Introduction
Large language models (LLMs) have emerged as a transformative force in the field of natural language processing (NLP). These models, powered by deep learning techniques, have demonstrated remarkable capabilities in understanding and generating human-like text. The development of LLMs gained momentum with the advent of GPT (Generative Pre-trained Transformer) introduced by OpenAI, with its pioneering model GPT-3 being one of the most well-known and influential language models to date. With billions of parameters and extensive pre-training on vast amounts of text data from the internet, ChatGPT, based on the GPT-3 architecture, has become a leading example of LLMs. It showcases extraordinary language comprehension, fluency, and contextual understanding, enabling it to excel across a wide range of NLP tasks. As these models continue to evolve and grow, they hold tremendous potential to revolutionize numerous applications in human-computer interaction, content generation, and information retrieval [@dale2021gpt],[@radford2019language].
The development of LLMs which have revolutionized NLP are directly facilitated by the introduction of the Transformer architecture in "Attention Is All You Need" by [@vaswani2017attention] on which they are based. Before Transformers, recurrent neural networks (RNNs) and variants like LSTMs dominated sequential data processing tasks, but struggled with long-range dependencies and parallelization. The Transformer's attention mechanism enables it to efficiently capture long-range dependencies, making it highly effective for NLP tasks. Notably, LLMs have been developed both by major research institutions and commercial organizations. On the one hand, models like GPT-3, with over 175 billion parameters, developed by OpenAI, are representative of the most well-known commercial LLMs, demonstrating human-like language understanding and generation capabilities. On the other hand, the research community has contributed several open-source large language models, such as T5 (Text-to-Text Transformer) by Google [@t5], and LLaMA (Large Language Model Meta AI) by Meta AI [@llama-1]. These models have also made significant impacts and advancements in various NLP applications, ranging from sentiment analysis to machine translation. The pre-training of large language models on vast amounts of internet text data is a common practice, followed by fine-tuning on specific tasks with labeled data to adapt their knowledge for targeted applications. Consequently, both paywalled and open-source large language models have played crucial roles in advancing the state of NLP, and they continue to pave the way for exciting developments in human-machine interaction and natural language understanding.
This article reviews various LLM model series and concludes with a comprehensive comparison of all models released to date.
Google's T5
Google's T5 was one of the earliest models that implemented the text-to-text as a sequence-to-sequence generation objective where a where the model is fed some text for context or conditioning on a range of diverse NLP tasks and is then asked to produce some output text. To specify which task the model should perform, they added a task-specific (text) prefix to the original input sequence before feeding it to the model. Consequently, it was found that a huge variety of NLP tasks can be cast in this format, including translation, summarization, and even classification and regression tasks. As an example, to ask T5 to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.” For text classification tasks, the model simply predicts a single word corresponding to the target label. For example, on the MNLI benchmark [@mnli] the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. With our preprocessing, the input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.” with the corresponding target word “entailment”. Their subsequent work FLAN advocated for a zero-shot task learning approach via instruction tuning. The idea is that by using supervision to teach an LM to perform tasks described via instructions, the LM will learn to follow instructions and do so even for unseen tasks. Distinguishing between T5 and FLAN prompts, T5 prompts were mostly just a tag for the dataset, which would not work in the zero-shot setting. In contrast, the prompts that are used for FLAN are similar to what would be used to ask a human to perform the task. As such just as the dessert that follows the end of meals, the chosen "FLAN" strategy could be applied over any pretrained language models. Specifically Google applied FLAN to their pretrained LAMDA dialog model and their pretrained T5.
A Catalog of Google's T5 series LLMs
Google's T5 is a versatile text-to-text model that can perform a wide range of language tasks, while their later introduced FLAN (Few-shot Language Adaptation Network) strategy was designed to enhance pretrained LLMs zero-shot performance on unseen task by further instruction finetuning produced optimized models of their base models as FLAN-T5, FLAN-PaLM, and FLAN-LAMDA. This comparison showcases this development based on the central characteristics of the models.
has model | T5 | LAMDA | FLAN | PaLM | Flan-T5 | Flan-PaLM |
model family | T5 | LaMDA-PT | LaMDA-PT | PaLM | T5 | PaLM |
date created | 2019-10-01 | 2022-01-01 | 2022-02-08 | 2022-04-01 | 2022-11-01 | 2022-11-01 |
pretraining task | Span Corruption | Causal language modeling | Causal language modeling | |||
pretraining architecture | Encoder/Decoder | Decoder | Decoder | Decoder | Encoder/Decoder | Decoder |
fine-tuning task | finetuning on downstream tasks one at a time | based on multi-turn crowdsourced dialog datasets, LaMDA-PT is finetuned in a mix of generative tasks that generate response given contexts, and discriminative tasks that evaluate quality and safety of a response in context | Instruction Tuning | Instruction Tuning | Instruction Tuning | |
number of parameters | 60M, 220M, 770M, 3B, and 11B | 137B | 137B | 8B, 62B, and 540B | 80M (Flan-T5-Small), 250M (Flan-T5-Base), 780M (FLan-T5-Large), 3B (Flan-T5-XL), and 11B (Flan-T5-XXL). | 8B, 62B, 540B |
optimizer | AdaFactor | AdaFactor | AdaFactor | |||
tokenization | sentencepiece | sentencepiece | sentencepiece | |||
training corpus | Colossal Clean Crawled Corpus | 1.56T words from public dialog data and other public web documents. Overall, it consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T\nwords | FLAN is instruction tuned on 25 tasks spanning 62 datasets.<SEP>LaMDA-PT is is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary | 780B tokens from multilingual social media conversations (50%), multilingual filtered webpages (27%), books in English (13%), code from Github (5%), multilingual Wikipedia (4%), and news in English (1%). Code includes 24 programming languages. | Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT | Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT |
extension | Same as original Transformer with some additions such as relative positional embeddings like Transformer XL | LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies | Zero-shot task learning. The output space for a given task is either one of several classes (classification) or free text (generation). | PaLM uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data | instruction finetuning\nwith a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on\nchain-of-thought data | Flan-PaLM is generated by "Flan Finetuning" the PaLM\nmodels: (1) scaling the number of tasks to 1,836, (2) scaling the model\nsize, and (3) finetuning on chain-of-thought data. |
hardware used | TPUv3 | TPUv3 | TPUv3 | TPUv4 | ||
hardware information | we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods. TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines. | LaMDA was pretrained on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch | instruction tuning takes around 60 hours on a TPUv3 with 128 cores | PaLM 540B is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism. Each Pod has 3072 TPU v4 chips attached to 768 hosts. | ||
organization | Google | Google | Google | Google | Google | Google |
application | Diverse set of downstream tasks including machine translation,\nquestion answering, abstractive summarization, and text classification | General language modeling, such as translation, summarization,\nquestion and answers | language understanding and generation tasks such as inference, sentiment analysis, paraphrase, closed-book QA, reading comprehension, coreference, summarization, translation, commonsense reasoning, and struct-to-text | PaLM is designed as a general purpose language model with\napplicability to hundreds of different language tasks | The primary use is to underestand how to improve large\nlanguage models with the right kind of instruction fine-tuning. The focus is\nresearch on zero-shot and in-context few-shot learning NLP tasks, such as\nreasoning, and question answering; advancing fairness and safety research,\nand understanding limitations of current large language models | Same as Flan-T5. The goal is to show Flan finetuning can\neven improve on the largest Google LMs (+9.4% improvement average\nacross tasks), with improvements to chain of thought, self consistency,\nmultilingual tasks, arithmetic reasoning |
has source code | https://github.com/google-research/text-to-text-transfer-transformer<SEP>https://huggingface.co/docs/transformers/model_doc/t5 | https://github.com/google-research/FLAN | https://github.com/lucidrains/PaLM-pytorch | https://github.com/google-research/t5x<SEP>https://huggingface.co/docs/transformers/model_doc/flan-t5 | ||
blog post | https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html | https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html<SEP>https://blog.google/technology/ai/lamda/ | http://rylanschaeffer.github.io/blog_posts/2022-01-20-google-brain-flan.html<SEP>https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html | https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/<SEP>https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html | https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html | |
license | Apache 2.0 | closed source | Apache 2.0 | closed source | Apache 2.0 | closed source |
EleutherAI's GPT-J, GPT-NeoX, and Pythia series
With the GPT-NeoX-20B model, EleutherAI sought to offer a strong performing open-source alternative mainly to GPT-3, which in its empirical evaluations proved an effective model--it was fairly close to GPT-3 performance. At the time of its release i.e. April 2022, it was the largest open-source model available. The earlier released GPT-J-6B model had a similar objective and outperformed GPT-3 Babbage and underperformed GPT-3 175B model by 10 points in the lowest results, still proving a strong open-source contender.
The Pythia series from EleutherAI mainly focused on ground research on LLMs to answer a central research question: "How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale?" It is well established that there are regular and predictable patterns in the behavior of trained language models as they scale [@scale1] [@scale2] [@ghorbani2021scaling] [@pu2021scaling] [@mikami2021scaling] [@hernandez2021scaling]. But prior work connecting these “Scaling Laws” to the learning dynamics of language models is minimal. One of the driving reasons for this gap in research is a lack of access to appropriate model suites to test theories. This is addressed by EleutherAI in their release of the Pythia-suite of 8 main models at parameter sizes of 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B, which are respectively released in two variants of the underlying Pile training dataset, thus resulting in 16 total models. Furthermore, public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders is provided for further study.
A Catalog of EleutherAI's Large Language Models
Starting in early 2022, EleutherAI released open-sourced high-performing models built after GPT-2/GPT-3 with an intent to open-source strong LLMs. With its Pythia series, its initiative to open-source LLMs continues, which now extends into releasing the largest collection of 16 total open-source models at different parameter sizes to study the effects of scale on LLMs. This comparison showcases EleutherAI's models released so far in terms of their salient properties.
has model | GPT-Neo | GPT-J | GPT-NeoX-20B | Pythia |
model family | GPT | GPT | GPT | Pythia |
date created | 2021-03-01 | 2021-05-01 | 2022-04-01 | 2023-03-13 |
innovation | GPT-Neo by EleutherAI is an open-source alternative to proprietary models like GPT-3. Trained on the diverse "The Pile" dataset, its significance lies in democratizing access to large language models, fostering community-driven development, and promoting ethical considerations in AI research. | GPT-J-6B by EleutherAI stands out for its open-source accessibility, training on the diverse "Pile" dataset, and its development by an independent research organization. The model emphasizes democratization in AI, showcasing that state-of-the-art advancements aren't limited to large corporations. EleutherAI's transparent approach and emphasis on open research further distinguish GPT-J in the LLM landscape. | GPT-NeoX-20B is a 20 billion parameter autoregressive language model with notable architectural differences from GPT-3, including the use of rotary positional embeddings. It excels in few-shot reasoning and is distinctively open-sourced, making it a significant contribution to the public research community. The model was trained on the Pile dataset. | The paper introduces Pythia, a suite of 16 Large Language Models (LLMs) trained on consistent public data, spanning from 70M to 12B parameters. Pythia provides public access to 154 checkpoints for each model, facilitating research into LLM training dynamics, memorization, and bias reduction. This unique setup offers novel insights into LLM behavior and evolution. |
pretraining architecture | Decoder | Decoder | Decoder | Decoder |
pretraining task | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling |
training corpus | Pile - 840 GB open source text dataset that combines 22 pre existing datasets | Pile corpus, a large-scale curated dataset created by EleutherAI | Pile — 840 GB open source text dataset that combines 22 preexisting datasets | Two version of each of the eight models are released: one trained on the original Pile corpus and another trained on the deduplicated Pile corpus.<SEP>Pile |
optimizer | ZeRO<SEP>AdamW | ZeRO<SEP>Adam | ||
tokenization | same set of Byte Pair Encoding (BPE) Tokenizer as GPT-2/GPT-3 | train a new BPE tokenizer based on the Pile | BPE tokenizer that is trained specifically on the Pile same as used for GPT-NeoX-20B | |
number of parameters | 125M, 350M, 1.3B, and 2.7B | 6B | 20B | 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B |
maximum number of parameters (in million) | 2700 | 6000 | 20000 | 12000 |
hardware used | TPU v3-256 pod | AMD EPYC 7532 CPU<SEP>A100-SXM4-40GB GPU | A100-40GB GPU | |
hardware information | trained GPT-NeoX-20B on twelve Supermicro\nAS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs | training the 70M, 160M, and 410M models required 32 A100 GPUs with 40GB RAM, 1.0B, 1.4B, and 2.8B models used 64 A100-40GB GPUs. 6.9B model used 128 GPUs. And 12B model used 256 GPUs. | ||
extension | Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens | GPT-J 6B is a Transformer model trained using Mesh Transformer\nJAX and same tokenizer as GPT2/3 | Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse | Trained with the library GPT-NeoX |
organization | EleutherAI | EleutherAI | EleutherAI | EleutherAI |
application | Text generation, but adaptable to many other NLP tasks when fine tuned. | generating text from a prompt, but is advised to be finetuned for more effective performance | range of language-understanding, mathematics and knowledge-based tasks | Research on language model’s behavior, functionality, and\nlimitations |
has source code | https://github.com/EleutherAI/gpt-neo<SEP>https://huggingface.co/docs/transformers/model_doc/gpt_neo | https://huggingface.co/EleutherAI/gpt-j-6b<SEP>https://github.com/kingoflolz/mesh-transformer-jax | https://github.com/EleutherAI/gpt-neox<SEP>other gpt-neo models<SEP>https://huggingface.co/EleutherAI/gpt-neox-20b | pythia-deduped<SEP>pythia<SEP>https://github.com/EleutherAI/pythia |
blog post | https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api<SEP>https://www.section.io/engineering-education/leveraging-gptneo-to-generate-ai-based-blog-content/ | https://en.wikipedia.org/wiki/GPT-J | https://blog.eleuther.ai/announcing-20b/ | https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling |
license | Open, MIT license | Apache 2.0 | Apache 2.0 | Apache 2.0 |
MosaicML's Mosaic Pretrained Transformer (MPT)
MosaicML based on San Francisco, California, is a leading generative AI platform that empowers enterprises to build their own AI. As of July 19, 2023, Databricks, the Data and AI company, announced it has completed its acquisition of MosaicML. MosaicML is widely known for its state-of-the-art MPT large language models (LLMs).
Its first model released in early May 2023 was MPT-7B a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. The base model is accompanied with three different finetuned models for practical application purposes: 1) an instruction-tuned model called MPT-7B-Instruct (licensed for commercial use), 2) a non-commercially licensed MPT-7B-Chat as a conversational version of MPT-7B. MPT-7B-Chat has been finetuned using ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct, ensuring that it is well-equipped for a wide array of conversational tasks and applications. While MPT-7B-Instruct focuses on delivering a more natural and intuitive interface for instruction-following, MPT-7B-Chat aims to provide seamless, engaging multi-turn interactions for users. And finally, 3) they also announced MPT-7B-StoryWriter-65k+—“a model designed to read and write stories with super long context lengths”—with a previously unheard of 65,000 token context length. MPT-7B matches the quality of LLaMA-1-7B and outperforms other open source 7B - 20B models such as Pythia on standard academic tasks. The FALCON, Llama-2, or OpenLAMA models were not part of MPT-7B evaluations.
Subsequently in June, MosaicML expanded the MPT series with the bigger brother of the 7B model as MPT-30B -- a new, completely open-source model licensed for commercial use. This model is significantly more powerful than 7B and outperforms GPT-3 on many benchmarks. This model has been released in 2 fine-tuned variants too: MPT-30B-Instruct and MPT-30B-Chat licensed for non-commercial use only. All models in this series demonstrate strong abilities to generate code. MPT-30B models outperform TII's Falcon-40B.
Both MPT-7B and 30B are trained on a large amount of data (1T tokens like LLaMA vs. 300B for Pythia, and 800B for StableLM).
With its models, MosiacML aims to build on and furthermore set a strong precedent to the development of open-source LLMs pioneered in afore-mentioned efforts such such as the LLaMA series from Meta, the Pythia series from EleutherAI, the Falcon model from Technology Innovation Institute (TII) and the OpenLLaMA model from Berkeley AI Research.
A Catalog of MosaicML's MPT Large Language Models
Thus far the MPT model series from MosaicML, with full-form Mosaic Pretrained Transformers, i.e. MPT-7B and MPT-30B, have proven to be very competitive with MPT-7B surpassing GPT-3 in many academic tasks, and MPT-30B surpassing perfomances by Llama-1, Falcon-40B, and all prior models from EleutherAI. This comparison showcases MosaicML's models released so far in terms of their salient properties. Increasingly with stronger models released for research, the open-sourced development of LLMs offers a promising future trajectory.
has model | MPT-7B | MPT-30B |
model family | MosaicPretrainedTransformer (MPT) models | MosaicPretrainedTransformer (MPT) models |
date created | 2023-05-05 | 2023-06-22 |
pretraining architecture | Decoder | Decoder |
pretraining task | Language Modeling | Language Modeling |
training corpus | Semantic Scholar ORC<SEP>The Stack Dedup<SEP>RedPajama<SEP>C4<SEP>mC4 | RedPajama - StackExchange<SEP>RedPajama - arXiv<SEP>RedPajama - Books<SEP>Semantic Scholar ORC<SEP>The Stack - Markdown<SEP>RedPajama - Wikipedia<SEP>The Stack - Selected Languages<SEP>RedPajama - CommonCrawl<SEP>c4 - English - SemDedup<SEP>mC4 3.1.0 - English |
optimizer | LION | LION |
tokenization | EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus | EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus |
number of parameters | 7B | 30B |
hardware used | A100-40GB GPU | A100-40GB GPU<SEP>H100-80GB GPU |
hardware information | A100-80GB GPU was used in model finetuning<SEP> took ~9.5 days to train on 440xA100-40GB GPUs, and cost ~$200k | The model was trained in three stages using the MosaicML Platform: (i) First it was trained on 440 A100-40GBs with a batch size of 1760. (ii) Then, on 216 A100-40GBs with a batch size of 1728. (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens. The model was trained with sharded data parallelism using FSDP and used the LION optimizer. |
extension | MPT-7B-StoryWriter-65k+<SEP>MPT-7B-Chat<SEP>MPT-7B-Instruct | MPT-30B 8k Context Window<SEP>MPT-30B-Chat<SEP>MPT-30B-Instruct |
fine-tuning task | Conversations for the Chat model<SEP>Short instructions for the Instruct model<SEP>Excerpts of fiction books for StoryWriter | Finetuning on longer context data using sample subsets of datasets only samples with at least 4096 tokens of instances in base model dataset for the 8K context window model<SEP>A large collection of chat datasets for the Chat model<SEP>Instructions following for the Instruct model |
organization | MosaicML | MosaicML |
application | Multiple tasks similar to LLaMA such as reasoning or code generation | Multiple text generation tasks. This model demonstrates good coding ability better than its predecessor as well as specifically finetuned code writers. |
has source code | https://huggingface.co/mosaicml/mpt-7b | https://huggingface.co/mosaicml/mpt-30b |
blog post | https://www.youtube.com/watch?v=KSlWkrByc0o&t=9s<SEP>https://www.mosaicml.com/blog/mpt-7b | https://www.mosaicml.com/blog/mpt-30b |
license | Apache 2.0 | Apache 2.0 |
DeepMind's compute-optimal Chinchilla and other models
DeepMind In a seminal paper called "Training Compute-Optimal Large Language Models" published in 2022, carried out a detailed study of the performance of language models of various sizes and quantities of training data. The goal was to find the optimal number of parameters and volume of training data for a given compute budget. This paper is often referred to as the Chinchilla paper. Some of the findings of the paper: 1) it hints that many of the 100B parameter LLMs like GPT-3 may actually be over parameterized, meaning they have more parameters than they need to achieve a good understanding of language and under trained so that they would benefit from seeing more training data. The authors hypothesized that smaller models may be able to achieve the same performance as much larger ones if they are trained on larger datasets.
One important takeaway from the Chinchilla paper is that the optimal training dataset size for a given model is about 20 times larger than the number of parameters in the model. Chinchilla was determined to be compute optimal. For a 70 billion parameter model, the ideal training dataset contains 1.4 trillion tokens or 20 times the number of parameters. Per this, 175B parameter models like GPT-3, OPT, and BLOOM were trained on datasets that are smaller than the Chinchilla optimal size. These models may actually be under trained. In contrast, LLaMA 65B was trained on a dataset size of 1.4 trillion tokens, which is close to the Chinchilla recommended number. Another important result from the paper is that the compute optimal Chinchilla model outperforms non compute optimal models such as GPT-3 on a large range of downstream evaluation tasks. With the results of the Chinchilla paper in hand teams have recently started to develop smaller models that achieved similar, if not better results than larger models that were trained in a non-optimal way. Moving forward, you can probably expect to see a deviation from the bigger is always better trends of the last few years as more teams or developers like you start to optimize their model design.
This comparison characteristically describes Chinchilla as well as other models from DeepMind on certain salient properties of LLMs.
A Catalog of DeepMind's LLMs including their seminal Chinchilla model
Coming from researchers at DeepMind, in their 2022 paper "Training Compute-Optimal Large Language Models," often called the "Chinchilla paper," it explored the ideal size and training data volume for language models within a set compute budget. The study suggested that many 100B parameter models, like GPT-3, might be over-parameterized and under-trained. They proposed that smaller models could match the performance of larger ones if trained on more extensive datasets. The Chinchilla paper highlights that the best training dataset size for a model is roughly 20 times its parameter count. For instance, a 70B parameter model needs 1.4 trillion tokens for optimal training. Models like GPT-3, OPT, and BLOOM, with 175B parameters, were trained on suboptimal dataset sizes, potentially under-training them. In comparison, LLaMA 65B was trained near the Chinchilla's recommended size. Notably, the compute-optimal Chinchilla model surpassed models like GPT-3 in various tasks. Given these findings, there's a shift towards developing efficient smaller models, challenging the previous "bigger is always better" trend. This comparison characteristically describes Chinchilla as well as other models from DeepMind on certain salient properties of LLMs.
has model | AlphaFold | Gopher | Chinchilla | GopherCite | Flamingo | Gato | Sparrow |
model family | SE(3)-Transformer | GPT | GPT | Gopher | Chinchilla | “Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | GPT |
date created | 2019-09-01 | 2021-12-01 | 2022-03-01 | 2022-03-01 | 2022-04-01 | 2022-05-01 | 2022-09-01 |
innovation | The main innovation of "Highly accurate protein structure prediction with AlphaFold" is the creation of AlphaFold, a deep learning model that accurately predicts protein structures. This extends the capabilities of Large Language Models by showcasing their potential to solve complex scientific challenges beyond language-related tasks, marking a significant advancement in the intersection of machine learning and biology. | The paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" emphasizes the benefits and limitations of scaling LLMs. It highlights significant performance advances with data quality and scale but notes uneven gains across tasks, especially in mathematical reasoning. The study also delves into the impact of scale on toxicity and bias in model outputs. | The paper "Training Compute-Optimal Large Language Models" introduces a methodology to optimally balance model size and training tokens under a fixed compute budget. The findings emphasize equal scaling of parameters and training tokens, highlighting the importance of high-quality dataset scaling. The research also underscores ethical concerns with training on vast datasets sourced from the web. | GopherCite, in the context of large language models, is designed to support its answers with verified quotes from sources. It employs reinforcement learning with unique training techniques and can decline to answer questions if uncertain about the quality of its response. This approach differentiates it from other models by emphasizing evidence-backed answers and selective response generation. | Flamingo is a Visual Language Model designed to bridge powerful pretrained vision-only and language-only models, allowing it to handle sequences of interleaved visual and textual data. It's trained on large-scale multimodal web corpora with interleaved text and images, enabling rapid adaptation to new tasks using few-shot learning. This approach allows Flamingo to outperform models fine-tuned on significantly more task-specific data. | Gato is a multi-modal, multi-task agent inspired by Large Language Models, capable of handling diverse tasks like playing games, captioning images, and robotics using a single neural network. It employs a unique tokenization approach to process varied data types, from text to images. This innovation allows Gato to generalize across a vast range of tasks, setting a new standard in the realm of LLMs. | Sparrow, developed by DeepMind, introduces innovations in the context of LLMs by utilizing targeted human judgments for alignment. It employs a unique approach of using a single external knowledge fragment for evidence and focuses on breaking down goals into detailed rules, enhancing the model's helpfulness and accuracy in responses. The model also integrates techniques like self-play, search, and fine-grained rules to shape its behavior. |
pretraining architecture | Encoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder |
pretraining task | Protein folding prediction of BERT using parameter\nsharing, which is much more efficient given the same number of parameters | Causal language modeling | Causal language modeling | Causal language modeling | Log likelihood of text given some visual input | CLM (where tokens are either text or agent actions) | Causal language modeling |
training corpus | 170,000 proteins from a public repository of protein sequences and structures | Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. | 1.4 trillion training tokens. Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. | Same as Gopher plus specific dataset generated in the RLHP process | MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average) | 1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot) | Same as Chinchilla + interactive data gathering with human annotators during the RLHF process |
optimizer | Adam optimizer | AdamW | AdaFactor | AdamW | Adam optimizer | AdaFactor | |
tokenization | sentencepiece | a slightly modified SentencePiece (Kudo and Richardson, 2018)\ntokenizer that does not apply NFKC normalisation | sentencepiece | sentencepiece | |||
number of parameters | b12M, Large = 18M, XLarge = 60M | 44M, 117M, 417M, 1.4B, 7.1B, and 280B | 70B | 280B | 80B (largest) | 79M, 364M, and 1.18B | 70B |
maximum number of parameters (in million) | 60 | 280000 | 70000 | 280000 | 80000 | 1180 | 70000 |
hardware used | TPUv3 | TPUv4<SEP>TPUv3 | TPUv3 | TPUv4 | TPUv3 | ||
hardware information | We built our training and evaluation codebase with JAX and Haiku. In particular, we use JAX’s pmap transformation to efficiently express both data and\nmodel parallelism. We trained and evaluated all models on TPUv3 chips.\nThe half-precision parameters and single-precision Adam state for Gopher occupy 2.5 TiB, which far exceeds the 16 GiB of memory available on each TPUv3 core. To address these memory concerns,\nwe use optimiser state partitioning, model parallelism, and rematerialisation to partition the model state and reduce\nthe activations so that they fit in TPU memory. | All models in this analysis have been trained on TPUv3/TPUv4 with JAX and Haiku. | shard the networks across 128 TPU v3\nmachines. | Our model and associated infrastructure were implemented using JAX and Haiku. All training and evaluation was performed on TPUv4 instances. The\nlargest model containing 80 billion parameters is trained on 1536 chips for 15 days and sharded across 16 devices. Megatron type sharding is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers\nwere unsharded. ZeRO stage 1 is used to shard the optimizer state. | shard the models across 64\nTPU v3 machines | ||
extension | The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post). | Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute | Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance | GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported | It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet | The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable. | Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite |
organization | Deepmind | Deepmind | Deepmind | Deepmind | Deepmind | Deepmind | Deepmind |
application | Protein folding | Mostly Language Modeling and NLU, but also extensible like GPT | Same as Gopher/GPT3 | Dialog systems, Q&A, general language generation tasks | Text to image | Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm. | Dialog agents and general language generation applications like Q&A |
has source code | https://github.com/deepmind/alphafold | https://github.com/lucidrains/flamingo-pytorch | https://github.com/OrigamiDream/gato | ||||
blog post | https://www.deepmind.com/publications/highly-accurate-protein-structure-prediction-with-alphafold<SEP>https://fabianfuchsml.github.io/alphafold2/ | https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval | https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510<SEP>https://medium.com/mlearning-ai/language-models-need-proper-training-c71484727f00 | https://www.deepmind.com/blog/gophercite-teaching-language-models-to-support-answers-with-verified-quotes | https://medium.com/geekculture/3-overlooked-things-deepminds-flamingo-a-large-model-for-computer-vision-84cd9d2f738c<SEP>https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model | https://www.deepmind.com/blog/a-generalist-agent<SEP>https://www.deepmind.com/publications/a-generalist-agent | https://www.deepmind.com/blog/building-safer-dialogue-agents\n<SEP>https://medium.com/to-cut-a-long-paper-short/sparrow-improving-alignment-of-dialogue-agents-via-targeted-human-judgments-e0876402d800 |
license | Apache 2.0 | closed source | closed source | closed source | closed source | closed source | closed source |
MetaAI's LLaMA
In the spirit of building LLMs, Meta AI first introduced LLaMA-1 which was released under a non-commercial license. In subsequent work, Meta AI introduced open-sourced Llama 2 and Llama 2-Chat. Crucially, both these models support community development, marking a significant stride towards fostering transparency and promoting the development of more responsible, replicable LLMs. Llama 2 models are trained on 2 trillion tokens and have double the context length of Llama 1. Llama-2-chat models have additionally been trained on over 1 million new human annotations. Furthermore, compared to Llama 1, in Llama 2 they performed more robust data cleaning, updated the data mixes, trained on 40% more total tokens, doubled the context length, as well as leveraged grouped-query attention (GQA) for inference scalability improving. Evaluation comparisons of open-source models, including Llama 1, Llama 2 base models, MPT (MosaicML), and Falcon on standard academic benchmarks for Commonsense Reasoning, World Knowledge, Reading Comprehension, Math, MMLU, BBH, and AGI Eval indicates that Llama 2 outperforms other models including Llama 1. The following comparison showcases Llama 1 versus 2 on their essential characteristics.
A Catalog of Meta's LLaMA Models
In the spirit of building LLMs, Meta AI first introduced LLaMA-1 which was released under a non-commercial license. In subsequent work, Meta AI introduced open-sourced Llama 2 and Llama 2-Chat. Crucially, both models are open sourced with license that authorizes commercial use, marking a significant stride towards fostering transparency and promoting the development of more responsible, replicable LLMs. Evaluation comparisons of open-source models, including Llama 1, Llama 2 base models, MPT (MosaicML), and Falcon on standard academic benchmarks indicate that Llama 2 outperforms other models including Llama 1. This comparison showcases Llama 1 versus 2 on their essential characteristics.
has model | LLaMA | Llama 2 |
model family | LLaMa | LLaMa |
date created | 2023-02-27 | 2023-07-18 |
pretraining architecture | Decoder | Decoder |
pretraining task | Language Modeling | Self-supervized Learning |
training corpus | Stack Exchange<SEP>ArXiv<SEP>Gutenberg and Books3<SEP>Wikipedia<SEP>Github<SEP>C4<SEP>CommonCrawl<SEP>approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%) | No specific information available other than "A new mix of publicly available online data" |
optimizer | AdamW | AdamW |
tokenization | the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. | same as Llama-1 which uses the BPE algorithm |
number of parameters | 6.7B, 13.0B, 32.5B, and 65.2B | 7B, 13B, 34B, 70B |
hardware used | A100-80GB GPU | A100-80GB GPU |
hardware information | When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days. | models were pretrained on Meta’s Research Super Cluster (RSC) as well as internal production clusters, both of which use NVIDIA A100s. |
fine-tuning task | instructions | both instruction tuning and reinforcement learning from human feedback for Llama-2-Chat |
extension | LLaMA uses a Transformer architecture, and with extensions:\nPre-normalization, SwiGLU activations, RoPE embeddings, reduced\nmemory usage and runtime through efficient implementation of the causal\nmulti-head attention, checkpointing to reduce the amount of activations\nthat are recomputed during the backward pass, model and sequence parallelism\nto reduce memory usage of the model, and uses 1.4T BPE tokens\nafter tokenization. | Llama-2-Chat |
organization | Meta AI | Meta AI |
application | Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning,\nCode generation, Reading comprehension, and multitask understanding. | Research on LLMs demonstrating multitask abilities such as reading comprehension, commonsense reasoning, math and coding abilities |
has source code | https://huggingface.co/docs/transformers/main/model_doc/llama<SEP>https://github.com/facebookresearch/llama | https://github.com/facebookresearch/llama |
blog post | https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ | https://ai.meta.com/blog/llama-2/<SEP>https://ai.meta.com/resources/models-and-libraries/llama/ |
license | Limited, Non-commercial bespoke license | Llama 2 Community License Agreement |
carbon emitted (tco2eq) | 14 for LLaMA-7B, 23 for LLaMA-13B, 90 for LLaMA-33B, and 173 for LLaMA-65B | 31.22 for Llama-2-7B, 62.44 for Llama-2-13B, 153.90 for Llama-2-34B, and 291.42 for Llama-2-70B |
Llama-1, Alpaca, and Vicuna
With major industry players introducing open-source models, the academic research world was not meant to be far left behind. Before going into further details, a general difference between academic and industry players is that, more often than not, academics have access to far lesser compute facilities compared to their industry counterparts.
In the spirit of fostering academic research on LLMs, a group of researchers at Stanford university released the first finetuned LLM of its kind called Alpaca. The identify two important parts to creating an instruction following model in the spirit of OpenAI's ChatGPT. They are: 1) a strong baseline pretrained model to finetune. They select Llama-1-7B as their deemed most effective at the time. 2) A highly optimized and effective finetuning dataset with instruction following demonstrations. For this, they generated 52K instruction-following demonstrations by building upon the self-instruct method generated from OpenAI’s text-davinci-003. Thus they created Alpaca as a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAI’s text-davinci-003.
Inspired from Llama-1 and Alpaca, a team of students from UC Berkeley in collaboration with UC San Diego and CMU released the Vicuna chat model at a small price of 300 dollars. Still based on the Llama models in several parameter sizes, Vicuna was finetuned on user conversations from the ShareGPT platform. The team collected 70K conversations from ShareGPT.com, a website where users can share their ChatGPT conversations. The team also made several improvements to the training recipe, including memory optimizations to enable Vicuna’s understanding of long context. They also adjusted the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot’s output. Thus, after fine-tuning Vicuna with 70K user-shared ChatGPT conversations, it was discovered that Vicuna was capable of generating more detailed and well-structured answers compared to Alpaca, with the quality on par with ChatGPT.
In a nutshell, the Meta's Llama models are proving very effective in influencing academic research on optimizing LLMs via finetuning.
A Catalog of LLM's of Camelid Species: Llama-1, Alpaca, and Vicuna
This comparison showcases academically developed Alpaca from Stanford University and Vicuna from Berkeley in collaboration with other universities. Both models are finetuned variants based on Llama-1 developed by Meta AI.
has model | LLaMA | Alpaca | Vicuna |
model family | LLaMa | LLaMa | LLaMa |
date created | 2023-02-27 | 2023-03-01 | 2023-03-30 |
pretraining architecture | Decoder | Decoder | Decoder |
number of parameters | 6.7B, 13.0B, 32.5B, and 65.2B | 7B, 13B, 33B, 65B | 7B, 13B, 33B |
hardware used | A100-80GB GPU | A100-80GB GPU | A100-80GB GPU |
hardware information | When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days. | fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. | the training was done with PyTorch FSDP on 8 A100 GPUs in one day |
extension | LLaMA uses a Transformer architecture, and with extensions:\nPre-normalization, SwiGLU activations, RoPE embeddings, reduced\nmemory usage and runtime through efficient implementation of the causal\nmulti-head attention, checkpointing to reduce the amount of activations\nthat are recomputed during the backward pass, model and sequence parallelism\nto reduce memory usage of the model, and uses 1.4T BPE tokens\nafter tokenization. | Alpaca is fine-tuned from a 7B LLaMA model | stable-vicuna-13b-delta<SEP>Vicuna v1.3 is fine-tuned from LLaMA with supervised instruction fine-tuning. |
fine-tuning task | instructions | 52K instruction-following data generated from OpenAI’s text-davinci-003 using self-instruct mechanism, from 175 human-written instruction-output pairs. | 70K user-shared conversations gathered from ShareGPT.com with public APIs |
organization | Meta AI | Stanford University | Large Model Systems Organization |
application | Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning,\nCode generation, Reading comprehension, and multitask understanding. | Evaluated on a variety of text generation and classification\ntasks. | chatbot |
has source code | https://huggingface.co/docs/transformers/main/model_doc/llama<SEP>https://github.com/facebookresearch/llama | https://github.com/tatsu-lab/stanford_alpaca | https://chat.lmsys.org/<SEP>https://huggingface.co/lmsys/vicuna-33b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-13b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-7b-v1.3<SEP>https://github.com/lm-sys/FastChat |
blog post | https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ | https://medium.com/version-1/stanford-alpaca-a-small-yet-mighty-language-model-for-instruction-following-tasks-af9e92e87d9a | https://lmsys.org/blog/2023-03-30-vicuna/ |
license | Limited, Non-commercial bespoke license | Limited, non-commercial license CC-BY-NC-SA-4.0 | Non-commercial license |
optimizer | AdamW | ||
pretraining task | Language Modeling | ||
tokenization | the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. | ||
training corpus | Stack Exchange<SEP>ArXiv<SEP>Gutenberg and Books3<SEP>Wikipedia<SEP>Github<SEP>C4<SEP>CommonCrawl<SEP>approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%) |
Berkeley AI Research's OpenLLaMA
The gap between commercial and non-commercial LLMs is rapidly narrowing. This is owing to open-sourced LLM development research efforts from Google with the T5 series, Meta with the LLaMA series, and now OpenLLaMA developed by researchers Xinyang Geng and Hao Liu from Berkeley AI Research.
With Google and Meta's models discussed above, this section presents the OpenLLaMA series engineered as a fully open-source model based on LLaMA-1. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
A Catalog of BerkeleyAI's OpenLLaMA series
This comparison showcases the OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA-1. OpenLLaMA is released in two versions, with the 2nd version better than the first, as well as exhibiting comparable performance to the original LLaMA-1 and GPT-J across a majority of tasks, and even outperforming them in some tasks.
has model | OpenLLaMA v1 | OpenLLaMA v2 |
model family | LLaMa | LLaMa |
date created | 2023-06-16 | 2023-07-16 |
pretraining task | Causal language modeling | Causal language modeling |
pretraining architecture | Decoder | Decoder |
number of parameters | 3B, 7B, 13B | 3B, 7B |
optimizer | AdamW | AdamW |
training corpus | We train our models on the RedPajama dataset released by Together, which is a reproduction of the LLaMA training dataset containing over 1.2 trillion tokens.<SEP>RedPajama | The v2 models are trained on a mixture of the Falcon refined-web dataset, the StarCoder dataset and the wikipedia, arxiv, book and stackexchange part of the RedPajama dataset.<SEP>Starcoder<SEP>Falcon refined-web |
extension | public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. | public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. |
hardware used | cloud TPU-v4 | cloud TPU-v4 |
organization | Berkeley AI Research | Berkeley AI Research |
application | same as LLaMA | same as LLaMA |
blog post | https://github.com/openlm-research/open_llama | https://github.com/openlm-research/open_llama |
has source code | https://huggingface.co/openlm-research/open_llama_13b<SEP>https://huggingface.co/openlm-research/open_llama_7b<SEP>https://huggingface.co/openlm-research/open_llama_3b | https://huggingface.co/openlm-research/open_llama_7b_v2<SEP>https://huggingface.co/openlm-research/open_llama_3b_v2 |
license | Apache 2.0 | Apache 2.0 |
OpenAI's GPT
OpenAI's GPT (Generative Pre-trained Transformer) models have undergone remarkable advancements, evolving from GPT-1's foundational contextual text generation to the groundbreaking capabilities of GPT-4. Beginning with GPT-1, these models harnessed the power of the transformer architecture and large-scale pre-training on diverse text sources to generate coherent and contextually relevant human-like text. GPT-2 pushed boundaries by demonstrating more coherent long-form text generation, albeit raising concerns about misuse. GPT-3 further amplified this prowess, showcasing its capacity for a wide array of tasks and its uncanny ability to understand and produce human-like language. GPT-4 represents a culmination of these developments, featuring handling multimodal context, even more refined context comprehension, nuanced responses, and also enhanced handling harm, safety, and helpfulness criteria for LLMs.
A Catalog of OpenAI's GPT series
The comparison showcases the GPT series of Large Language Models from GPT-1 to the latest, at the time of creation of this comparison, GPT-4.
has model | GPT-1 | GPT-2 | GPT-3 | InstructGPT | ChatGPT | GPT-4 |
model family | GPT | GPT | GPT | GPT | GPT | GPT |
date created | 2018-06-01 | 2019-02-01 | 2020-05-01 | 2022-01-01 | 2022-11-30 | 2023-03-14 |
pretraining task | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling |
pretraining architecture | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder |
tokenization | byte pair encoding | byte pair encoding | ||||
number of parameters | 117M | 124M, 355M, 774M, 1.5B | 175B | Same as GPT3 | 175B | 170T |
maximum number of parameters (in million) | 117 | 1500 | 175000 | 175000 | 170000000 | |
training corpus | BookCorpus<SEP>Supervised Finetuning on several task-specific datasets for Natural Language Inference, Question Answering, Sentence similarity, and Classification. | https://github.com/openai/gpt-2-output-dataset<SEP>8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points. | ~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B) | Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts | Human written prompt and interaction dataset collected through the OpenAI API | |
fine-tuning task | Supervized discriminative finetuning | Reinforcement Learning from Human Feedback | Step 3. RLHF using Proximal Policy Optimization<SEP>Step 2. Collect comparison data and train a reward model<SEP>Step 1. Supervized fine-tuning | Reinforcement Learning from Human Feedback<SEP>Rule-Based Reward Model | ||
extension | GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.<SEP>Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024) | Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer | GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning | a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs | ||
innovation | It can generate upto 768 words (equivalent to 1 1/2 page)<SEP>demonstrate that language models begin to learn tasks such as question answering, machine translation, reading comprehension, and summarization without any explicit supervision when trained on a task-agnostic, diverse dataset of millions of web-scraped webpages. The work proposes a central research question: do WebText LMs transfer well across domains and datasets? | It can generate upto 1,536 words (equivalent to 3 pages) | Better alignment of LLMs with human expectations using reinforcement learning through human feedback | It can generate upto 3000 words (equivalent to 6 pages)<SEP>Supports input context length of 2048 tokens | It can generate upto 24000 words (equivalent to 48 pages)<SEP>Supports input context length between 8192 and 32,768 tokens depending on the model version | |
organization | OpenAI | OpenAI | OpenAI | OpenAI | OpenAI | OpenAI |
application | Text generation, but adaptable to many other NLP tasks when fine tuned. | Text generation, but adaptable to many other NLP tasks when fine tuned. | Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation | Knowledge-intensive dialog or language tasks | provide human-like conversational interactions and assist users in answering questions, generating text, providing recommendations, and engaging in natural language conversations. | Creating highly realistic and contextually accurate human-like text generation |
blog post | https://medium.com/@hyponymous/paper-summary-improving-language-understanding-by-generative-pre-training-7b77babd7086<SEP>https://www.reddit.com/r/MachineLearning/comments/n36htr/p_gpt1_annotated_paper_paper_summary/ | https://openai.com/research/better-language-models<SEP>https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface | https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122<SEP>https://openai.com/blog/gpt-3-apps | https://sh-tsang.medium.com/review-instructgpt-training-language-models-to-follow-instructions-with-human-feedback-7fce4bf9059a<SEP>https://openai.com/research/instruction-following | https://openai.com/blog/chatgpt | https://openai.com/research/gpt-4 |
has source code | https://github.com/openai/finetune-transformer-lm<SEP>https://huggingface.co/docs/transformers/model_doc/openai-gpt | https://huggingface.co/docs/transformers/model_doc/gpt2 | https://platform.openai.com/docs/models/gpt-3-5<SEP>https://github.com/openai/gpt-3 | https://github.com/openai/following-instructions-human-feedback | ||
license | closed source | closed source | closed source | Closed source, accessible through API | Closed source, accessible through API | Closed source, accessible through API |
Microsoft's Wizard LLM series
WizardLM is a LLM based on LLaMA trained using a new method, called Evol-Instruct, on complex instruction data. By using AI to "evolve" instructions, WizardLM outperforms similar LLaMA-based LLMs trained on simpler instruction data. On the 6th of July, 2023, WizardLM V1.1 was released with significantly improved performance. Apart from a generalist LLM, WizardLM is also released as coding (WizardCoder) and math solver (WizardMath) variants able to handle more complex scenarios given their finetuning complex instructions setups than prior released specialized models.
A Catalog of the Wizard LLM series
WizardLM is a LLM based on LLaMA trained using a new method, called Evol-Instruct, on complex instruction data. By using AI to "evolve" instructions, WizardLM outperforms similar LLaMA-based LLMs trained on simpler instruction data. On the 6th of July, 2023, WizardLM V1.1 was released with significantly improved performance. This comparison juxtaposes all Wizard LLM variants based on their salient properties
has model | WizardLM | WizardCoder | WizardMath |
model family | LLaMa | StarCoder | LLaMa |
date created | 2023-04-24 | 2023-06-14 | 2023-04-24 |
innovation | The paper introduces the Evol-Instruct methodology, allowing Large Language Models (LLMs) to autonomously generate diverse instructional data. This innovation shifts away from traditional human-generated instructions, leading to more robust models. The resulting model, WizardLM, demonstrated superior performance, highlighting the effectiveness of this approach. | The paper introduces WizardCoder, a model that empowers Code Large Language Models with complex instruction fine-tuning using the Evol-Instruct method tailored for code. This innovation allows WizardCoder to surpass other open-source Code LLMs and even outperform major closed LLMs on key benchmarks. The model fills a gap in the field by emphasizing instruction fine-tuning specifically for the code domain. | The main innovation of WizardMath in the context of Large Language Models (LLMs) is its enhanced mathematical reasoning capabilities. Through a case study, the model demonstrated its ability to solve complex problems using step-by-step reasoning, showcasing its advancement over other LLMs in mathematical tasks. |
pretraining architecture | Decoder | Decoder | Decoder |
fine-tuning task | supervized open-domain complex instruction finetuning | supervized complex code-based instruction finetuning | Reinforcement Learning from Evol-Instruct Feedback |
training corpus | 250K instructions generated from OpenAI ChatGPT API auto-generated based on the Evol-Instruct method | initialized with the 20K instruction-following dataset called\nCode Alpaca, the Evol-Instruct technique is iteratively applied on this dataset consisting of 20,000 samples to produce evolved data. After each round of data evolution, the evolved data is merged from\nall previous rounds with the original dataset to finetune StarCoder | To enhance the model’s ability to adhere to the neural and diverse instructions, 1.5k open-domain conversations from WizardLM’s training data are sampled and merged with above math corpus as the final supervized finetuning training data<SEP>few-shot re-generate 15k answers for GSM8k\nand MATH with an Alpha version of WizardLM 70B model to produce solutions in a\nstep-by-step format, then find out those with a correct answer, and use this data to finetune\nbase Llama model<SEP>evolve the original math (GSM8k + MATH) instructions by 8 turns,\nincreasing the data size from 15k to 96k |
number of parameters | 7B | 1B, 3B, 7B, 13B, 15B, and 34B | 7B, 13B, and 70B |
maximum number of parameters (in million) | 7000 | 34000 | 70000 |
extension | Evol-Instruct | Code Evol-Instruct | it enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2 |
organization | Microsoft | Microsoft | Microsoft |
application | assistant in foundational NLP tasks such as reasoning or multi-domain and multi-genre QA, code generation, etc. | code understanding, code generation, instruction fine-tuning for code tasks, tested on benchmarks (HumanEval, HumanEval+, MBPP, DS-1000), generating accurate code responses with clear explanations. | surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B, Llama-1 65B, Falcon-40B, MPT-30B8, Baichuan-13B Chat and ChatGLM2 12B on both GSM8k and MATH. |
blog post | https://medium.com/@preangelleo/wizardlm-enhancing-large-language-models-with-ai-evolved-instructions-7fd4425afe80 | https://ollama.ai/blog/wizardmath-examples | |
has source code | https://github.com/nlpxucan/WizardLM | https://github.com/nlpxucan/WizardLM | https://github.com/nlpxucan/WizardLM |
license | Non-commercial license | Non-commercial license | Non-commercial license |
hardware information | train our model on\n8 V100 GPUs with Deepspeed Zero-3 for 70 hours on 3 epochs. | ||
hardware used | Nvidia V100 GPU | ||
optimizer | Adam optimizer |
A Comprehensive Catalog of Large Language Models (LLMs)
This last section brings together 87 different transformer-based language models in a comprehensive comparison of their characteristic properties.
A Catalog of Transformer Models
In the past few years we have seen the meteoric appearance of dozens of models of the Transformer family. This ORKG comparison contrasts models from different families including GPT, T5, LLaMA etc. on some salient aspects around their innovation and difference. This comparison is inspired after the paper "Transformer models: an introduction and catalog" by Xavier Amatriain, while seeking to be an extendable list of both properties and new contributions in terms of transformer models as they continue to be introduced.
has model | GPT-1 | BERT | Transformer XL | GPT-2 | XLNet | ERNIE | RoBERTa | ALBERT | CTRL | AlphaFold | BART | DialoGPT | DistilBERT | T5 | XLM-RoBERTa | Pegasus | mBART | ELECTRA | Megatron | GPT-3 | DeBERTa | Big Bird | ViT | DALL-E | Switch | CLIP | GPT-Neo | Swin Transformer | GPT-J | Decision Transformers | Trajectory Transformers | HTLM | Jurassic-1 | Megatron-Turing NLG | Anthropic Assistant | GLaM | GLIDE | Gopher | StableDiffusion | CM3 | LAMDA | InstructGPT | FLAN | Chinchilla | DQ-BART | GopherCite | SeeKer | GLM | T0 | DALL-E 2 | Flamingo | PaLM | GPT-NeoX-20B | Gato | OPT | UL2 | Global Context ViT | Imagen | Minerva | Godel | BLOOM | BlenderBot 3 | AlexaTM 20B | Sparrow | Flan-T5 | Flan-PaLM | Galactica | ChatGPT | E5 | InstructOR | LLaMA | Alpaca | Pythia | GPT-4 | Vicuna | WizardLM | WizardMath | MPT-7B | StarCoderBase<SEP>StarCoder | Falcon | Orca | WizardCoder | OpenLLaMA v1 | MPT-30B | OpenLLaMA v2 | Llama 2 | JAIS |
model family | GPT | BERT | GPT | Transformer XL | BERT | BERT | BERT | SE(3)-Transformer | BERT for encoder, GPT for Decoder | GPT | BERT | T5 | RoBERTa | Transformer | BART | BERT | T5<SEP>BERT<SEP>GPT | GPT | BERT | BERT | BERT | GPT | T5 | Also using Resnet, ViT, and vanilla transformer for text<SEP>CLIP | GPT | ViT | GPT | GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | BART | GPT | GPT | Transformer | Transformer | Diffusion models | GPT | Diffusion | HTLM | LaMDA-PT | GPT | LaMDA-PT | GPT | BART | Gopher | GPT (but can extend any family) | GLM (General Language Model) | T5 | GLIDE<SEP>CLIP | Chinchilla | PaLM | GPT | “Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | GPT | Transformer | ViT | Diffusion models<SEP>CLIP<SEP>T5 | PaLM | T5<SEP>GPT | GPT | GPT | transformer | GPT | T5 | PaLM | transformer | GPT | BERT | T5 | LLaMa | LLaMa | Pythia | GPT | LLaMa | LLaMa | LLaMa | MosaicPretrainedTransformer (MPT) models | SantaCoder | transformer | LLaMa | StarCoder | LLaMa | MosaicPretrainedTransformer (MPT) models | LLaMa | LLaMa | GPT | ||
date created | 2018-06-01 | 2018-10-01 | 2019-01-01 | 2019-02-01 | 2019-05-01 | 2019-05-01 | 2019-07-01 | 2019-09-01 | 2019-09-01 | 2019-09-01 | 2019-10-01 | 2019-10-01 | 2019-10-01 | 2019-10-01 | 2019-10-01 | 2019-12-01 | 2020-01-01 | 2020-03-01 | 2020-03-01 | 2020-05-01 | 2020-06-01 | 2020-07-01 | 2020-10-01 | 2021-01-01 | 2021-01-01 | 2021-02-01 | 2021-03-01 | 2021-03-01 | 2021-05-01 | 2021-06-01 | 2021-06-01 | 2021-07-01 | 2021-09-01 | 2021-10-01 | 2021-12-01 | 2021-12-01 | 2021-12-01 | 2021-12-01 | 2021-12-01 | 2022-01-01 | 2022-01-01 | 2022-01-01 | 2022-02-08 | 2022-03-01 | 2022-03-01 | 2022-03-01 | 2022-03-01 | 2022-03-01 | 2022-03-01 | 2022-04-01 | 2022-04-01 | 2022-04-01 | 2022-04-01 | 2022-05-01 | 2022-05-01 | 2022-05-01 | 2022-06-01 | 2022-06-01 | 2022-06-01 | 2022-06-01 | 2022-07-01 | 2022-08-01 | 2022-08-01 | 2022-09-01 | 2022-11-01 | 2022-11-01 | 2022-11-01 | 2022-11-30 | 2022-12-01 | 2022-12-01 | 2023-02-27 | 2023-03-01 | 2023-03-13 | 2023-03-14 | 2023-03-30 | 2023-04-24 | 2023-04-24 | 2023-05-05 | 2023-05-09 | 2023-05-25 | 2023-06-05 | 2023-06-14 | 2023-06-16 | 2023-06-22 | 2023-07-16 | 2023-07-18 | 2023-08-30 |
organization | OpenAI | Google | Google<SEP>CMU | OpenAI | Google<SEP>CMU | Pengcheng Lab<SEP>Baidu | Google<SEP>University of Washington | Google | Salesforce | Deepmind | Facebook | Microsoft | Huggingface | Google | Facebook | Google<SEP>Imperial College London | Facebook | Google<SEP>Stanford | NVidia | OpenAI | Microsoft | Google | Google | OpenAI | Google | OpenAI | EleutherAI | Microsoft | EleutherAI | Facebook<SEP>Google<SEP>UC Berkeley | UC Berkeley | Facebook | AI21 | NVidia | Anthropic | Google | OpenAI | Deepmind | EleutherAI<SEP>Stability.ai<SEP>LMU Munich | Facebook | Google | OpenAI | Google | Deepmind | Amazon | Deepmind | Facebook | Tsinghua University | BigScience | OpenAI | Deepmind | Google | EleutherAI | Deepmind | Facebook | Google | NVidia | Google | Google | Microsoft | Huggingface<SEP>Big Science | Facebook | Amazon | Deepmind | Google | Google | Meta | OpenAI | Microsoft | Meta AI<SEP>University of Washington<SEP>University of Hong Kong | Meta AI | Stanford University | EleutherAI | OpenAI | Large Model Systems Organization | Microsoft | Microsoft | MosaicML | BigCode Project | Technology Innovation Institute | Microsoft | Microsoft | Berkeley AI Research | MosaicML | Berkeley AI Research | Meta AI | Cerebras<SEP>Mohamed bin Zayed University of Artificial Intelligence<SEP>Inception |
innovation | The paper introduces a framework for natural language understanding by first using generative pre-training on a diverse corpus and then fine-tuning for specific tasks. This approach improved state-of-the-art results on 9 out of 12 datasets, highlighting the potential of unsupervised learning combined with discriminative tasks. | BERT's primary innovation in Language Model Learning is the "masked language model" (MLM) approach, inspired by the Cloze task. This method masks random tokens in a sentence and trains the model to predict them, enabling bidirectional context understanding. | Transformer-XL introduces a segment-level recurrence mechanism and a novel positional encoding scheme to overcome the fixed-length context limitations of traditional Transformers. This allows it to capture dependencies 80% longer than RNNs and 450% longer than vanilla Transformers, addressing context fragmentation and improving efficiency in language modeling. | It can generate upto 768 words (equivalent to 1 1/2 page)<SEP>demonstrate that language models begin to learn tasks such as question answering, machine translation, reading comprehension, and summarization without any explicit supervision when trained on a task-agnostic, diverse dataset of millions of web-scraped webpages. The work proposes a central research question: do WebText LMs transfer well across domains and datasets? | XLNet introduces a generalized autoregressive pretraining method that captures bidirectional context by considering all possible permutations of the factorization order. This approach overcomes BERT's limitations related to data corruption and token independence. Additionally, XLNet integrates techniques from Transformer-XL and offers architectural improvements for permutation-based modeling. | ERNIE innovatively incorporates knowledge from knowledge graphs (KGs) into language representation models. It fuses lexical, syntactic, and knowledge information, enabling enhanced performance on knowledge-driven tasks. This approach sets ERNIE apart from traditional models like BERT, which primarily rely on textual context. | The work introduced RoBERTa, an improved version of BERT, by optimizing design choices: extending training duration, removing the next sentence prediction objective, and dynamically altering the masking pattern. These modifications led RoBERTa to achieve state-of-the-art results on benchmarks like GLUE, RACE, and SQuAD, emphasizing the importance of refining pretraining strategies in Large Language Models. | The main innovation of the work is ALBERT, a language model that improves on existing large models like BERT by employing parameter reduction techniques, such as factorized embeddings and cross-layer parameter sharing. This allows ALBERT to achieve better performance and efficiency on natural language understanding tasks by training larger models with fewer parameters. | The main innovation of the work in the context of LLMs appears to involve advancements in model architecture, training techniques, and multitask learning to enhance the performance, efficiency, and ethical considerations of language models. | The main innovation of "Highly accurate protein structure prediction with AlphaFold" is the creation of AlphaFold, a deep learning model that accurately predicts protein structures. This extends the capabilities of Large Language Models by showcasing their potential to solve complex scientific challenges beyond language-related tasks, marking a significant advancement in the intersection of machine learning and biology. | The main innovation of the work is the introduction of BART, a pre-training approach that combines bidirectional and auto-regressive modeling. BART excels in generating coherent text, making it particularly effective for tasks like summarization, leveraging denoising tasks for multi-task learning, and achieving state-of-the-art results in text generation, especially abstractive summarization. | The main innovation of this work is DialoGPT, a large language model designed for open-domain conversations. DialoGPT is trained on a Reddit dataset, allowing it to generate coherent responses in multi-turn dialogues and adapt to different conversational domains through fine-tuning. It addresses context handling, offensive content mitigation, and employs human evaluation to assess its performance and variants. | The main innovation of this work is the creation of DistilBERT, a smaller language model derived from BERT using knowledge distillation during pre-training. It retains 97% of BERT's performance while being 40% smaller and 60% faster, making it efficient for on-device tasks and resource-constrained environments, thus addressing challenges related to large-scale language models' deployment. | The main innovation of Google's T5 language model is its "text-to-text" framework, where various tasks are formulated as converting input text to output text. This unified approach allows T5 to achieve state-of-the-art performance on diverse tasks without task-specific modifications, simplifying training and deployment. This innovation enhances efficiency and effectiveness in real-world applications of large language models. | The main innovation of the work is the development of the XLM-R model, trained on data from 100 languages, achieving superior performance on cross-lingual tasks. The research highlights the challenges of scaling multilingual models and suggests that increasing model capacity can address some limitations. This approach is especially beneficial for low-resource languages. | PEGASUS introduces a novel pre-training objective called Gap Sentence Generation (GSG), where "important" sentences are removed from a document and the model is trained to regenerate them. The method of selecting these principal sentences is based on their relevance to the entire document, measured using the ROUGE1-F1 score. This unique approach tailors the model for abstractive summarization, achieving state-of-the-art performance on various tasks. | mBART introduces a multilingual denoising pre-training method for neural machine translation using a sequence-to-sequence auto-encoder model. Unlike other models like XLM and MASS, mBART pre-trains both the encoder and decoder, making it more adaptable for translation tasks. The model's versatility is further showcased by its ability to handle various levels of multilinguality, from monolingual to 25 languages. | ELECTRA introduces a novel pre-training task called "replaced token detection" where, instead of masking tokens like in BERT, input tokens are replaced with plausible alternatives from a generator network. The model then acts as a discriminator, predicting if each token was replaced or original. This approach is more computationally efficient than traditional Masked Language Modeling (MLM) and offers superior performance, especially for smaller models. | Megatron-LM introduces an efficient intra-layer model parallelism approach for training large transformer models. Implemented seamlessly in PyTorch without custom modifications, it allows for the training of models with billions of parameters, achieving state-of-the-art results on datasets like WikiText103 and LAMBADA. This innovation pushes the boundaries of Large Language Models, offering a scalable solution for the research community. | GPT-3's primary innovation in the context of Large Language Models is its exceptional few-shot learning capabilities, allowing it to make accurate predictions using just a natural language prompt and a few task demonstrations. The model also introduced prompt-based and in-context learning methodologies. However, its vast size (175B parameters) poses challenges for real-world applications.<SEP>It can generate upto 1,536 words (equivalent to 3 pages) | The DeBERTa model introduces a disentangled attention mechanism, representing each word with separate vectors for content and position. It also incorporates an Enhanced Mask Decoder for better prediction of masked tokens during pre-training. These innovations lead to improved efficiency and performance over models like BERT and RoBERTa. | BigBird introduces a sparse attention mechanism, allowing it to efficiently handle sequences up to 8 times longer than traditional models like BERT. It combines global, sliding window, and random attention patterns to capture both local and long-range dependencies. This innovation enables superior performance on various NLP tasks without sacrificing efficiency. | The Vision Transformer (ViT) applies Transformers, typically used in NLP, directly to image patches without image-specific biases. It excels when pre-trained on larger datasets, outperforming traditional convolutional models like ResNets. This approach challenges the dominance of convolutional architectures in computer vision, mirroring the Transformer's rise in NLP. | The paper introduces a model with remarkable generalization, capable of creatively interpreting and combining unusual textual concepts into images. It also demonstrates combinatorial generalization and zero-shot image-to-image translation controlled by natural language, showcasing advancements in LLMs for text-to-image synthesis. | The Switch Transformer introduces a sparsely-activated model approach, enhancing the Mixture of Experts (MoE) models by simplifying their routing algorithm and reducing computational costs. It enables training large models with lower precision formats like bfloat16 and achieves up to 7x faster pre-training speeds. This innovation pushes LLM boundaries, scaling up to trillion parameter models with significant efficiency gains. | CLIP, in the context of Large Language Models, introduces a novel approach by leveraging natural language supervision with a dataset of 400 million (image, text) pairs. It excels in zero-shot learning, allowing it to classify images using textual descriptions without prior training on specific categories. This integration of vision and language offers a flexible, scalable solution with potential for diverse applications. | GPT-Neo by EleutherAI is an open-source alternative to proprietary models like GPT-3. Trained on the diverse "The Pile" dataset, its significance lies in democratizing access to large language models, fostering community-driven development, and promoting ethical considerations in AI research. | The Swin Transformer introduces a hierarchical representation for visual elements of varying scales and achieves linear computational complexity by computing self-attention within non-overlapping, shifted windows. This design efficiently adapts the Transformer architecture for high-resolution images, making it a robust backbone for computer vision tasks. The shifted window approach enhances modeling power while ensuring lower latency. | GPT-J-6B by EleutherAI stands out for its open-source accessibility, training on the diverse "Pile" dataset, and its development by an independent research organization. The model emphasizes democratization in AI, showcasing that state-of-the-art advancements aren't limited to large corporations. EleutherAI's transparent approach and emphasis on open research further distinguish GPT-J in the LLM landscape. | The Decision Transformer reframes Reinforcement Learning (RL) as a sequence modeling task, leveraging the Transformer architecture used in Large Language Models like GPT. It autoregressively models trajectories by conditioning on desired returns, past states, and actions, enabling it to generate optimal future actions. This approach offers a simplified and scalable solution to RL, especially in sparse reward settings. | The paper presents a novel approach to reinforcement learning (RL) by treating it as a sequence modeling problem. Using the Transformer architecture, traditionally employed in natural language processing, they model trajectories in RL tasks. This perspective simplifies design decisions and proves versatile across various RL challenges, including long-horizon tasks. | The HTLM model's primary innovation is its ability to directly model hyper-text (HTML) from web crawls, enabling structured prompting and auto-prompting. This approach leverages the inherent structure of HTML for tasks like zero-shot summarization and offers improved performance and data efficiency compared to traditional Large Language Models. | The Jurassic-1 model's primary innovation in the context of Large Language Models is its enhanced tokenization efficiency, achieved through a larger 256K vocabulary SentencePiece tokenizer. This tokenizer captures a mix of word pieces, whole words, and multi-word expressions, allowing Jurassic-1 to represent text with 28% fewer tokens than GPT-3, leading to faster processing and broader domain prediction capabilities. | The Megatron-Turing NLG 530B (MT-NLG) is the largest monolithic language model trained to date with 530 billion parameters. It was developed using advanced 3D parallelism techniques, a collaboration between NVIDIA Megatron-LM and Microsoft DeepSpeed. The model showcases superior in-context learning capabilities and sets new benchmarks in zero-shot and few-shot learning. | The paper introduces techniques to improve the alignment of Large Language Models (LLMs) with human values, specifically HHH--helpful, honest, and harmless. It delves into methods like context distillation and preference modeling over imitation learning. The goal is to ensure LLMs provide responses that are both relevant and in line with human expectations. | GLaM introduces a sparsely activated mixture-of-experts architecture, allowing it to scale to 1.2 trillion parameters while consuming only 1/3 of GPT-3's training energy. Despite its size, it achieves superior performance on 29 NLP tasks and is more energy-efficient than dense models like GPT-3. | The paper introduces two guidance techniques for text-guided image synthesis: CLIP guidance and classifier-free guidance. Of the two, classifier-free guidance produces higher-quality, photorealistic images that align closely with textual descriptions, outperforming previous models like DALL-E in evaluations. | The paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" emphasizes the benefits and limitations of scaling LLMs. It highlights significant performance advances with data quality and scale but notes uneven gains across tasks, especially in mathematical reasoning. The study also delves into the impact of scale on toxicity and bias in model outputs. | The CM3 model introduces a causally masked approach for generative modeling, blending the strengths of causal and masked language models. Trained on large-scale multi-modal documents, it can generate rich structured outputs and offers impressive zero-shot capabilities across text and image tasks. This innovation positions CM3 as a powerful advancement in the realm of Large Language Models. | LaMDA is a specialized dialog model that emphasizes safety and factual grounding. The model's innovation lies in its fine-tuning with annotated data and its ability to consult external knowledge sources. This approach aims to produce more accurate and safer dialog responses compared to traditional LLMs. | Better alignment of LLMs with human expectations using reinforcement learning through human feedback | The primary innovation of FLAN in the context of Large Language Models is instruction tuning, where models are finetuned on datasets described via natural language instructions. This method significantly enhances zero-shot learning abilities, with FLAN outperforming the 175B GPT-3 on numerous tasks. The approach emphasizes human-like prompts over traditional model-specific prompts used in models like GPT-3 and T5. | The paper "Training Compute-Optimal Large Language Models" introduces a methodology to optimally balance model size and training tokens under a fixed compute budget. The findings emphasize equal scaling of parameters and training tokens, highlighting the importance of high-quality dataset scaling. The research also underscores ethical concerns with training on vast datasets sourced from the web. | The paper introduces DQ-BART, which innovatively combines model distillation and quantization to compress sequence-to-sequence models. It uniquely initializes student models by copying specific layers from the teacher and distills both the encoder and decoder. This approach achieves significant model compression with minimal performance loss. | GopherCite, in the context of large language models, is designed to support its answers with verified quotes from sources. It employs reinforcement learning with unique training techniques and can decline to answer questions if uncertain about the quality of its response. This approach differentiates it from other models by emphasizing evidence-backed answers and selective response generation. | The main innovation of the model, SeeKeR, is its modular approach that combines internet search, knowledge generation, and response generation for more factual and up-to-date outputs. This method addresses the limitations of traditional large language models by ensuring accurate and current information in responses. SeeKeR outperforms existing models in open-domain knowledge-grounded conversations. | GLM's main innovation lies in its use of gap sentences for pretraining, along with a blank infilling objective during fine-tuning. This approach enhances the model's ability to generate coherent and contextually relevant text. By combining autoregressive decoding and text infilling, GLM achieves competitive performance on a variety of NLU tasks. | The paper introduces a model named T0, based on the T5 architecture, which is trained using a unique method of mapping natural language tasks into prompted forms. This approach, combined with training on multiple prompts and datasets, enables T0 to achieve robust generalization to unseen tasks, often surpassing larger models like GPT-3 in performance. The innovation lies in the effective use of prompts and multitask training to enhance zero-shot learning capabilities. | Flamingo is a Visual Language Model designed to bridge powerful pretrained vision-only and language-only models, allowing it to handle sequences of interleaved visual and textual data. It's trained on large-scale multimodal web corpora with interleaved text and images, enabling rapid adaptation to new tasks using few-shot learning. This approach allows Flamingo to outperform models fine-tuned on significantly more task-specific data. | To demonstrate the first large-scale use of Pathways -- a new\nML system which enables training a single model across thousands or tens of thousands of accelerator chips in a highly efficient manner. With Pathways, they trained a 540B parameter language model on 6144 TPU v4 chips at efficiency levels that could not be reached before for models of this scale. E.g., GPT-3 (175B), Gopher (280B), Megatron-Turing-NLG (530B). | GPT-NeoX-20B is a 20 billion parameter autoregressive language model with notable architectural differences from GPT-3, including the use of rotary positional embeddings. It excels in few-shot reasoning and is distinctively open-sourced, making it a significant contribution to the public research community. The model was trained on the Pile dataset. | Gato is a multi-modal, multi-task agent inspired by Large Language Models, capable of handling diverse tasks like playing games, captioning images, and robotics using a single neural network. It employs a unique tokenization approach to process varied data types, from text to images. This innovation allows Gato to generalize across a vast range of tasks, setting a new standard in the realm of LLMs. | The authors introduced OPT, a collection of auto-regressive language models ranging from 125M to 175B parameters, replicating GPT-3's performance. They applied the latest best practices in data curation and training efficiency and emphasized the importance of community collaboration for responsible LLM guidelines and ethical considerations. | The paper introduces the UL2 model, a unified framework for pre-training in NLP, featuring a novel Mixture-of-Denoisers (MoD) objective. This objective smoothly integrates various pre-training paradigms, such as span corruption and prefix language modeling. Additionally, UL2 introduces dynamic "mode switching" between different denoisers and showcases superior performance across diverse NLP tasks. | The Global Context Vision Transformer (GC ViT) introduces a hierarchical structure optimized for compute and parameter efficiency. It employs global query tokens for capturing contextual information and a novel downsampling module with Fused MB-Conv blocks to enhance inter-channel dependencies. These innovations lead to state-of-the-art performance across various computer vision tasks. | The "Imagen" model innovatively merges transformer language models with high-fidelity diffusion techniques to produce photorealistic images from text descriptions. This demonstrates that embeddings from text-only pretrained large language models are highly effective for text-to-image synthesis. | The main innovation of the "Godel" model lies in its novel approach to integrating external knowledge into dialogue generation using a knowledge selection mechanism. This enables the model to produce contextually relevant and accurate responses by incorporating relevant information from external sources, enhancing the capabilities of LLMs in generating grounded and informed dialogues. | BLOOM is an open-access Large Language Model (LLM) collaboratively designed by hundreds of researchers. Trained on the ROOTS corpus with 59 languages, it achieves competitive performance, especially after multitask prompted finetuning. The model and code are publicly released under the Responsible AI License. | BlenderBot 3 enhances its predecessor by grounding conversations with internet-based knowledge retrieval. It emphasizes fine-tuning with diverse datasets and incorporates advanced safety mechanisms, while also focusing on continual learning from public deployments. | The paper introduces AlexaTM 20B, the largest multilingual seq2seq model, trained on denoising and Causal Language Modeling tasks. This model excels at in-context learning with long contexts and showcases efficiency by outperforming much larger models like GPT3 175B in certain benchmarks. Additionally, it emphasizes a new paradigm for one-shot machine translation, especially for low-resource languages. | Sparrow, developed by DeepMind, introduces innovations in the context of LLMs by utilizing targeted human judgments for alignment. It employs a unique approach of using a single external knowledge fragment for evidence and focuses on breaking down goals into detailed rules, enhancing the model's helpfulness and accuracy in responses. The model also integrates techniques like self-play, search, and fine-grained rules to shape its behavior. | this paper explores instruction finetuning\nwith a particular focus on (1) scaling the number of tasks (1.8K fine-tuning tasks), (2) scaling the model size, and (3) finetuning on\nchain-of-thought data. This approach is compatible with various model sizes and architectures, with Flan-T5 models notably outperforming baseline T5 models. | The paper introduced an extended instruction fine-tuning for the Flan-PaLM model, scaling it to a 540B-parameter size and 1.8K fine-tuning tasks. They incorporated chain-of-thought (CoT) data, which enhanced performance across evaluations. This approach is compatible with various model sizes and architectures | Galactica is designed to revolutionize the way we access scientific knowledge, moving beyond the traditional store-and-retrieve approach. It efficiently absorbs technical knowledge, including complex LaTeX equations and chemical reactions, and demonstrates superior context-associative capabilities over traditional search engines. This model's architecture and approach highlight its potential to serve as a new, more efficient interface for scientific inquiry. | trained using Reinforcement Learning from Human Feedback (RLHF) to obtain better model alignment<SEP>It can generate upto 3000 words (equivalent to 6 pages)<SEP>Supports input context length of 2048 tokens | The paper introduces E5, a text embedding model trained contrastively using a curated dataset named CCPairs. A unique consistency-based filtering refines this dataset, ensuring high-quality training data. E5's efficiency allows it to match or outperform much larger models in various tasks. | InstructOR is built on the GTR model architecture, initialized from T5 models, and fine-tuned on information search datasets. It uses a single encoder to encode input text and task instructions. The training objective distinguishes between good and bad candidate outputs based on their cosine similarity in embeddings. | The LLaMA models prioritize efficiency, with LLaMA-13B outperforming larger models like GPT-3 using fewer parameters. They exclusively train on publicly available data, emphasizing both inference efficiency and democratizing access to powerful language models. | Alpaca 7B is fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$). | The paper introduces Pythia, a suite of 16 Large Language Models (LLMs) trained on consistent public data, spanning from 70M to 12B parameters. Pythia provides public access to 154 checkpoints for each model, facilitating research into LLM training dynamics, memorization, and bias reduction. This unique setup offers novel insights into LLM behavior and evolution. | It can generate upto 24000 words (equivalent to 48 pages)<SEP>Supports input context length between 8192 and 32,768 tokens depending on the model version | Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. | The paper introduces the Evol-Instruct methodology, allowing Large Language Models (LLMs) to autonomously generate diverse instructional data. This innovation shifts away from traditional human-generated instructions, leading to more robust models. The resulting model, WizardLM, demonstrated superior performance, highlighting the effectiveness of this approach. | The main innovation of WizardMath in the context of Large Language Models (LLMs) is its enhanced mathematical reasoning capabilities. Through a case study, the model demonstrated its ability to solve complex problems using step-by-step reasoning, showcasing its advancement over other LLMs in mathematical tasks. | MosaicML tries to provide a commercially-usable, open-source model that matches (and - in many ways - surpasses) LLaMA-7B. It is trained on a large amount of data (1T tokens like LLaMA). | StarCoder introduces novel features like an 8K context length, Fill-in-the-Middle (FIM) infilling, and Multi-Query-Attention (MQA) for fast inference. It outperforms other open LLMs for code and matches the OpenAI code-cushman-001 model. Additionally, it emphasizes safe open model release with improved PII redaction and an integrated attribution tool. | Falcon-7B and Falcon-40B have been trained on 1.5 trillion and 1 trillion tokens respectively, in line with modern models optimising for inference. The key ingredient for the high quality of the Falcon models is their training data, predominantly based (>80%) on RefinedWeb — a novel massive web dataset based on CommonCrawl. Another interesting feature of the Falcon models is their use of multiquery attention. | The main innovation of the Orca model is Explanation Tuning, which leverages detailed explanations rather than just input-response pairs for training. This approach enhances the model's reasoning capabilities, using system instructions from GPT-4 and ChatGPT as a teaching assistant for generating explanations. | The paper introduces WizardCoder, a model that empowers Code Large Language Models with complex instruction fine-tuning using the Evol-Instruct method tailored for code. This innovation allows WizardCoder to surpass other open-source Code LLMs and even outperform major closed LLMs on key benchmarks. The model fills a gap in the field by emphasizing instruction fine-tuning specifically for the code domain. | The only difference between OpenLLaMA setting and the original LLaMA is the dataset used: OpenLLaMA employs open datasets rather than the one utilized by the original LLaMA. | The paper introduces Llama 2-Chat, a fine-tuned Large Language Model optimized for dialogue use cases, ranging from 7 billion to 70 billion parameters. It outperforms most open-source chat models and emphasizes safety through specific data annotation and tuning. <SEP>The primary architectural differences from Llama 1 include increased context length\nand grouped-query attention (GQA). | The paper introduces Jais and Jais-chat, state-of-the-art Arabic-centric LLMs. These models are trained bilingually with Arabic and English, addressing the underrepresentation of Arabic in the LLM space. They combine specialized Arabic text processing with advanced model architectures, offering superior performance in Arabic while remaining competitive in English. | |||||
pretraining architecture | Decoder | Encoder | Decoder | Decoder | Decoder | Encoder | Encoder | Encoder | Decoder | Encoder | Encoder/Decoder | Decoder | Encoder | Encoder/Decoder | Encoder | Encoder/Decoder | Encoder/Decoder | Encoder | Encoder or Decorder, depending on the base model | Decoder | Encoder | Encoder | Encoder | Decoder | Encoder/Decoder | Encoder | Decoder | Encoder | Decoder | Decoder | Decoder | Encoder/Decoder | Decoder | Decoder | Decoder | Decoder | Encoder | Decoder | Encoder/Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Encoder/Decoder | Decoder | Encoder/decoder or decoder only, depending on the base model it’s extending | Encoder/Decoder | Encoder/Decoder | Encoder/Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Encoder/Decoder | Encoder | T5 (or CLIP or BERT) for frozen text encoder + U-net architecture for cascaded diffusion models for text to image | Decoder | Decoder | Decoder | Decoder | Encoder/Decoder | Decoder | Encoder/Decoder | Decoder | Decoder | Decoder | Encoder | Encoder/Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder | Decoder |
pretraining task | Causal language modeling | Masked Language Modeling | Causal language modeling | Causal language modeling | Causal language modeling | Masked Language Modeling | Masked Language Modeling | Next Sentence Prediction<SEP>Masked Language Modeling | Causal language modeling | Protein folding prediction of BERT using parameter\nsharing, which is much more efficient given the same number of parameters | denoising autoencoder | Causal language modeling | Masked Language Modeling<SEP>Next Sentence Prediction | Span Corruption | Masked Language Modeling | DAE (more concretely GSG) and MLM | denoising autoencoder | replaced token detection | Same as base model | Causal language modeling | Masked Language Modeling | Masked Language Modeling | image classification | Caption prediction | denoising autoencoder | predict which of the N × N possible (image, text) pairings across a batch actually occurred | Causal language modeling | Same as ViT | Causal language modeling | Next action prediction | predict most likely sequence | denoising autoencoder | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling | Caption prediction | Causal language modeling | Caption prediction | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling | denoising autoencoder | Causal language modeling | LM training, Dialogue training | Auto regressive blank infilling | Masked Language Modeling | Caption prediction | Log likelihood of text given some visual input | Causal language modeling | Causal language modeling | CLM (where tokens are either text or agent actions) | Causal language modeling | Mixture-of-Denoisers, which combines diverse pretraining\nparadigms together | Image Classification | image/text pair prediction | Causal language modeling | Causal language modeling | Causal language modeling | Causal language modeling | Optimizes denoising (80%) and Prefix LM (20%) | Causal language modeling | Span Corruption | Causal language modeling | Causal language modeling | Causal language modeling | Contrastive pretraining | Language Modeling | Causal language modeling | Causal language modeling | Language Modeling | Causal language modeling | Causal language modeling | Language Modeling | Causal language modeling | Self-supervized Learning | Causal language modeling | |||||||||
fine-tuning task | Supervized discriminative finetuning | Next Sentence Prediction | finetuning on downstream tasks one at a time | based on multi-turn crowdsourced dialog datasets, LaMDA-PT is finetuned in a mix of generative tasks that generate response given contexts, and discriminative tasks that evaluate quality and safety of a response in context | Reinforcement Learning from Human Feedback | Instruction Tuning | dialogue-based fine-tuning | Natural language prompts | Supervised fine-tuning (SFT) | Instruction Tuning | Instruction Tuning | Step 3. RLHF using Proximal Policy Optimization<SEP>Step 2. Collect comparison data and train a reward model<SEP>Step 1. Supervized fine-tuning | Supervised fine-tuning (SFT) | Wide variety of instruction based text-to-text tasks | instructions | supervized open-domain instruction finetuning | Reinforcement Learning from Human Feedback<SEP>Rule-Based Reward Model | supervized open-domain instruction finetuning | supervized open-domain complex instruction finetuning | Reinforcement Learning from Evol-Instruct Feedback | Conversations for the Chat model<SEP>Short instructions for the Instruct model<SEP>Excerpts of fiction books for StoryWriter | Python-specific variant | Explanation tuning | supervized complex code-based instruction finetuning | Finetuning on longer context data using sample subsets of datasets only samples with at least 4096 tokens of instances in base model dataset for the 8K context window model<SEP>A large collection of chat datasets for the Chat model<SEP>Instructions following for the Instruct model | both instruction tuning and reinforcement learning from human feedback for Llama-2-Chat | Instruction Tuning | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
training corpus | BookCorpus<SEP>Supervised Finetuning on several task-specific datasets for Natural Language Inference, Question Answering, Sentence similarity, and Classification. | Toronto Book Corpus and Wikipedia (3.3B Tokens) | Different training datasets depending on experiments, but baseline is Wikitext-103 | https://github.com/openai/gpt-2-output-dataset<SEP>8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points. | Same as BERT + Giga5 (16GB text),\r\nand and aggressively filtered ClueWeb 2012-B (19GB), Common Crawl (110 GB) | English Wikipedia + Wikidata for entitites (note that they\ninitialize model to original BERT parameter values | Same as BERT + CC News + OpenWebText + Stories (~33B Tokens) | Same as BERT | 140 GB of text including: Wikipedia (En, De, Es, Fr), Project Gutenberg, 45 subreddits, OpenWebText2, Amazon Reviews, Europarl and UN data from WMT, question-answer pairs from ELI5, and the MRQA shared task3, which includes the Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA , and Natural Questions | 170,000 proteins from a public repository of protein sequences and structures | Same as RoBERTa (160Gb of news, books, stories, and web text) | 140M Reddit conversations | Same as BERT | Colossal Clean Crawled Corpus | Cleaned Common Crawl in 100 languages | C4 (750GB) + HugeNews (3.8 TB) | CC25 Corpus includes 25 monolingual corpuses in different languages. Largest corpuses are English (300 GB) and Russian (280GB) | Same as BERT except for Large with is same as XLNet | Original paper uses an aggregate dataset consisting of Wikipedia), CC-Stories), RealNews, and OpenWebtext | ~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B) | For DeBERTa pre-training, we use Wikipedia (English Wikipedia dump; 12GB), BookCorpus (Zhu et al., 2015) 9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB) and STORIES (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after\ndata deduplication (Shoeybi et al., 2019) is about 78GB. | Books, CC-News, Stories and Wikipedia | From standard Imagenet to JFT-300M (large inhouse dataset) | 250 million text-images pairs from the internet | Colossal Clean Crawled Corpus | WIT (WebImageText) - 400 million text,image pairs | Pile - 840 GB open source text dataset that combines 22 pre existing datasets | Imagenet and Imagenet-22k | Pile corpus, a large-scale curated dataset created by EleutherAI | Different corpus for different experiments | D4RL dataset and other RL datasets depending on the task at hand | 23TB of simplified HTML extracted from CommonCrawl | 300B tokens (same as GPT-3) | The Pile (800GB dataset) + 2 Common Crawl snapshots | 400B tokens from filtered Common Crawl and Books. They also create several Dialogue Preference datasets for the RLHF training. | 1.6T tokens including web pages filtered by Wikipedia and books for quality | Same as DALL-E | Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. | LAION-5B, a publicly available dataset derived from Common Crawl | CC-News, English Wikipedia | 1.56T words from public dialog data and other public web documents. Overall, it consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T\nwords | Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts | FLAN is instruction tuned on 25 tasks spanning 62 datasets.<SEP>LaMDA-PT is is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary | 1.4 trillion training tokens. Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. | CNN/DM, XSUM, ELI5, WMT16 En-Ro (~1M tokens) | Same as Gopher plus specific dataset generated in the RLHP process | Wizard of the Internet/Wikipedia, PersonaChat, Blended Skill\nTalk, Empatheic Dialogues, Multi-Session Chat, MS MARCO, Natural\nquestions, SQuAD, TriviaQA | Pile, GLM-130B Chinese corpora, P3, DeepStruct finetuning\ndataset | T0 (Multiple-choice QA, Extractive QA, Closed-Book QA,\nStructure-To-Text, Sentiment, Summarization, Topic Classification, Paraphrase\nIdentification. T0p (same as T0, with additional datasets from\nGPT-3’s evaluation suite). T0pp (same as T0p, with additional datasets\nfrom SuperGLUE, excluding NLI sets) | Combination of the DALL-E and CLIP datasets | MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average) | 780B tokens from multilingual social media conversations (50%), multilingual filtered webpages (27%), books in English (13%), code from Github (5%), multilingual Wikipedia (4%), and news in English (1%). Code includes 24 programming languages. | Pile — 840 GB open source text dataset that combines 22 preexisting datasets | 1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot) | 180B tokens = RoBERTa + the Pile + PushShift.io Reddit | 1 trillion tokens on C4 | Imagenet-1K and other task dependent dataasets | a combination of internal datasets, with ? 460M image-text pairs, and the publicly available Laion dataset, with ? 400M image-text pairs | Same as PaLM + 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats | 147M dialog sessions for a total of 6B tokens from Reddit comment chains\nfor DialoGPT. And grounded dialog corpora like DSTC7 Task 2 corpus, MS MARCO, UnifiedQA, and Schema-Guided Dialog. | https://openreview.net/forum?id=UoEw6KigkUn<SEP>366B tokens (1.5 TB of text data) multilingual dataset (46 natural languages and 13 programming languages) | 180B tokens = RoBERTa + the Pile + PushShift.io Reddit | Wikipedia and mC4 datasets in 12 languages. | Same as Chinchilla + interactive data gathering with human annotators during the RLHF process | Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT | Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT | Trained on 106 billion tokens of open-access scientific text and\ndata. This includes papers, textbooks, scientific websites, encyclopedias,\nreference material, knowledge bases, and more | Human written prompt and interaction dataset collected through the OpenAI API | Finetune on a combination of 3 datasets: NLI 6 (Natural Language Inference), MS-MARCO passage ranking\ndataset, and NQ (Natural Questions) dataset<SEP>CCPairs dataset by combining various semistructured\ndata sources such as CommunityQA, Common Crawl and Scientific papers, and perform\naggressive filtering with a consistency-based filter | Finetuned on MEDI | Stack Exchange<SEP>ArXiv<SEP>Gutenberg and Books3<SEP>Wikipedia<SEP>Github<SEP>C4<SEP>CommonCrawl<SEP>approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%) | 52K instruction-following data generated from OpenAI’s text-davinci-003 using self-instruct mechanism, from 175 human-written instruction-output pairs was leveraged to finetune the model | Two version of each of the eight models are released: one trained on the original Pile corpus and another trained on the deduplicated Pile corpus.<SEP>Pile | Vicuna is finetuned with 70K user-shared ChatGPT conversations user-shared conversations collected from ShareGPT.com | 250K instructions generated from OpenAI ChatGPT API auto-generated based on the Evol-Instruct method | To enhance the model’s ability to adhere to the neural and diverse instructions, 1.5k open-domain conversations from WizardLM’s training data are sampled and merged with above math corpus as the final supervized finetuning training data<SEP>few-shot re-generate 15k answers for GSM8k\nand MATH with an Alpha version of WizardLM 70B model to produce solutions in a\nstep-by-step format, then find out those with a correct answer, and use this data to finetune\nbase Llama model<SEP>evolve the original math (GSM8k + MATH) instructions by 8 turns,\nincreasing the data size from 15k to 96k | Semantic Scholar ORC<SEP>The Stack Dedup<SEP>RedPajama<SEP>C4<SEP>mC4 | StarCoderBase was trained using the dataset "The Stack v1.2" (Kocetkov et al., 2022), which exclusively contains data from permissively licensed GitHub repositories. The training set was further refined by selecting 86 languages out of the 358 programming languages present in The Stack, based on criteria like data volume, popularity, and active support. The assignment of data to programming languages was done based on file extensions. | Falcon RefinedWeb | the Flan 2022 Collection with its extensive public assortment of tasks and instructions<SEP>⟨query, response⟩ pairs augmented with detailed responses from\nGPT-4 that explain the reasoning process of the teacher as it generates the response | initialized with the 20K instruction-following dataset called\nCode Alpaca, the Evol-Instruct technique is iteratively applied on this dataset consisting of 20,000 samples to produce evolved data. After each round of data evolution, the evolved data is merged from\nall previous rounds with the original dataset to finetune StarCoder | We train our models on the RedPajama dataset released by Together, which is a reproduction of the LLaMA training dataset containing over 1.2 trillion tokens.<SEP>RedPajama | RedPajama - StackExchange<SEP>RedPajama - arXiv<SEP>RedPajama - Books<SEP>Semantic Scholar ORC<SEP>The Stack - Markdown<SEP>RedPajama - Wikipedia<SEP>The Stack - Selected Languages<SEP>RedPajama - CommonCrawl<SEP>c4 - English - SemDedup<SEP>mC4 3.1.0 - English | The v2 models are trained on a mixture of the Falcon refined-web dataset, the StarCoder dataset and the wikipedia, arxiv, book and stackexchange part of the RedPajama dataset.<SEP>Starcoder<SEP>Falcon refined-web | No specific information available other than "A new mix of publicly available online data" | An English dataset comprising 232B tokens from Pile-CC, Books3, ArXiv, PubMed Central, OpenWebText2, Wikipedia (en), FreeLaw, PubMed Abstracts, DeepMind Math, Project Gutenberg, BookCorpus2, EuroParl, PhilPapers, Youtube Subtitles, NIH Grant Abstracts, Enron Emails. And 46B tokens from its Github subset.<SEP>3B tokens from English Wikipedia and 15B tokens from the Books3 corpus translated to Arabic.<SEP>An Arabic dataset comprising 55B tokens from Abu El-Khair, Aranews, ArabicText 2022, ARabic subset of C4, Arabic Wikipedia, ArabicNews 202, Maktabah, UN Meeting transcripts, and other sources. | |
optimizer | Adam optimizer | Adam optimizer | Adam weight decay optimizer | Adam optimizer | Adam optimizer | LAMB optimizer | Adagrad optimizer | AdaFactor | Adam optimizer | Adam optimizer | Adam optimizer | Adam optimizer | Adam optimizer | Adam optimizer | AdamW | AdamW | Adam optimizer | Adam optimizer | AdaFactor | Adam optimizer | Adam optimizer | AdaFactor | AdamW | AdaFactor | Adam optimizer | Adam optimizer | AdamW | AdaFactor | ZeRO<SEP>AdamW | Adam optimizer | AdamW | AdaFactor | AdamW | AdaFactor | Adam optimizer | Adam optimizer | Adam optimizer | AdaFactor | AdaFactor | AdaFactor | AdamW | AdamW | AdamW | AdamW | ZeRO<SEP>Adam | Adam optimizer | LION | Adam optimizer | AdamW | AdamW | LION | AdamW | AdamW | AdamW | |||||||||||||||||||||||||||||||||
tokenization | byte pair encoding | WordPiece | byte pair encoding | byte pair encoding | sentencepiece | fastBPE | sentencepiece | sentencepiece | sentencepiece | sentencepiece | byte pair encoding | byte pair encoding | BPE-ecnode | same set of Byte Pair Encoding (BPE) Tokenizer as GPT-2/GPT-3 | self-trained sentencepiece | sentencepiece | sentencepiece | sentencepiece | sentencepiece | a slightly modified SentencePiece (Kudo and Richardson, 2018)\ntokenizer that does not apply NFKC normalisation | sentencepiece | uncased wordpiece | sentencepiece | train a new BPE tokenizer based on the Pile | sentencepiece | GPT-2 byte level BPE tokenizer | sentencepiece | Byte level BPE tokenizer | byte pair encoding | the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. | BPE tokenizer that is trained specifically on the Pile same as used for GPT-NeoX-20B | EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus | use the Hugging Face Tokenizers library (MOI et al., 2022) to train a byte-level\nByte-Pair-Encoding with a vocabulary size of 49,152 tokens—including the sentinel tokens | LLaMA Byte Pair Encoding (BPE) | EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus | sentencepiece | Jais tokenizer | ||||||||||||||||||||||||||||||||||||||||||||||||||
number of parameters | 117M | Base = 110M, Large = 340M | 151M | 124M, 355M, 774M, 1.5B | Base=117M, Large=360M | Ernie-ViLG 2.0 = 10B, Ernie 3.0 Titan = 260B | 125M Base, and 356M Large | Base = 12M, Large = 18M, XLarge = 60M | 1.63B | b12M, Large = 18M, XLarge = 60M | Base = 140M, Large = 400M. In general, roughly 10%\nlarger than BART for equivalent architectures | 117M, 345M and 762M | 66M | 60M, 220M, 770M, 3B, and 11B | Base = 270M Large = 550M | Base = 223M Large = 568M | Same as BART | Small = 14M, Base = 110M, Large = 335M | 8.3B (GPT-like), 3.9B (BERT-like) | 125M, 350M, 774M, 1.3B, 2.7B, 6.7B, 13B, 175B | 100M (Base), 350M (Large), 700M (X-large), and 1.5B (XX-large) | Depends on the overall architecture | 86M(Base) to 632M (Huge) | 12B | 1T | 125M, 350M, 1.3B, and 2.7B | Swin-Tiny (29M), Swin-Small (50M), Swin-Base (88M), and Swin-Large (197M) | 6B | Same as GPT | Smaller architecture than GPT | 400M | 178B (Jumbo), 17B (Grande), 7.5B (Large) | 530B | 13M, 42M, 197M, 810M, 2.7B, 13B, and 52B | 1.2T across 64 experts, but only 96B get activated for inference | 3.5B diffusion model (2.3B for visual encoding, 1.2B for textual) + 1.5B for model for upsampling | 44M, 117M, 417M, 1.4B, 7.1B, and 280B | 890M (although there are different, smaller, variants) | 125M (small), 800M (small), 2.7B (medium) and 13B (large) | 137B | Same as GPT3 | 137B | 70B | Up to 30x reduction in parameters compared to standard BART | 280B | SeeKeR Dialogue: 400M, 3B; SeeKeR LM: 365M, 762M,\n1.5B, R2C2 BlenderBot: 400M, 3B | Base = 110M, Large = 335M, and also 2B, 10B, 130B | T0-3B: 3 billion, T0, T0p, T0pp: 11 billion | 3.5B | 80B (largest) | 8B, 62B, and 540B | 20B | 79M, 364M, and 1.18B | 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, and 175B | 20B | 90M | 2B | 540B | 220M (base), 770M (large), and 175B (XL) | 560m, 1.1B, 1.7B, 3B, 7.1B, and 176B | 3B, 30B and 175B | 20B | 70B | 80M (Flan-T5-Small), 250M (Flan-T5-Base), 780M (FLan-T5-Large), 3B (Flan-T5-XL), and 11B (Flan-T5-XXL). | 8B, 62B, 540B | mini: 125M, base: 1.3B, standard: 6.7B, large: 30B,\nhuge: 120B | 175B | 300M | 330M | 6.7B, 13.0B, 32.5B, and 65.2B | 7B, 13B, 33B, 65B | 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B | 170T | 7B, 13B, 33B | 7B | 7B, 13B, and 70B | 7B | 15.5B | 7B, 40B | 13B | 1B, 3B, 7B, 13B, 15B, and 34B | 3B, 7B, 13B | 30B | 3B, 7B | 7B, 13B, 34B, 70B | 1.3B, 6.7B, and 13B | |
maximum number of parameters (in million) | 117 | 340 | 151 | 1500 | 360 | 260000 | 356 | 60 | 1630 | 60 | 400 | 762 | 66 | 11000 | 550 | 568 | 335 | 8300 | 175000 | 1500 | 632 | 12000 | 1000000 | 2700 | 197 | 6000 | 400 | 178000 | 530000 | 52000 | 1200000 | 3500 | 280000 | 890 | 13000 | 137000 | 137000 | 70000 | 280000 | 130000 | 11000 | 3500 | 80000 | 540000 | 20000 | 1180 | 175000 | 20000 | 90 | 2000 | 540000 | 175000 | 176000 | 175000 | 20000 | 70000 | 11000 | 540000 | 120000 | 175000 | 300 | 330 | 65200 | 65000 | 12000 | 170000000 | 33000 | 7000 | 70000 | 7000 | 15500 | 40000 | 13000 | 34000 | 13000 | 30000 | 7000 | 70000 | 13000 | ||||||||
hardware used | TPUv3 | Cloud TPUv3 | Cloud TPUv3 | Nvidia V100 GPU | Nvidia V100 GPU | TPUv3 | NVIDIA V100 (32GB) GPUs | NVIDIA V100 (32GB) GPUs | TPUv3<SEP>Nvidia V100 GPU | Tesla V100 (32GB) GPU | Nvidia V100 GPU | Nvidia V100 GPU | Cloud TPUv3 | NVIDIA V100 (16GB) GPU | TPUv3 | Nvidia V100 GPU | Nvidia V100 GPU | TPU v3-256 pod | graphics processing unit | A100-80GB GPU | cloud TPU-v4 | TPUv3 | NVIDIA A100 GPU<SEP>Nvidia V100 GPU | TPUv3 | TPUv3 | TPUv4<SEP>TPUv3 | A100 GPU | TPUv3 | Nvidia V100 GPU | Nvidia V100 GPU | TPUv3 | TPUv4 | TPUv4 | AMD EPYC 7532 CPU<SEP>A100-SXM4-40GB GPU | A100-80GB GPU | TPUv4 | NVIDIA A40 GPU<SEP>NVIDIA A100 GPU | TPUv4 | Nvidia V100 GPU | A100-80GB GPU | NVIDIA V100 (32GB) GPUs<SEP>A100-40GB GPU | NVIDIA A100 GPU | TPUv3 | TPUv3 | TPUv4 | A100-80GB GPU | Nvidia V100 GPU | A100-40GB GPU | A100-80GB GPU | A100-80GB GPU | A100-40GB GPU | A100-80GB GPU | Nvidia V100 GPU | A100-40GB GPU | A100-80GB GPU | A100-40GB GPU | A100-80GB GPU | cloud TPU-v4 | A100-40GB GPU<SEP>H100-80GB GPU | cloud TPU-v4 | A100-80GB GPU | Condor Galaxy 1 AI supercomputer | |||||||||||||||||||||||||
hardware information | state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster | train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days | trained on 16 Nvidia V100 machines with NVLink. | DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours. | we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods. TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines. | train the XLM-R model for\n1.5 Million updates on five-hundred 32GB Nvidia\nV100 GPUs with a batch size of 8192. | The full model (including 25\nlanguages) is trained on 256 Nvidia V100 GPUs (32GB) for 500K steps. The total batch size is around 128K tokens per GPU, matching the BART configuration. | ELECTRA-SMALL trained for 4days on 1 V100 GPU. ELECTRA-Base trained for 4d on 16 TPUv3s. | Their experiments use up to 32 DGX-2H servers (a total of 512 Tesla V100 SXM3 32GB GPUs). Our infrastructure is optimized for multi-node deep learning applications, with 300 GB/sec bandwidth between GPUs inside a server via NVSwitch and 100 GB/sec of interconnect bandwidth between servers using 8 InfiniBand adapters per server. | All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. | The base model of DeBERTa was trained on 4 DGX-2 machines equipped with 64 V100 GPUs. The training took 10 days to complete 1M training steps with a batch size of 2048. For the larger version of DeBERTa with 1.5 billion parameters (DeBERTa1.5B), the model was trained on a pre-training dataset of 160G. The training was conducted on a DGX-2 machine with 16 V100 GPUs. | the ViT-L/16 model\npre-trained on the public ImageNet-21k dataset could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days. | All models are trained with the same amount of computation (32 cores) and on the same hardware (TPUv3). | The largest ResNet model,\nRN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. | We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192. | Model training is done with mixed precision using 16-bit bfloat on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes. Each cluster node has 8 NVIDIA 80-GB A100 GPUs, connected to each other by NVLink and NVSwitch. Each node has eight NVIDIA Mellanox 200Gbps HDR Infiniband HCAs for application communication, with an additional two HCAs per node for dedicated storage. The nodes are connected in a three-level (leaf, spine, core) fat-tree topology with 850 switches. The cluster uses an all-NVME shared parallel filesystem for high-performance data access and storage. | the GLaM (64B/64E)\ntraining after 600B tokens consumes 456 MWh, about 1/3 of the energy cost of 1287 MWh used by GPT-3. Moreover, to reach similar (and slightly exceeded) scores as GPT-3, we train using 1,024 TPU-v4 chips for 574 hours (with 280B tokens). This consumes 213 MWh or 1/6 of the GPT-3\nenergy cost. | We built our training and evaluation codebase with JAX and Haiku. In particular, we use JAX’s pmap transformation to efficiently express both data and\nmodel parallelism. We trained and evaluated all models on TPUv3 chips.\nThe half-precision parameters and single-precision Adam state for Gopher occupy 2.5 TiB, which far exceeds the 16 GiB of memory available on each TPUv3 core. To address these memory concerns,\nwe use optimiser state partitioning, model parallelism, and rematerialisation to partition the model state and reduce\nthe activations so that they fit in TPU memory. | HTLM-Medium was trained on 240 V100 GPU for 28 days, while HTLM-Large was trained on 384 A100 GPU for 24 days. | LaMDA was pretrained on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch | instruction tuning takes around 60 hours on a TPUv3 with 128 cores | All models in this analysis have been trained on TPUv3/TPUv4 with JAX and Haiku. | shard the networks across 128 TPU v3\nmachines. | The SeeKeR language models were fine-tuned on all of the search, knowledge, and response tasks\nsimultaneously, with training occurring on 32 V100 GPUs for around 17, 21, and 31 hours for the XL,\nLarge, and Medium models, respectively.<SEP>The SeeKeR 2.7B R2C2 model was fine-tuned on all of the search, knowledge, and dialogue response\ntasks simultaneously, with training occurring on 64 V100 GPUs for around 20 hours<SEP>SeeKeR 2.7B R2C2 was pre-trained on 128 V100 GPUs for approximately 25 days | The models\nare trained on 64 V100 GPUs for 200K steps with\nbatch size of 1024 and maximum sequence length\nof 512, which takes about 2.5 days for GLMLarge. | training runs corresponded to about 270 total hours of training on a v3-512 Cloud TPU device. Further, T5 was trained in Google’s Taiwan datacenter, whereas we\ntrained in the europe-west4-a Cloud region. The gCO2eq/kWh published by Google for these\ndatacenters are 540 and 410 respectively | Our model and associated infrastructure were implemented using JAX and Haiku. All training and evaluation was performed on TPUv4 instances. The\nlargest model containing 80 billion parameters is trained on 1536 chips for 15 days and sharded across 16 devices. Megatron type sharding is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers\nwere unsharded. ZeRO stage 1 is used to shard the optimizer state. | PaLM 540B is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism. Each Pod has 3072 TPU v4 chips attached to 768 hosts. | trained GPT-NeoX-20B on twelve Supermicro\nAS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs | training OPT-175B\non 992 80GB A100 GPUs, reaching 147 TFLOP/s\nutilization per GPU. From this implementation, and\nfrom using the latest generation of NVIDIA hardware,\nwe are able to develop OPT-175B using only\n1/7th the carbon footprint of GPT-3. | We use a batch size of 1024 and 512 TPUv4 chips\nfor pretraining this model. UL20B is trained with Jax and T5X infrastructure. We release and open source\nT5X-based model checkpoints of this 20B model | Object detection\nand instance segmentation models as well as semantic segmentation models were trained using one computational node with\n8 NVIDIA A40 GPUs using a total batch size of 16, hence a batch size of 2 per GPU.<SEP>For image classification, GC ViT models were trained using four computational nodes with 32 NVIDIA A100 GPUs | use 256 TPU-v4 chips for our base 64 x 64 model, and 128 TPU-v4 chips for both super-resolution models | GODEL_B and GODEL_L were trained on 16 Nvidia\nV100 machines, and GODEL_XL was trained with\n128 Nvidia V100 GPUs. | Training was conducted on 48 nodes, each\nhaving 8 NVIDIA A100 80GB GPUs (a total of 384 GPUs); due to possible hardware\nfailures during training, we also maintained a reserve of 4 spare nodes. The nodes were\nequipped with 2x AMD EPYC 7543 32-Core CPUs and 512 GB of RAM, while the storage\nwas handled by mix of full flash and hard disk drives using a SpectrumScale (GPFS) parallel\nfile system shared between all nodes and users of the supercomputer. 4 NVLink GPU-to-\nGPU interconnects per node enabled intra-node communications while 4 Omni-Path 100\nGbps links per node, arranged in an enhanced hypercube 8D global topology, were used for\ninter-node communications. | The 30B and 175B parameter BlenderBot 3 models were each trained for one epoch of the training data\non 64 (30B) or 128 (175B) x 40gb A100 GPU<SEP>The 3B parameter BlenderBot 3 model was trained on 64 x 32gb V100 GPUs for 27k updates with a batch\nsize of 64 | trained AlexaTM 20B for 120 days on 128 A100 GPUs for the total of 500k updates with the accumulated\nbatch size of 2 million tokens (total of 1 trillion token updates) | shard the models across 64\nTPU v3 machines | use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4\nTPU chips for 37 hours) | We use the metaseq library for training the models, built by the NextSys team at Meta AI. For training the largest 120B model, we use 128 NVIDIA A100 80GB nodes. For inference Galactica 120B\nrequires a single A100 node. | trained on an Azure AI supercomputing infrastructure | takes {16; 32; 64} V100 GPUs and {1; 1; 2} days for the {small, base, large} models. To improve training efficiency and reduce GPU memory usage, we adopt mixed precision training and gradient checkpointing. | When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days. | fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. | training the 70M, 160M, and 410M models required 32 A100 GPUs with 40GB RAM, 1.0B, 1.4B, and 2.8B models used 64 A100-40GB GPUs. 6.9B model used 128 GPUs. And 12B model used 256 GPUs. | The training is done with 8x A100 GPUs. The longest single training run takes around 2 days. They utilized\nSkyPilot managed spot instances for saving training costs and FlashAttention for memory optimizations. | train our model on\n8 V100 GPUs with Deepspeed Zero-3 for 70 hours on 3 epochs. | A100-80GB GPU was used in model finetuning<SEP> took ~9.5 days to train on 440xA100-40GB GPUs, and cost ~$200k | trained our model on a GPU cluster with 512 A100 80 GB GPUs distributed across 64 nodes. We partitioned the model with a 3D-parallel layout that shards the model with both tensor and pipeline parallelism rank 4, requiring 16 GPUs (two nodes) for one replica. | Falcon-40B-Instruct was trained on AWS SageMaker, on 64 A100 40GB GPUs in P4d instances. <SEP>Falcon-7B-Instruct was trained on AWS SageMaker, on 32 A100 40GB GPUs in P4d instances. <SEP>Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.<SEP>Falcon-7B was trained on 384 A100 40GB GPUs, using a 2D parallelism strategy (PP=2, DP=192) combined with ZeRO. | trained Orca on 20 NVIDIA A100 GPUs with 80GB memory. It took 160\nhours to train Orca on FLAN-5M (ChatGPT augmentations) for 4 epochs, and 40 hours to\ncontinue training on FLAN-1M (GPT-4 augmentations) for the same number of epochs. | The models are trained on cloud TPU-v4s using EasyLM, a JAX based training pipeline we developed for training and fine-tuning large language models. We employ a combination of normal data parallelism and fully sharded data parallelism (also know as ZeRO stage 3) to balance the training throughput and memory usage. | The model was trained in three stages using the MosaicML Platform: (i) First it was trained on 440 A100-40GBs with a batch size of 1760. (ii) Then, on 216 A100-40GBs with a batch size of 1728. (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens. The model was trained with sharded data parallelism using FSDP and used the LION optimizer. | models were pretrained on Meta’s Research Super Cluster (RSC) as well as internal production clusters, both of which use NVIDIA A100s. | All training, hyper-parameter tuning, and instruction-tuning experiments were executed on the Condor Galaxy\n1 (CG-1) AI supercomputer from Cerebras, built in partnership with G42. The final training and fine-tuning runs for Jais were performed on 16 CS-2 systems within CG-1. CG-1 is a Cerebras Wafer-Scale Cluster composed of Cerebras CS-2 systems, MemoryX, SwarmX, management, and input worker nodes. | ||||||||||||||||||||||||||||||||
extension | Relative positioned embeddings enable longer-context attention when compared to vanilla Transformer model | GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.<SEP>Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024) | This model basically adapts Transformer XL architecture to permutation-based LM | Uses BERT for Encoder architecture, but stacks and aggregates\ntwo of them for text and entities. This architecture could be\nunderstood as BERT for text + knowledge graphs | Extension of BERT with optimized training procedure and more data | Compressed version of BERT using parameter sharing, which is much more efficient given the same number of parameters | model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior | The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post). | It can be seen as a generalization of BERT and GPT in that it combines ideas from both in the encoder and decoder | GPT-2 architecture trained on dialog data | Compressed version of BERT using distillation, which is much more efficient given the same number of parameters | Same as original Transformer with some additions such as relative positional embeddings like Transformer XL | An extension of RoBERTa that introduces small parameter tuning insights in the context of multilingual applications | PEGASUS introduces a novel pre-training objective called Gap Sentence Generation (GSG), extending vanilla Transformers, where "important" sentences are removed from a document and the model is trained to regenerate them. The method of selecting these principal sentences is based on their relevance to the entire document, measured using the ROUGE1-F1 score. This unique approach tailors the model for abstractive summarization, achieving state-of-the-art performance on various tasks. | mBART introduces a multilingual denoising pre-training method for neural machine translation using a sequence-to-sequence auto-encoder model. Unlike other models like XLM and MASS, mBART pre-trains both the encoder and decoder, making it more adaptable for translation tasks. The model's versatility is further showcased by its ability to handle various levels of multilinguality, from monolingual to 25 languages. | Applied new training techniques including Replaced Token\nDetection | Megatron is a family of models that extend previously known architectures (namely GPT-2 and BERT originally, but also T5 more recently) by introducing model parallelism primitives. In the case of BERT, the authors also replace the next sentence prediction head with sentence order prediction and use whole word n-gram masking. | Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer | Separate positional embedding vector independent from the\ncontent embedding using disentangled attention matrices for contents and\nrelative positions | Big Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequences | Extension of BERT architecture to train on patches of images | A differential variational auto-encoder is used to learn the visual codebook. The transformer is a variation of GPT-3 | Goal to increase parameter count while keeping FLOP operations constant by using efficient routing of MoE (Mixture of Experts) | Combines Resnet and ViT for the visual encoding with Transformer for the Textual encoder | Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens | Extends ViT by replacing the standard multi-head self attention (MSA) module by a module based on shifted windows (Swin) allowing ViT-like architectures to generalize to higher resolution images | GPT-J 6B is a Transformer model trained using Mesh Transformer\nJAX and same tokenizer as GPT2/3 | Decision transformers use a GPT architecture and extend it by encoding trajectories in a way that they can be learned by an auto-regressive task | Similarly to the Decision transformers, the main extension introduced by Trajectory Transformers is a way to encode a trajectory (state, actions, rewards) | As opposed to BART, they don’t do sentence shuffling | Very similar to GPT-3, but far more parameters and improved training efficiency mostly because of the improved tokenizer. Also, different ratio of depth to breadth | Uses parallelization similar to Megatron to train a LM double the size of GPT-3 | These models do not introduce novelties at the architecture/pretraining level and they are based on GPT-3 but rather focuses on how to improve alignment through fine-tuning and prompting. Note that the Anthropic Assistant includes several models optimized for different tasks. Latest versions of this work focus on the benefits of RLHF. | GLaM introduces a Mixture of 64 Experts to increase parameter count and generalization properties in a somewhat standard decoder-only. Transformer architecture. Only two experts get activated at a time per token, which makes the model also more efficient in training and inference. | GLIDE can be seen as an extension of the ADM (Ablated Diffusion Model) by the same authors. However, ADM is not per se a transformer architecture although it does resemble one in some of the configurations the authors use. Given that ADM is by the same authors and was quickly followed up by GLIDE, I think it is fair to consider GLIDE as the first of its kind. | Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute | Stable diffusion is basically the Latent Diffusion model developed by LMU Munich researchers + some learnings on conditional diffusion from DALL-e and Imagen | This is somewhat similar to HTML in its use of structured\ntraining data. However, it is a different architecture and uses causal\nmasking, which makes the model predict, at the end of the sequence,\nan entire missing span of text. It also includes image input via Vector\nQuantized Variational Autoencoding (VQ-VAE) tokens. | LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies | GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning | Zero-shot task learning. The output space for a given task is either one of several classes (classification) or free text (generation). | Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance | Adds quantization and distillation to a BART model to improve performance and model size | GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported | SeeKer is an extension that can be applied to any Transformer architecture by introducing “search”, “knowledge”, and “response” modules that are introduced during pretraining | GLM has a bidirectional encoder and a unidirectional decoder in a unified model | T0 stands for "T5 for Zero Shot", obtained by fine-tuning\nthe T5 model on multitask mixture covering many different NLP tasks.\nCompared with T0, T0p and T0pp were fine-tuned with more datasets.\nT0pp is recommended as it leads (on average) to the best performances on\na variety of NLP tasks. | Combines CLIP encoder and Diffusion decoder similar to GLIDE | It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet | PaLM uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data | Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse | The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable. | Basically same architecture as GPT-3 but with some training improvements introduced in Megatron-LM | UL2-20B (Unifying Language Learning) can be interpreted as\na model that is quite similar to T5 but trained with a different objective\nand slightly different scaling knobs. | hierarchical ViT architecture consisting of local and global self-attention modules | Imagen adds a few extensions to the U-net diffusion architecture (pooled embedding vector, cross attention over text embeddings, and Layer Normalizations) | Extends PaLM by fine-tuning on the mathematical dataset | In contrast\nwith earlier models such as DialoGPT,\nGODEL leverages a new phase of grounded\npre-training designed to better support adapting\nGODEL to a wide range of downstream\ndialog tasks that require information external\nto the current conversation (e.g., a database\nor document) to produce good responses. | Main difference to GPT-3 is that it uses full attention instead of sparse attention | BlenderBot 3 is based on a pre-trained OPT. It adds features needed for a dialog agent such as long-term memory or the ability to search the internet. It is also fine-tuned for some specific tasks given human feedback on them. | Derived from BART and layernorms located exactly at the beginning of each layer. Encoder initialized with internal 10B pre-trained\nencoder. | Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite | instruction finetuning\nwith a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on\nchain-of-thought data | Flan-PaLM is generated by "Flan Finetuning" the PaLM\nmodels: (1) scaling the number of tasks to 1,836, (2) scaling the model\nsize, and (3) finetuning on chain-of-thought data. | Transformer based architecture in a decoder-only setup with\na few modifications. Data extensions include special tokens for working\nmemory, citations, genetic data, and a few other biology related tasks. | Fine-tunes BERT-based models to create text string embeddings\noptimized for semantic relatedness | Fine-tunes T5 explicitly to optimize encoder to produce a\ngeneral purpose text string embedding useful for many NLU tasks. | LLaMA uses a Transformer architecture, and with extensions:\nPre-normalization, SwiGLU activations, RoPE embeddings, reduced\nmemory usage and runtime through efficient implementation of the causal\nmulti-head attention, checkpointing to reduce the amount of activations\nthat are recomputed during the backward pass, model and sequence parallelism\nto reduce memory usage of the model, and uses 1.4T BPE tokens\nafter tokenization. | Alpaca is fine-tuned from a 7B LLaMA model | Trained with the library GPT-NeoX | a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs | stable-vicuna-13b-delta<SEP>Vicuna v1.3 is fine-tuned from LLaMA with supervised instruction fine-tuning. | Evol-Instruct | it enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2 | MPT-7B-StoryWriter-65k+<SEP>MPT-7B-Chat<SEP>MPT-7B-Instruct | Falcon-Instruct | finetuning on complex explanation traces obtained from GPT-4 | Code Evol-Instruct | public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. | MPT-30B 8k Context Window<SEP>MPT-30B-Chat<SEP>MPT-30B-Instruct | public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. | Llama-2-Chat | Jais-chat | ||||
application | Text generation, but adaptable to many other NLP tasks when fine tuned. | General Language Understanding and Question Answering. Many other language applications followed | General language tasks | Text generation, but adaptable to many other NLP tasks when fine tuned. | General language tasks | Knowledge intensive related tasks that might benefit from\nknowledge graphs or entities such as entity recognition | Same as BERT | Same as BERT | Controllable text generation | Protein folding | Mostly text generation but also some text understanding tasks | Text generation in dialog settings | Same as BERT | Diverse set of downstream tasks including machine translation,\nquestion answering, abstractive summarization, and text classification | Translation and other cross-lingual language tasks | abstractive text summarization | Translation | Same as BERT | Same as base model | Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation | Same as BERT | Particularly well suited for longer sequences, not only in text but also e.g. in genomics | image classification | Text to image | General language tasks (e.g. question answering) | Image/Object classification | Text generation, but adaptable to many other NLP tasks when fine tuned. | Image (object detection, image classification..) | generating text from a prompt, but is advised to be finetuned for more effective performance | General RL (reinforcement learning tasks) | General RL (reinforcement learning tasks) | General purpose language model that allows structured HTML prompting | Similar to GPT-3 | Language generation and others (similar to GPT-3) | Different models with different applications from general dialog to code assistant. | General language modeling - tested across 29 NLP tasks | Text to image | Mostly Language Modeling and NLU, but also extensible like GPT | Text to image | Multimodal language model with the ability to do structured\nprompting, zero-shot captioning, image generation, and entity linking (via\ntarget text prediction of hyperlinks) | General language modeling, such as translation, summarization,\nquestion and answers | Knowledge-intensive dialog or language tasks | language understanding and generation tasks such as inference, sentiment analysis, paraphrase, closed-book QA, reading comprehension, coreference, summarization, translation, commonsense reasoning, and struct-to-text | Same as Gopher/GPT3 | Text generation and understanding | Dialog systems, Q&A, general language generation tasks | Same as base models | a General Language Model pretrained with an autoregressive\nblank-filling objective and can be finetuned on various natural language\nunderstanding and generation tasks. | Perform zero-shot inference tasks by specifying the query\nin natural language, and the models will generate a prediction. | Text to image | Text to image | PaLM is designed as a general purpose language model with\napplicability to hundreds of different language tasks | range of language-understanding, mathematics and knowledge-based tasks | Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm. | Same as GPT-3 | A unified framework for pre-training models that are universally\neffective across datasets and setups. | image generation | Text to image | Mathematical reasoning | open-domain goal-directed dialog tasks such as knowledge-grounded response generation, task-oriented dialog, and conversational QA | Same as GPT-3 | same as GPT-3 | Summarization, multi-lingual machine translation and NLU tasks | Dialog agents and general language generation applications like Q&A | The primary use is to underestand how to improve large\nlanguage models with the right kind of instruction fine-tuning. The focus is\nresearch on zero-shot and in-context few-shot learning NLP tasks, such as\nreasoning, and question answering; advancing fairness and safety research,\nand understanding limitations of current large language models | Same as Flan-T5. The goal is to show Flan finetuning can\neven improve on the largest Google LMs (+9.4% improvement average\nacross tasks), with improvements to chain of thought, self consistency,\nmultilingual tasks, arithmetic reasoning | The models are designed to perform scientific tasks, including\nbut not limited to citation prediction, scientific QA, mathematical\nreasoning, summarization, document generation, molecular property prediction\nand entity extraction. | provide human-like conversational interactions and assist users in answering questions, generating text, providing recommendations, and engaging in natural language conversations. | Text embeddings for semantic relatedness tasks such as text\nclustering or search retrieval | Any NLU task requiring a single text string embedding. As\nof April 2023 InstructOR is the top-ranked system on the Massive Text\nEmbedding Benchmark (MTEB). | Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning,\nCode generation, Reading comprehension, and multitask understanding. | Evaluated on a variety of text generation and classification\ntasks. | Research on language model’s behavior, functionality, and\nlimitations | Creating highly realistic and contextually accurate human-like text generation | chatbot | assistant in foundational NLP tasks such as reasoning or multi-domain and multi-genre QA, code generation, etc. | surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B, Llama-1 65B, Falcon-40B, MPT-30B8, Baichuan-13B Chat and ChatGLM2 12B on both GSM8k and MATH. | Multiple tasks similar to LLaMA such as reasoning or code generation | potential in generating code across multiple programming languages, and acting as a virtual technical assistant without the need for instruction-tuning or RLHF | Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.) | various NLP tasks including Bio Olympiad, Forming Inequalities, Counterfactual Question Answering, Compound Interest Problems, Spatial Reasoning, Commonsense Question Answering, Hallucination, Quadratic Equation Solving, Meeting Transcript Processing | code understanding, code generation, instruction fine-tuning for code tasks, tested on benchmarks (HumanEval, HumanEval+, MBPP, DS-1000), generating accurate code responses with clear explanations. | same as LLaMA | Multiple text generation tasks. This model demonstrates good coding ability better than its predecessor as well as specifically finetuned code writers. | same as LLaMA | Research on LLMs demonstrating multitask abilities such as reading comprehension, commonsense reasoning, math and coding abilities | various NLP tasks encompassing world knowledge and commonsense reasoning |
has source code | https://github.com/openai/finetune-transformer-lm<SEP>https://huggingface.co/docs/transformers/model_doc/openai-gpt | https://huggingface.co/docs/transformers/model_doc/bert | https://github.com/chiayewken/transformer_xl<SEP>https://huggingface.co/docs/transformers/model_doc/transfo-xl | https://huggingface.co/docs/transformers/model_doc/gpt2 | https://huggingface.co/docs/transformers/model_doc/xlnet | https://github.com/thunlp/ERNIE | https://github.com/facebookresearch/fairseq/tree/main/examples/roberta<SEP>https://huggingface.co/docs/transformers/model_doc/roberta | https://github.com/google-research/albert<SEP>https://huggingface.co/docs/transformers/model_doc/albert | https://github.com/salesforce/ctrl<SEP>https://huggingface.co/docs/transformers/model_doc/ctrl | https://github.com/deepmind/alphafold | https://huggingface.co/docs/transformers/model_doc/bart | https://github.com/microsoft/DialoGPT<SEP>https://huggingface.co/docs/transformers/model_doc/dialogpt | https://huggingface.co/docs/transformers/model_doc/distilbert | https://github.com/google-research/text-to-text-transfer-transformer<SEP>https://huggingface.co/docs/transformers/model_doc/t5 | https://huggingface.co/docs/transformers/model_doc/xlm-roberta | https://github.com/google-research/pegasus<SEP>https://huggingface.co/docs/transformers/model_doc/pegasus | https://github.com/facebookresearch/fairseq/tree/main/examples/mbart<SEP>https://huggingface.co/docs/transformers/model_doc/mbart | https://github.com/google-research/electra<SEP>https://huggingface.co/docs/transformers/model_doc/electra | https://github.com/NVIDIA/Megatron-LM | https://platform.openai.com/docs/models/gpt-3-5<SEP>https://github.com/openai/gpt-3 | https://github.com/microsoft/DeBERTa<SEP>https://huggingface.co/microsoft/deberta-v2-xxlarge<SEP>https://huggingface.co/microsoft/deberta-v2-xlarge<SEP>https://huggingface.co/microsoft/deberta-xlarge<SEP>https://huggingface.co/microsoft/deberta-large | https://github.com/google-research/bigbird<SEP>https://huggingface.co/docs/transformers/model_doc/big_bird | https://github.com/google-research/vision_transformer<SEP>https://huggingface.co/docs/transformers/model_doc/vit | https://github.com/openai/DALL-E<SEP>https://github.com/borisdayma/dalle-mini | https://github.com/google-research/t5x<SEP>https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py | https://github.com/openai/CLIP<SEP>https://huggingface.co/docs/transformers/model_doc/clip | https://github.com/EleutherAI/gpt-neo<SEP>https://huggingface.co/docs/transformers/model_doc/gpt_neo | https://github.com/microsoft/Swin-Transformer | https://huggingface.co/EleutherAI/gpt-j-6b<SEP>https://github.com/kingoflolz/mesh-transformer-jax | https://github.com/kzl/decision-transformer<SEP>https://huggingface.co/docs/transformers/main/en/model_doc/decision_transformer | https://trajectory-transformer.github.io/<SEP>https://github.com/JannerM/trajectory-transformer | https://github.com/ai21labs/lm-evaluation | https://github.com/openai/glide-text2im | https://github.com/CompVis/latent-diffusion<SEP>https://huggingface.co/CompVis/stable-diffusion<SEP>https://huggingface.co/spaces/stabilityai/stable-diffusion<SEP>https://github.com/Stability-AI/stablediffusion | https://github.com/conceptofmind/LaMDA-rlhf-pytorch | https://github.com/openai/following-instructions-human-feedback | https://github.com/google-research/FLAN | https://github.com/amazon-science/dq-bart | https://parl.ai/projects/seeker/ | https://github.com/THUDM/GLM-130B | https://huggingface.co/bigscience/T0 | https://github.com/lucidrains/flamingo-pytorch | https://github.com/lucidrains/PaLM-pytorch | https://github.com/EleutherAI/gpt-neox<SEP>other gpt-neo models<SEP>https://huggingface.co/EleutherAI/gpt-neox-20b | https://github.com/OrigamiDream/gato | https://github.com/facebookresearch/metaseq<SEP>https://huggingface.co/facebook/opt-350m | https://github.com/google-research/google-research/tree/master/ul2 | https://github.com/NVlabs/GCVit | https://huggingface.co/microsoft/GODEL-v1_1-large-seq2seq?text=Hey+my+name+is+Mariama%21+How+are+you%3F<SEP>https://huggingface.co/microsoft/GODEL-v1_1-base-seq2seq?text=Hey+my+name+is+Julien%21+How+are+you%3F<SEP>https://github.com/microsoft/GODEL | https://huggingface.co/docs/transformers/model_doc/bloom | https://parl.ai/projects/bb3/<SEP>https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/model_card.md<SEP>https://github.com/facebookresearch/ParlAI/blob/main/projects/bb3/agents/README.md | https://github.com/amazon-science/alexa-teacher-models | https://github.com/google-research/t5x<SEP>https://huggingface.co/docs/transformers/model_doc/flan-t5 | https://huggingface.co/intfloat/e5-large-v2<SEP>https://huggingface.co/intfloat/e5-base-v2<SEP>https://huggingface.co/intfloat/e5-small-v2<SEP>https://github.com/microsoft/unilm/tree/master/e5 | https://huggingface.co/hkunlp/instructor-xl | https://huggingface.co/docs/transformers/main/model_doc/llama<SEP>https://github.com/facebookresearch/llama | https://github.com/tatsu-lab/stanford_alpaca | pythia-deduped<SEP>pythia<SEP>https://github.com/EleutherAI/pythia | https://chat.lmsys.org/<SEP>https://huggingface.co/lmsys/vicuna-33b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-13b-v1.3<SEP>https://huggingface.co/lmsys/vicuna-7b-v1.3<SEP>https://github.com/lm-sys/FastChat | https://github.com/nlpxucan/WizardLM | https://github.com/nlpxucan/WizardLM | https://huggingface.co/mosaicml/mpt-7b | https://github.com/bigcode-project/starcoder | https://huggingface.co/tiiuae/falcon-7b<SEP>https://huggingface.co/tiiuae/falcon-40b | https://github.com/nlpxucan/WizardLM | https://github.com/openlm-research/open_llama<SEP>https://huggingface.co/openlm-research/open_llama_13b<SEP>https://huggingface.co/openlm-research/open_llama_7b<SEP>https://huggingface.co/openlm-research/open_llama_3b | https://huggingface.co/mosaicml/mpt-30b | https://huggingface.co/openlm-research/open_llama_7b_v2<SEP>https://huggingface.co/openlm-research/open_llama_3b_v2 | https://github.com/facebookresearch/llama | https://huggingface.co/inception-mbzuai/jais-13b | |||||||||||||||||
blog post | https://medium.com/@hyponymous/paper-summary-improving-language-understanding-by-generative-pre-training-7b77babd7086<SEP>https://www.reddit.com/r/MachineLearning/comments/n36htr/p_gpt1_annotated_paper_paper_summary/ | https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb<SEP>https://www.philschmid.de/bert-text-classification-in-a-different-language | https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html | https://openai.com/research/better-language-models<SEP>https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface | https://towardsdatascience.com/xlnet-explained-in-simple-terms-255b9fb2c97c | http://research.baidu.com/Blog/index-view?id=160 | https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ | https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html | https://blog.salesforceairesearch.com/introducing-a-conditional-transformer-language-model-for-controllable-generation/ | https://www.deepmind.com/publications/highly-accurate-protein-structure-prediction-with-alphafold<SEP>https://fabianfuchsml.github.io/alphafold2/ | https://huggingface.co/microsoft/DialoGPT-medium?text=Hey+my+name+is+Mariama%21+How+are+you%3F | https://medium.com/huggingface/distilbert-8cf3380435b5 | https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html | https://ai.meta.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/ | https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html | https://medium.com/syncedreview/facebook-ai-mbart-the-tower-of-babels-silicon-solution-610dfb494f98 | https://sh-tsang.medium.com/brief-review-electra-pre-training-text-encoders-as-discriminators-rather-than-generators-9568050d3a86 | https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/<SEP>https://huggingface.co/blog/megatron-training | https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122<SEP>https://openai.com/blog/gpt-3-apps | https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/ | https://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html<SEP>https://huggingface.co/blog/big-bird | https://www.v7labs.com/blog/vision-transformer-guide | https://openai.com/blog/dall-e/<SEP>https://ml.berkeley.edu/blog/posts/dalle2/ | https://www.alexanderthamm.com/en/blog/switch-transformer-upscaling-to-over-a-billion-parameters/ | https://openai.com/research/clip<SEP>https://medium.com/axinc-ai/clip-learning-transferable-visual-models-from-natural-language-supervision-4508b3f0ea46 | https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api<SEP>https://www.section.io/engineering-education/leveraging-gptneo-to-generate-ai-based-blog-content/ | https://www.section.io/engineering-education/an-overview-of-swin-transformer/ | https://en.wikipedia.org/wiki/GPT-J | https://sites.google.com/berkeley.edu/decision-transformer | https://bair.berkeley.edu/blog/2021/11/19/trajectory-transformer/ | https://www.ai21.com/blog/ai21-studio-use-cases<SEP>https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1 | https://developer.nvidia.com/megatron-turing-natural-language-generation<SEP>https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ | https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html | https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval | https://stability.ai/blog/stable-diffusion-public-release | https://lilianweng.github.io/posts/2022-06-09-vlm/ | https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html<SEP>https://blog.google/technology/ai/lamda/ | https://sh-tsang.medium.com/review-instructgpt-training-language-models-to-follow-instructions-with-human-feedback-7fce4bf9059a<SEP>https://openai.com/research/instruction-following | http://rylanschaeffer.github.io/blog_posts/2022-01-20-google-brain-flan.html<SEP>https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html | https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510<SEP>https://medium.com/mlearning-ai/language-models-need-proper-training-c71484727f00 | https://www.amazon.science/publications/dq-bart-efficient-sequence-to-sequence-model-via-joint-distillation-and-quantization | https://www.deepmind.com/blog/gophercite-teaching-language-models-to-support-answers-with-verified-quotes | http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/ | https://openai.com/product/dall-e-2<SEP>https://labs.openai.com/ | https://medium.com/geekculture/3-overlooked-things-deepminds-flamingo-a-large-model-for-computer-vision-84cd9d2f738c<SEP>https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model | https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/<SEP>https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html | https://blog.eleuther.ai/announcing-20b/ | https://www.deepmind.com/blog/a-generalist-agent<SEP>https://www.deepmind.com/publications/a-generalist-agent | https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ | https://blog.research.google/2022/10/ul2-20b-open-source-unified-language.html | https://towardsdatascience.com/global-context-vision-transformers-nvidias-new-sota-image-model-2923bdaf438e | https://imagen.research.google/ | https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html | https://www.microsoft.com/en-us/research/blog/godel-combining-goal-oriented-dialog-with-real-world-conversations/ | https://huggingface.co/blog/bloom-megatron-deepspeed<SEP>https://huggingface.co/blog/bloom-inference-pytorch-scripts<SEP>https://huggingface.co/blog/bloom-inference-optimization | https://ai.facebook.com/blog/blenderbot-3-a-175b-parameter-publicly-available-chatbot-that-improves-its-skills-and-safety-over-time/ | https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning | https://www.deepmind.com/blog/building-safer-dialogue-agents\n<SEP>https://medium.com/to-cut-a-long-paper-short/sparrow-improving-alignment-of-dialogue-agents-via-targeted-human-judgments-e0876402d800 | https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html | https://galactica.org/ | https://openai.com/blog/chatgpt | https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/ | https://pub.towardsai.net/paper-review-instructor-one-embedder-any-task-6a846b0d3ba | https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ | https://medium.com/version-1/stanford-alpaca-a-small-yet-mighty-language-model-for-instruction-following-tasks-af9e92e87d9a | https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling | https://openai.com/research/gpt-4 | https://lmsys.org/blog/2023-03-30-vicuna/ | https://medium.com/@preangelleo/wizardlm-enhancing-large-language-models-with-ai-evolved-instructions-7fd4425afe80 | https://ollama.ai/blog/wizardmath-examples | https://www.youtube.com/watch?v=KSlWkrByc0o&t=9s<SEP>https://www.mosaicml.com/blog/mpt-7b | https://huggingface.co/blog/starcoder\n | https://huggingface.co/blog/falcon | https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/ | https://github.com/openlm-research/open_llama | https://www.mosaicml.com/blog/mpt-30b | https://github.com/openlm-research/open_llama | https://ai.meta.com/blog/llama-2/<SEP>https://ai.meta.com/resources/models-and-libraries/llama/ | https://inceptioniai.org/jais/ | ||||||||
license | closed source | Open, Apache 2.0 | N/A | closed source | Open, MIT license | closed source | N/A | Open, Apache 2.0 | Open, BSD-3-Clause license | Apache 2.0 | Open, Apache 2.0 | Open, MIT license | Open, Apache 2.0 | Apache 2.0 | N/A | Open, MIT license | Open, Apache 2.0 | Limited, Non-commercial usage | closed source | Open, MIT license | Open, Apache 2.0 | N/A | N/A | Open, Apache 2.0 | Open, MIT license | Open, MIT license | Open, MIT license | Apache 2.0 | Open, MIT license | Open, MIT license | N/A | Closed source, accessible through API | Limited, Non-commercial usage | N/A | closed source | Open, MIT license | closed source | open, CreativeML Open RAIL++-M License | N/A | closed source | Closed source, accessible through API | Apache 2.0 | closed source | Open, Apache 2.0 | closed source | the code is open sourced | Open, MIT license | Open, Apache 2.0 | Closed source, accessible through API | closed source | closed source | Apache 2.0 | closed source | MIT license | Open, Apache 2.0 | Limited, non-commercial license CC-BY-NC-SA-4.0 | closed source | closed source | MIT License | Open, but need to follow restrictions in Attachment A, BigScience\nRAIL License v1.0 | Limited, non-commercial, research only | Limited, non-commercial | closed source | Apache 2.0 | closed source | Limited, non-commerical CC BY-NC 4.0 license | Closed source, accessible through API | Open, MIT license | Open, Apache 2.0 | Limited, Non-commercial bespoke license | Limited, non-commercial license CC-BY-NC-SA-4.0 | Apache 2.0 | Closed source, accessible through API | Non-commercial license | Non-commercial license | Non-commercial license | Apache 2.0 | OpenRAIL-M license | Apache 2.0 | Non-commercial license | Non-commercial license | Apache 2.0 | Apache 2.0 | Apache 2.0 | Llama 2 Community License Agreement | Apache 2.0 | |
research problem | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | transformer model<SEP>Large Language Models (LLMs) | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | transformer model<SEP>Large Language Models (LLMs) | transformer model<SEP>Large Language Models (LLMs) | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | transformer model<SEP>Large Language Models (LLMs) | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model | transformer model<SEP>Large Language Models (LLMs) | transformer model<SEP>Large Language Models (LLMs) | transformer model<SEP>Large Language Models (LLMs) | Large Language Models (LLMs)<SEP>transformer model | Large Language Models (LLMs)<SEP>transformer model |
Conclusion
Let's collaborate on keeping this review up-to-date either here directly on the ORKG or via the following Github repository https://github.com/jd-coderepos/llms.
Introduction
Google's T5
EleutherAI's GPT-J, GPT-NeoX, and Pythia series
MosaicML's Mosaic Pretrained Transformer (MPT)
DeepMind's compute-optimal Chinchilla and other models
MetaAI's LLaMA
Llama-1, Alpaca, and Vicuna
Berkeley AI Research's OpenLLaMA
OpenAI's GPT
Microsoft's Wizard LLM series
A Comprehensive Catalog of Large Language Models (LLMs)
Conclusion