Skip to content

nlpfromscratch/nlp-llms-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 

Repository files navigation

Master NLP and LLM Resource List

This is the master resource list for NLP from scratch. This is a living document and will continually be updated and so should always be considered a work in progress. If you find any dead links or other issues, feel free to submit an issue.

This document is quite large, so you may wish to use the Table of Contents automatically generated by Github to find what you are looking for:

Thanks, and enjoy!

Traditional NLP

Datasets

  • nlp-datasets: Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
  • awesome-public-datasets - Natural Language: Natural language section of the awesome public datasets github page
  • SMS Spam Dataset: The “Hello World” of NLP datasets, ~55K SMS messages with label of spam/not spam for binary classification. Hosted on UC Irvine Machine Learning repository.
  • IMDB dataset: The other “Hello World” of datasets for NLP, 50K “highly polar” movie reviews scraped from IMDB and compiled by Andrew Maas of Stanford.
  • Twitter Airline Sentiment: Tweets from February of 2015 and associated sentiment labels at major US airlines - hosted on Kaggle (~3.5MB)
  • CivilCommentst: Dataset from the Civil Comments platform which shut down in 2017. 2M public comments with labels for toxicity, obscenity, threat, insulting, etc.
  • Cornell Movie Dialog: ~220K conversations from 10K pairs of characters across 617 popular movies, compiled by Cristian Danescu-Niculescu-Mizil of Cornell. Tabular compiled format available on Hugging Face.
  • CNN Daily Mail: “Hello World” dataset for summarization, consisting of articles from CNN and Daily Mail and accompanying summaries. Also available through Tensorflow and via Hugging Face.
  • Entity Recognition Datasets: Very large list of named entity recognition (NER) datasets (on Github).
  • WikiNER: 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.
  • OntoNotes: Large corpus comprising various genres of text in three languages with structural information and shallow semantic information.
  • Flores-101 - Multilingual, multi-task dataset from Meta for machine translation research, focusing on “low resource” languages. Associated Github repo.
  • CulturaX: Open dataset of 167 languages with over 6T words, the largest multilingual dataset ever released
  • Amazon Review Datasets: Massive datasets of reviews from Amazon.com, compiled by Julian McAuley of University of California San Diego
  • Yelp Open Dataset: 7M reviews, 210K businesses, and 200K images released by Yelp. Note the educational license.
  • Google Books N-grams: Very large dataset (2.2TB) of all the n-grams from Google Books. Also available hosted in an S3 bucket by AWS.
  • Sentiment Analysis @ Stanford NLP: Includes a link to the dataset of movie reviews used for Stanford Sentiment Treebank 2 (SST2). Also available on Hugging Face.
  • CoNLL-2003: Language-independent entity recognition dataset from the Conference on Computational Natural Language Learning (CoNLL-2003) shared task. Foundational datasets for named entity recognition (NER).
  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset: Large scale dataset of LLM 1M conversations with LLMs collected from Chatbot Arena website.
  • TabLib: Largest publicly available dataset of tabular tokens (627M tables, 867B tokens), to encourage the community to build Large Data Models that better understand tabular data
  • LAION 5B: Massive dataset of images and captions from Large-scale Artificial Intelligence Open Network (LAION), used to train Stable Diffusion.
  • Databricks Dolly 15K: Instruction dataset compiled internally by Databricks, used to train the Dolly models based on the Pythia LLMs.
  • Conceptual Captions: Large image & caption pair dataset from Google research.
  • Instruction Tuning Volume 1: List of popular instruction-tuning datasets from Sebastian Ruder
  • Objaverse: Massive dataset of annotated 3D objects (with associated text labels) from Allen Institute. Comes in two sizes: 1.0 (800K objects) and XL (~10M objects).
  • Gretel Synthetic Text to SQL Dataset: Open dataset of synthetically generated natural language and SQL query pairs for LLM training, from Gretel AI.
  • Fineweb: 15T token dataset of cleaned and deduplicated data from CommonCrawl by Hugging Face.

Data Acquisition

Libraries

  • Natural Language Toolkit (NLTK): Core and essential NLP python library put together for teaching purposes by University of Pennsylvania, now fundamental to NLP work.
  • spaCy: Fundamental python NLP library for “industrial-strength natural language processing”, focused on building production systems.
  • Gensim: open-source python library with a focus on topic modeling, semantic similarity, and embeddings. Also contains implementations of word2vec and doc2vec.
  • fastText: Open-source, free, lightweight library that allows users to learn text representations (embeddings) and text classifiers. Includes pre-trained word vectors from Wikipedia and Common Crawl. From Meta’s FAIR Group.
  • KerasNLP: Natural language processing with deep learning and LLMs in Keras using Tensorflow, Pytorch, or JAX. Includes models such as BERT, GPT, and OPT.
  • Tensorflow Text: Lower level than KerasNLP, text manipulation built into Tensorflow.
  • Stanford CoreNLP: Java-based NLP library from Stanford, still important and in use
  • TextBlob: Easy to use NLP library in Python, including simple sentiment scoring and part-of-speech (POS) tagging.
  • Scikit-learn (sklearn): The essential library for doing machine learning in python, but more specifically for working with text data.
  • SparkNLP: Essential Big Data library for NLP work from John Snow Labs. Take a look at their extensive model repo. Github repo with lots of resources here. Medium post here on using the T5 model for classification with SparkNLP.

Neural Networks / Deep Learning

Sentiment Analysis

Optical Character Recognition (OCR)

Information Extraction and NERD

  • RAKE: Rapid Automatic Keyword Extraction, a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurrence with other words in the text.
  • YAKE: Yet Another Keyword Extractor is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text.
  • Pytextrank: Python implementation of TextRank and associated algorithms as a spaCy pipeline extension, for information extraction and extractive summarization.
  • PKE (Python Keyphrase Extraction): open source python-based keyphrase extraction toolkit, implementing a variety of algorithms. Uses spaCy.
  • KeyBERT: Keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.
  • UniversalNER: Targeted distillation model for named entity recognition from Microsoft Research and USC, based on data generated by ChatGPT.
  • SpanMarker: Framework for NER models based on transformers such as BERT, RoBERTa and ELECTRA using Hugging Face Transformers (HF page)

Semantics and Syntax

  • Treebank: Definition at Wikipedia
  • Universal Dependencies: Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.
  • UDPipe: UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files.

Topic Modeling & Embedding

Multilingual NLP and Machine Translation:

  • fastText language identification models: Language identification models for use with fastText
  • SeamlessM4T: Multimodal translation and transcription model based on the transformer architecture from Meta research.
  • Helsinki NLP Translation Models: Well-known and used translation models in Hugging Face from the University of Helsinki Language Technology Research Group, based on the OPUS neural machine translation framework.
  • ACL 2023 Multilingual Models Tutorial: Microsoft’s presentations from ACL 2023 - a lot of dense content here on low resource languages, benchmarks, prompting, and bias.
  • ROUGE: Wikipedia page for ROUGE score for summarization and translation tasks.
  • BLEU: Wikipedia page for BLEU machine translation tasks.
  • sacreBLEU: Python library for hassle-free and reproducible BLEU scores
  • XTREME: Comprehensive benchmark for cross-lingual transfer learning on a diverse set of languages and tasks from researchers at Google and Carnegie Mellon
  • Belebele: Multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants from Meta, based upon the Flores dataset
  • OpenNMT: Open neural machine translation models in Pytorch and Tensorflow. Documentation for python here.
  • FinGPT-3: GPT model trained in Finnish, from a research group at the University of Turku, Finland.
  • Jais 13-B: Bilingual Arabic/English model based on GPT-3 architecture, from Inception AI / Core42 group in UAE.
  • Evo-LLM-JP: Japanese LLM from AI startup Sakana.ai created using evolutionary model merging. There is a chat model, a vision model, and a stable diffusion model all of which can be prompted and converse in Japanese. On Hugging Face here.

Natural Language Inference (NLI) and Natural Language Understanding (NLU)

  • Adversarial NLI: Benchmark for NLI from Meta research and associated dataset.

Interviewing

Large Language Models (LLMs) and Gen AI

Introductory LLMs

Foundation Models

Text Generation

Web-based Chat Clients

  • ChatGPT: Obviously. From OpenAI. Free, but requires an account.
  • Perplexity Labs: Free, web-based LLM chat client, no account required. Includes popular models such as versions of LLaMA and Mistral as well as Perplexity’s own pplx model.
  • HuggingChat: Chat client from HuggingFace, includes LLaMA and Mistral clients as well as OpenChat. Free for short conversations (in guest mode), account required for longer use.
  • DeepInfra Chat: Includes LLaMA and Mistral, even Mixtral 8x7B! Free to use.
  • Pi: Conversational LLM from Inflection. No account required.
  • Poe: AI assistant from Quora, allows interacting with OpenAI, Anthropic, LLaMA and Google models. Account required.
  • Copilot: Or is it Bing Chat? The lines are blurry. Backed by GPT, allows using GPT-4 on mobile (iOS, Android) for free! Requires a Microsoft account.

Summarization

Fine-tuning LLMs

Model Quantization

Data Labeling

  • Label Studio: Open source python library / framework for data labelling

Code Examples and Cookbooks

  • OpenAI Cookbook: Recipes and tutorial posts for working and building with OpenAI, all in one place. Example code in the Github repo.
  • Cohere Guides: Example notebooks for working with Cohere for various LLM usage cases.

Local LLM Development

  • GPT4All: Locally-hosted LLM from Nomic for offline development.
  • LM Studio: Software framework for local LLM development and usage.
  • Jan: Offline GUI for working with LLMs. Mobile app under development.
  • Open WebUI: Self-hosted WebUI for LLMS to operate entirely offline - formly Ollama Web UI.
  • TransformerLab: Open source project for GUI interface for working with LLMs locally.
  • SuperWhisper: Local usage of Whisper model on Mac OS, allows you to speak commands to your machine and have them transcribed (all locally).
  • Cursor: Locally installable code editor with autocomplete, chat, etc. backed by OpenAI GPT3.5/4.
  • llama.cpp: Inference from Meta’s LLaMA model in pure C/C++. Python integration through llama-cpp-python.
  • Ollama: Host LLMs locally, includes models like LLaMA, Mistral, Zephyr, Falcon, etc.
  • Exploring Ollama for On-Device AI: Comprehensive tutorial on Ollama from PyImageSearch
  • llamafile: Framework for LLMs as single executable files for local execution and development work, examples of one-liners and use from its creator here Bash One-Liners for LLMs
  • PowerInfer: CPU/GPU LLM inference engine leveraging activation locality for fast on-device generation and serving of results from LLMs locally.
  • MLC LLM: Native deployment of LLMs with native APIs with compiler acceleration. Includes WebLLM for serving LLMs through the browser and examples of locally developed Android and iPhone LLM apps.
  • DSPy: Framework for algorithmically optimizing LLM prompts and weights from Stanford NLP.
  • AnythingLLM: Docker-based framework for offline LLM usage with RAG.

Multimodal LLMs

Images

Audio

  • wav2vec 2.0 And w2v-BERT: Explanations of the technical details behind these multimodal models from Meta’s FAIR group and Google Brain, by Mohamed Anwar
  • Musenet: Older research from OpenAI, Musenet applied the GPT architecture to MIDI files to compose music.
  • AudioCraft: Multiple models from Meta research, for music (MusicGen), sound effect (AudioGen), and a codec and diffusion model for recovering compressed audio (EnCodec and Multi-band Diffusion). Demo also available in a Hugging Face space, and a sample Colab notebook here.
  • Audiobox: Text-to-audio and speech prompt to audio from Meta. Interactive demo site here.
  • StableAudio: Diffusion-based music generation model from Stability AI. Blog post with technical details.
  • SALMONN: Speech Audio Language Music Open Neural Network from researchers at Tsinghua University and ByteDance. Allows for things like inquiring about the content of audio files, multilingual speech recognition & translation and audio-speech co-reasoning.
  • Real-time translation and lip-synching: https://blog.invgate.com/video-translator
  • HeyGen: Startup creating AI generated avatars and multimedia content, _e.g. _for instructional videos. Video demo of lip-synching (dubbing) and translation.
  • Whisper: OpenAI’s open source multilingual, text-to-speech transcription model. Official Github repo with lots of details.
  • whisper_real_time: Example of real-time audio transcription using Whisper
  • whisper.cpp: High-performance plain C/C++ implementation of inference using OpenAI's Whisper without dependencies
  • Deepgram: Audio AI company with enterprise offerings for TTS including both their own Nova-2 model as well as Whisper or custom models.
  • AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios: Model for realistic audio generation (text-to-speech / TTS) from researchers at Microsoft.
  • Project Gutenberg Audio Collection Project: Thousands of free audiobooks transcribed using AdaSpeech4, brought to you by Project Gutenberg, MIT, and Microsoft
  • ElevenLabs: Well-known American software company with AI voice cloning and translation products.
  • Projects: Create High-Quality Audiobooks in Minutes: Tool for creating high-quality audiobooks via TTS from ElevenLabs.
  • Brain2Music: Research from Google for using fMRI scans to reconstruct audio perceived by the listener.
  • WavJourney: Compositional Audio Creation with Large Language Models: An approach for generating audio combining generative text for scriptwriting plus audio generation models.
  • XTTS: Voice cloning model specifically designed with game creators in mind from coqui.ai. Available in a Hugging Face space here.
  • The Future of Music - How Generative AI Is Transforming the Music Industry: Blog post from Anderssen-Horowitz covering a lot of recent developments in the intersection of the music industry and GenAI tools.
  • StyleTTS2: Diffusion and adversarial model for realistic speech synthesis (TTS). Audio samples and comparisons with previous models are here.
  • Qwen-Audio: Multimodal audio understanding LLM from Alibaba Group
  • Audio Diffusion Pytorch: A fully featured audio diffusion library in PyTorch, from researchers at ElevenLabs.
  • MARS5-TTS: English TTS model from Camb.ai. With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more.
  • IMS Toucan: IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the Institute for Natural Language Processing (IMS), University of Stuttgart, Germany.

Video and Animation

  • Generative Image Dynamics: Model from researchers at Google for creating looping images or interactive images from still ones.
  • IDEFICS: Open multimodal text and image model from Hugging Face based on Flamingo, similar to GPT4-V. Updated version IDEFICS 2 released 04/2024 with demo here.
  • NeRF; Neural Radiance fields creates multiple views of a scene from a single image.
  • ZipNeRF: Building on NeRF with more advanced techniques and impressive results, generating drone-style “fly-by” videos from still images of settings.
  • Pegasus-1: Multimodal model from TwelveLabs for describing videos and video-to-text generation.
  • Gen-2 by RunwayML: Video-generating multimodal model from Runway ML that takes text or images as input.
  • Replay: Video (animated picture) generating model from Genmo AI
  • Hotshot XL: Text to animated GIF generator based on Stable Diffusion XL. Github and Hugging Face model page.
  • ModelScope: Open model for text-to-video generation from Alibaba research
  • Stable Video Diffusion: Generative video diffusion model from Stability AI.
  • VideoPoet: Synthetic video generation from Google Research, taking a variety of inputs (text, image, video).
  • Pika Labs: AI startup for video creation with $55 million in backing.
  • Assistive Video: Video generation from text from AI startup Assistive
  • Haiper: Text-to-video for short clips (2-4s) from Google Deepmind alumni. Free to use with an account.
  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation. Text-to-video model from ByteDance research.
  • Video-LLaVA: Open model for visual question answering in images, video, and between video and image data.

3D Model Generation

  • Stable Zero123: 3D image generation model from Stability AI building on the Zero123-XL model. Weights available for non-commercial use on HF here.
  • DreamBooth3D: Approach for generating high-quality custom 3D models from source images.
  • MVDream: 3D model generation from Diffusion from researchers at ByteDance.
  • TADA! Text to Animatable Digital Avatars: Research on models for synthetic generation of 3D avatars from text prompts, from researchers in China and Germany
  • TripoSR: Image to 3D generative model jointly developed by Tripo AI & Stability AI
  • Microdreamer: Github repo for implementation of Zero-shot 3D Generation in ~20 Seconds from researchers at Renmin University of China

Powerpoint and Presentation Creation

  • Tome: Startup for AI-generated slides (Powerpoint). Free to signup.
  • Decktopus: “World’s #1 AI-Powered Presentation Generator”. Paid signup
  • Beautiful.ai: Another AI-based slide deck generator (paid)

Domain-specific LLMs

Code

  • Github Copilot: Github’s AI coding assistant, based on OpenAI’s Codex model.
  • GitHub Copilot Fundamentals - Understand the AI pair programmer: Introductory online training / short course on Copilot from Microsoft.
  • Gemini Code Assist: Code assistant from Google based on Gemini. Available in Google Cloud or in local IDEs via a plugin (requires subscription).
  • CodeCompose: (TechCruch article): Meta’s internal coding LLM / answer to Copilot
  • CodeInterpreter: Experimental ChatGPT plugin that provides it with access to executing python code.
  • StableCode: Stability AI’s generative LLM coding model. Hugging Face collection here. Github here.
  • Starcoder: Coding LLM from Hugging Face. Github is here. Update: Starcoder 2 has been released as of Feb 2024!
  • CodeQwen-1.5: Code-specific version of Alibaba’s Qwen model.
  • Codestral: 22B coding model from Mistral AI, supports 80+ languages.
  • Ghostwriter: an AI-powered programming assistant from Replit AI.
  • DeciCoder 1B: Code completion LLM from Deci AI, trained on Starcoder dataset.
  • SQLCoder: Open text-to-SQL query models fine-tuned on Starcoder, from Defog AI. Demo is here.
  • CodeLLama: Fine-tuned version of LLaMA 2 for coding tasks, from Meta.
  • Refact Code LLM: 1.6B coding LLM with fill-in-the-middle (fim) capability, trained by Refact AI.
  • Tabby: Open source, locally-hosted coding assistant framework. Can use Starcoder or CodeLLaMA.
  • DuetAI for Developers: Coding assistance based on PaLM as part of Google’s DuetAI offering.
  • Gorilla LLM: LLM model from researchers at UC Berkeley trained to generate API calls across many different platforms and tools.
  • Deepseek Coder: Series of bilinginual English/Chinese coding LLMs from DeepSeek AI, trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language.
  • Codestral Mamba: Open coding model from Mistral based on the MAMBA architecture.
  • Phind 70B: Code generation model purported to rival GPT-4 from AI startup Phind.
  • Granite: Open-sourced family of code-specific LLMs from IBM Research. On Hugging Face here.

Mathematics

Finance

  • BloombergGPT: LLM trained by Bloomberg from scratch based on code / approaches from BLOOM
  • FinGPT: Finance-specific family of models trained with RLHF, fine-tuned from various base foundation models.
  • DocLLM: Layout-aware large language model from JPMorgan

Science and Health

  • Galactica: (MIT Blog Post) Learnings from Meta’s Galactica LLM, trained on scientific research papers.
  • BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, open LLM from Microsoft Research trained on PubMeb papers.
  • MedPALM: A large language model from Google Research, designed for the medical domain. Google has continued this work with MedLM,
  • Meditron: Fine-tuned LLaMAs on medical data from Swiss university EPFL. HuggingFace space here. Github here. Llama3 version released 2024/04/19.
  • MedicalLLM: Evaluation benchmark for medical LLMs from Hugging Face including leaderboard.

Law

  • SaulLM-7B: Legal LLM from researchers at Equall.ai and other universities. A fine-tune of Mistral-7B trained on a legal corpus of over 30B tokens.

Time Series

  • TimeGPT: Transformer-based time series prediction models from NIXTLA. Requires using their service / an API token.
  • Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting. Open-source foundation model for time series forecasting based on the transformer architecture.
  • Granite: Time-series versions of open-sourced family of LLMs from IBM Research. On Hugging Face here.

Vector Databases and Frameworks

  • Docarray: python library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, and so on.
  • Faiss: Library for efficient similarity search and clustering of dense vectors from Meta Research.
  • Pinecone: Vector database is a vector-based database that offers high-performance search and similarity matching.
  • Weaviate: Open-source vector database to store data objects and vector embeddings from your favorite ML-models.
  • Chroma: Open-source vector store used for storing and retrieving vector embeddings and metadata for use with large language models.
  • Milvus: Vector database built for scalable similarity search.
  • AstraDB: Datastax’s vector database offering built atop of Apache Cassandra.
  • Activeloop: Database for AI powered by a unique storage format optimized for deep-learning and Large Language Model (LLM) based applications.
  • OSS Chat: Demo of RAG from Zilliz, allowing chat with OSS documentation.

Evaluation

  • The Stanford Natural Language Inference (SNLI) Corpus: Foundational dataset for NLI-based evaluation, 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
  • GLUE: General Language Understanding Evaluation Benchmark from NYU, University of Washington, and Google - model evaluation using Natural Language Inference (NLI) tasks.
  • SuperGLUE: The Super General Language Understanding Evaluation, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
  • SQuAD (Stanford Question Answering Dataset): Reading comprehension question answering dataset for LLM evaluation.
  • BigBench: The Beyond the Imitation Game Benchmark (BIG-bench) from Google, a collaborative benchmark with over 200 tasks.
  • BigBench Hard: Subset of BigBench tasks considered to be the most challenging, with associated paper.
  • MMLU: Multi-task Language Understanding is a benchmark developed by researchers at UC Berkeley and others to specifically measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.
  • HeLM: Holistic Evaluation of Language Models, a “living” benchmark designed to be comprehensive, from the Center for Research on Foundation Models (CRFM) at Stanford.
  • HellaSwag: a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
  • Dynabench: A “platform for dynamic data collection and benchmarking”. Sort of a Kaggle / collaborative site for benchmarks and data collaboration, an effort of researchers from Meta and American universities.
  • LMSys Chatbot Area: Leaderboard from LMSys group based upon human evaluation and Elo score. The only evaluation that Andrej Karpathy trusts.
  • Hugging Face Open LLM Leaderboard: Leaderboard from H4 (alignment) Group at Hugging Face. Largely open and fine-tuned models, though this can be filtered.
  • AlpacaEval Leaderboard: AlpacaEval an LLM-based automatic evaluation based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions.
  • OpenCompass: Leaderboard for Chinese LLMs.
  • Evaluating LLMs is a minefield: Popular deck from researchers at Princeton (and authors of AI Snake Oil) on the pitfalls and intricacies of evaluating LLMs.
  • LM Contamination Index: The LM Contamination Index is a manually created database of contamination of LLM evaluation benchmarks.
  • The Curious Case of LLM Evaluation: In depth blog post, examining some of the finer nuances and sticking points of evaluating LLMs.
  • LLM Benchmarks: Dynamic dataset of crowd-sourced prompt that changes weekly for more realistic LLM evaluation.
  • Language Model Evaluation Harness: EleutherAI’s language model evaluation harness, a unified framework to test generative language models on over 200 different evaluation tasks.
  • PromptBench: Unified framework for LLM evaluation from Microsoft.
  • HarmBench: Standardized evaluation framework for automated red teaming for mitigating risks associated with malicious use of LLMs. Paper on arxiv.

Agents

  • AutoGPT: One of the most popular frameworks for using LLM agents, using the OpenAI API / GPT4.
  • ThinkGPT: python library for implementing Chain of Thoughts for LLMs, prompting the model to think, reason, and to create generative agents.
  • AutoGen: Multi-agent LLM framework for building applications from Microsoft.
  • XAgent: Open-source experimental agent, designed to be a general-purpose and applied to a wide range of tasks. From students at Tsinghua University.
  • Thought Cloning: Github repo for implementation of Thought Cloning (TC), an imitation learning framework by training agents to think like humans.
  • Demonstrate-Search-Predict (DSP): framework for solving advanced tasks with language models (LMs) and retrieval models (RMs).
  • ReAct Framework: Prompting method includes examples with actions, the observations gained by taking those actions, and transcribed thoughts (reasoning) for LLMs to take complex actions and reason or solve problems.
  • Tree of Thoughts (ToT): LLM reasoning process as a tree, where each node is an intermediate "thought" or coherent piece of reasoning that serves as a step towards the final solution.
  • GPT Engineer: Python framework for attempting to get GPT to write code and build software.
  • MetaGPT - The Multi-Agent Framework: Agent framework where different assigned roles (product managers, architects, project managers, engineers) are used for building different products (user stories, competitive analysis, requirements, data structures, etc.) given a requirement.
  • OpenGPTs: Open source effort from Langchain to create a similar experience to OpenAI's GPTs with greater flexibility and choice.
  • Devin: “AI software engineer” from startup Cognition Labs.
  • SWE-Agent: Open source software engineering agent framework from researchers at Princeton.
  • GATO: Generalist agent from Google Deepmind research for many tasks and media types
  • WebLLaMa: Fine-tuned version of LLaMa 3 from McGill University and optimized for web browsing tasks..

Application Frameworks:

  • LlamaIndex: LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Usedl for RAG and building LLM applications working with stored data.
  • LangChain: LangChain is a framework for developing applications powered by language models.
  • Chainlit: Chainlit is an open-source Python package that makes it incredibly fast to build ChatGPT-like applications with your own business logic and data.

LLM Training, Training Frameworks, Training at Scale

  • Deepspeed: Deep learning optimization software suite that enables unprecedented scale and speed for DL Training and Inference from Microsoft.
  • Megatron-LM: From NVIDIA, Megatron-LM enables training large transformer language models with efficient tensor, pipeline and sequence-based model parallelism for pre-training transformer based language models.
  • GPT-NeoX: Eleuther AI’s library for large scale GPU training of LLMs, based on Megatron.
  • TRL (Transformer Reinforcement Learning): Library for Reinforcement Learning of Transformer and Stable Diffusion models built atop of the transformers library.
  • Autotrain Advanced: In development offering and python library from Hugging Face for easy and fast auto-training of LLMs and Stable Diffusion models.
  • Transformer Math: Detailed blog post from Eleuther AI on the mathematics of compute requirements for training LLMs

Reinforcement Learning from Human Feedback (RLHF)

Embeddings

LLM Serving

Preprocessing and Tokenization

  • Tiktoken: OpenAI’s BPE-based tokenizer
  • SentencePiece: Unsupervised text tokenizer and detokenizer for text generation systems from Google (but not an official product).

Open LLMs

  • LLaMa 2: Incredibly popular open weights (with license) model from Meta AI which spawned a generation of offspring and fine-tunes. Comes in 7, 13, and 70B versions.
  • Mistral 7B: Popular open model from French startup Mistral with no fine-tuning (only pretraining). See also: the Mixtral 8x7B mixture of experts successor, Mixtral 8x22B
  • Mistral NeMO: Open model from Mistral with 128B parameters, trained in partnership with NVIDIA and a new updated tokenizer (Tekken). Model on Hugging Face.
  • Gemma: Lightweight open models from Google based on the same architecture as Gemini. Comes in 2B and 7B base and instruction-tuned versions.
  • GPT-J and GPT Neo-X: Open model trained from scratch by Eleuther AI.
  • Falcon 40B: Open text generation LLM from UAE’s Technology Innovation Institute (TII). Available on Hugging Face here.
  • Falcon 2 11B: Second set of models in the series from TII, released May 2024, including a multimodal model. On Hugging Face herec.
  • StableLM: Open language model from Stability AI. Succeeded by StableLM 2, in 1.6B (Jan 2024) and 12B versions (April 2024, try live demo here)
  • OLMo: Open Language Models from the Allen Institute for AI (AI2)
  • DCLM-7B: 7 billion parameter language model from Apple designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.
  • Snowflake Arctic: Open LLM from Snowflake, released April 2024. Github here and on Hugging Face here.
  • Minotaur 15B: Fine-tuned version of Starcoder on open code datasets from the OpenAccess AI Collective
  • MPT: Family of open models free for commercial use from MosaicML. Includes MPT Storywriter which has a 65K context window.
  • DBRX: Family of mixture-of-experts (MoE) large language model trained from scratch by Databricks Mosaic Research. Try it out in the Hugging Face playground here.
  • Qwen: Open LLM models from Alibaba Cloud in 7B and 14B sizes, including chat versions. Model family 1.5 released Feb 2024 and Qwen1.5-MoE Mixture of Experts model released 03/28/2024.
  • Command-R / Command-R+: Open LLM from Cohere for AI for long-context tasks such as retrieval augmented generation (RAG) and tool use. Available on HuggingFace Command-R, Command-R+
  • Aya: Massively multilingual models from Cohere for AI, Aya 101 and 23 which support those many languages respectively each. Aya 23 comes in 8B and 35B versions.
  • Grok-1: X.ai’s LLM, an MoE with 314B parameters, weights available via torrent. This is the (pre-trained) base model only, and not fine-tuned for chat.
  • SmolLM: Family of small language models (SLMs) from Huggingface in 135M, 360M, and 1.7B parameters. On Hugging Face here.
  • Jamba: Hybrid SSM-Transformer model from AI21 Labs - “world’s first production grade Mamba based model”. Weights on Hugging Face here.
  • Fuyu-8B: Open multimodal model from Adept AI, a smaller version of the model that powers their commercial product.
  • Yi: Bilingual open LLM from Chinese startup 01.AI founded by Kai-Fu Lee, with two versions Yi-34B & 6B. Also Yi-9B open-sourced in March 2024.
  • OpenHermes: Popular series of open (and uncensored) LLMs from Nousresearch, fine tunes of models such as LLaMA, Mixtral, Yi, and SOLAR.
  • Poro 34B: Fully open-source bilingual Finnish & English model trained in collaboration between Finnish startup Silo AI and the TurkuNLP group of the University of Turku.
  • Nemotron-3 8B: Family of “semi-open” (requires accepting a license) LLMs from NVIDIA, optimized for their Nemo framework. Find them all on the collections page on HF.
  • ML Foundations: Github repo for Ludwig Schmidt from University of Washington, includes open versions of multimodal models Flamingo & CLIP

Visualization

Prompt Engineering

Ethics, Bias, and Legal

Costing

Books, Courses and other Resources

Communities

  • MLOps Community: Community of machine learning operations (MLOps) practitioners, but lately very much focused on LLMs.
  • LLMOps Space: global community for LLM practitioners & enthusiasts, focused on topics related to deploying LLMs into production
  • Aggregate Intellect Socratic Circles (AISC): Online community of ML and AI practitioners based in Toronto, with Slack server, journal club, and free talks
  • /r/LanguageTechnology: Reddit community on Natural Language Processing and LLMs with over 40K members
  • /r/LocalLLaMA: Subreddit to discuss training Llama and development around it, though also contains a lot of good general LLM discussion.

MOOCS and Courses

Books

Surveys

Aggregators and Online Resources

Newsletters

These are not referral links.

  • GPTRoad: Daily no-nonsense newsletter covering developments in the AI / LLM space. They also have a site following the HackerNews template.
  • TLDR AI: Daily newsletter with little fluff, covering developments in AI news.
  • AI Tool Report: Newsletter from Respell, with AI headlines, jobs,
  • The Memo from Lifearchitect.ai: Bi-weekly newsletter with future-focused updates on developments in the LLM-space.
  • AI Breakfast: Curated weekly analysis of the latest AI projects, products, and news
  • The Rundown AI: Another daily AI newsletter (400K+ readers)
  • Interconnects: LLM / AI newsletter for more technical readers.
  • The Neuron: Another AI newsletter with cutesy and light tone.

Papers (WIP)

Conferences and Societies

About

Master list of curated resources on NLP and LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published