Today, there are many Generative AI models to choose from: Large Language Models (LLMs), Small Language Models (SLMs), proprietary models, open models, models in the cloud, models on devices, text models, code models, image models, and multimodal models. In this session, we'll provide guidance to help you choose the right model to fit your needs, and tools you can use to evaluate and compare models for your specific tasks.
There are many Generative AI models: LLMs, SLMs, proprietary, open, cloud-based, on-device, text, code, image, and multimodal. Learn how to choose the right model for your needs and about tools to evaluate and compare models for specific tasks.
45 minutes
Attendees should leave with a clear understanding of what generative AI models are, how they work, and their significance in various applications.
Participants will learn how to choose the right model for specific tasks based on factors such as task requirements, model capabilities, and performance metrics.
We’ll explore different types of generative models, including Large Language Models (LLMs) like GPT-3, Small Language Models (SLMs), and multimodal models that handle text, code, and images.
Attendees will see how fine-tuning can improve model performance and adapt it to specific use cases.
We’ll discuss industry-standard evaluation metrics and tools like Azure AI Studio for accurate model measurement.
Examples of successful model applications will provide insights into practical deployment scenarios.
We’ll wrap up with thoughts on where this field is headed and its potential impact.
- AI Studio
- Azure Model Catalog
- GitHub Model Catalog
- Large Language Models - GPT 3.5 GPT 4/4v/4o
- Small Language Models - Phi-3.5
- ONNXRuntime
- OLIVE
- Windows AI PC SDK
- Brief overview of Generative AI models
- Importance of choosing the right model for specific tasks
- Multimodal and GPT Prompts vs DALL-E Outcomes
- Large Language Models (LLMs)
- Small Language Models (SLMs)
- Proprietary vs. Open Models
- Cloud-based vs. On-device Models
- Text, Code, Image, and Multimodal Models
DEMO- Inference Phi-3-mini-instruct & Phi-3-vision (3 min) Comparing SLMs vs LLMs Inference using text and vision building cross platform solution
- Notebooks
- Model Chat Demo Video This demo takes an image png and then converts the image to code using Phi3 Onnx model local hosted vs GPT4o (Azure/GitHub Models Cloud hosted) the image is then converted to create a matplot python version of the image.
DEMO- Chainlit RAG Chat App which is using GitHub Models and Inference API (2 min) Comparing SLMs vs LLMs Inference using text from a RAGChat web application
-
Model Chat Demo Video This demo is a RAG chat demo using text files and compairing the difference of LLM and SLM model outputs you can select the LLM or SLM to utilise and compare results.
-
The opportunity of SLMs and LLMs
DEMO- ONNXRuntime WebGL + AI PC (5 min)
- AIPC Sample Source Code
- AI PC Video Demo for non AI PC users
- WebGPU RAG Chat Demo
- Video WebGPU RAG Chat Demo
- Video AI PC Demo
-
Task requirements and model capabilities
-
Performance metrics and evaluation methods
-
Iterative refinement and validation processes
-
Fine-tuning options for model improvement
Cloud Based FineTuning using Azure AI Compute and Local based Fine Tuning using Microsoft Olive
- Azure Machine Learning for model accuracy measurement
- Industry-standard evaluation metrics for different model types
- Experimentation and validation techniques for model selection
-
Examples of successful model applications
-
Lessons learned from model deployment and usage
-
DEMO - Cloud Native Distributed Application using Phi-3 & .NET Aspire to undertake RAG (5 min)
RAG Aspire demo(Deployment of Phi-3 as Models as a Service and .using .NET Aspire to create Cloud Native Distribution Application)
The RAG Aspire demo showcases the deployment of Phi-3 as a service and the use of .NET Aspire to create a cloud-native distributed application chat application. This demonstration aligns with Azure’s capabilities, highlighting the seamless integration and deployment of advanced AI models like Phi-3 within the Azure ecosystem. It also emphasizes the versatility of .NET Aspire in building scalable, cloud-native applications, catering to the growing demand for intelligent and responsive chat applications in various industries.
- Recap of key points
- Final thoughts on the future of Generative AI models
- Open floor for questions and discussion
Resources | Links | Description |
---|---|---|
Phi-3 CookBook | Phi-3CookBook | Learn more about Phi-3 samples, models and deployments |
Windows AI PC and Copilot SDK | Windows Copilot SDK | Use powerful AI APIs with Windows Copilot Library |
ONNX Runtime | ONNXRuntime | Accelerated Mobile Machine Learning Production-grade AI engine to speed up training and inferencing in your existing technology stack. |
AI Studio | AI Studio | Azure AI Studio is a platform for building, evaluating, and deploying generative AI solutions and custom copilots. |
AI Studio Model Catalog | AI Studio Model Catalog | Find the right model to build your custom AI solution |
AI Toolkit for VsCode | AI Toolkit | AI Toolkit for VS Code streamlines generative AI app development by integrating tools and models from Azure AI Studio and Hugging Face. Browse and download public models, fine-tune, test, and use them in your Windows applications. |
GitHub Model Catalog | GitHub Model Catalog | Models Try, test, and deploy from a wide range of model types, sizes, and specializations. |
Requirement | Start with | Why? |
---|---|---|
Steer model with a few examples | Prompt engineering | Easy to craft and quick experimentation, very low barrier to entry |
Simple & quick implementation | Prompt engineering, RAG | Easy tooling with Azure OpenAI on Your Data, PromptFlow, LangChain |
Improve model relevancy | RAG | Retrieve relevant information from your own datasets to insert into prompts |
Up to date information | RAG | Query up to date information from your own databases, search engineers, etc. to insert into prompts |
Factual grounding | RAG | Ability to reference & inspect retrieved data |
Optimize for specific tasks | Fine tuning | Fine tuning is great at steering your model for specific tasks like summarizing data in a specific format |
Instructions won't fit in a prompt | Fine tuning | Fine tuning moves few-shot examples into the training step but increases the quantity of examples are needed to train. |
Lower costs | It depends | |
Complex, novel data or domains | Prompt Engineering + RAG+ Fine Tuning |
Benchmarks | Description | Reference URL |
---|---|---|
GLUE Benchmark | GLUE (General Language Understanding Evaluation) benchmark provides a standardized set of diverse NLP tasks to evaluate the effectiveness of different language models | https://gluebenchmark.com/ |
SuperGLUE Benchmark | Compares more challenging and diverse tasks with GLUE, with comprehensive human baselines | https://super.gluebenchmark.com/ |
HellaSwag | Evaluates how well an LLM can complete a sentence | https://rowanzellers.com/hellaswag/ |
TruthfulQA | Measures truthfulness of model responses | https://github.com/sylinrl/TruthfulQA |
MMLU | MMLU ((Massive Multitask Language Understanding) evaluates how well the LLM can multitask | https://github.com/hendrycks/test |
KILT | Library for Knowledge intestive language tasks | https://github.com/facebookresearch/KILT |
Frameworks / Platforms | Description | Tutorials/lessons | Reference |
---|---|---|---|
Azure AI Studio Evaluation (Microsoft) | Azure AI Studio is an all-in-one AI platform for building, evaluating, and deploying generative AI solutions and custom copilots.Technical Landscape: No code: model catalog in AzureML studio & AI studio, Low-code: as CLI, Pro-code: as azureml-metrics SDK | Tutorials | Link |
Prompt Flow (Microsoft) | A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, and evaluation to production, deployment, and monitoring. | Tutorials | Link |
Weights & Biases(Weights & Biases) | A Machine Learning platform to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues. | Tutorias, DeepLearning.AI Lesson |
Link |
ONNX Runtime (Open Neural Network Exchange) | GGUF (GGML Unified Format) |
---|---|
Purpose: ONNX Runtime is an inference engine designed to run models in the ONNX (Open Neural Network Exchange) format. It supports multiple backends and optimizations, making it versatile for various hardware and deployment scenarios. | Purpose: GGUF is part of the GGML ecosystem, which includes tools like llama.cpp for running inference on large language models (LLMs). It is optimized for local inference. |
Optimization: ONNX Runtime includes features for model optimization, such as quantization and hardware acceleration. | Optimization: GGUF supports 4-bit quantization and is designed to work efficiently with specific hardware setups, such as M-series GPUs on Apple Silicon |
Compatibility: It supports a wide range of frameworks like PyTorch, TensorFlow, and others, making it versatile for various backend ML frameworks. | Compatibility: GGUF (formerly GGML) is designed to be highly compatible with various environments and tools, particularly for efficient inference of large language models |
Inference: ONNX models can be run using ONNX Runtime, which supports multiple backends and optimizations. | Inference: GGUF is tightly integrated with GGML and related tools, providing a streamlined experience for inference. |
Memory Optimization | Efficiency Optimization |
---|---|
Low Ranking Adaptation (LoRA): LoRA is a technique to fine tune large language models. It uses low-rank approximation methods to reduce the computational and financial costs of adapting models with billions of parameters, such as GPT-3, to specific tasks or domains. | DeepSpeed: is a deep learning software library that accelerates the training of large language models. It includes ZeRO (Zero Redundancy Optimizer), a memory-efficient approach for distributed training. DeepSpeed can automatically optimize fine-tuning jobs that use Hugging Face’s Trainer API, and offers a drop-in replacement script to run existing fine-tuning scripts. |
Quantized LoRA (QLoRA): QLoRA is an efficient finetuning approach for large language models (LLMs) that significantly reduces memory usage while maintaining the performance of full 16-bit finetuning. It achieves this by backpropagating gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters. | ZeRO: ZeRO is a set of memory optimization techniques that enable effective training of large models with trillions of parameters, such as GPT-2 and Turing-NLG 17B. A key appeal of ZeRO is that no model code modifications are required. It’s a memory-efficient form of data parallelism that gives you access to the aggregate GPU memory of all the GPU devices available to you, without inefficiency caused by the data replication in data parallelism. |
Parameter Efficient Fine Tuning (PEFT): PEFT is an NLP technique that adapts pre-trained language models adjusting key parameters for specific tasks, and delivers comparable performance to full fine-tuning across modalities like image classification and stable diffusion. It's a valuable approach for high performance with minimal trainable parameters. | DORA: Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA initially decomposes the pre-trained weight into its magnitude and directional components and finetunes both of them. Because the directional component is large in terms of parameter numbers, we further decompose it with LoRA for efficient finetuning. |
Lee Stott 📢 |
Kinfey Lo 📢 |
Microsoft is committed to helping our customers use our AI products responsibly, sharing our learnings, and building trust-based partnerships through tools like Transparency Notes and Impact Assessments. Many of these resources can be found at https://aka.ms/RAI. Microsoft’s approach to responsible AI is grounded in our AI principles of fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.
Large-scale natural language, image, and speech models - like the ones used in this sample - can potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. Please consult the Azure OpenAI service Transparency note to be informed about risks and limitations.
The recommended approach to mitigating these risks is to include a safety system in your architecture that can detect and prevent harmful behavior. Azure AI Content Safety provides an independent layer of protection, able to detect harmful user-generated and AI-generated content in applications and services. Azure AI Content Safety includes text and image APIs that allow you to detect material that is harmful. Within Azure AI Studio, the Content Safety service allows you to view, explore and try out sample code for detecting harmful content across different modalities. The following quickstart documentation guides you through making requests to the service.
Another aspect to take into account is the overall application performance. With multi-modal and multi-models applications, we consider performance to mean that the system performs as you and your users expect, including not generating harmful outputs. It's important to assess the performance of your overall application using Performance and Quality and Risk and Safety evaluators. You also have the ability to create and evaluate with custom evaluators.
You can evaluate your AI application in your development environment using the Azure AI Evaluation SDK. Given either a test dataset or a target, your generative AI application generations are quantitatively measured with built-in evaluators or custom evaluators of your choice. To get started with the azure ai evaluation sdk to evaluate your system, you can follow the quickstart guide. Once you execute an evaluation run, you can visualize the results in Azure AI Studio.