
Databricks launched a public preview of GPU and LLM optimization help for Databricks Mannequin Serving. This new characteristic permits the deployment of varied AI fashions, together with LLMs and Imaginative and prescient fashions, on the Lakehouse Platform.
Databricks Mannequin Serving provides computerized optimization for LLM Serving, delivering high-performance outcomes with out the necessity for guide configuration. Based on Databricks, it’s the primary serverless GPU serving product constructed on a unified knowledge and AI platform, permitting customers to create and deploy GenAI purposes seamlessly inside a single platform, overlaying every part from knowledge ingestion to mannequin deployment and monitoring.
Databricks Mannequin Serving simplifies the deployment of AI fashions, making it straightforward even for customers with out deep infrastructure information. Customers can deploy a variety of fashions, together with pure language, imaginative and prescient, audio, tabular, or customized fashions, no matter how they had been skilled (from scratch, open-source, or fine-tuned with proprietary knowledge).
Simply log your mannequin with MLflow, and Databricks Mannequin Serving will robotically put together a production-ready container with GPU libraries like CUDA and deploy it to serverless GPUs. This totally managed service handles every part from managing situations, sustaining model compatibility, to patching variations. It additionally robotically adjusts occasion scaling to match visitors patterns, saving on infrastructure prices whereas optimizing efficiency and latency.
Databricks Mannequin Serving has launched optimizations for serving giant language fashions (LLM) extra effectively, leading to as much as a 3-5x discount in latency and value. To make use of Optimized LLM Serving, you merely present the mannequin and its weights, and Databricks takes care of the remainder, guaranteeing your mannequin performs optimally.
This streamlines the method, permitting you to focus on integrating LLM into your software slightly than coping with low-level mannequin optimization. At the moment, Databricks Mannequin Serving robotically optimizes MPT and Llama2 fashions, with plans to help further fashions sooner or later.