AI Inference Optimization Engineering: Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment (Production AI Engineering Series) - Softcover

Book 6 of 20: Production AI Engineering Series

Team, ChatVariety

9798199720021: AI Inference Optimization Engineering: Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment (Production AI Engineering Series)

Softcover

ISBN 13: 9798199720021

Publisher: Independently published, 2026

View all copies of this ISBN edition

0 Used

4 New

From US$ 15.00

Slash LLM Deployment Costs and Latency

Deploying Large Language Models (LLMs) in production is a massive economic and engineering hurdle. AI Inference Optimization Engineering is your comprehensive, hands-on guide to mastering the full stack of modern LLM optimization techniques. From memory-bandwidth solutions to hardware-specific compilation, this book bridges the gap between research-level models and enterprise-grade execution.

What you will master inside this book:

Hardware-Aware Optimization: Dive deep into KV cache mechanics, autoregressive decoding, and GPU memory hierarchies to eliminate latency bottlenecks.
State-of-the-Art Quantization: Apply GPTQ, AWQ, and GGUF compression algorithms to scale down massive neural networks without sacrificing model accuracy.
Advanced Acceleration Methods: Implement speculative decoding with draft models (like Medusa and Eagle), PagedAttention, and FlashAttention to boost throughput by 2-3x.
Production-Grade Serving: Build ultra-low-latency deployment infrastructures using vLLM, Triton Inference Server, and continuous batching.
Cross-Platform Deployment: Optimize models for specific target hardware, including NVIDIA H100 (TensorRT-LLM), Apple Silicon (llama.cpp/Metal), and Qualcomm mobile/edge accelerators.

Whether you are an ML infrastructure engineer, an AI platform architect, or a technical leader looking to scale LLMs cost-effectively, this book provides the production-ready code, equations, and architectural patterns you need to build hyper-efficient AI pipelines.

"synopsis" may belong to another edition of this title.

Publisher: Independently published
Publication date: 2026
Language: English
ISBN 13: 9798199720021
Binding: Paperback
Number of pages: 95

Search results for AI Inference Optimization Engineering: Quantization,...

Stock Image

AI Inference Optimization Engineering: Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment

Team, ChatVariety

Published by Independently published, 2026

ISBN 13: 9798199720021

New Softcover

Print on Demand

Seller: California Books, Miami, FL, U.S.A.

Seller rating 4 out of 5 stars

Condition: New. Print on Demand. Seller Inventory # I-9798199720021

Contact seller

Buy New

US$ 15.00

Free Shipping
Ships within U.S.A.

Quantity: Over 20 available

Add to basket

Stock Image

AI Inference Optimization Engineering

Team, Chatvariety

Published by Independently published, 2026

ISBN 13: 9798199720021

New PAP

Seller: PBShop.store UK, Fairford, GLOS, United Kingdom

Seller rating 5 out of 5 stars

PAP. Condition: New. New Book. Shipped from UK. Established seller since 2000. Seller Inventory # L2-9798199720021

Contact seller

Buy New

US$ 15.47

US$ 4.43 shipping
Ships from United Kingdom to U.S.A.

Quantity: Over 20 available

Add to basket

Stock Image

AI Inference Optimization Engineering (Paperback)

Chatvariety Team

Published by Independently Published, 2026

ISBN 13: 9798199720021

New Paperback

Print on Demand

Seller: CitiRetail, Stevenage, United Kingdom

Seller rating 5 out of 5 stars

Paperback. Condition: new. Paperback. Slash LLM Deployment Costs and LatencyDeploying Large Language Models (LLMs) in production is a massive economic and engineering hurdle. AI Inference Optimization Engineering is your comprehensive, hands-on guide to mastering the full stack of modern LLM optimization techniques. From memory-bandwidth solutions to hardware-specific compilation, this book bridges the gap between research-level models and enterprise-grade execution.What you will master inside this book: Hardware-Aware Optimization: Dive deep into KV cache mechanics, autoregressive decoding, and GPU memory hierarchies to eliminate latency bottlenecks.State-of-the-Art Quantization: Apply GPTQ, AWQ, and GGUF compression algorithms to scale down massive neural networks without sacrificing model accuracy.Advanced Acceleration Methods: Implement speculative decoding with draft models (like Medusa and Eagle), PagedAttention, and FlashAttention to boost throughput by 2-3x.Production-Grade Serving: Build ultra-low-latency deployment infrastructures using vLLM, Triton Inference Server, and continuous batching.Cross-Platform Deployment: Optimize models for specific target hardware, including NVIDIA H100 (TensorRT-LLM), Apple Silicon (llama.cpp/Metal), and Qualcomm mobile/edge accelerators.Whether you are an ML infrastructure engineer, an AI platform architect, or a technical leader looking to scale LLMs cost-effectively, this book provides the production-ready code, equations, and architectural patterns you need to build hyper-efficient AI pipelines. This item is printed on demand. Shipping may be from our UK warehouse or from our Australian or US warehouses, depending on stock availability. Seller Inventory # 9798199720021

Contact seller

Buy New

US$ 19.42

US$ 49.87 shipping
Ships from United Kingdom to U.S.A.

Quantity: 1 available

Add to basket

Stock Image

AI Inference Optimization Engineering : Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment

Chatvariety Team

Published by Independently Published Jun 2026, 2026

ISBN 13: 9798199720021

New Taschenbuch

Seller: AHA-BUCH GmbH, Einbeck, Germany

Seller rating 5 out of 5 stars

Taschenbuch. Condition: Neu. Neuware - Slash LLM Deployment Costs and LatencyDeploying Large Language Models (LLMs) in production is a massive economic and engineering hurdle. AI Inference Optimization Engineering is your comprehensive, hands-on guide to mastering the full stack of modern LLM optimization techniques. From memory-bandwidth solutions to hardware-specific compilation, this book bridges the gap between research-level models and enterprise-grade execution.What you will master inside this book: - Hardware-Aware Optimization: Dive deep into KV cache mechanics, autoregressive decoding, and GPU memory hierarchies to eliminate latency bottlenecks.- State-of-the-Art Quantization: Apply GPTQ, AWQ, and GGUF compression algorithms to scale down massive neural networks without sacrificing model accuracy.- Advanced Acceleration Methods: Implement speculative decoding with draft models (like Medusa and Eagle), PagedAttention, and FlashAttention to boost throughput by 2-3x.- Production-Grade Serving: Build ultra-low-latency deployment infrastructures using vLLM, Triton Inference Server, and continuous batching.- Cross-Platform Deployment: Optimize models for specific target hardware, including NVIDIA H100 (TensorRT-LLM), Apple Silicon (llama.cpp/Metal), and Qualcomm mobile/edge accelerators.Whether you are an ML infrastructure engineer, an AI platform architect, or a technical leader looking to scale LLMs cost-effectively, this book provides the production-ready code, equations, and architectural patterns you need to build hyper-efficient AI pipelines. Seller Inventory # 9798199720021

Contact seller

Buy New

US$ 15.32

US$ 69.48 shipping
Ships from Germany to U.S.A.

Quantity: 2 available

Add to basket

AI Inference Optimization Engineering: Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment (Production AI Engineering Series) - Softcover

Team, ChatVariety

Synopsis

Search results for AI Inference Optimization Engineering: Quantization,...

AI Inference Optimization Engineering: Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment

Buy New

AI Inference Optimization Engineering

Buy New

AI Inference Optimization Engineering (Paperback)

Buy New

AI Inference Optimization Engineering : Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment

Buy New