Splitwiser: Efficient LM inference with constrained resources

Splitwiser introduces a method to split LLM inference phases on a single GPU using multiprocessing and NVIDIA MPS, achieving modest latency reductions (up to 18.2%) and throughput improvements (up to 1.42x) on Huggingface and vLLM pipelines, though constrained by overheads and scalability issues.

Large Language Model, Efficiency, Multimodality, Pre-training, Generative AI

Asad Aali, Adney Cardoza, Melissa Capo

Stanford University, University of Texas at Austin

Generated by grok-3

Background Problem

The widespread adoption of Large Language Models (LLMs) has highlighted significant challenges in inference efficiency, particularly due to the high computational and memory demands on expensive GPUs. LLM inference consists of two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase. Existing batching and scheduling techniques often fail to fully utilize compute resources during token generation compared to prompt computation. Prior work, such as Splitwise, addressed this by splitting inference phases across multiple GPUs to leverage heterogeneous hardware, but this approach is inaccessible to operators with limited GPU resources. Splitwiser aims to solve this problem by optimizing split-phase inference on a single GPU, reducing overheads and improving resource utilization for those without access to large GPU clusters.

Method

Splitwiser proposes a methodology to split the two phases of LLM inference—prompt computation and token generation—onto the same GPU using multiprocessing and NVIDIA’s Multi-Process Service (MPS). The core idea is to run two inference serving instances in parallel, each handling a subset of requests, to overlap the compute-intensive prompt phase with the memory-intensive token generation phase, thereby improving GPU utilization and reducing latency. The implementation involves:

Basic Design: Two LLM inference instances are started on a single GPU using MPS, splitting a batch of requests (e.g., n*2 requests into two sets of n) to process phases concurrently.
Huggingface Pipeline: Inference is divided into prompt processing (using AutoTokenizer) and token generation (using AutoModelForCausalLM), with multiprocessing to run phases in parallel across sub-datasets, enhanced by MPS for GPU resource sharing.
vLLM Pipeline: Two approaches are tested—running two vLLM instances with shared model memory (coarse-grained multiprocessing) and modifying the vLLM scheduler to spawn processes for simultaneous prompt and token phase execution (fine-grained). Overheads include GPU memory duplication (in naive setups) and synchronization challenges between processes for KV-cache sharing.

Experiment

The experiments evaluate Splitwiser across Huggingface and vLLM frameworks using the OPT-125m model on NVIDIA A10 and A100 GPUs, with input token sizes of 512-1024, output token sizes of 20-1024, and batch sizes ranging from 10 to 160. The setup aims to measure latency (end-to-end time), throughput (iterations per second), and resource utilization (SM and memory throughput, KV-cache usage), using datasets like MIMIC-III radiology reports for Huggingface tests. Profiling with NVIDIA Nsight Compute and vLLM metrics confirms prompt phase as compute-intensive and token phase as memory-intensive, with batching increasing memory usage but not fully utilizing resources. Results show:

Huggingface: Splitwiser achieves a 17.6% latency reduction with 8 parallel processes and a 1.1x throughput improvement with 4 processes on A100; combining with MPS on A10 yields an 18.2% latency reduction.
vLLM: MPS-enabled multiprocessing (MPSx2) provides a 1.42x speedup for 160 requests compared to single-process execution, though multiprocessing without MPS underperforms due to context-switching overheads. Scheduler-level multiprocessing introduces significant initialization costs, rendering it impractical. The experimental design focuses on small models and specific configurations, which limits generalizability to larger LLMs or diverse workloads. While results match the expectation of modest latency and throughput gains, the trade-offs (e.g., increased per-batch latency, overheads in process spawning) suggest limited scalability and practical impact.

Further Thoughts

The Splitwiser approach, while innovative in targeting single-GPU environments, prompts deeper reflection on its broader implications and potential connections to other areas. One critical aspect is the trade-off between latency and throughput observed in the experiments, which mirrors challenges in distributed systems where resource contention often negates parallelization gains. This suggests a potential alignment with research on GPU scheduling and resource allocation in cloud computing, where dynamic workload balancing could mitigate some of Splitwiser’s overheads. Additionally, the reliance on MPS raises questions about portability across different hardware or software stacks—could similar benefits be achieved with alternative multiprocessing frameworks or custom CUDA implementations? Another avenue for exploration is the intersection with memory-efficient inference techniques like ZeRO-Inference, referenced in the paper. Combining Splitwiser’s phase-splitting with ZeRO’s heterogeneous memory usage might offer a more robust solution for resource-constrained environments, potentially democratizing LLM inference further. Finally, the paper’s focus on small models like OPT-125m overlooks the real-world demand for larger models (e.g., LLaMA-70B), where memory contention and phase imbalance could exacerbate overheads—future work should stress-test Splitwiser under such conditions to validate its practical utility. These thoughts highlight the need for a holistic approach integrating hardware, software, and workload considerations to truly advance efficient LLM inference.