Skip to content
Go back 2505.03763 arXiv logo

Splitwiser: Efficient LM inference with constrained resources

Published:  at  11:14 AM
60.85 🤔

Splitwiser introduces a method to split LLM inference phases on a single GPU using multiprocessing and NVIDIA MPS, achieving modest latency reductions (up to 18.2%) and throughput improvements (up to 1.42x) on Huggingface and vLLM pipelines, though constrained by overheads and scalability issues.

Large Language Model, Efficiency, Multimodality, Pre-training, Generative AI

Asad Aali, Adney Cardoza, Melissa Capo

Stanford University, University of Texas at Austin

Generated by grok-3

Background Problem

The widespread adoption of Large Language Models (LLMs) has highlighted significant challenges in inference efficiency, particularly due to the high computational and memory demands on expensive GPUs. LLM inference consists of two distinct phases: a compute-intensive prompt computation phase and a memory-intensive token generation phase. Existing batching and scheduling techniques often fail to fully utilize compute resources during token generation compared to prompt computation. Prior work, such as Splitwise, addressed this by splitting inference phases across multiple GPUs to leverage heterogeneous hardware, but this approach is inaccessible to operators with limited GPU resources. Splitwiser aims to solve this problem by optimizing split-phase inference on a single GPU, reducing overheads and improving resource utilization for those without access to large GPU clusters.

Method

Splitwiser proposes a methodology to split the two phases of LLM inference—prompt computation and token generation—onto the same GPU using multiprocessing and NVIDIA’s Multi-Process Service (MPS). The core idea is to run two inference serving instances in parallel, each handling a subset of requests, to overlap the compute-intensive prompt phase with the memory-intensive token generation phase, thereby improving GPU utilization and reducing latency. The implementation involves:

  1. Basic Design: Two LLM inference instances are started on a single GPU using MPS, splitting a batch of requests (e.g., n*2 requests into two sets of n) to process phases concurrently.
  2. Huggingface Pipeline: Inference is divided into prompt processing (using AutoTokenizer) and token generation (using AutoModelForCausalLM), with multiprocessing to run phases in parallel across sub-datasets, enhanced by MPS for GPU resource sharing.
  3. vLLM Pipeline: Two approaches are tested—running two vLLM instances with shared model memory (coarse-grained multiprocessing) and modifying the vLLM scheduler to spawn processes for simultaneous prompt and token phase execution (fine-grained). Overheads include GPU memory duplication (in naive setups) and synchronization challenges between processes for KV-cache sharing.

Experiment

The experiments evaluate Splitwiser across Huggingface and vLLM frameworks using the OPT-125m model on NVIDIA A10 and A100 GPUs, with input token sizes of 512-1024, output token sizes of 20-1024, and batch sizes ranging from 10 to 160. The setup aims to measure latency (end-to-end time), throughput (iterations per second), and resource utilization (SM and memory throughput, KV-cache usage), using datasets like MIMIC-III radiology reports for Huggingface tests. Profiling with NVIDIA Nsight Compute and vLLM metrics confirms prompt phase as compute-intensive and token phase as memory-intensive, with batching increasing memory usage but not fully utilizing resources. Results show:

Further Thoughts

The Splitwiser approach, while innovative in targeting single-GPU environments, prompts deeper reflection on its broader implications and potential connections to other areas. One critical aspect is the trade-off between latency and throughput observed in the experiments, which mirrors challenges in distributed systems where resource contention often negates parallelization gains. This suggests a potential alignment with research on GPU scheduling and resource allocation in cloud computing, where dynamic workload balancing could mitigate some of Splitwiser’s overheads. Additionally, the reliance on MPS raises questions about portability across different hardware or software stacks—could similar benefits be achieved with alternative multiprocessing frameworks or custom CUDA implementations? Another avenue for exploration is the intersection with memory-efficient inference techniques like ZeRO-Inference, referenced in the paper. Combining Splitwiser’s phase-splitting with ZeRO’s heterogeneous memory usage might offer a more robust solution for resource-constrained environments, potentially democratizing LLM inference further. Finally, the paper’s focus on small models like OPT-125m overlooks the real-world demand for larger models (e.g., LLaMA-70B), where memory contention and phase imbalance could exacerbate overheads—future work should stress-test Splitwiser under such conditions to validate its practical utility. These thoughts highlight the need for a holistic approach integrating hardware, software, and workload considerations to truly advance efficient LLM inference.



Previous Post
MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Next Post
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization