Research

We focus on persistent agent systems across long-term memory, long-context understanding, semantic routing, multi-agent collaboration, inference infrastructure, and evaluation.

The publications below cover both foundational directions and systems work shaped by real-world applications.

Publications

Work across persistent agents, long-context understanding, semantic routing, model-serving systems, and evaluation.

2026

Publications

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Authors: vLLM Semantic Router Team

Venue: arXiv Technical Report

Signal-driven decision routing for Mixture-of-Modality deployments across cost, privacy, latency, and safety constraints.

Paper

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

Venue: arXiv Technical Report

A synthesis of routing, fleet, multimodal, and governance results into the Workload-Router-Pool architecture.

Paper

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

A security treatment of perception failures in computer-using agents with a dual-channel guardrail for click targets and action reasoning.

Paper

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Authors: Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

OATS improves semantic-router tool ranking under single-digit millisecond CPU budgets without serving-time model inference.

Paper

Adaptive Vision-Language Model Routing for Computer Use Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

Adaptive VLM Routing estimates action difficulty and routes each computer-use step to the cheapest model that meets a reliability target.

Paper

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

Flash Attention, prompt compression, and near-streaming body processing reduce routing latency from seconds to tens of milliseconds.

Paper

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

A queueing-theory-grounded fleet planner and simulator for sizing multi-pool GPU fleets against P99 TTFT targets.

Paper

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

An analytical method for deriving minimum-cost two-pool fleets directly from workload CDFs and P99 TTFT targets.

Paper

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

An analytical result showing context-length routing topology can matter more than pure GPU generation upgrades for tokens-per-watt.

Paper

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Authors: Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

Venue: arXiv Technical Report

A framework for conflict detection and prevention when probabilistic ML predicates can silently co-fire in routing policy languages.

Paper

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu

Venue: arXiv Technical Report

A cross-layer extension of the Semantic Router DSL from stateless request routing into multi-step agent workflows.

Paper

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

Conversational memory and retrieval-grounded routing recover most of a 235B model’s performance while cutting effective inference cost by 96%.

Paper

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

Venue: SIGIR 2026 Industry Track

A real-time verification component for long-document RAG that preserves grounding checks without falling back to truncated validation.

Paper

2025

Publications

When to Reason: Semantic Router for vLLM

Authors: Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

Venue: NeurIPS 2025- MLForSys

A semantic router that classifies queries by reasoning need and selectively applies reasoning only when beneficial.

Paper

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Authors: Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

A category-aware semantic caching architecture where similarity thresholds, TTLs, and quotas vary by workload class.

Paper