AI

Hugging Face Adds One-Command vLLM Servers to HF Jobs

Hugging Face Adds One-Command vLLM Servers to HF Jobs

Logo: Victor (Hugging Face Staff) — Public domain, via Wikimedia Commons

Hugging Face has added native one-command deployment for vLLM inference servers to its HF Jobs infrastructure, letting users spin up private, OpenAI-compatible LLM endpoints without provisioning servers or managing Kubernetes. The feature, announced in a June 26, 2026 Hugging Face blog post, bills per-second based on the GPU hardware used for the running job Hugging Face vLLM Jobs Guide.

The workflow requires only three components: the Hugging Face Hub CLI v1.20.0 or later, an authenticated HF account with a valid payment method or prepaid credit balance, and a single hf jobs run command pointing to the official vllm/vllm-openai Docker image and target model Hugging Face vLLM Jobs Guide.

For example, deploying the 4B-parameter Qwen/Qwen3-4B model on an a10g-large GPU flavor takes one line of code, with the service reachable via a public HF jobs proxy URL within minutes of model weights downloading to the hosted GPU. The command automatically handles container orchestration, network routing, and port exposure, removing the need for users to manage underlying infrastructure.

How the vLLM Server Deployment Works

Port Exposure and Access Control

The --expose 8000 flag routes vLLM’s default port through HF’s public jobs proxy, generating a unique, namespace-scoped URL for the running job Hugging Face vLLM Jobs Guide. All requests to the endpoint require a bearer token tied to the user’s HF account with read access to the job’s namespace, so the service is private by default rather than publicly accessible.

OpenAI API Compatibility

The vLLM server implements the full OpenAI API specification, so existing OpenAI client libraries work without modification. Users can point the official OpenAI Python SDK at the proxy URL and pass their HF token as the api_key to send chat completion requests identical to those sent to OpenAI’s own endpoints. A simple curl request to the /v1/chat/completions endpoint returns standard OpenAI-formatted JSON with the model’s response, and a quick health check via the /v1/models endpoint confirms the service is live before sending generation requests.

Scaling and Cost for Larger Models

Scaling to Large Models

The same single-command workflow scales to much larger models by matching the --flavor flag to the required GPU count and adding a --tensor-parallel-size parameter equal to the number of GPUs in the selected flavor Hugging Face vLLM Jobs Guide. For example, the 122B-parameter Qwen3.5 mixture-of-experts model can be deployed on 2x H200 GPUs with two additional flags: --max-model-len 32768 to cap context length, and --max-num-seqs 256 to limit concurrent sequences to fit within GPU memory constraints. This configuration is required because the model’s default 256K-token context exceeds default vLLM batch memory limits.

Cost Structure

HF Jobs bills per-second of runtime, with the a10g-large GPU flavor priced at $1.50 per hour Hugging Face vLLM Jobs Guide. Users can set a --timeout safety net to auto-stop jobs after a set period, while explicit cancellation via the hf jobs cancel <job_id> command stops charges entirely for idle runtime.

Hugging Face Adds One-Command Vllm Servers: Target Use Cases

Supported Use Cases

The feature is built for temporary, ad-hoc inference workloads, including testing model behavior, running evaluation suites, and generating batch outputs Hugging Face vLLM Jobs Guide. Users do not need to reserve dedicated infrastructure for full days to run these tasks. The one-command workflow cuts setup time for these use cases from hours of manual infrastructure configuration to minutes, with costs limited to only the GPU time actually used.

Bottom line: Hugging Face’s one-command vLLM deployment on HF Jobs removes manual infrastructure management for temporary LLM inference workloads, letting users spin up private, OpenAI-compatible endpoints in minutes with per-second billing tied to GPU usage, ideal for ad-hoc model testing, evaluation runs, and batch output generation.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 29, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.