2025/12/28

How to Set Up Z-Image Turbo in ComfyUI: Complete Workflow Guide

Step-by-step instructions for installing and configuring Z-Image Turbo in ComfyUI, including model downloads, directory structure, node setup, and optimization tips.

This guide covers how to set up Z-Image Turbo in ComfyUI for local image generation. All instructions are based on the official ComfyUI documentation and Hugging Face model repositories.

No GPU? Use Z-Image Turbo online — generate images directly in your browser without local installation.

Prerequisites

Before starting, ensure you have:

ComfyUI installed (nightly version recommended)
GPU with at least 16GB VRAM (RTX 4090, RTX 3090, or similar)
Python 3.10 or later
Sufficient disk space (~15GB for model files)

Required Model Files

Z-Image Turbo requires three model files to function in ComfyUI:

1. Text Encoder

Download qwen_3_4b.safetensors from the Comfy-Org repository.

This is the Qwen 3.4B language model that processes your text prompts. It enables the strong prompt understanding that Z-Image Turbo is known for.

2. Diffusion Model

Download z_image_turbo_bf16.safetensors from the Comfy-Org Z-Image-Turbo repository.

This is the main 6B parameter diffusion model that generates images.

3. VAE (Variational Autoencoder)

Download ae.safetensors—the FLUX VAE that works with Z-Image Turbo.

The VAE handles encoding and decoding between latent space and pixel space.

Directory Structure

Place downloaded files in your ComfyUI installation:

ComfyUI/
└── models/
    ├── text_encoders/
    │   └── qwen_3_4b.safetensors
    ├── diffusion_models/
    │   └── z_image_turbo_bf16.safetensors
    └── vae/
        └── ae.safetensors

Create any missing directories before copying files.

Loading the Workflow

ComfyUI provides official workflow templates for Z-Image Turbo:

Open ComfyUI in your browser
Navigate to Workflow Templates in the menu
Search for "Z-Image" or "Z-Image-Turbo"
Load the text-to-image workflow

Alternatively, download workflow JSON files from the ComfyUI examples repository.

Node Configuration

The basic Z-Image Turbo workflow uses these nodes:

Load Text Encoder Node

Select qwen_3_4b.safetensors
This loads the language model for prompt processing

Load Diffusion Model Node

Select z_image_turbo_bf16.safetensors
Model type: Diffusion Transformer

Load VAE Node

Select ae.safetensors
Used for final image decoding

Sampler Settings

Z-Image Turbo uses specific sampler parameters:

Parameter	Value
Steps	9
CFG Scale	0.0
Sampler	euler
Scheduler	simple

The model uses 0.0 guidance scale because Z-Image Turbo is a distilled model that does not require classifier-free guidance.

Resolution Settings

Z-Image Turbo supports various resolutions. Recommended options:

1024 × 1024 — Standard square format
1280 × 720 — 16:9 landscape
720 × 1280 — 9:16 portrait
1024 × 576 — Cinematic widescreen

Higher resolutions require more VRAM. If you encounter out-of-memory errors, reduce dimensions or enable memory optimizations.

Memory Optimization

For GPUs with limited VRAM, apply these optimizations:

Enable Attention Slicing

In ComfyUI settings, enable attention slicing to reduce peak memory usage at the cost of slightly slower generation.

Use FP8 Quantized Model

If 16GB VRAM is insufficient, use the FP8 quantized version:

Download z_image_turbo_fp8.safetensors instead of the bf16 version
Reduces memory usage to ~10GB
Minor quality reduction

CPU Offloading

Enable model offloading in ComfyUI to move inactive model components to CPU RAM during generation.

ControlNet Integration (Optional)

Z-Image Turbo supports ControlNet for guided generation:

Download ControlNet Model

Get Z-Image-Turbo-Fun-Controlnet-Union.safetensors and place it in:

ComfyUI/models/controlnet/

Workflow with ControlNet

Add a "Load ControlNet Model" node
Select the Z-Image ControlNet Union model
Connect your reference image through a preprocessor (Canny, Depth, Pose)
Connect ControlNet output to the sampler

Common ControlNet modes:

Canny — Edge detection for structural guidance
Depth — Depth maps for spatial composition
DWPose — Human pose estimation

Performance Expectations

Based on user reports and official documentation:

GPU	Resolution	Generation Time
RTX 4090	1024×1024	~5 seconds
RTX 3090	1024×1024	~13 seconds
RTX 4070 Ti	1024×1024	~20 seconds
RTX 3080 (10GB)	1024×1024	~30 seconds (with FP8)

Times are approximate and vary based on system configuration.

Troubleshooting

"Model not found" Error

Verify file paths match exactly. Check that:

Files are in correct directories
Filenames match what ComfyUI expects
No extra extensions or typos

Out of Memory Error

Reduce resolution
Use FP8 quantized model
Enable attention slicing
Close other GPU applications

Slow First Generation

The first generation after loading models is slower due to CUDA kernel compilation. Subsequent generations run at normal speed.

Black or Corrupted Output

Ensure VAE is loaded correctly
Check that sampler settings match (steps: 9, CFG: 0.0)
Verify bfloat16 is supported by your GPU

Next Steps

Once your basic workflow runs correctly:

Experiment with different prompts to test text rendering
Try ControlNet for consistent character poses
Train custom LoRAs for specific styles

Use Z-Image Turbo Online — No installation required
LoRA Training Guide — Create custom style and character models
Train LoRAs on our Platform — Web-based training, no GPU needed
Z-Image Turbo vs FLUX — Technical comparison
View Pricing — Check our plans for online generation

Sources:

All Posts