Unlock the Power of Text-to-Video Generation with Aliyun's Wan2.1 Model
Generate dynamic videos with text prompts using Aliyun's Wan2.1 model! Learn how to utilize this Text-to-Video workflow with Chinese support, customizable frame rates, and resolutions. Discover the core models, key nodes, and workflow structure.
Workflow Overview
Generate dynamic videos with text prompts using Aliyun's Wan2.1 model! Learn how to utilize this Text-to-Video workflow with Chinese support, customizable frame rates, and resolutions. Discover the core models, key nodes, and workflow structure.
Content type: Workflow
Primary intent: Download
Required Models
- Wan2.1
Setup Notes
- Install the required models before opening the workflow template.
- Recommended hardware: Low VRAM (≤8GB).
1. Workflow Overview

This workflow utilizes Aliyun's Wan2.1 model for Text-to-Video (T2V) generation. It integrates text encoding, video diffusion, and VAE decoding to produce dynamic video content. Key features:
Supports Chinese prompts (e.g., "滑雪的男人" - "a man skiing")
Configurable frame rate (default: 16fps) and resolution (480x768)
Includes negative prompts for quality filtering
2. Core Models
Model Name | Function | Installation |
|---|---|---|
Wan2.1-T2V-1.3B | Video diffusion backbone | Manual download ( |
umt5-xxl-enc | Chinese text encoder | Place in |
Wan2.1_VAE | Latent space decoder | Manual download |
3. Key Nodes
LoadWanVideoT5TextEncoder
Loads the Chinese text encoder (umt5-xxl-enc). Usebf16precision to save VRAM.WanVideoTextEncode
Processes positive/negative prompts. Example negative prompts filter low-quality content.WanVideoModelLoader
Loads the main video model with options forfp32/fp16and VRAM optimization.WanVideoSampler
Core sampler parameters:steps: 10 (lower for faster video generation)cfg_scale: 6 (lower for creative freedom)sampler: dpm++
VHS_VideoCombine
Combines frames into MP4 video with configurable:Frame rate (16fps)
Output format (H.264, CRF=19)
Filename prefix (
WanVideo2_1_T2V)
4. Workflow Structure
Group 1: Text Processing
Input: Chinese prompt
Output: Text embeddings
Key nodes:
LoadWanVideoT5TextEncoder→WanVideoTextEncode
Group 2: Video Generation
Input: Text embeds + empty image embeds (480x768)
Output: Latent video data
Key nodes:
WanVideoSampler
Group 3: Video Export
Input: Decoded image sequence
Output: MP4 file
Key nodes:
WanVideoDecode→VHS_VideoCombine
5. I/O Specifications
Input Parameters:
Resolution: 480x768 (set in
WanVideoEmptyEmbeds)Seed: Fixed/Random (example:
1057359483639287)Prompts: Natural Chinese language (avoid complex syntax)
Output:
MP4 video (saved to ComfyUI output folder)
Includes generation metadata
6. Notes
⚠️ VRAM Requirements
Minimum 12GB (16GB recommended)
Enable
offload_devicefor optimization
⚠️ Model Installation
Download Wan2.1 models manually from official sources
Text encoder path:
models/wan_t5/umt5-xxl-enc-bf16.safetensors
⚠️ Dependencies
Requires
ComfyUI-WanVideoWrapper&VideoHelperSuiteInstall via ComfyUI Manager