From Pose to Playback: Mastering Video Generation with Tongyi Wanxiang's Fun-ControlNet
Tongyi Wanxiang-WAN2.1-Fun ControlNet Video Generation: Create dynamic videos with pose/depth control & style control. Learn how this workflow generates videos, controls content, and upscales resolution.
- Use Case
- Video
- Best For
- Video
- Models
- Wan2.1Controlnet
- Key Nodes
- ControlnetUpscaler
- VRAM
- Low VRAM (≤8GB)
- Reading Time
- 4 min
Workflow Overview
Tongyi Wanxiang-WAN2.1-Fun ControlNet Video Generation: Create dynamic videos with pose/depth control & style control. Learn how this workflow generates videos, controls content, and upscales resolution.
Content type: Workflow
Primary intent: Download
Required Models
- Wan2.1
- Controlnet
Required Nodes
- Controlnet
- Upscaler
Setup Notes
- Install the required models before opening the workflow template.
- Recommended hardware: Low VRAM (≤8GB).
1. Workflow Overview

This workflow, titled "Tongyi Wanxiang-WAN2.1-Fun ControlNet Video Generation [Pose/Depth Control]", is designed for:
Video Generation: Creates dynamic videos from input control signals (e.g., pose/depth maps).
Style Control: Uses Fun-ControlNet for precise content control (e.g., character motion).
Post-Processing: Includes video upscaling, frame interpolation, and final rendering.
2. Core Models
WAN2.1-Fun-ControlNet: Main video generation model with multi-modal control.
Meta-Llama-3.1-8B: Generates captions for input images.
FILM VFI: Frame interpolation model for smoother motion.
4x_foolhardy_Remacri: Upscales video resolution.
3. Key Nodes
Video Generation
WanVideoModelLoader: Loads the WAN2.1-Fun-ControlNet model.
WanVideoSampler: Generates video frames with configurable parameters (steps, CFG scale).
WanVideoDecode: Decodes latent frames to images.
Control Signal Processing
AIO_Preprocessor: Preprocesses control maps (e.g., pose/depth).
WanVideoControlEmbeds: Encodes control signals.
Post-Processing
FILM VFI: Interpolates frames for smoother playback.
ImageUpscaleWithModel: Enhances video resolution.
VHS_VideoCombine: Renders final video (supports audio merging).
Utilities
Joy_caption_two: Generates text prompts from reference images.
easy cleanGpuUsed: Clears GPU memory to prevent overflow.
4. Workflow Structure (Groups)
Input Control Video Group
Input: Uploaded video or control images (e.g., pose maps).
Key Nodes:
VHS_LoadVideo,ImageResizeKJ(resizes input).
Fun-Control Group
Input: Control signals, prompts, model parameters.
Key Nodes:
WanVideoSampler,WanVideoControlEmbeds.
Reference Image Captioning Group
Input: Reference image.
Key Node:
Joy_caption_two(generates descriptive text).
Post-Processing Group
Input: Raw generated frames.
Key Nodes:
FILM VFI(interpolation),VHS_VideoCombine(final render).
5. Inputs & Outputs
Input Parameters:
Control video, resolution (default: 480x832), prompts, frame limit (default: 49).
Output:
Final video (MP4), optionally upscaled and interpolated.
6. Notes & Tips
VRAM Requirement: Recommended GPU with 16GB+ VRAM (e.g., RTX 3090).
Dependencies: Install
ComfyUI-WanVideoWrapperandComfyUI-VideoHelperSuitemanually.Common Issues:
Missing model files: Ensure
Wan2.1-Fun-Control-14B_fp8_e4m3fn.safetensorsis downloaded.Resolution mismatch: Align input video and control map dimensions.