Unlock Lip-Synced Cartoon Avatar Videos with This AI-Powered Workflow
Generate lip-synced cartoon avatar videos with ease! Learn how to create 10s videos in minutes using SVD XT 1.1, SONIC UNet, and VHS Video plugin. Discover the workflow and key nodes to produce high-quality results. Get started now!
- Use Case
- Video
- Best For
- Video
- VRAM
- Low VRAM (≤8GB)
- Reading Time
- 3 min
Workflow Overview
Generate lip-synced cartoon avatar videos with ease! Learn how to create 10s videos in minutes using SVD XT 1.1, SONIC UNet, and VHS Video plugin. Discover the workflow and key nodes to produce high-quality results. Get started now!
Content type: Workflow
Primary intent: Download
Setup Notes
- Install the required models before opening the workflow template.
- Recommended hardware: Low VRAM (≤8GB).
1. Workflow Overview

This workflow generates lip-synced cartoon avatar videos (e.g., Sonic) at any resolution. It syncs mouth movements with input audio, producing ~10s videos (~8 mins on RTX 4090).
2. Core Models
Model/Plugin | Function | Source/Installation |
|---|---|---|
SVD XT 1.1 | Base video generation model | Download |
SONIC UNet | Lip-sync specialized UNet | Load |
VHS Video | Video synthesis plugin | Install via ComfyUI Manager |
3. Key Nodes
Node Name | Function | Installation | Dependencies |
|---|---|---|---|
| Load base model | Built-in | SVD XT 1.1 model |
| Load lip-sync UNet | Manual SONIC plugin install |
|
| Preprocess audio/image data | SONIC plugin | CLIP vision encoder |
| Merge video/audio | Install | FFmpeg required |
4. Workflow Groups
Group 1: Data Loading
Inputs:
Image (e.g.,
45b437ee...png)Audio (e.g.,
10s-aijuxi.wav)
Outputs: Preprocessed data
Key Nodes:
LoadImage,LoadAudio,SONIC_PreData
Group 2: Lip-Sync Generation
Inputs: Preprocessed data + model
Outputs: Frames with mouth movements
Key Node:
SONICSampler(controls FPS/seed)
Group 3: Video Export
Inputs: Frames + original audio
Outputs: MP4 (H.264 encoded)
Key Node:
VHS_VideoCombine
5. Inputs & Outputs
Input Parameters:
Image: 1080x1920 PNG (clear mouth area required)
Audio: 10s WAV file
Frame Rate: Default 25 FPS (adjustable)
Seed: Random or fixed (e.g.,
837794266)
Output: MP4 video (e.g.,
output/Sonic/aijuxi_xxxx.mp4)
6. Notes
⚠️ Hardware: NVIDIA GPU (recommended RTX 4090, ≥16GB VRAM)
⚠️ Model Prep:
Place
svd_xt_1_1inmodels/checkpointsunet.pthmust be in SONIC plugin directory
✅ Optimization:
Shorter audio reduces generation time
Set
weight_dtypetofp16inSONICSampler