Unlock the Power of Lip-Synced Talking Avatars with Sonic Digital Human Workflow
Create Lip-Synced Talking Avatars with Sonic Digital Human Workflow | Generate MP4 videos with synchronized facial animations using Stable Video Diffusion (SVD) framework and audio input.
- VRAM
- Low VRAM (≤8GB)
- Reading Time
- 3 min
Workflow Overview
Create Lip-Synced Talking Avatars with Sonic Digital Human Workflow | Generate MP4 videos with synchronized facial animations using Stable Video Diffusion (SVD) framework and audio input.
Content type: Workflow
Primary intent: Download
Setup Notes
- Install the required models before opening the workflow template.
- Recommended hardware: Low VRAM (≤8GB).
1. Workflow Overview

This "Sonic Digital Human" workflow generates lip-synced talking avatar videos by combining input images (e.g. portraits) with audio (e.g. speech). Based on Stable Video Diffusion (SVD) framework, it outputs MP4 videos with synchronized facial animations.
2. Core Models
Model/Component | Function | Source |
|---|---|---|
svd_xt_1_1 | Base video diffusion model | Download to |
Sonic model (unet.pth) | Lip-sync control | Quark/Baidu links in workflow |
CLIP Vision | Image feature extraction | Built-in |
3. Key Nodes
Node | Purpose | Installation |
|---|---|---|
SONICTLoader | Load Sonic adapter | Install |
SONIC_PreData | Fuse audio/image data | Same as above |
VHS_VideoCombine | Video compositing |
|
LoadAudio | Audio file loader | Built-in |
4. Pipeline Structure
Input Group
Image:
LoadImage(e.g.image.png)Audio:
LoadAudio(e.g.April28.MP3)
Processing Group
Data fusion:
SONIC_PreDataencodes temporal dataConfig: Image size 768x768, audio weight=0.5
Generation Group
SONICSampler: 25 steps, 25fpsOutput: 8fps H.264 video (CRF=19)
5. I/O Specifications
Input Requirements:
Image: 1139x1151 PNG recommended
Audio: MP3/WAV with clear speech
Output:
Video:
ComfyUI/output/AnimateDiff_xxxx-audio.mp4
6. Critical Notes
Model Setup:
Download Sonic model from provided cloud links
Verify
svd_xt_1_1model path
Performance:
VRAM ≥16GB required
Reduce FPS to 8 for lower resource usage
Troubleshooting:
Desync lips: Check audio sample rate (44.1kHz)
Choppy video: Adjust CRF (18-23)