Unlock Lip-Synced Cartoon Avatar Videos with This AI-Powered Workflow

CN
ComfyUI.org
2025-04-02 10:33:01

Generate lip-synced cartoon avatar videos with ease! Learn how to create 10s videos in minutes using SVD XT 1.1, SONIC UNet, and VHS Video plugin. Discover the workflow and key nodes to produce high-quality results. Get started now!

Use Case
Video
Best For
Video
VRAM
Low VRAM (≤8GB)
Reading Time
3 min
More Video Workflows

Workflow Overview

Generate lip-synced cartoon avatar videos with ease! Learn how to create 10s videos in minutes using SVD XT 1.1, SONIC UNet, and VHS Video plugin. Discover the workflow and key nodes to produce high-quality results. Get started now!

Content type: Workflow

Primary intent: Download

Setup Notes

  • Install the required models before opening the workflow template.
  • Recommended hardware: Low VRAM (≤8GB).

1. Workflow Overview

m8zsf9i5mqnr29ggr0jf2090560740d3541f2208f82b2f09965c34ae4604b1fe4946bd6f83153ef720d.gif

This workflow generates lip-synced cartoon avatar videos (e.g., Sonic) at any resolution. It syncs mouth movements with input audio, producing ~10s videos (~8 mins on RTX 4090).


2. Core Models

Model/Plugin

Function

Source/Installation

SVD XT 1.1

Base video generation model

Download svd_xt_1_1 checkpoint

SONIC UNet

Lip-sync specialized UNet

Load unet.pth

VHS Video

Video synthesis plugin

Install via ComfyUI Manager


3. Key Nodes

Node Name

Function

Installation

Dependencies

ImageOnlyCheckpointLoader

Load base model

Built-in

SVD XT 1.1 model

SONICTLoader

Load lip-sync UNet

Manual SONIC plugin install

unet.pth file

SONIC_PreData

Preprocess audio/image data

SONIC plugin

CLIP vision encoder

VHS_VideoCombine

Merge video/audio

Install ComfyUI-VideoHelperSuite

FFmpeg required


4. Workflow Groups

  • Group 1: Data Loading

    • Inputs:

      • Image (e.g., 45b437ee...png)

      • Audio (e.g., 10s-aijuxi.wav)

    • Outputs: Preprocessed data

    • Key Nodes: LoadImage, LoadAudio, SONIC_PreData

  • Group 2: Lip-Sync Generation

    • Inputs: Preprocessed data + model

    • Outputs: Frames with mouth movements

    • Key Node: SONICSampler (controls FPS/seed)

  • Group 3: Video Export

    • Inputs: Frames + original audio

    • Outputs: MP4 (H.264 encoded)

    • Key Node: VHS_VideoCombine


5. Inputs & Outputs

  • Input Parameters:

    • Image: 1080x1920 PNG (clear mouth area required)

    • Audio: 10s WAV file

    • Frame Rate: Default 25 FPS (adjustable)

    • Seed: Random or fixed (e.g., 837794266)

  • Output: MP4 video (e.g., output/Sonic/aijuxi_xxxx.mp4)


6. Notes

  • ⚠️ Hardware: NVIDIA GPU (recommended RTX 4090, ≥16GB VRAM)

  • ⚠️ Model Prep:

    • Place svd_xt_1_1 in models/checkpoints

    • unet.pth must be in SONIC plugin directory

  • Optimization:

    • Shorter audio reduces generation time

    • Set weight_dtype to fp16 in SONICSampler

FAQ