Skip to main content

How lip sync works

Lip sync takes a source video and a new audio track, then updates the speaker’s mouth movements so they better match the translated speech.

What changes and what stays the same

VoiceCheap focuses on the speaking face region. In practice:
  • the mouth and lower-face motion are updated
  • the target audio drives the new mouth movement
  • the rest of the frame stays as intact as possible

The high-level pipeline

1

Detect the face

VoiceCheap identifies the speaking face and tracks the relevant facial landmarks around the mouth and lower face.
2

Analyze the translated audio

The system listens to the target audio and computes the mouth shapes needed to match the speech.
3

Generate new lip movement

The speaking region is regenerated frame by frame to match the translated audio more closely.
4

Blend the result back into the video

The generated mouth region is composited back into the original frame so the result stays visually coherent.

Where lip sync fits in the workflow

Lip sync is not the first quality step. The usual order is:
  1. get the transcript right
  2. choose the voice strategy
  3. make sure the dubbed audio sounds natural
  4. add lip sync if the project benefits from it
If the translated voice or timing still sounds wrong, fix that first. Lip sync works best when the audio is already in good shape.

Limitations to expect

  • strong profile views are harder than front-facing shots
  • heavy obstructions around the mouth reduce quality
  • shaky footage reduces face-tracking reliability
  • noisy or overlapping audio can reduce the naturalness of the final result