From Phonemes to Mouth Shapes: Animating VRM Models

When a 3D avatar speaks, its mouth needs to move in sync with the audio. Not approximately — frame by frame, syllable by syllable. The technology that makes this work bridges two worlds: phonemes (the sounds of speech) and visemes (the visual mouth shapes that produce those sounds).

This is how it actually works.

Phonemes: The Atoms of Speech

A phoneme is the smallest unit of sound that distinguishes one word from another. English has about 44 phonemes. The word "cat" has three: /k/, /æ/, /t/. Change the middle phoneme and you get "cut" (/k/, /ʌ/, /t/) — a completely different word.

When a text-to-speech engine synthesizes audio, it does not just produce a waveform. It produces a timeline — a sequence of phonemes, each tagged with a start time and end time in the audio stream. For the word "hello":

h  →  0.00s – 0.08s
e  →  0.08s – 0.16s
l  →  0.16s – 0.28s
l  →  0.28s – 0.36s
o  →  0.36s – 0.48s

This timeline is the foundation of lip sync. Without it, you are guessing.

Visemes: What the Mouth Actually Looks Like

Here is the key insight: the mouth makes far fewer shapes than the voice makes sounds. While there are 44 phonemes in English, the mouth only takes about 5-6 distinct positions when speaking. Many phonemes that sound completely different look identical on the lips.

Say "b", "m", and "p" out loud. Watch your mouth in a mirror. All three look the same — lips pressed together, then released. The difference is in the voicing and nasal airflow, not the lip position.

These visual mouth positions are called visemes. The VRM avatar standard (used for 3D characters in virtual environments) defines five core viseme blend shapes:

Blend Shape	Mouth Position	Example Sounds
aa	Wide open	"ah" as in "father"
ee	Lips spread horizontally	"ee" as in "see"
ih	Slightly open, relaxed	"ih" as in "sit"
oh	Lips rounded, medium open	"oh" as in "go"
ou	Lips tightly rounded	"oo" as in "moon"

These are implemented as blend shapes (also called morph targets) — deformations of the 3D mesh that can be mixed at varying intensities. Setting aa to 100% opens the mouth wide. Setting oh to 60% and aa to 30% produces a partially rounded, partially open shape. By mixing these five targets, you can approximate any mouth position in human speech.

The Mapping: Phoneme → Viseme

The bridge between sounds and shapes is a lookup table. Each phoneme maps to one or more viseme targets with weights:

Pure vowels map cleanly:
"a" → { aa: 1.0 } // wide open "e" → { ee: 1.0 } // spread "o" → { oh: 1.0 } // rounded "u" → { ou: 1.0 } // tight round

Diphthongs and reduced vowels are blends:
"æ" → { aa: 0.6, ee: 0.4 } // "cat" — open but slightly spread "ə" → { aa: 0.4, ih: 0.3 } // schwa — lazy, central "ʊ" → { ou: 0.8, oh: 0.2 } // "book" — mostly rounded

Consonants are where it gets interesting. Most consonants are produced inside the mouth (tongue, teeth, palate) and barely affect the lips:
"t", "d", "n" → { } // tongue tip — lips neutral "k", "g" → { } // back of tongue — lips neutral "s", "z" → { ih: 0.3 } // slight spread from airflow

But some consonants are deeply visual:
"m", "b", "p" → { } // bilabial — mouth closes completely "f", "v" → { ou: 0.2 } // labiodental — teeth on lower lip "w" → { ou: 0.8 } // strong lip rounding "r" → { ou: 0.4 } // moderate rounding

And punctuation produces silence:
" ", ".", "," → { } // mouth returns to rest

This mapping is not arbitrary. It is grounded in articulatory phonetics — the physical mechanics of how humans produce speech sounds.

The Animation Loop

With the phoneme timeline and the viseme mapping table, the real-time animation works like this:

1. Decode and play the audio. The synthesized speech (typically MP3) is decoded into a Web Audio buffer and routed through an AudioContext. The AudioContext provides a high-precision clock — audioContext.currentTime — that stays perfectly synchronized with the audio output, even if the browser tab is backgrounded or the frame rate drops.

2. On every animation frame, find the current phoneme. At 60 frames per second, the engine checks the current playback time against the phoneme timeline. A backward linear search finds the active phoneme segment:

elapsed = 0.22s → active phoneme is "l" (0.16s – 0.28s)
target blend shapes: { }  (lips neutral for "l")

3. Interpolate toward the target. Rather than snapping between mouth positions — which would look robotic — the blend shape weights are smoothly interpolated using an exponential lerp:

lerpFactor = 1 - e^(-speed × deltaTime)
current.aa += (target.aa - current.aa) × lerpFactor

With a lerp speed of 12 units per second, the mouth transitions feel natural. Fast enough to keep up with rapid speech, smooth enough to avoid jarring pops. The exponential curve means large differences close quickly while small refinements happen gradually — exactly how real muscles behave.

4. Apply to the 3D model. Each frame, the interpolated weights are pushed to the VRM model's expression manager. The 3D engine deforms the mesh vertices according to the blend shape weights. Five floating-point numbers — updated 60 times per second — drive the entire illusion of speech.

The Graceful Close

When the audio ends, you cannot just zero out all blend shapes. That would snap the mouth shut unnaturally. Instead, the weights decay over about 15 frames, each frame multiplying by 0.7:

Frame 1: aa = 0.50 → 0.35
Frame 2: aa = 0.35 → 0.245
Frame 3: aa = 0.245 → 0.172
...
Frame 10: aa ≈ 0.01 → 0.0

This produces a natural "settling" motion, like a real person closing their mouth after finishing a sentence. Values below 0.001 are clamped to zero to avoid floating-point drift.

The Emotion Problem

3D avatars do not just talk — they emote. A happy character smiles. A surprised character opens their mouth. These emotion-driven facial expressions also affect the mouth area, which creates a conflict: should the mouth follow the phonemes ("say hello") or the emotion ("look happy")?

The solution is a priority system. During speech, a suppression flag disables emotion-based mouth blending. The phoneme animation has exclusive control of the five mouth blend shapes. Meanwhile, everything above the mouth — eyebrow raises, eye squints, cheek movements — continues to follow the emotion system normally.

When the audio ends and the mouth closing animation completes, the flag is released, and emotion-based mouth expressions resume. The result: characters can look happy while talking (smiling eyes, raised cheeks) without their mouth fighting between "say the word" and "smile."

When Phonemes Are Not Available

Not every TTS provider returns phoneme timing. When the data is missing, there is a fallback: real-time frequency analysis.

The audio stream is run through an FFT (Fast Fourier Transform) on every frame, splitting it into frequency bands. Different parts of the frequency spectrum correlate with different mouth shapes:

Low frequencies (80–500 Hz) → oh blend shape. Bass frequencies dominate in rounded vowels like "o" and "u" where the vocal tract is long and open.
Mid frequencies (500–1500 Hz) → aa blend shape. The first formant of open vowels like "ah" lives here.
High frequencies (1500–4000 Hz) → ih and ee blend shapes. Sibilants ("s", "sh") and front vowels produce energy in the upper range.

The energy distribution across these bands drives the blend shape weights in real time. It is less precise than phoneme-level control — you lose the consonant detail and the exact timing of transitions — but it produces surprisingly convincing results. The mouth opens wider on loud, open vowels and narrows on quiet fricatives, which is the most important visual cue for speech perception.

Why It Works

Human speech perception is remarkably forgiving of lip sync imprecision. Research in audiovisual speech perception (the McGurk effect and related work) shows that viewers primarily need three things to perceive natural speech:

Temporal alignment — the mouth opens when sound starts, closes when it stops
Amplitude correlation — louder sounds produce more mouth movement
Vowel distinction — open vowels look different from closed ones

Consonant precision matters less than you might expect. We get most of our consonant information from audio, not vision. This is why even the approximate phoneme timing approach — dividing audio duration equally across characters — produces acceptable results. The vowels land close enough to their correct positions, and the consonants pass too quickly for the eye to catch the imprecision.

The five VRM blend shapes are not a limitation. They are a design insight: five is enough, because human lip-reading resolution is low enough that finer distinctions are lost anyway. What matters is smooth transitions, correct timing, and getting the vowels right.