Release notes: voice, eyes, breath, and everything we shipped this week
This past week was unglamorous-but-real work on the parts of Primeta you actually see and hear every turn: the face, the voice, and the microseconds between them. Here is what shipped.
The avatar is more alive
Most Primeta models come from VRoid Studio with a rigging setup that technically supports a lot of motion but doesnt actually do any of it by default. We noticed. A few changes close that gap.
Eyes move now. Subtle saccades every 1-3 seconds — small ±30 degree yaw darts that your brain reads as "there is somebody in there." Before this, the gaze was locked forward because the default VRM setup expects a rig with eye bones mapped to the humanoid, and most library models dont have that. We detect the mismatch at load and swap in an expression-based look-at driver, which the same models do have blendshapes for. Effect: eye motion visible on hundreds of models it wasnt visible on before.
Breathing. A gentle additive sine wave on the spine and chest, running on top of whatever scripted animation is playing. If you watch closely for a second, the chest rises. Not watching closely, you dont think "this character is frozen."
Blinks stop happening mid-syllable. We now defer a blink if the mouth is actively forming a viseme, pushing it into the natural pause between phonemes. Small thing. Noticeable once you stop seeing the wrong thing.
Hair and secondary bones animate further down the chain. The typical VRoid export configures only the first joint of a chain as a spring bone — so hair pivoted from the root but the tip stayed static. We extend those chains at load using the authors own settings, so the full length of the hair/tail/etc. responds to motion. Models that shipped with bust bones get the same treatment on bust chains.
Lipsync got smarter (and safer)
We tried a full rewrite using real-time formant analysis of the playing audio — in theory more natural. In practice, on continuous speech, the mouth just held open because the audio energy never drops. We reverted. The phoneme-timeline approach is back and the mouth articulates shapes through the sentence rather than locking open.
What did stick: the mouth output is scoped to the visemes the specific VRM actually has, with graceful fallback behavior when it doesnt have them all.
Emotion tags are finally coherent
The LLM emits things like [happy] or [curious] at the start of a reply. Two problems we fixed:
The prompt now tells the LLM only what the current avatar can render. We parse each VRMs blendshape set at upload time and store the canonical emotion subset. Instead of always listing ten emotions, the prompt now lists the three-to-five the face can actually do. Fewer tokens, and the LLM doesnt learn to emit tags that produce no visible change.
Voice emotions match face emotions per persona. Our TTS provider supports about sixty tonal tags ((shouting), (whispering), (determined), etc.) but we deliberately hold voice expressiveness to the personas canonical set. Rationale: Primeta is a 3D-avatar-first product. A voice delivering an emotion the face cant mirror would feel off-brand. The invariant is face emotions ⊇ voice emotions per persona, not the other way around.
TTS is faster
Four overlapping improvements on the audio pipeline. Each small individually, collectively you notice:
- Fish parameters tuned for low latency. We switched from the "balanced" to "low" latency tier, dropped chunk size, dropped mp3 bitrate to 64 kbps, and turned off context conditioning between chunks. Smaller payload, faster time-to-first-byte.
- Keep-alive HTTP connection pool. We were opening a fresh TLS connection per /tts_proxy request. With sentence-level firing that meant 3-5 handshakes per reply. Now its one persistent connection per Puma thread, reused.
- Real TTFB metric. The "time to first byte" number we record in admin was actually the total synthesis wall-time before. Now its the real first-byte timestamp from Fish, which makes the "slow request" alerts alert on actual slowness.
- Cleaner emotion-tag conversion. Tags get whitelisted against Fishs vocabulary. Unknown tags (
[mysterious]etc.) are dropped rather than getting passed through as literal parenthetical speech.
Small quality-of-life
A handful of things that dont fit a theme.
- Idle animation "none" option. Some personas dont need a canned idle loop. You can now choose "None" in the wizard and the character just breathes procedurally rather than playing a default clip.
- Admin cover-image generation. Every post auto-generates an AI cover on create (via gpt-image-2). Manual regenerate available on the edit page if you dont like it. The cover panel updates live via Turbo Streams while the job runs, with clear feedback on success or failure.
- TTS emotion-tag stripping consolidated. All the server-side regex copies were replaced with a single EmotionTag module. If we ever change the tag syntax, we change it in one place.
- Dead code removed. Between the
/ttsendpoint, the old lip_sync.js bundle, the broken Stream Actions, and the unused approximate_phonemes helper, about 500 lines of code left the tree.
Whats next
The biggest item on the board is audio-reactive head micro-motion during speech — the head nodding and swaying slightly in time with the voice, which is one of the last big "this character is pretending to be alive" signals a VRM character can give. Also upgrading from template-based image prompts to LLM-generated image prompts for cover art, since the current prompts produce same-y results across posts.
If you notice anything strange in the avatar or voice, tell us. Most of the above came from a single "the eyes never move" report that unrolled into a week of follow-ups.