VibeVoice Gurudev Clone

Microsoft VibeVoice-1.5B · zero-shot voice cloning from 45s reference · RTX 5070 Ti (rog-beast) · SDPA · bf16

Reference input

45 seconds · Bhakti Sutras Ep 01, 3:00–3:45 · Gurudev speaking English

English clone sanity check

12.8s · RTF 2.17x

"Life is a journey. Every experience, every moment, is a gift. When you accept what is, with a smile, joy blooms in the heart."

German cross-lingual clone the real test

15.2s · RTF 2.30x

"Das Leben ist eine Reise. Jede Erfahrung, jeder Augenblick, ist ein Geschenk. Wenn du das Gegenwärtige mit einem Lächeln annimmst, erblüht Freude im Herzen."

How to reproduce this on rog-beast

1 Hardware & prereqs

NVIDIA GPU with ≥8 GB VRAM (tested on RTX 5070 Ti, 16 GB, Blackwell sm_120)
Driver ≥570, CUDA 12.8+ (13.2 works)
Python 3.12 + uv + git
Windows native works fine. WSL2 hangs under OpenSSH non-interactive sessions — avoid.

2 Clone the community mirror

Microsoft removed the VibeVoice TTS code from GitHub in Sept 2025. The weights are still live on Hugging Face (MIT license) and the community preserved the code.

cd G:\Krishi\VibeVoice
git clone https://github.com/shijincai/VibeVoice.git src
cd src

3 Create venv & install

uv venv --python 3.12

# PyTorch with CUDA 12.8 wheels (Blackwell-compatible)
uv pip install --python .venv\Scripts\python.exe torch torchaudio --index-url https://download.pytorch.org/whl/cu128

# VibeVoice deps (transformers 4.51.3 is hard-pinned — don't bump it)
uv pip install --python .venv\Scripts\python.exe ^
  "transformers==4.51.3" "accelerate==1.6.0" "diffusers" ^
  "librosa" "scipy" "numpy" "tqdm" "numba>=0.57" "llvmlite>=0.40" ^
  "ml-collections" "absl-py" "soundfile" "huggingface_hub"

# The vibevoice package itself (editable, no-deps to respect the pin)
uv pip install --python .venv\Scripts\python.exe --no-deps -e .

4 Download weights (5.4 GB)

.venv\Scripts\python.exe -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/VibeVoice-1.5B', local_dir=r'G:\Krishi\VibeVoice\models\VibeVoice-1.5B')"

5 Add a voice reference

Drop a clean 30-60s WAV of the target voice into demo\voices\. Naming: <lang>-<Name>_<gender>.wav (e.g. en-Gurudev_man.wav). The VoiceMapper splits on _ and -, so the SSH speaker name becomes Gurudev. Mono, 24 kHz, dense speech works best.

6 Write the script

VibeVoice uses a Speaker N: text format. One line per utterance.

Speaker 1: Das Leben ist eine Reise. Jede Erfahrung, jeder Augenblick, ist ein Geschenk. Wenn du das Gegenwärtige mit einem Lächeln annimmst, erblüht Freude im Herzen.

Save as e.g. demo\text_examples\test_de.txt.

7 Run inference

.venv\Scripts\python.exe demo\inference_from_file.py ^
  --model_path G:\Krishi\VibeVoice\models\VibeVoice-1.5B ^
  --txt_path demo\text_examples\test_de.txt ^
  --speaker_names Gurudev ^
  --output_dir G:\Krishi\VibeVoice\outputs

Output is a 24 kHz mono WAV at outputs\test_de_generated.wav. Expect ~2.2× real-time on a 5070 Ti with SDPA fallback (∼5 GB VRAM for 1.5B at bf16).

8 Known gotchas

Flash-attention 2 not installed — the script prints an error but falls back to SDPA automatically. The model warns "may result in lower audio quality." Installing flash-attn on Windows requires a manual wheel build.

Em-dashes (—) break PowerShell .ps1 files uploaded via scp. Use ASCII hyphens only in PowerShell scripts or you'll get cryptic "string missing terminator" parse errors.

WSL2 hangs under SSH on Windows 11 Pro with OpenSSH in non-interactive admin sessions (UAC split token). Don't fight it — use native Windows Python.

Windows SSH default shell is cmd.exe, not PowerShell. Chain with & not ;. For anything non-trivial, write a .ps1 to disk and invoke powershell -NoProfile -ExecutionPolicy Bypass -File ….

tqdm progress in background output files doesn't flush until newline, so .output files look stuck during generation. Monitor disk growth or nvidia-smi utilization instead of tailing logs.

transformers 4.51.3 is a hard pin. Newer versions break the VibeVoiceForConditionalGenerationInference class. Install the vibevoice package with --no-deps after the pinned transformers to keep it locked.

9 Lessons from 8 iterations on meditation voice

BS 2019 reference hits home. Clean dense 45s of Gurudev speaking English in discourse register gives the best voice character. This is the only thing that worked.
Meditation-derived references failed. Demucs-isolated vocals leave phase artifacts; sparse breath-heavy references confuse the speaker characterization; over-compacted references break the model's stopping criterion.
Longer German text produces better clones. 1-sentence imperatives come out clipped (4.5s, feels rushed). 3-sentence contemplative passages sit in Test 1 quality (12-15s). Give VibeVoice room to settle into the voice.
Cross-lingual works out of the box. 1.5B model card lists only en/zh but German delivery via BS reference is solid. The model generalizes beyond its stated languages.
Pace cannot be controlled via text. Ellipses and punctuation only slow delivery by ~20%. Reference pace dominates. If you need meditation pace, the model fights you.

Generated on rog-beast · Model: microsoft/VibeVoice-1.5B · Code: shijincai/VibeVoice (community mirror)