VibeVoice Gurudev Clone

Microsoft VibeVoice-1.5B · zero-shot voice cloning from 45s reference · RTX 5070 Ti (rog-beast) · SDPA · bf16

Reference input

45 seconds · Bhakti Sutras Ep 01, 3:00–3:45 · Gurudev speaking English

English clone sanity check

12.8s · RTF 2.17x
"Life is a journey. Every experience, every moment, is a gift. When you accept what is, with a smile, joy blooms in the heart."

German cross-lingual clone the real test

15.2s · RTF 2.30x
"Das Leben ist eine Reise. Jede Erfahrung, jeder Augenblick, ist ein Geschenk. Wenn du das Gegenwärtige mit einem Lächeln annimmst, erblüht Freude im Herzen."

How to reproduce this on rog-beast

1 Hardware & prereqs

2 Clone the community mirror

Microsoft removed the VibeVoice TTS code from GitHub in Sept 2025. The weights are still live on Hugging Face (MIT license) and the community preserved the code.

cd G:\Krishi\VibeVoice
git clone https://github.com/shijincai/VibeVoice.git src
cd src

3 Create venv & install

uv venv --python 3.12

# PyTorch with CUDA 12.8 wheels (Blackwell-compatible)
uv pip install --python .venv\Scripts\python.exe torch torchaudio --index-url https://download.pytorch.org/whl/cu128

# VibeVoice deps (transformers 4.51.3 is hard-pinned — don't bump it)
uv pip install --python .venv\Scripts\python.exe ^
  "transformers==4.51.3" "accelerate==1.6.0" "diffusers" ^
  "librosa" "scipy" "numpy" "tqdm" "numba>=0.57" "llvmlite>=0.40" ^
  "ml-collections" "absl-py" "soundfile" "huggingface_hub"

# The vibevoice package itself (editable, no-deps to respect the pin)
uv pip install --python .venv\Scripts\python.exe --no-deps -e .

4 Download weights (5.4 GB)

.venv\Scripts\python.exe -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/VibeVoice-1.5B', local_dir=r'G:\Krishi\VibeVoice\models\VibeVoice-1.5B')"

5 Add a voice reference

Drop a clean 30-60s WAV of the target voice into demo\voices\. Naming: <lang>-<Name>_<gender>.wav (e.g. en-Gurudev_man.wav). The VoiceMapper splits on _ and -, so the SSH speaker name becomes Gurudev. Mono, 24 kHz, dense speech works best.

6 Write the script

VibeVoice uses a Speaker N: text format. One line per utterance.

Speaker 1: Das Leben ist eine Reise. Jede Erfahrung, jeder Augenblick, ist ein Geschenk. Wenn du das Gegenwärtige mit einem Lächeln annimmst, erblüht Freude im Herzen.

Save as e.g. demo\text_examples\test_de.txt.

7 Run inference

.venv\Scripts\python.exe demo\inference_from_file.py ^
  --model_path G:\Krishi\VibeVoice\models\VibeVoice-1.5B ^
  --txt_path demo\text_examples\test_de.txt ^
  --speaker_names Gurudev ^
  --output_dir G:\Krishi\VibeVoice\outputs

Output is a 24 kHz mono WAV at outputs\test_de_generated.wav. Expect ~2.2× real-time on a 5070 Ti with SDPA fallback (∼5 GB VRAM for 1.5B at bf16).

8 Known gotchas

Flash-attention 2 not installed — the script prints an error but falls back to SDPA automatically. The model warns "may result in lower audio quality." Installing flash-attn on Windows requires a manual wheel build.
Em-dashes (—) break PowerShell .ps1 files uploaded via scp. Use ASCII hyphens only in PowerShell scripts or you'll get cryptic "string missing terminator" parse errors.
WSL2 hangs under SSH on Windows 11 Pro with OpenSSH in non-interactive admin sessions (UAC split token). Don't fight it — use native Windows Python.
Windows SSH default shell is cmd.exe, not PowerShell. Chain with & not ;. For anything non-trivial, write a .ps1 to disk and invoke powershell -NoProfile -ExecutionPolicy Bypass -File ….
tqdm progress in background output files doesn't flush until newline, so .output files look stuck during generation. Monitor disk growth or nvidia-smi utilization instead of tailing logs.
transformers 4.51.3 is a hard pin. Newer versions break the VibeVoiceForConditionalGenerationInference class. Install the vibevoice package with --no-deps after the pinned transformers to keep it locked.

9 Lessons from 8 iterations on meditation voice