Microsoft VibeVoice-1.5B · zero-shot voice cloning from 45s reference · RTX 5070 Ti (rog-beast) · SDPA · bf16
uv + gitMicrosoft removed the VibeVoice TTS code from GitHub in Sept 2025. The weights are still live on Hugging Face (MIT license) and the community preserved the code.
cd G:\Krishi\VibeVoice git clone https://github.com/shijincai/VibeVoice.git src cd src
uv venv --python 3.12 # PyTorch with CUDA 12.8 wheels (Blackwell-compatible) uv pip install --python .venv\Scripts\python.exe torch torchaudio --index-url https://download.pytorch.org/whl/cu128 # VibeVoice deps (transformers 4.51.3 is hard-pinned — don't bump it) uv pip install --python .venv\Scripts\python.exe ^ "transformers==4.51.3" "accelerate==1.6.0" "diffusers" ^ "librosa" "scipy" "numpy" "tqdm" "numba>=0.57" "llvmlite>=0.40" ^ "ml-collections" "absl-py" "soundfile" "huggingface_hub" # The vibevoice package itself (editable, no-deps to respect the pin) uv pip install --python .venv\Scripts\python.exe --no-deps -e .
.venv\Scripts\python.exe -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/VibeVoice-1.5B', local_dir=r'G:\Krishi\VibeVoice\models\VibeVoice-1.5B')"
Drop a clean 30-60s WAV of the target voice into demo\voices\. Naming: <lang>-<Name>_<gender>.wav (e.g. en-Gurudev_man.wav). The VoiceMapper splits on _ and -, so the SSH speaker name becomes Gurudev. Mono, 24 kHz, dense speech works best.
VibeVoice uses a Speaker N: text format. One line per utterance.
Speaker 1: Das Leben ist eine Reise. Jede Erfahrung, jeder Augenblick, ist ein Geschenk. Wenn du das Gegenwärtige mit einem Lächeln annimmst, erblüht Freude im Herzen.
Save as e.g. demo\text_examples\test_de.txt.
.venv\Scripts\python.exe demo\inference_from_file.py ^ --model_path G:\Krishi\VibeVoice\models\VibeVoice-1.5B ^ --txt_path demo\text_examples\test_de.txt ^ --speaker_names Gurudev ^ --output_dir G:\Krishi\VibeVoice\outputs
Output is a 24 kHz mono WAV at outputs\test_de_generated.wav. Expect ~2.2× real-time on a 5070 Ti with SDPA fallback (∼5 GB VRAM for 1.5B at bf16).
scp. Use ASCII hyphens only in PowerShell scripts or you'll get cryptic "string missing terminator" parse errors.& not ;. For anything non-trivial, write a .ps1 to disk and invoke powershell -NoProfile -ExecutionPolicy Bypass -File …..output files look stuck during generation. Monitor disk growth or nvidia-smi utilization instead of tailing logs.VibeVoiceForConditionalGenerationInference class. Install the vibevoice package with --no-deps after the pinned transformers to keep it locked.