her-voice
Give your agent a voice.
Setup & Installation
Install command
clawhub install matusvojtek/her-voiceIf the CLI is not installed:
Install command
npx clawhub@latest install matusvojtek/her-voiceOr install with OpenClaw CLI:
Install command
openclaw skills install matusvojtek/her-voiceor paste the repo link into your assistant's chat
Install command
https://github.com/openclaw/skills/tree/main/skills/matusvojtek/her-voiceWhat This Skill Does
Her Voice adds spoken audio responses to an agent using Kokoro TTS, a local 82M-parameter neural model. Audio streams on-device with no API keys or cloud dependencies. Supports a real-time LED visualizer, a background daemon for low latency, and voice blending between multiple voices.
Runs entirely on-device with no API keys or cloud costs, while still achieving sub-second first-audio latency via streaming and an optional warm-model daemon.
When to Use It
- Hearing agent replies read aloud while working hands-free
- Listening to long code summaries without reading the screen
- Using persist mode as a standalone voice station for pasted text
- Saving spoken responses as WAV files for later playback
- Getting instant audio feedback when the agent starts responding
View original SKILL.md file
# Her Voice ποΈ
**Give your agent a voice.** Audio responses powered by Kokoro TTS β a compact, naturally expressive model running entirely on-device.
### β¨ Features
Highly optimized response time thanks to on-the-fly audio streaming technology. 100% free, no API keys required. Inspired by Samantha and Sky.
- **β‘ On-the-fly Streaming** β Audio plays as it generates, very low latency
- **π The Voice of an angel** β Cutting-edge local text-to-speech model Kokoro TTS
- **π§ TTS Daemon** β Keep the model warm in RAM for instant responses (can be disabled to save RAM)
- **π₯οΈ Persist Mode** β Drag & drop audio, paste text, use as a voice station
- **π§ Fully Configurable** β Voice, speed, visualizer, notification sounds
- **π MLX + PyTorch** β Native Metal acceleration on Apple Silicon, PyTorch fallback everywhere else
- **π¨ Real-time Visualizer** β Floating 60fps LED bars that react to speech (macOS only)
## First-Run Setup
```bash
python3 SKILL_DIR/scripts/setup.py
```
> **Note:** `SKILL_DIR` is the root directory of this skill β the agent resolves it automatically when running commands.
The setup wizard will:
1. Detect platform and select TTS engine (MLX on Apple Silicon, PyTorch elsewhere)
2. Find or install the appropriate TTS backend (mlx-audio or kokoro)
3. Install `espeak-ng` (Homebrew on macOS, apt on Linux)
4. Patch espeak loader if needed (macOS compatibility)
5. Compile the native visualizer binary (macOS only)
6. Download the Kokoro model
7. Create config at `~/.her-voice/config.json`
Check status anytime:
```bash
python3 SKILL_DIR/scripts/setup.py status
```
### Post-Setup: Names & Pronunciation
After setup, configure the agent and user names:
```bash
python3 SKILL_DIR/scripts/config.py set agent_name "Jackie"
python3 SKILL_DIR/scripts/config.py set user_name "MatΓΊΕ‘"
python3 SKILL_DIR/scripts/config.py set user_name_tts "Mah-toosh"
```
**TTS pronunciation tip:** If the user's name is non-English, figure out a phonetic English spelling that Kokoro will pronounce correctly. Store it in `user_name_tts` and use that spelling whenever speaking the name aloud. The real name stays in `user_name` for display purposes.
## Speaking Text
```bash
# Basic usage
python3 SKILL_DIR/scripts/speak.py "Hello, world!"
# Skip visualizer for this call
python3 SKILL_DIR/scripts/speak.py --no-viz "Quick note"
# Save to file instead of playing
python3 SKILL_DIR/scripts/speak.py --save /tmp/output.wav "Save this"
# Override voice or speed
python3 SKILL_DIR/scripts/speak.py --voice af_bella --speed 1.2 "Faster!"
# Pipe text from stdin
echo "Piped text" | python3 SKILL_DIR/scripts/speak.py
```
### Options
| Flag | Description |
|------|-------------|
| `--no-viz` | Skip the visualizer for this call |
| `--persist` | Keep visualizer open after playback ends |
| `--save PATH` | Save audio to WAV file instead of playing |
| `--voice NAME` | Override the configured voice |
| `--speed N` | Override the configured speed multiplier |
| `--mode MODE` | Override visualizer mode (`v2` or `classic`) |
## Agent Workflow
When the user wants voice responses:
1. **Check voice mode** β is voice enabled or did the user ask for it?
2. **Play notification sound** (instant feedback while TTS generates):
```bash
afplay /System/Library/Sounds/Blow.aiff &
```
3. **Speak the response:**
```bash
python3 SKILL_DIR/scripts/speak.py "Response text here"
```
4. **Always provide text alongside voice** β accessibility matters.
### Notification Sound
The notification sound plays instantly (~0.1s) while TTS generates (~0.3-3s). This gives the user immediate feedback that the agent is responding.
Configure in `~/.her-voice/config.json`:
```json
{
"notification_sound": {
"enabled": true,
"sound": "Blow"
}
}
```
Available macOS sounds: `Blow`, `Bottle`, `Frog`, `Funk`, `Glass`, `Hero`, `Morse`, `Ping`, `Pop`, `Purr`, `Sosumi`, `Submarine`, `Tink`. Located in `/System/Library/Sounds/`.
## TTS Daemon
The daemon keeps the Kokoro model warm in RAM, eliminating ~1.1s of startup overhead per call.
The daemon auto-resolves the mlx-audio venv β no need to find the venv Python manually.
```bash
# Start (persists in background)
nohup python3 SKILL_DIR/scripts/daemon.py start > /tmp/her-voice-daemon.log 2>&1 & disown
# Status
python3 SKILL_DIR/scripts/daemon.py status
# Stop
python3 SKILL_DIR/scripts/daemon.py stop
# Restart
python3 SKILL_DIR/scripts/daemon.py restart
```
`speak.py` auto-detects the daemon: uses it if available, falls back to direct model loading.
**The daemon is optional.** Without it, speech still works β just ~1s slower per call as the model loads each time. Skip the daemon to save ~2.3GB RAM.
**Note:** The daemon writes its PID file and socket after the model is fully loaded and ready to accept connections. They live in `~/.her-voice/` with restricted permissions (owner-only access). The daemon won't survive a reboot β start it again after restart if needed.
## Visualizer
A floating overlay with three animated LED bars that react to speech in real-time. 60fps, native macOS (Cocoa + AVFoundation). **macOS only** β on other platforms, audio plays without the visualizer.
### Modes
- **v2** (default) β Three-tier pure red, center raw amplitude, sides with lag
- **classic** β Original smooth gradient look
### Controls
| Key | Action |
|-----|--------|
| ESC | Quit |
| Space | Pause/Resume (file mode) |
| β β | Seek Β±5s (file mode) |
| βV | Paste text to speak (persist mode) |
### Persist Mode
Keep the visualizer on screen between playbacks. Use as a standalone voice station:
```bash
# Launch in persist mode (stays open, idle breathing animation)
~/.her-voice/bin/her-voice-viz --persist
# Stream mode + persist (stays open after speech ends)
python3 SKILL_DIR/scripts/speak.py --persist "Hello!"
```
In persist mode:
- **Drag & drop** audio files (.wav, .mp3, .aiff, .m4a) onto the visualizer to play them
- **βV** pastes clipboard text β streams directly from TTS daemon with full visualizer animation
- **Idle breathing** β subtle center bar pulse when waiting for input
### Standalone Usage
```bash
# Play a file with visualizer
~/.her-voice/bin/her-voice-viz --audio /path/to/file.wav
# Demo mode (simulated audio)
~/.her-voice/bin/her-voice-viz --demo
# Stream raw PCM
cat audio.raw | ~/.her-voice/bin/her-voice-viz --stream --sample-rate 24000
```
### Disable Visualizer
```bash
python3 SKILL_DIR/scripts/config.py set visualizer.enabled false
```
## Configuration
Config file: `~/.her-voice/config.json`
```bash
# View all settings
python3 SKILL_DIR/scripts/config.py status
# Get a value
python3 SKILL_DIR/scripts/config.py get voice
# Set a value (dot notation for nested keys)
python3 SKILL_DIR/scripts/config.py set speed 1.1
python3 SKILL_DIR/scripts/config.py set visualizer.mode classic
```
### Key Settings
| Key | Default | Description |
|-----|---------|-------------|
| `agent_name` | `""` | Agent's name (e.g. "Jackie") |
| `user_name` | `""` | User's real name |
| `user_name_tts` | `""` | Phonetic spelling for TTS (e.g. "Mah-toosh" for MatΓΊΕ‘) |
| `voice` | `af_heart` | Base voice name |
| `voice_blend` | `{af_heart: 0.6, af_sky: 0.4}` | Voice blend weights |
| `speed` | `1.05` | Speech speed multiplier |
| `language` | `en` | Language code |
| `tts_engine` | `auto` | TTS engine: `auto`, `mlx`, or `pytorch` |
| `model` | `mlx-community/Kokoro-82M-bf16` | Model identifier (MLX) |
| `visualizer.enabled` | `true` | Show visualizer overlay |
| `visualizer.mode` | `v2` | Animation mode (v2/classic) |
| `visualizer.remember_position` | `true` | Save window position between sessions |
| `notification_sound.enabled` | `true` | Play sound before speaking |
| `notification_sound.sound` | `Blow` | macOS system sound name |
| `daemon.auto_start` | `true` | Advisory flag only β the daemon never self-starts. When `true`, the agent should start it on first voice use (saves ~1s/call, costs ~2.3GB RAM) |
| `daemon.socket_path` | `~/.her-voice/tts.sock` | Unix socket path |
## Voice Selection
### Voice Blending
Mix multiple voices for a unique sound. Configure `voice_blend` in config:
```json
{
"voice_blend": {"af_heart": 0.6, "af_sky": 0.4}
}
```
The blended voice is stored as a `.safetensors` file in the model's voices directory (e.g., `af_heart_60_af_sky_40.safetensors`). Create it by running TTS once β `speak.py` looks for the pre-blended file automatically.
## Error Handling
| Error | Cause | Fix |
|-------|-------|-----|
| "mlx-audio not found" | Venv missing or broken | Run `setup.py` |
| "espeak-ng not found" | Phonemizer missing | `brew install espeak-ng` |
| Compilation failed | Xcode tools missing | `xcode-select --install` |
| "Model not found" | First run, no download | Run `setup.py` or speak once |
| Daemon "not running" | Crashed or rebooted | Start daemon again |
| No sound output | macOS audio permissions | Check System Settings β Sound β Output |
| Visualizer not showing | Binary not compiled | Run `setup.py` |
| "kokoro not found" | PyTorch venv missing | Run `setup.py` |
| PyTorch CUDA error | GPU driver mismatch | `pip install torch --force-reinstall` in kokoro venv |
| "soundfile not found" | Missing dependency | `pip install soundfile` in kokoro venv |
## Requirements
- **macOS + Apple Silicon** recommended for best experience (MLX engine + visualizer + notification sounds)
- **Linux/Intel Mac** supported via PyTorch Kokoro engine (no visualizer)
- **Windows** is not supported
- Xcode Command Line Tools for visualizer on macOS (`xcode-select --install`)
- `espeak-ng` for phonemization (`brew install espeak-ng` on macOS, `apt install espeak-ng` on Linux)
- ~500MB disk (model + venv)
- ~2.3GB RAM when daemon is running
## Uninstall
Remove all Her Voice data (config, venvs, compiled binary, daemon state):
```bash
python3 SKILL_DIR/scripts/daemon.py stop
rm -rf ~/.her-voice
```
## How It Works
1. **Kokoro 82M** β A compact neural TTS model with two backends: **MLX** (Apple's framework for native Metal GPU acceleration on Apple Silicon) and **PyTorch** (works everywhere). The engine is auto-detected based on platform, or can be forced via the `tts_engine` config option (`auto`, `mlx`, or `pytorch`)
2. **Streaming** β Audio generates and plays simultaneously. First sound in ~0.3s (with daemon) vs ~3s batch
3. **Visualizer** β Native macOS app (Swift/Cocoa) reads raw PCM from stdin, plays via AVAudioEngine with real-time amplitude metering
4. **Daemon** β Unix socket server holding the model in RAM. Eliminates Python import + model load overhead on every call
Example Workflow
Here's how your AI assistant might use this skill in practice.
User asks: Hearing agent replies read aloud while working hands-free
- 1Hearing agent replies read aloud while working hands-free
- 2Listening to long code summaries without reading the screen
- 3Using persist mode as a standalone voice station for pasted text
- 4Saving spoken responses as WAV files for later playback
- 5Getting instant audio feedback when the agent starts responding
Give your agent a voice.
Security Audits
These signals reflect official OpenClaw status values. A Suspicious status means the skill should be used with extra caution.
Similar Skills
VIEW ALLcult-of-carcinization
Give your agent a voice β and ears.
freshbooks-cli
FreshBooks CLI for managing invoices, clients, and billing.
eternal-haven-lore-pack
Eternal Haven Chronicles lore + mythic persona pack.
briefing-room
Daily news briefing generator β produces a conversational radio-host-style audio briefing + DOCX document covering.