- Rust 97.1%
- Python 1.8%
- Shell 1.1%
| .githooks | ||
| .zed | ||
| crates | ||
| scripts | ||
| tools | ||
| .gitignore | ||
| ARCHITECTURE.md | ||
| Cargo.lock | ||
| Cargo.toml | ||
| CLAUDE.md | ||
| README.md | ||
squawk
Push-to-talk voice client for a local llama.cpp multimodal model. Hold a key, speak, release — your audio goes straight to the model (no separate STT; the model's audio encoder handles speech) and the reply streams back into a terminal UI, optionally spoken aloud (streaming TTS with barge-in; see below).
Built as a Cargo workspace so the core logic can be reused by future frontends (GUI, web).
🎙 PUSH-TO-TALK — hold [SPACE] to talk, release to send
● REC 2.3s
🤖 The capital of France is Paris.
[SPACE] talk [m] swap mode [d] device [q] quit
Workspace layout
squawk/
├── crates/
│ ├── squawk-core/ # UI-agnostic: config, conversation, audio capture, streaming client
│ └── squawk-tui/ # terminal frontend (clap + crossterm + ratatui)
squawk-core module |
responsibility |
|---|---|
config |
layered TOML config (file + env + CLI): model, endpoint, audio, logging, prompt |
conversation |
turn history + partial-response carry-through |
model_client |
async, cancellable SSE streaming; splits content vs reasoning |
recorder |
cpal continuous capture → raw mono buffer (pre-roll lead-in) |
processing |
trait-based audio pipeline: one Stage per file, run + resample → WAV |
analysis |
live syllable-rate (speech-speed) estimation from the audio envelope |
tts |
text-to-speech backends behind a Tts trait: tone placeholder + kokoro (feature) |
speech |
streaming, cancellable playback sink with a position readout (karaoke + barge-in) |
responder |
output interpreter: strips [mood:…] → style, sentence-chunked TTS, heard-boundary mapping |
player |
replay a WAV on the default output device (cpal) with progress |
squawk-tui module |
responsibility |
|---|---|
input |
crossterm key mapping; Kitty enhancement flags for key-release |
app |
state machine (idle → recording → streaming) + ratatui UI |
main |
CLI (device listing/selection), current-thread runtime |
Adding a squawk-gui or squawk-web crate later means depending on
squawk-core and implementing a new frontend against the same Recorder,
ModelClient, and Conversation types.
Requirements
- A llama.cpp router (or server) at
http://localhost:8080serving an audio-capable model. Defaults targetunsloth/gemma-4-12b-it-GGUF:Q4_K_Mwith its mmproj loaded. Override with--base-url/--model. - A terminal that speaks the Kitty keyboard protocol for true hold-to-talk (e.g. ghostty, kitty, foot). Otherwise squawk auto-falls back to toggle mode.
- System libs: ALSA (Linux) / the platform audio backend cpal targets.
Run
cargo run --bin squawk # launch the TUI
cargo run --bin squawk -- --list-devices
cargo run --bin squawk -- --device 6 --model some/other-model
Configuration
squawk reads a single TOML file. Generate a commented starter at the user config
path (~/.config/squawk/config.toml), then edit it:
cargo run --bin squawk -- --init-config # write the default config
cargo run --bin squawk -- --print-config # show the effective config (api key redacted)
You only set the keys you want to change; everything else falls back to the
built-in defaults (crates/squawk-core/config.default.toml, embedded at compile
time). Values resolve from several layers — highest priority first:
- CLI flags —
--model,--device,--base-url - environment —
SQUAWK_*(SQUAWK_API_KEY,SQUAWK_BASE_URL,SQUAWK_MODEL,SQUAWK_LOG_DIR,SQUAWK_LOG,SQUAWK_LOG_LEVEL,SQUAWK_LOG_AUDIO) --config <path>— an explicitly chosen file./squawk.toml— project-local (handy for dev)~/.config/squawk/config.toml— user config (honoursXDG_CONFIG_HOME)- compiled defaults
Files merge (a project file overrides only the keys it sets). For a hosted
OpenAI-compatible endpoint, set endpoint.api_key — but prefer the
SQUAWK_API_KEY env var over committing a secret to a file; it is never written
to logs or --print-config.
Controls
| key | action |
|---|---|
SPACE |
hold to talk, release to send (or press/press in toggle mode) |
p |
replay the last recording — with processing applied (validate by ear) |
t |
toggle discussion on/off (draft mode — see below) |
1–5 |
toggle an audio-processing stage live (see Audio processing) |
[ / ] |
speak the reply slower / faster (TTS speed, applies to the next reply) |
m |
swap capture mode (hold ↔ toggle) — only when the terminal supports key-release |
↑/↓, PgUp/PgDn |
scroll the conversation (auto-follows the latest reply) |
d |
open the input-device picker (j/k move, Enter select, d/Esc cancel) |
q / Ctrl-C |
quit |
While a reply is streaming, pressing SPACE again interrupts it (keeps the
partial answer) and starts a new recording. When replies are spoken (see
below), the same press also stops playback and records how much you actually
heard — so the next turn's context reflects what reached your ears, not the
full generated text (TTS synthesis runs ahead of playback). Assistant replies
render markdown (code fences, headings, bold, inline code). While
recording, the header shows a live speaking-rate meter (syllables/sec) —
very fast speech is harder for the audio encoder to understand.
Audio processing (1–5)
Each recording runs through a pipeline of [Stage]s before being sent (and before
replay, so p plays exactly what the model receives). Stages toggle live via the
number keys; the header's fx line shows what's on (green) or off (dim):
| key | stage | what |
|---|---|---|
1 |
dc |
DC-offset removal |
2 |
hp |
80 Hz high-pass (rumble / handling noise) |
3 |
denoise |
RNNoise denoiser (nnnoiseless); off by default — heavier |
4 |
trim |
drop leading/trailing silence (fixes the startup blank) |
5 |
norm |
level normalization with headroom: lift toward −20 dBFS RMS, gain-capped, then peak-limited to −3 dBFS (for speech the −3 ceiling usually binds — effectively peak-normalize-with-headroom, never clipping) |
Adding a treatment is just a new file in processing/ implementing Stage plus
one line in Pipeline::default_voice. Pre-emphasis is intentionally not
included — it's a classical-ASR front-end that tends to hurt a neural audio encoder.
Draft mode (t)
Turning discussion off puts squawk in draft mode: recordings are held
instead of sent to the model. Record, replay with p, and re-record as many
times as you like — each capture replaces the previous draft. Turning discussion
back on sends the held draft. This is the "decide after listening, re-record
if you don't like it" workflow.
Spoken replies (TTS)
The model is audio-native on the way in but emits text out, so spoken
replies are a half-cascade: model text → TTS → speaker. Turn it on under
[tts] in the config:
[tts]
enabled = true
engine = "tone" # built-in placeholder; "kokoro" for the neural voice
moods = true # let the model pick its tone of voice
speed = 1.0
toneengine needs no model: it plays a faint tone of the right duration for each sentence. It exists to validate the whole path by ear — streaming, barge-in, the karaoke highlight, and mood-driven pacing — before committing to a real voice.kokorois the neural backend (requires a build with thekokorofeature plus model files; see below).- Streaming: replies are synthesized a sentence at a time as they stream, so speech starts almost immediately instead of waiting for the full answer.
- Barge-in: press
SPACEto talk over a reply — playback stops at once and the turn records the heard boundary, so the model knows what you did and didn't hear and can pick up from there. - Karaoke: the spoken portion of the reply stays bright while the rest dims,
and a
🔊 speaking NN%gauge tracks playback. - Speed: set a baseline with
tts.speed(1.0 = natural), or nudge it live with[/]— it feeds Kokoro'sspeedinput and applies to the next reply. - Moods: with
moods = true, the model may begin a reply with a tag like[mood: cheerful](one of neutral / cheerful / excited / serious / sad / calm / urgent). The tag is stripped before you see or hear it and maps onto the delivery (currently speaking rate — the knob every backend can honour).
If speech is enabled but no output device is available, squawk logs a warning and falls back to text-only.
Kokoro (neural voice)
The real voice is Kokoro-82M,
run locally via onnxruntime. It's behind the kokoro cargo feature (which pulls
onnxruntime — fetched at build time, no system lib needed) and needs espeak-ng
installed for grapheme-to-phoneme.
# 1. one-time model download (model + per-voice style files + tokenizer)
hf download onnx-community/Kokoro-82M-v1.0-ONNX
# 2. point the config at it
# [tts] engine = "kokoro", and model_path / voices_path (see squawk.toml)
# 3. run with the feature
cargo run --features kokoro
Phonemes come from espeak-ng; the phoneme→id vocab is read from the model's
own tokenizer.json, and each voice is a voices/<name>.bin style tensor — so
everything matches the loaded model. Pick a voice with tts.voice (e.g.
af_heart, bf_emma, ff_siwis); the first letter selects the language.
Notes / roadmap
- Resampling (
recorder::resample_linear) applies a windowed-sinc low-pass before decimation (anti-aliased), then linear interpolation for the fractional step. Good enough for the encoder; a full polyphase resampler is overkill here. - Not yet implemented: voice-activity detection, wake-word activation, and an adaptive speaking-rate time-stretch (the live rate meter is the first piece of this).