Research project of a speech to speech assistant working on my rtx 3090

Rust 97.1%
Python 1.8%
Shell 1.1%

Find a file

Vincent S. f24d2d5ab9 [ADD] Initial commit		2026-06-28 00:14:23 +02:00
.githooks	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
.zed	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
crates	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
scripts	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
tools	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
.gitignore	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
ARCHITECTURE.md	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
Cargo.lock	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
Cargo.toml	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
CLAUDE.md	[ADD] Initial commit	2026-06-28 00:14:23 +02:00
README.md	[ADD] Initial commit	2026-06-28 00:14:23 +02:00

README.md

squawk

Push-to-talk voice client for a local llama.cpp multimodal model. Hold a key, speak, release — your audio goes straight to the model (no separate STT; the model's audio encoder handles speech) and the reply streams back into a terminal UI, optionally spoken aloud (streaming TTS with barge-in; see below).

Built as a Cargo workspace so the core logic can be reused by future frontends (GUI, web).

🎙  PUSH-TO-TALK — hold [SPACE] to talk, release to send
● REC  2.3s
🤖 The capital of France is Paris.
[SPACE] talk    [m] swap mode    [d] device    [q] quit

Workspace layout

squawk/
├── crates/
│   ├── squawk-core/   # UI-agnostic: config, conversation, audio capture, streaming client
│   └── squawk-tui/    # terminal frontend (clap + crossterm + ratatui)

`squawk-core` module	responsibility
`config`	layered TOML config (file + env + CLI): model, endpoint, audio, logging, prompt
`conversation`	turn history + partial-response carry-through
`model_client`	async, cancellable SSE streaming; splits content vs reasoning
`recorder`	`cpal` continuous capture → raw mono buffer (pre-roll lead-in)
`processing`	trait-based audio pipeline: one `Stage` per file, run + resample → WAV
`analysis`	live syllable-rate (speech-speed) estimation from the audio envelope
`tts`	text-to-speech backends behind a `Tts` trait: `tone` placeholder + `kokoro` (feature)
`speech`	streaming, cancellable playback sink with a position readout (karaoke + barge-in)
`responder`	output interpreter: strips `[mood:…]` → style, sentence-chunked TTS, heard-boundary mapping
`player`	replay a WAV on the default output device (`cpal`) with progress

`squawk-tui` module	responsibility
`input`	crossterm key mapping; Kitty enhancement flags for key-release
`app`	state machine (idle → recording → streaming) + ratatui UI
`main`	CLI (device listing/selection), current-thread runtime

Adding a squawk-gui or squawk-web crate later means depending on squawk-core and implementing a new frontend against the same Recorder, ModelClient, and Conversation types.

Requirements

A llama.cpp router (or server) at http://localhost:8080 serving an audio-capable model. Defaults target unsloth/gemma-4-12b-it-GGUF:Q4_K_M with its mmproj loaded. Override with --base-url / --model.
A terminal that speaks the Kitty keyboard protocol for true hold-to-talk (e.g. ghostty, kitty, foot). Otherwise squawk auto-falls back to toggle mode.
System libs: ALSA (Linux) / the platform audio backend cpal targets.

Run

cargo run --bin squawk                 # launch the TUI
cargo run --bin squawk -- --list-devices
cargo run --bin squawk -- --device 6 --model some/other-model

Configuration

squawk reads a single TOML file. Generate a commented starter at the user config path (~/.config/squawk/config.toml), then edit it:

cargo run --bin squawk -- --init-config     # write the default config
cargo run --bin squawk -- --print-config    # show the effective config (api key redacted)

You only set the keys you want to change; everything else falls back to the built-in defaults (crates/squawk-core/config.default.toml, embedded at compile time). Values resolve from several layers — highest priority first:

CLI flags — --model, --device, --base-url
environment — SQUAWK_* (SQUAWK_API_KEY, SQUAWK_BASE_URL, SQUAWK_MODEL, SQUAWK_LOG_DIR, SQUAWK_LOG, SQUAWK_LOG_LEVEL, SQUAWK_LOG_AUDIO)
--config <path> — an explicitly chosen file
./squawk.toml — project-local (handy for dev)
~/.config/squawk/config.toml — user config (honours XDG_CONFIG_HOME)
compiled defaults

Files merge (a project file overrides only the keys it sets). For a hosted OpenAI-compatible endpoint, set endpoint.api_key — but prefer the SQUAWK_API_KEY env var over committing a secret to a file; it is never written to logs or --print-config.

Controls

key	action
`SPACE`	hold to talk, release to send (or press/press in toggle mode)
`p`	replay the last recording — with processing applied (validate by ear)
`t`	toggle discussion on/off (draft mode — see below)
`1`–`5`	toggle an audio-processing stage live (see Audio processing)
`[` / `]`	speak the reply slower / faster (TTS speed, applies to the next reply)
`m`	swap capture mode (hold ↔ toggle) — only when the terminal supports key-release
`↑`/`↓`, `PgUp`/`PgDn`	scroll the conversation (auto-follows the latest reply)
`d`	open the input-device picker (`j`/`k` move, `Enter` select, `d`/`Esc` cancel)
`q` / `Ctrl-C`	quit

While a reply is streaming, pressing SPACE again interrupts it (keeps the partial answer) and starts a new recording. When replies are spoken (see below), the same press also stops playback and records how much you actually heard — so the next turn's context reflects what reached your ears, not the full generated text (TTS synthesis runs ahead of playback). Assistant replies render markdown (code fences, headings, bold, inline code). While recording, the header shows a live speaking-rate meter (syllables/sec) — very fast speech is harder for the audio encoder to understand.

Audio processing (`1`–`5`)

Each recording runs through a pipeline of [Stage]s before being sent (and before replay, so p plays exactly what the model receives). Stages toggle live via the number keys; the header's fx line shows what's on (green) or off (dim):

key	stage	what
`1`	`dc`	DC-offset removal
`2`	`hp`	80 Hz high-pass (rumble / handling noise)
`3`	`denoise`	RNNoise denoiser (`nnnoiseless`); off by default — heavier
`4`	`trim`	drop leading/trailing silence (fixes the startup blank)
`5`	`norm`	level normalization with headroom: lift toward −20 dBFS RMS, gain-capped, then peak-limited to −3 dBFS (for speech the −3 ceiling usually binds — effectively peak-normalize-with-headroom, never clipping)

Adding a treatment is just a new file in processing/ implementing Stage plus one line in Pipeline::default_voice. Pre-emphasis is intentionally not included — it's a classical-ASR front-end that tends to hurt a neural audio encoder.

Draft mode (`t`)

Turning discussion off puts squawk in draft mode: recordings are held instead of sent to the model. Record, replay with p, and re-record as many times as you like — each capture replaces the previous draft. Turning discussion back on sends the held draft. This is the "decide after listening, re-record if you don't like it" workflow.

Spoken replies (TTS)

The model is audio-native on the way in but emits text out, so spoken replies are a half-cascade: model text → TTS → speaker. Turn it on under [tts] in the config:

[tts]
enabled = true
engine  = "tone"   # built-in placeholder; "kokoro" for the neural voice
moods   = true     # let the model pick its tone of voice
speed   = 1.0

tone engine needs no model: it plays a faint tone of the right duration for each sentence. It exists to validate the whole path by ear — streaming, barge-in, the karaoke highlight, and mood-driven pacing — before committing to a real voice. kokoro is the neural backend (requires a build with the kokoro feature plus model files; see below).
Streaming: replies are synthesized a sentence at a time as they stream, so speech starts almost immediately instead of waiting for the full answer.
Barge-in: press SPACE to talk over a reply — playback stops at once and the turn records the heard boundary, so the model knows what you did and didn't hear and can pick up from there.
Karaoke: the spoken portion of the reply stays bright while the rest dims, and a 🔊 speaking NN% gauge tracks playback.
Speed: set a baseline with tts.speed (1.0 = natural), or nudge it live with [ / ] — it feeds Kokoro's speed input and applies to the next reply.
Moods: with moods = true, the model may begin a reply with a tag like [mood: cheerful] (one of neutral / cheerful / excited / serious / sad / calm / urgent). The tag is stripped before you see or hear it and maps onto the delivery (currently speaking rate — the knob every backend can honour).

If speech is enabled but no output device is available, squawk logs a warning and falls back to text-only.

Kokoro (neural voice)

The real voice is Kokoro-82M, run locally via onnxruntime. It's behind the kokoro cargo feature (which pulls onnxruntime — fetched at build time, no system lib needed) and needs espeak-ng installed for grapheme-to-phoneme.

# 1. one-time model download (model + per-voice style files + tokenizer)
hf download onnx-community/Kokoro-82M-v1.0-ONNX
# 2. point the config at it
#    [tts] engine = "kokoro", and model_path / voices_path (see squawk.toml)
# 3. run with the feature
cargo run --features kokoro

Phonemes come from espeak-ng; the phoneme→id vocab is read from the model's own tokenizer.json, and each voice is a voices/<name>.bin style tensor — so everything matches the loaded model. Pick a voice with tts.voice (e.g. af_heart, bf_emma, ff_siwis); the first letter selects the language.

Notes / roadmap

Resampling (recorder::resample_linear) applies a windowed-sinc low-pass before decimation (anti-aliased), then linear interpolation for the fractional step. Good enough for the encoder; a full polyphase resampler is overkill here.
Not yet implemented: voice-activity detection, wake-word activation, and an adaptive speaking-rate time-stretch (the live rate meter is the first piece of this).

README.md Unescape Escape