Skip to content

Scripture Listener (Spec)

Status: Finalized (v1)
Last updated: 2026-03-03

Goal

Build a lightweight service that listens to live speech, detects explicit Bible references, and instantly outputs:

  • canonical Route Bible link
  • QR code
  • embeddable snippet

Product outcome

Church teams, speakers, and media operators can generate accurate, shareable scripture links from spoken content in real time without manually typing references.

Non-goals (v1)

  • Paraphrase / semantic verse matching
  • Full sermon transcript storage/search product
  • Multi-speaker diarization
  • Automatic slide deck generation
  • AI-generated theological commentary

Future scope (post-v1)

  • Exedra-powered paraphrase detection and ranked semantic candidates
  • Candidate confirmation flow for ambiguous paraphrase matches

Locked Decisions (v1)

  • Product domain: cue.selah.tools
  • Scope: direct-reference parsing only (no paraphrase triggers in v1)
  • Default confidence threshold: 0.70
  • UI model: start-first screen -> live slide screen with explicit stop
  • Slide requirement: verse text + verse reference + small Route Bible QR on every slide
  • Transport: WebSocket streaming for live session

Backend

Architecture overview

Use an event-driven backend with three main components:

  1. ASR ingress service (Rust)
    • Receives live audio stream (WebSocket)
    • Runs VAD/chunking
    • Sends utterance chunks to ASR worker
    • Maintains per-session state
  2. ASR worker (Parakeet runtime)
    • Returns incremental transcript chunks and final utterance text
  3. Scripture match + action pipeline (Rust)
    • Resolves direct references first
    • Applies direct-reference confidence policy
    • Calls Route Bible parse/QR APIs
    • Emits match events

Reuse of existing Selah infrastructure

  • Route Bible outputs
    • Reuse canonical parsing endpoint (/parse)
    • Reuse QR generation endpoint (/qr)
    • Reuse Route Bible canonical URL conventions
  • Reference parsing
    • Reuse grab-bcv parsing rules and canonicalization behavior
    • Use local grab-bcv parse as the primary v1 match detector for explicit references

Deferred Exedra integration

Paraphrase trigger support is intentionally deferred to post-v1:

  • reuse Exedra hybrid retrieval (resolve_query/search)
  • reuse semantic query candidate expansion strategy
  • reuse existing paraphrase fixtures for quality gates

Trigger policy (when search runs)

Run direct-reference resolution on utterance boundaries and controlled rolling windows:

  1. Trigger when silence is detected for 500-800ms
  2. Also trigger every 1.5-2s during long uninterrupted speech
  3. Skip if transcript delta is low (near-duplicate suppression)
  4. Require minimum query signal:
    • at least 6 words, or
    • at least 40 characters

Resolution order:

  1. Direct reference parse attempt
  2. If parse fails, no match is emitted in v1

Confidence and match policy

Target behavior: only emit final matches when direct parse confidence is high enough to avoid false triggers.

Scoring model

Compute match_confidence in [0, 1] from direct-reference signals:

  • parse validity (canonical parse succeeds)
  • parser certainty (single unambiguous parse result)
  • ASR confidence for relevant token span (if available)
  • short-window stability (same canonical ref in 2 of last 3 windows)
  • transcript quality guard (minimum token quality, no severe truncation)

Decision thresholds

  • >= 0.70: emit final match with QR/snippet
  • < 0.70: suppress match

Anti-noise guardrails

  • session cooldown: suppress duplicate canonical result for 20-30s
  • stability guard: require consistent canonical result across short window
  • max output rate: at most one auto-match per N seconds (configurable)

APIs

Ingress API (new service)

  • WS /v1/listen
    • client sends audio frames + control events
    • server emits transcript and match events

Example server events:

  • transcript.partial
  • transcript.final
  • match.final
  • match.suppressed

Match output payload

json
{
  "session_id": "abc123",
  "utterance_id": "utt_0042",
  "mode": "direct_reference",
  "confidence": 0.81,
  "canonical": "JHN.3.16",
  "display": "John 3:16",
  "route_url": "https://route.bible/jhn.3.16?src=listener",
  "qr_url": "https://route.bible/qr?passage=JHN.3.16&format=svg&download=false",
  "snippet_html": "<a href=\"https://route.bible/jhn.3.16\">John 3:16</a>",
  "needs_confirmation": false,
  "reason": "direct_parse"
}

Route Bible integration contract

For final canonical match:

  1. Parse transcript text with grab-bcv (local) to produce canonical passage (for example JHN.3.16)
  2. POST /parse (or GET /parse?q=...) for route-normalized target and compatibility validation
  3. GET/POST /qr for QR asset generation
  4. Build snippet variants:
    • plain anchor snippet
    • dynamic badge snippet

Parsing source of truth (v1)

  • Listener-side explicit reference detection: grab-bcv
  • Route construction + downstream share format: Route Bible canonical conventions and APIs
  • In v1, if grab-bcv parse fails, no semantic/paraphrase fallback is attempted

Data model

Core entities:

  • Session: stream metadata, language, source
  • Utterance: transcript text, timestamps, asr confidence
  • DirectMatch: parsed canonical passage + confidence diagnostics
  • MatchEvent: emitted artifact payload and suppression reason if blocked

Storage strategy (v1):

  • keep in-memory session state
  • optional short-lived event log (24h) for debugging/QA
  • no long-term raw audio retention by default

Observability

Required metrics:

  • ASR latency p50/p95
  • match latency p50/p95 (utterance end -> emitted match)
  • direct-parse hit rate
  • parse failure rate
  • false positive rate from operator feedback
  • dedupe suppression count
  • threshold bucket distribution (>=0.70, <0.70)

Structured logs must include:

  • session id, utterance id
  • parsed canonical ref (if any)
  • confidence breakdown
  • final decision reason

Security and privacy

  • TLS only
  • authenticated ingest keys for non-local deployments
  • PII minimization: do not persist full transcripts by default
  • configurable retention for diagnostics
  • explicit user disclosure that microphone input is processed

Deployment plan

  • Public app host: https://cue.selah.tools
  • Deploy Rust ingress/matcher service as long-running instances (recommended: Fly.io in US-East; equivalent container platform acceptable)
  • Run Parakeet ASR as a separate GPU-backed service (for example Runpod/Modal/Lambda Labs class infrastructure)
  • Use managed Redis for short-lived session/cooldown state
  • Depend on existing Route Bible public endpoints (or internal mirror) for parse/QR generation
  • Keep ingress and ASR services independently scalable

UI/UX

v1 surface

UI uses two explicit states:

  1. Pre-start state (default)
    • Nearly the entire screen is a single primary CTA: Start listening
    • No dense controls shown before session start
  2. Live state (after start)
    • Real-time sermon slide view auto-follows confirmed scripture matches
    • Persistent Stop button is visible and immediately ends the live connection

Core live behavior:

  • full-screen slide presentation mode
  • each confirmed match becomes one rendered slide
  • slide updates in real time as new references are detected

Session controls and connection lifecycle

  • Start listening:
    • opens the live session UI
    • establishes the server stream connection (WS /v1/listen)
  • Stop:
    • explicitly terminates the active stream connection
    • halts further transcript/match events
    • returns UI to pre-start state

Connection states to render:

  • idle (pre-start)
  • connecting
  • live
  • stopping
  • disconnected (unexpected loss, with retry/start action)

Slide composition (required)

Every slide must include:

  • scripture reference label (for example John 3:16) in a clear, readable position
  • verse text as the visual focus (large type, high contrast, sermon-readable)
  • a small Route Bible QR code in a corner on every slide

Layout constraints:

  • verse text occupies primary visual area
  • reference stays visible even for long verse text
  • QR is present but non-dominant (~64-96px target size on 1080p output)
  • safe-area padding so projector/stream crop does not hide text or QR

Slide transition behavior

  • new confirmed match triggers slide update with subtle transition (no flashy animation)
  • duplicate canonical passage inside cooldown window does not create a new slide
  • if no new confirmed match, current slide remains pinned

Secondary controls (operator)

Use a minimal control strip or panel for:

  • Stop (in live state only)
  • live transcript pane
  • recent match log (Confirmed / Suppressed)
  • copy link, copy snippet, open QR actions for latest match

Interaction design rules

  • Do not interrupt operator flow with modal dialogs
  • Prioritize slide readability over control density
  • Keep controls visually secondary to the slide canvas
  • Pre-start screen should feel intentionally sparse, with Start listening as the dominant action
  • Keep match log compact and timestamped
  • Highlight confidence and mode (Direct) in control view
  • Show clear reason on suppression (duplicate, low confidence, unstable)

Error states

  • ASR unavailable
  • Route Bible QR generation failure
  • degraded mode should still show canonical text match if QR fails

Acceptance

Functional acceptance criteria

  1. Direct spoken references are detected and resolved to canonical passage links.
  2. Every confirmed match renders a sermon-style slide containing verse text, verse reference, and a small Route Bible QR in the corner.
  3. Initial screen presents Start listening as the dominant, near-only UI action.
  4. Pressing Start listening transitions to live slide UI and opens server stream connection.
  5. Pressing Stop closes the active stream connection and returns to pre-start UI.
  6. Matches at or above configured confidence floor generate Route Bible link + QR + snippet.
  7. Duplicate suppression prevents repeated fire for same canonical passage in cooldown window.
  8. Suppressed matches include clear reason codes for operator debugging.
  9. Service remains responsive under continuous speech sessions.

Quality gates

Use direct-reference transcript fixtures as baseline (clear references, abbreviated references, noisy-ASR references).

Initial target (v1):

  • Direct parse precision: >= 0.95
  • Direct parse recall on clean references: >= 0.90
  • Auto-fire precision (confidence >= 0.70): >= 0.95
  • End-to-end match latency p95: <= 1200ms after utterance boundary
  • Slide update latency p95 (confirmed match -> rendered slide): <= 300ms

Rollout plan

  1. Phase 1 (v1): direct references only (auto-fire, no paraphrase)
  2. Phase 2: improve direct reference robustness (abbreviations, partial chapter/verse wording)
  3. Phase 3: add paraphrase suggestions using Exedra retrieval (needs_confirmation=true)
  4. Phase 4: optional paraphrase auto-fire for high-confidence band

Open Questions

  1. Should v1 support one language (en) only, or include multilingual ASR/parsing?
  2. Should confidence thresholds be global or configurable per organization/session?
  3. Should QR payload be returned as URL only, or inline SVG/PNG bytes for low-latency clients?
  4. What retention policy is required for transcripts/audio in production deployments?
  5. Do we need a fallback when Parakeet/GPU is unavailable (alternate ASR provider)?
  6. For long passages, should v1 render one slide per verse or a single condensed slide block?
  7. When paraphrase mode is added later, should it always require explicit operator confirmation?