Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/screenpipe/screenpipe/llms.txt

Use this file to discover all available pages before exploring further.

Audio Transcription

screenpipe continuously records and transcribes audio from all your devices — microphones, speakers, and system audio — with automatic speaker identification and smart deduplication.

How It Works

1

Audio Capture

Records 30-second segments with 2-second overlap from all active devices.
2

Transcription

Whisper model processes each segment, with automatic language detection and hallucination prevention.
3

Speaker Identification

ONNX embedding model identifies and clusters speakers based on voice characteristics.
4

Deduplication

Smart overlap deduplication removes repeated text across segments and devices.

Recording Pipeline

30-Second Segments

All audio is recorded in 30-second chunks with 2-second overlap, regardless of batch or realtime mode.
pub async fn run_record_and_transcribe(
    audio_stream: Arc<AudioStream>,
    duration: Duration,  // Always 30s
    whisper_sender: Arc<Sender<AudioInput>>,
) -> Result<()> {
    const OVERLAP_SECONDS: usize = 2;
    let overlap_samples = OVERLAP_SECONDS * sample_rate;
    
    let audio_samples_len = sample_rate * duration.as_secs() as usize;
    let max_samples = audio_samples_len + overlap_samples;
    
    // Collect audio until we have 30s + 2s overlap
    while collected_audio.len() < max_samples {
        match recv_audio_chunk(&mut receiver).await? {
            Some(chunk) => collected_audio.extend(chunk),
            None => continue,
        }
    }
    
    // Send to transcription
    flush_audio(&mut collected_audio, overlap_samples).await?;
}
Overlap Purpose: The 2-second overlap ensures words spoken at segment boundaries aren’t cut in half. Deduplication removes the repeated text later.

Audio Stream Timeout

If no audio is received for 30 seconds, the stream is considered hijacked (e.g., another app taking over the microphone).
const AUDIO_RECEIVE_TIMEOUT_SECS: u64 = 30;

match tokio::time::timeout(
    Duration::from_secs(AUDIO_RECEIVE_TIMEOUT_SECS),
    receiver.recv(),
).await {
    Err(_timeout) => {
        // Stream hijacked — trigger reconnect
        audio_stream.is_disconnected.store(true, Ordering::Relaxed);
    }
}

Whisper Transcription

Model Configuration

screenpipe uses whisper.cpp with Rust bindings for fast, local transcription.
Whisper hallucinates on silence/near-silence (e.g., “Thank you.”, “So, let’s go.”). screenpipe filters this out:
const MIN_RMS_ENERGY: f32 = 0.015;

let rms = (audio.iter().map(|s| s * s).sum::<f32>() 
          / audio.len() as f32).sqrt();
          
if rms < MIN_RMS_ENERGY {
    return Ok(String::new());  // Skip transcription
}
Why 0.015?
  • Silence: RMS = 0.0
  • Ambient noise (0.01 amplitude): RMS ≈ 0.007
  • White noise (0.1 amplitude): RMS ≈ 0.071
  • Normal speech: RMS ≈ 0.05-0.3

Custom Vocabulary

Bias Whisper toward specific words (names, jargon, acronyms):
[
  {
    "word": "screenpipe",
    "replacement": "screenpipe"
  },
  {
    "word": "llama",
    "replacement": "Llama"
  },
  {
    "word": "gpt",
    "replacement": "GPT"
  }
]
Whisper’s initial_prompt parameter biases the model toward these words without forcing them. It’s not a find-and-replace — it improves recognition accuracy.

Speaker Identification

screenpipe uses speaker embedding models (ONNX) to identify and cluster speakers based on voice characteristics.

How It Works

1

Audio Segmentation

Split transcription into voice activity segments (pauses indicate speaker changes).
2

Embedding Extraction

Compute voice embeddings (numerical fingerprints) for each segment.
3

Speaker Clustering

Group similar embeddings into speaker clusters with cosine similarity.
4

Speaker Assignment

Assign each segment to a speaker ID (Speaker 0, Speaker 1, etc.).
pub struct EmbeddingExtractor {
    session: Session,
}

impl EmbeddingExtractor {
    pub fn compute(&mut self, samples: &[f32]) -> Result<impl Iterator<Item = f32>> {
        // Compute fbank features (mel-frequency filterbank)
        let features: Array2<f32> = knf_rs::compute_fbank(samples)?;
        let features = features.insert_axis(ndarray::Axis(0));
        
        // Run ONNX model
        let inputs = ort::inputs! ["feats" => features.view()]?;
        let ort_outs = self.session.run(inputs)?;
        
        // Extract embedding vector
        let embeddings = ort_outs.get("embs")
            ?.try_extract_tensor::<f32>()?;
        
        Ok(embeddings.iter().copied())
    }
}
Speaker IDs are consistent within a session but may change across restarts. For persistent speaker names, use the speaker management API to label speakers.

Audio Device Management

Multi-Device Recording

screenpipe records from all devices simultaneously:
  • Built-in microphone
  • External USB microphones
  • Bluetooth headsets
  • Virtual audio inputs (Loopback, BlackHole)
Output device recording captures what your computer is playing (YouTube, Zoom calls, music). On macOS, this requires a virtual audio device like BlackHole.

Device Monitoring

screenpipe automatically detects when devices are added or removed:
// Device monitor runs every 5s
let device_changes = detect_audio_device_changes().await;

if !device_changes.is_empty() {
    // Restart streams for new/removed devices
    reconcile_audio_streams(device_changes).await?;
}
Stream hijacking: If another app takes over a microphone (e.g., Wispr Flow), screenpipe detects the timeout and automatically reconnects when the device becomes available again.

Deduplication

Why Deduplication?

  1. Segment overlap: 2-second overlap between 30s chunks creates duplicate text
  2. Multi-device recording: Same audio captured by mic + speakers
  3. Echo/feedback: Speaker playback captured by microphone

Deduplication Algorithm

// Remove duplicate text from segment overlap
fn deduplicate_overlap(
    prev_segment: &str,
    curr_segment: &str,
    overlap_duration_secs: usize,
) -> String {
    // Find longest common suffix/prefix
    let overlap_text = find_overlap(prev_segment, curr_segment);
    
    // Remove overlap from current segment
    curr_segment.strip_prefix(overlap_text)
        .unwrap_or(curr_segment)
        .to_string()
}

Performance

Transcription Speed

  • tiny: ~2x realtime (30s audio in 15s)
  • base: ~1x realtime (30s audio in 30s)
  • small: ~0.5x realtime (30s audio in 60s)
Runs on 2 threads to avoid blocking the main process.

Resource Usage

Audio Recording:    <1% CPU, ~50MB RAM per device
Whisper (tiny):     10-20% CPU, ~200MB RAM
Whisper (base):     20-30% CPU, ~400MB RAM
Speaker ID:         <5% CPU, ~100MB RAM

Configuration

{
  "audio": {
    "model": "base",           // tiny, base, small, medium, large
    "language": "auto",        // auto-detect or specify (en, es, fr, etc.)
    "devices": [
      "MacBook Pro Microphone",
      "BlackHole 2ch"          // Virtual device for system audio
    ],
    "speaker_identification": true,
    "deduplication": true
  }
}

Reference

Source files:
  • Recording pipeline: crates/screenpipe-audio/src/core/run_record_and_transcribe.rs
  • Whisper transcription: crates/screenpipe-audio/src/transcription/whisper/batch.rs
  • Speaker embedding: crates/screenpipe-audio/src/speaker/embedding.rs
  • Device management: crates/screenpipe-audio/src/device/device_manager.rs
  • Audio manager: crates/screenpipe-audio/src/audio_manager/manager.rs