Skip to content

Performance Tuning

This guide covers techniques to maximize RavenRustRAG throughput and minimize latency.

Embedding Cache

RavenRustRAG uses a lock-free in-memory embedding cache (DashMap) that avoids redundant embedding API calls. The cache is especially effective when:

  • Re-indexing documents with overlapping content
  • Running multiple queries with similar phrasing
  • Using the watch command with incremental updates

Cache stats are available via:

ravenrag info --verbose

SQLite Optimizations

The following PRAGMAs are applied automatically:

PRAGMA Value Effect
journal_mode WAL Concurrent reads during writes
synchronous NORMAL Faster writes (safe with WAL)
cache_size -64000 64 MB page cache
mmap_size 268435456 256 MB memory-mapped I/O
busy_timeout 5000 5s retry on lock contention

WAL Mode

Write-Ahead Logging allows readers to proceed without blocking on writes. This is critical for the server where queries and indexing happen concurrently.

Memory-Mapped I/O

With mmap enabled, SQLite reads bypass the userspace buffer and read directly from the OS page cache. This eliminates syscall overhead for hot data and is especially beneficial for large indexes.

Batch Sizes

Embedding and storage operations are batched. Tune batch sizes based on your hardware:

  • Embedding batch: Number of texts sent per API call (default varies by backend)
  • Concurrent batches: Controlled by semaphore (default: 4 concurrent embedding calls)

Larger batches reduce API overhead but increase memory usage and latency per batch.

Chunk Size Tuning

Chunk Size Pros Cons
256 chars Fine-grained retrieval, precise matches More chunks to store/search, fragmented context
512 chars (default) Good balance of precision and context
1024 chars More context per result, fewer chunks Less precise matches, may include irrelevant text

For technical documentation, 512 is usually optimal. For narrative content, 1024 may work better.

Hybrid Search Performance

BM25 search adds minimal overhead since terms are pre-computed during indexing. The RRF fusion step is O(n) in the number of results. For most use cases, hybrid search adds less than 1ms to query time.

Hardware Recommendations

Component Minimum Recommended
RAM 512 MB 2+ GB (for large indexes with mmap)
Storage HDD works SSD strongly recommended
CPU Any x86_64/arm64 Multi-core for concurrent embedding

Benchmarking

Use the built-in benchmark command to measure your system's performance:

ravenrag benchmark --num-docs 1000 --iterations 100

This generates synthetic documents, indexes them, and measures query latency percentiles.

Monitoring

The HTTP server exposes /metrics in Prometheus format with:

  • Request count by endpoint
  • Latency histograms
  • Active connections
  • Cache hit/miss rates