TL;DR
- Real-time audio capture and spectral analysis driving GLSL fragment shaders
- Configurable FFT pipeline with 6 interpolation and 3 collation strategies for log-frequency output
- Persistent mapped SSBOs for zero-copy CPU-to-GPU data transfer
- Double-buffered feedback system for ping-pong shader effects
- Hot-reload of shaders and config files with on-screen error display
- Expression parser for dynamic buffer sizing based on runtime variables
- 66 MB private memory, flat under sustained hot-reload stress testing
- Runs on desktop Linux (x86_64) and Raspberry Pi 4 (aarch64, VideoCore VI)
Introduction
This project started from frustration with existing audio visualizers. Most are either locked into a fixed visual style, require a full browser runtime, or treat the audio data as an afterthought by handing the shader author a vague "volume" value and calling it done. I wanted a tool where the shader author has direct access to properly processed spectral data, peak and RMS metering, and persistent GPU memory, all configurable per-shader through a simple text file, with changes visible instantly on save.
The result is a standalone audio visualization engine that captures system audio, performs real-time FFT analysis with configurable output shaping, and drives fragment shaders at display refresh rate. Shader authors write GLSL and a config file. The engine handles everything else: audio capture, spectral analysis, temporal smoothing, peak tracking, GPU buffer management, and hot-reload.
Project Goals
- Give shader authors direct, configurable access to real audio analysis data
- Support rapid iteration with instant hot-reload on save
- Keep the audio and rendering pipeline lean enough for a Raspberry Pi 4
- Make the audio backend reusable and modular for other projects
- Provide sensible defaults so a minimal shader works immediately
- Handle errors gracefully with on-screen feedback instead of crashes
System Architecture
The engine is structured as a pipeline with clear boundaries between audio capture, analysis, formatting, and rendering:
Audio Thread (miniaudio callback)
|
Ring Buffer (lock-free, power-of-two masked)
|
Main Thread: Accumulate hops, run FFT, measure Peak/RMS
|
AVBridge: Format, interpolate/collate, smooth, track holds
|
GPU Buffers: Persistent mapped SSBOs (6 bindings)
|
Fragment Shader: User GLSL with injected header
Audio capture runs on miniaudio's dedicated callback thread, writing into a lock-free ring buffer. The main thread consumes hop-sized windows from the ring buffer, runs FFT analysis, and formats the output through the AVBridge layer. Formatted data is written directly into persistent mapped GPU buffers with no intermediate copies, and the shader runs every frame at vsync.
Audio Capture and Ring Buffer
System audio is captured through miniaudio's loopback/monitor device at the system's native sample rate and channel count. The capture callback writes interleaved float samples into a power-of-two ring buffer using atomic index management.
The ring buffer uses bitwise masking rather than modulo for index wrapping, and stores channel count internally so all external interfaces operate in frame counts rather than sample counts. The main thread tracks accumulated frames with an atomic counter and consumes hop-sized windows (fftSize / 4) for overlapped analysis.
A mono-summed window extraction method reads from the ring buffer and averages across channels in a single pass, avoiding a separate mixing step before FFT input.
FFT and Spectral Analysis
The FFT uses FFTW3 (float precision) with a real-to-complex plan. The engine runs at a fixed order of 13 (8192 samples) with 4x overlap (2048-sample hops), giving roughly 23 spectral frames per second at 48 kHz.
Before the transform, an optional Hann window with energy normalization is applied. After the transform, output is converted to power, magnitude, or decibels depending on the shader's spec configuration. An optional perceptual frequency tilt is applied through a cached scalar table, supporting common slopes like 3.0 dB/octave (pink noise reference) or 4.5 dB/octave (music-focused, similar to FabFilter Pro-Q's analyzer).
Peak and RMS measurements are computed per-channel or mono-summed from the same audio window used for FFT, keeping time alignment between spectral and level data.
Custom-Size FFT Output
The most involved part of the audio pipeline is the custom-size output mode, which remaps raw FFT bins onto an arbitrary number of output indices using logarithmic frequency spacing. This is the feature that lets shader authors write customFFTSize = WINDOW_WIDTH in their config and get one FFT value per pixel column, without ever thinking about frequency-to-bin math.
The output is split into two regions based on bin density. Below a computed crossover frequency, output indices are sparser than FFT bins, so the engine interpolates between bins. Above the crossover, multiple bins map to each output index, so the engine collates them.
Six interpolation strategies are available for the low end:
- Linear: simple two-point interpolation
- PCHIP: Piecewise Cubic Hermite with Fritsch-Carlson slopes for monotone output
- Lanczos: 4-lobe windowed sinc for sharp reconstruction
- Gaussian: weighted average with configurable sigma
- Cubic B-Spline: very smooth, slightly blurs peaks
- Akima: local cubic resistant to outlier distortion
Three collation strategies handle the high end:
- RMS: root mean square of bins per output range
- Peak: maximum value per range
- Power-Weighted Mean: emphasizes louder components in each range
The crossover point is computed dynamically based on where bin density exceeds index density, so the transition adapts automatically to different output sizes and sample rates.
Temporal Smoothing and Peak Holds
Raw spectral data changes dramatically between analysis frames. To produce visually stable output, the engine applies asymmetric temporal smoothing with separate attack and release times.
Smoothing is implemented as a Structure-of-Arrays (SoA) layout: parallel vectors for current values, target values, per-element increments, and remaining step counts. When a new target is set, the step count and increment are calculated once based on whether the value is rising (attack) or falling (release). During steady state, elements that have reached their target skip all smoothing math entirely.
Peak hold tracking runs on top of the smoothed output. Each element maintains a countdown timer. While the timer is active, the hold value stays locked. After it expires, the value decays exponentially by a configurable scalar per frame. If the current smoothed value exceeds the hold at any point, the hold resets to the new peak and the countdown restarts.
Both systems operate on the FFT output array and the peak/RMS measurements independently, with all timing parameters configurable per-shader through the spec file.
GPU Buffer Architecture
Data reaches the GPU through six shader storage buffer objects (SSBOs) using persistent mapped memory:
- Binding 0: Peak/RMS data (2-4 floats depending on channel mode)
- Binding 1: FFT output array (size varies by config)
- Binding 2: Peak/RMS hold values
- Binding 3: FFT hold values
- Bindings 4-5: Double-buffered feedback (ping-pong)
Buffers are allocated with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT, giving the CPU a pointer that remains valid for the buffer's lifetime. Each frame, the engine writes directly through this pointer with a single memcpy per buffer, avoiding the overhead of glBufferSubData or repeated map/unmap cycles.
The feedback buffer pair implements a GPU-side ping-pong: each frame the engine flips which SSBO is bound as readonly (binding 4) and which is writeonly (binding 5). The shader reads last frame's output from feedbackIn[] and writes this frame's state to feedbackOut[]. This enables spectrograms, particle systems, trails, and any effect that needs frame-to-frame persistence, all without compute shaders or render-to-texture passes.
On preset swap or resize, a single glFinish() drains the pipeline before all six buffers are freed and reallocated in batch, rather than stalling the pipeline per-buffer.
Shader System and Hot-Reload
Each shader preset is a directory containing a frag.glsl and an optional spec.cfg. On startup, the engine loads all preset directories, compiles each shader with an injected header providing uniforms, SSBOs, coordinate helpers, and a built-in bitmap font. Invalid presets are flagged and displayed using a built-in error shader that renders the error message directly on screen.
Hot-reload works by polling file modification timestamps each frame. When a change is detected, the engine reloads the fragment source and spec, attempts to compile the new shader, and if successful, swaps the audio configuration, resizes GPU buffers, and begins rendering the updated shader. If compilation or parsing fails, the error is displayed and the previous working state is preserved.
The spec parser supports a key-value format with comments, enum values by name or index, and math expressions for buffer sizes. Expressions can reference runtime variables like WINDOW_WIDTH, SAMPLE_RATE, and FFT_SIZE, allowing buffer sizes to adapt dynamically to the display and audio environment. The expression evaluator handles addition, subtraction (clamped to zero), multiplication, division (with zero-check), and parenthesized grouping, with all intermediate math in double precision before truncating the final result to an unsigned integer.
Error Handling Philosophy
A core design goal was that nothing the user does in their shader or config file should crash the engine. Shader compile errors, spec parse errors, missing files, deleted directories, and invalid buffer sizes all result in on-screen error messages rather than segfaults or silent failures. The error shader renders text using the built-in CP437 bitmap font, with the character array and length cached so the GPU upload only happens when the message changes.
Buffer sizes are clamped against both a hardcoded maximum and the GPU's reported GL_MAX_SHADER_STORAGE_BLOCK_SIZE. SSBO binding count is checked at startup against the engine's requirement of 6 bindings. Texture paths are validated to reject directory traversal before any filesystem access.
Results From Profiling
The engine's total resident memory (RSS) is approximately 150 MB, but the proportional set size (PSS) is around 83 MB, with most of the shared portion being common libraries (libc, libGL, etc.) that would be loaded regardless.
The meaningful number is 66 MB of private anonymous memory: heap allocations, stack, audio buffers, FFT working memory, smoothing arrays, and GPU driver state. This stayed completely flat across extended hot-reload stress testing with rapid shader and config swaps, confirming no leaks in the reload path.
For context, a comparable shader tool running in a browser tab typically consumes 150-300+ MB due to browser overhead. At 66 MB private for a standalone engine with real-time audio analysis, multiple interpolation strategies, and persistent GPU buffers, the footprint is lean.
The engine runs comfortably on a Raspberry Pi 4 (VideoCore VI, OpenGL ES 3.1) at 1080p for most shaders. Feedback shaders using full-resolution buffers (W * H * 4 floats) can exceed the Pi's SSBO limits at higher resolutions, which the engine handles gracefully by clamping to the GPU's reported maximum.
What I Learned
- Designing a multi-stage audio analysis pipeline with clean boundaries between capture, analysis, formatting, and rendering
- Lock-free ring buffer design with atomic index management for audio thread safety
- Log-frequency remapping with adaptive interpolation/collation crossover
- Practical tradeoffs between interpolation strategies (PCHIP vs Lanczos vs Gaussian) for spectral data
- Asymmetric temporal smoothing with SoA layout for cache-friendly bulk operations
- Persistent mapped GPU buffers for zero-copy CPU-to-GPU data transfer
- SSBO-based double buffering as a lightweight alternative to render-to-texture feedback
- Building a hot-reload system with graceful error recovery
- Expression parsing and evaluation for user-facing runtime configuration
- Cross-architecture deployment targeting both x86_64 and ARM (Raspberry Pi 4)
- Memory profiling with /proc/smaps to distinguish private vs shared memory costs
- Security considerations for user-provided file paths in shader tooling
- Writing technical documentation aimed at creative users rather than engineers
Summary
audio_vis is a real-time audio visualization engine built around giving shader authors direct, configurable access to spectral analysis data. The project spans system audio capture, real-time DSP, GPU buffer management, shader compilation, file watching, expression parsing, and cross-architecture deployment. It prioritizes correctness, graceful error handling, and fast iteration, making it both a creative tool for writing audio-reactive shaders and a technical demonstration of how these systems connect end to end.