Supertonic 3 TTS: Lightning Fast On-Device Text-to-Speech System

Supertonic TTS is a text-to-speech system designed for efficient and streamlined speech synthesis. It represents a new approach to text-to-speech that prioritizes speed, privacy, and simplicity without compromising on quality.

What is Supertonic TTS?

Supertonic TTS is a novel text-to-speech system that operates entirely on-device, providing complete privacy and eliminating network dependencies. The system consists of three main components: a speech autoencoder for continuous latent representation, a text-to-latent module that uses flow-matching for text-to-latent mapping, and an utterance-level duration predictor.

To enable a lightweight architecture, the system employs a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. The TTS pipeline is simplified by operating directly on raw character-level text and using cross-attention for text-speech alignment, eliminating the need for grapheme-to-phoneme modules and external aligners.

The system includes context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment with minimal memory and I/O overhead. Experimental results demonstrate that Supertonic TTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters in the research version, while significantly reducing architectural complexity and computational cost.

Key Features

High-speed performance: Generates speech up to 167 times faster than real-time on consumer hardware
Lightweight design: Only 66M parameters, optimized for efficient on-device performance
Complete privacy: All processing happens locally on your device
Natural text handling: Processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
Configurable parameters: Adjust inference steps, batch processing, and other settings to match your needs
Multi-platform support: Works across servers, browsers, and edge devices with multiple runtime backends

Research and Development

Supertonic TTS is based on research conducted by a team at Supertone, Inc. The system builds on several research papers that describe the core technologies, including the main architecture, length-aware rotary position embedding for text-speech alignment, and self-purification techniques for training flow matching models.

The research demonstrates that Supertonic TTS can achieve performance comparable to larger, more complex systems while maintaining a much smaller footprint and faster inference speed. This makes it practical for deployment in resource-constrained environments and applications where speed and privacy are priorities.

Performance Characteristics

Supertonic TTS has been evaluated across different hardware configurations and text lengths. On consumer hardware like the M4 Pro, the system achieves 912 to 1,263 characters per second with CPU processing and 996 to 2,509 characters per second with WebGPU acceleration. On high-end GPUs like the RTX 4090, performance reaches 2,615 to 12,164 characters per second.

Real-time factors range from 0.015 for CPU processing to 0.006 for WebGPU on consumer hardware, demonstrating the system's speed advantage. These performance characteristics make Supertonic TTS suitable for real-time applications and scenarios where fast generation is essential.

Note: This is an educational demo website about Supertonic TTS. For the most accurate and up-to-date information, please refer to the official documentation and research papers.

About Supertonic TTS

What is Supertonic TTS?

Key Features

Research and Development

Performance Characteristics