Major Feature: Dia Text-to-Speech Integration
Overview
NeuralCodecs 0.4.0 introduces full support for Nari Labs' Dia 1.6B parameter text-to-speech model, enabling highly realistic dialogue generation directly from transcripts.
Key Capabilities
Advanced Dialogue Generation
- Speaker-aware dialogue with
[S1]and[S2]tags for natural conversation flow - Emotion and tone control for expressive speech synthesis
- Direct transcript-to-speech generation without intermediate processing steps
- Non-verbal tag support including laughter, coughing, throat clearing, and more
Voice Cloning & Style Transfer
- Audio-conditioned generation for voice cloning using reference audio
- Style transfer capabilities to adapt speech characteristics
- Batch generation support for processing multiple texts efficiently
Advanced Speed Control System
Includes a custom dynamic speed control system that addresses Dia's automatic speed-up issue on longer inputs:
Speed Correction Methods:
None: No correction for fastest processingTorchSharp: TorchSharp-based linear interpolationHybrid: Combined TorchSharp and NAudio methods (recommended for best quality)NAudioResampling: NAudio-based resampling correctionAll: Generates outputs using all methods for testing and comparison
Slowdown Modes:
Static: Fixed slowdown factorDynamic: Adaptive slowdown based on text length (recommended)
Usage Examples
// Load Dia model with DAC codec integration
var diaConfig = new DiaConfig
{
LoadDACModel = true,
SampleRate = 44100,
SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid,
SlowdownMode = AudioSlowdownMode.Dynamic
};
var diaModel = await NeuralCodecs.CreateDiaAsync("model.pt", diaConfig);
// Generate realistic dialogue
var dialogue = "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking! ...";
var audioOutput = diaModel.Generate(
text: dialogue,
maxTokens: 1000,
cfgScale: 3.0f,
temperature: 1.2f,
topP: 0.95f);
// Voice cloning with reference audio
var clonedAudio = diaModel.Generate(
text: "[S1] This is my cloned voice speaking new words. ...",
audioPromptPath: "reference_voice.wav",
maxTokens: 1000);
// Add non-verbal expressions
var expressiveText = "[S1] I can't believe it! (gasps) [S2] That's amazing! (laughs) ...";
var expressiveAudio = diaModel.Generate(expressiveText);Non-Verbal Communication Support
Dia supports several non-verbal expressions:
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
Performance & Requirements
- Memory Usage: ~10-11GB GPU memory (similar to Python implementation)
- DAC Codec Integration: Seamless integration with DAC for full audio generation pipeline
- Optimized Processing: Built-in speed correction maintains audio quality while handling generation speed issues
License Change: MIT to Apache 2.0
Starting with version 0.4.0, NeuralCodecs transitions from the MIT License to the Apache License 2.0.
Most new codecs and TTS libraries are being released under Apache 2.0, so this makes it easier to integrate these into the project without tracking mixed Apache 2.0 and MIT licensing requirements. There also seems to be a trend with MIT license requirements being ignored in the field, so hopefully the move to Apache 2.0 will help prevent that.
This change allows:
- Enhanced Patent Protection: Better protection against patent litigation
- Clearer Contribution Guidelines: More explicit terms for code contributions
- Continued Open Source: Maintains the open-source nature with additional protections
- Industry Alignment: Consistent licensing with most modern audio/ML libraries
What This Means for Users
- Existing users: Can continue using previous versions under MIT
- New projects: Should use Apache 2.0 license terms
- Contributors: New contributions will be under Apache 2.0
- Commercial use: Continues to be permitted with additional patent protections
Other changes
- Improved Snake1d implementation
- Improved inference using torch's
inference_mode - Default codec factory pattern allows the user to load models without a config instance
- Improved example project, including Dia TTS support