Release v0.4.0 · DillionLowry/NeuralCodecs

Major Feature: Dia Text-to-Speech Integration

Overview

NeuralCodecs 0.4.0 introduces full support for Nari Labs' Dia 1.6B parameter text-to-speech model, enabling highly realistic dialogue generation directly from transcripts.

Key Capabilities

Advanced Dialogue Generation

Speaker-aware dialogue with [S1] and [S2] tags for natural conversation flow
Emotion and tone control for expressive speech synthesis
Direct transcript-to-speech generation without intermediate processing steps
Non-verbal tag support including laughter, coughing, throat clearing, and more

Voice Cloning & Style Transfer

Audio-conditioned generation for voice cloning using reference audio
Style transfer capabilities to adapt speech characteristics
Batch generation support for processing multiple texts efficiently

Advanced Speed Control System

Includes a custom dynamic speed control system that addresses Dia's automatic speed-up issue on longer inputs:

Speed Correction Methods:

None: No correction for fastest processing
TorchSharp: TorchSharp-based linear interpolation
Hybrid: Combined TorchSharp and NAudio methods (recommended for best quality)
NAudioResampling: NAudio-based resampling correction
All: Generates outputs using all methods for testing and comparison

Slowdown Modes:

Static: Fixed slowdown factor
Dynamic: Adaptive slowdown based on text length (recommended)

Usage Examples

// Load Dia model with DAC codec integration
var diaConfig = new DiaConfig 
{ 
    LoadDACModel = true,
    SampleRate = 44100,
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid,
    SlowdownMode = AudioSlowdownMode.Dynamic
};
var diaModel = await NeuralCodecs.CreateDiaAsync("model.pt", diaConfig);

// Generate realistic dialogue
var dialogue = "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking! ...";
var audioOutput = diaModel.Generate(
    text: dialogue,
    maxTokens: 1000,
    cfgScale: 3.0f,
    temperature: 1.2f,
    topP: 0.95f);

// Voice cloning with reference audio
var clonedAudio = diaModel.Generate(
    text: "[S1] This is my cloned voice speaking new words. ...",
    audioPromptPath: "reference_voice.wav",
    maxTokens: 1000);

// Add non-verbal expressions
var expressiveText = "[S1] I can't believe it! (gasps) [S2] That's amazing! (laughs) ...";
var expressiveAudio = diaModel.Generate(expressiveText);

Non-Verbal Communication Support

Dia supports several non-verbal expressions:
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)

Performance & Requirements

Memory Usage: ~10-11GB GPU memory (similar to Python implementation)
DAC Codec Integration: Seamless integration with DAC for full audio generation pipeline
Optimized Processing: Built-in speed correction maintains audio quality while handling generation speed issues

License Change: MIT to Apache 2.0

Starting with version 0.4.0, NeuralCodecs transitions from the MIT License to the Apache License 2.0.
Most new codecs and TTS libraries are being released under Apache 2.0, so this makes it easier to integrate these into the project without tracking mixed Apache 2.0 and MIT licensing requirements. There also seems to be a trend with MIT license requirements being ignored in the field, so hopefully the move to Apache 2.0 will help prevent that.

This change allows:

Enhanced Patent Protection: Better protection against patent litigation
Clearer Contribution Guidelines: More explicit terms for code contributions
Continued Open Source: Maintains the open-source nature with additional protections
Industry Alignment: Consistent licensing with most modern audio/ML libraries

What This Means for Users

Existing users: Can continue using previous versions under MIT
New projects: Should use Apache 2.0 license terms
Contributors: New contributions will be under Apache 2.0
Commercial use: Continues to be permitted with additional patent protections

Other changes

Improved Snake1d implementation
Improved inference using torch's inference_mode
Default codec factory pattern allows the user to load models without a config instance
Improved example project, including Dia TTS support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Major Feature: Dia Text-to-Speech Integration

Overview

Key Capabilities

Advanced Dialogue Generation

Voice Cloning & Style Transfer

Advanced Speed Control System

Usage Examples

Non-Verbal Communication Support

Performance & Requirements

License Change: MIT to Apache 2.0

What This Means for Users

Other changes

Uh oh!