Skip to content

v0.4.0

Latest

Choose a tag to compare

@DillionLowry DillionLowry released this 10 Jun 23:58
· 2 commits to main since this release

Major Feature: Dia Text-to-Speech Integration

Overview

NeuralCodecs 0.4.0 introduces full support for Nari Labs' Dia 1.6B parameter text-to-speech model, enabling highly realistic dialogue generation directly from transcripts.

Key Capabilities

Advanced Dialogue Generation

  • Speaker-aware dialogue with [S1] and [S2] tags for natural conversation flow
  • Emotion and tone control for expressive speech synthesis
  • Direct transcript-to-speech generation without intermediate processing steps
  • Non-verbal tag support including laughter, coughing, throat clearing, and more

Voice Cloning & Style Transfer

  • Audio-conditioned generation for voice cloning using reference audio
  • Style transfer capabilities to adapt speech characteristics
  • Batch generation support for processing multiple texts efficiently

Advanced Speed Control System

Includes a custom dynamic speed control system that addresses Dia's automatic speed-up issue on longer inputs:

Speed Correction Methods:

  • None: No correction for fastest processing
  • TorchSharp: TorchSharp-based linear interpolation
  • Hybrid: Combined TorchSharp and NAudio methods (recommended for best quality)
  • NAudioResampling: NAudio-based resampling correction
  • All: Generates outputs using all methods for testing and comparison

Slowdown Modes:

  • Static: Fixed slowdown factor
  • Dynamic: Adaptive slowdown based on text length (recommended)

Usage Examples

// Load Dia model with DAC codec integration
var diaConfig = new DiaConfig 
{ 
    LoadDACModel = true,
    SampleRate = 44100,
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid,
    SlowdownMode = AudioSlowdownMode.Dynamic
};
var diaModel = await NeuralCodecs.CreateDiaAsync("model.pt", diaConfig);

// Generate realistic dialogue
var dialogue = "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking! ...";
var audioOutput = diaModel.Generate(
    text: dialogue,
    maxTokens: 1000,
    cfgScale: 3.0f,
    temperature: 1.2f,
    topP: 0.95f);

// Voice cloning with reference audio
var clonedAudio = diaModel.Generate(
    text: "[S1] This is my cloned voice speaking new words. ...",
    audioPromptPath: "reference_voice.wav",
    maxTokens: 1000);

// Add non-verbal expressions
var expressiveText = "[S1] I can't believe it! (gasps) [S2] That's amazing! (laughs) ...";
var expressiveAudio = diaModel.Generate(expressiveText);

Non-Verbal Communication Support

Dia supports several non-verbal expressions:
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)

Performance & Requirements

  • Memory Usage: ~10-11GB GPU memory (similar to Python implementation)
  • DAC Codec Integration: Seamless integration with DAC for full audio generation pipeline
  • Optimized Processing: Built-in speed correction maintains audio quality while handling generation speed issues

License Change: MIT to Apache 2.0

Starting with version 0.4.0, NeuralCodecs transitions from the MIT License to the Apache License 2.0.
Most new codecs and TTS libraries are being released under Apache 2.0, so this makes it easier to integrate these into the project without tracking mixed Apache 2.0 and MIT licensing requirements. There also seems to be a trend with MIT license requirements being ignored in the field, so hopefully the move to Apache 2.0 will help prevent that.

This change allows:

  • Enhanced Patent Protection: Better protection against patent litigation
  • Clearer Contribution Guidelines: More explicit terms for code contributions
  • Continued Open Source: Maintains the open-source nature with additional protections
  • Industry Alignment: Consistent licensing with most modern audio/ML libraries

What This Means for Users

  • Existing users: Can continue using previous versions under MIT
  • New projects: Should use Apache 2.0 license terms
  • Contributors: New contributions will be under Apache 2.0
  • Commercial use: Continues to be permitted with additional patent protections

Other changes

  • Improved Snake1d implementation
  • Improved inference using torch's inference_mode
  • Default codec factory pattern allows the user to load models without a config instance
  • Improved example project, including Dia TTS support