NeuralCodecs is a .NET library for neural audio codec implementations and TTS models written purely in C#. It includes implementations of SNAC, DAC, Encodec, and Dia, along with advanced audio processing tools.
- SNAC: Multi-Scale Neural Audio Codec
- Support for multiple sampling rates: 24kHz, 32kHz, and 44.1kHz
- Attention mechanisms with adjustable window sizes for improved quality
- Automatic resampling for input flexibility
 
- DAC: Descript Audio Codec
- Supports multiple sampling rates: 16kHz, 24kHz, and 44.1kHz
- Configurable encoder/decoder architecture with variable rates
- Flexible bitrate configurations from 8kbps to 16kbps
 
- Encodec: Meta's Encodec neural audio compression
- Supports stereo audio at 24kHz and 48kHz sample rates
- Variable bitrate compression (1.5-24 kbps)
- Neural language model for enhanced compression quality
- Direct file compression to .ecdc format
 
- Dia: Nari Labs' Dia text-to-speech model
- 1.6B parameter text-to-speech model for highly realistic dialogue generation
- Direct transcript-to-speech generation with emotion and tone control
- Audio-conditioned generation for voice cloning and style transfer
- Support for non-verbal communications (laughter, coughing, throat clearing, etc.)
- Speaker-aware dialogue generation with [S1] and [S2] tags
- Custom dynamic speed control to handle Dia's issue with automatic speed-up on long inputs
 
- AudioTools: Advanced audio processing utilities
- Based on Descript's audiotools Python package
- Extended with .NET-specific optimizations and additional features
- Audio filtering, transformation, and effects processing
- Works with Descript's AudioSignal or Tensors
 
- Audio Visualization: Example project includes spectrogram generation and comparison tools
- .NET 8.0 or later
- TorchSharp or libTorch compatible with your platform
- NAudio (for audio processing)
- SkiaSharp (for visualization features)
Install the main package from NuGet:
dotnet add package NeuralCodecsOr the Package Manager Console:
Install-Package NeuralCodecsModels will be automatically downloaded given the huggingface user/model, or can be downloaded separately:
SNAC Models - Available from hubersiuzdak's HuggingFace
DAC Models - Available from Descript's HuggingFace
Encodec Models - Available from Meta's HuggingFace
Dia Model - Available from Nari Labs' HuggingFace
- Requires both Dia model weights and DAC codec for full audio generation
Here's a simple example to get you started:
using NeuralCodecs;
// Load a SNAC model
var model = await NeuralCodecs.CreateSNACAsync("path/to/model.pt");
// Process audio
float[] audioData = LoadAudioFile("input.wav");
var compressed = model.ProcessAudio(audioData, sampleRate: 24000);
// Save the result
SaveAudioFile("output.wav", compressed);For more detailed examples, see the examples section below.
There are several ways to load a model:
// Load SNAC model with static method provided for built-in models
var model = await NeuralCodecs.CreateSNACAsync("model.pt");- SnacConfig provides premade configurations for 24kHz, 32kHz, and 44kHz sampling rates.
var model = await NeuralCodecs.CreateSNACAsync(modelPath, SNACConfig.SNAC24Khz);- Allows the use of custom loader implementations
// Load model with default config from IModelLoader instance
var torchLoader = NeuralCodecs.CreateTorchLoader();
var model = await torchLoader.LoadModelAsync<SNAC, SNACConfig>("model.pt");// For Encodec with custom bandwidth and settings
var encodecConfig = new EncodecConfig { 
    SampleRate = 48000,
    Bandwidth = 12.0f,
    Channels = 2,  // Stereo audio
    Normalize = true
};
var encodecModel = await torchLoader.LoadModelAsync<Encodec, EncodecConfig>("encodec_model.pt", encodecConfig);- Allows the use of custom model implementations with built-in or custom loaders
// Load custom model with factory method
var model = await torchLoader.LoadModelAsync<CustomModel, CustomConfig>(
    "model.pt",
    config => new CustomModel(config, ...),
    config);Models can be loaded in Pytorch or Safetensors format.
The AudioTools namespace provides extensive audio processing capabilities:
var audio = new Tensor(...); // Load or create audio tensor
// Apply effects
var processedAudio = AudioEffects.ApplyCompressor(
    audio, 
    sampleRate: 48000,
    threshold: -20f,
    ratio: 4.0f);
// Compute spectrograms and transforms
var spectrogram = DSP.MelSpectrogram(audio, sampleRate);
var stft = DSP.STFT(audio, windowSize: 1024, hopSize: 512, windowType: "hann");There are two main ways to process audio:
- Using the simplified ProcessAudio method:
// Compress audio in one step
var processedAudio = model.ProcessAudio(audioData, sampleRate);- Using separate encode and decode steps:
// Encode audio to compressed format
var codes = model.Encode(buffer);
// Decode back to audio
var processedAudio = model.Decode(codes);- 
Saving the processed audio Use your preferred method to save WAV files 
// using NAudio
await using var writer = new WaveFileWriter(
    outputPath,
    new WaveFormat(model.Config.SamplingRate, channels: model.Channels)
);
writer.WriteSamples(processedAudio, 0, processedAudio.Length);Encodec provides additional capabilities:
// Set target bandwidth for compression (supported values depend on model)
encodecModel.SetTargetBandwidth(12.0f); // 12 kbps
// Get available bandwidth options
var availableBandwidths = encodecModel.TargetBandwidths; // e.g. [1.5, 3, 6, 12, 24]
// Use language model for enhanced compression quality
var lm = await encodecModel.GetLanguageModel();
// Apply LM during encoding/decoding for better quality
// Direct file compression
await EncodecCompressor.CompressToFileAsync(encodecModel, audioTensor, "audio.ecdc", useLm: true);
// Decompress from file
var (decompressedAudio, sampleRate) = await EncodecCompressor.DecompressFromFileAsync("audio.ecdc");Dia is a 1.6B parameter text-to-speech model that generates highly realistic dialogue directly from transcripts:
// Load Dia model with optional DAC codec
var diaConfig = new DiaConfig 
{ 
    LoadDACModel = true,
    SampleRate = 44100 
};
var diaModel = NeuralCodecs.CreateDiaAsync("model.pt", diaconfig)
// or use LoadDACModel = false in config and manually load DAC:
diaModel.LoadDacModel("dac_model.pt");
// Basic text-to-speech generation
var text = "[S1] Hello, how are you today? [S2] I'm doing great, thanks for asking!";
var audioOutput = diaModel.Generate(
    text: text,
    maxTokens: 1000,
    cfgScale: 3.0f,
    temperature: 1.2f,
    topP: 0.95f);
// Voice cloning with audio prompt
var audioPromptPath = "reference_voice.wav";
var clonedAudio = diaModel.Generate(
    text: "[S1] This is my cloned voice speaking new words.",
    audioPromptPath: audioPromptPath,
    maxTokens: 1000);
// Batch generation for multiple texts
var texts = new List<string>
{
    "[S1] First dialogue line.",
    "[S2] Second dialogue line with (laughs) non-verbal."
};
var batchResults = diaModel.Generate(texts, maxTokens: 800);
// Save generated audio
Dia.SaveAudio("output.wav", audioOutput);Audio Speed Correction: Dia includes built-in speed correction to handle the automatic speed-up issue on longer inputs:
var diaConfig = new DiaConfig 
{ 
    LoadDACModel = true,
    SampleRate = 44100,
    // Configure speed correction method
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid, // Default: best quality
    // Configure slowdown mode
    SlowdownMode = AudioSlowdownMode.Dynamic // Default: adapts to text length
};- None: No speed correction applied
- TorchSharp: TorchSharp-based linear interpolation
- Hybrid: Combines TorchSharp and NAudio methods (recommended)
- NAudioResampling: Uses NAudio resampling for speed correction
- All: Creates separate outputs using all methods (for testing/comparison)
- Static: Uses a fixed slowdown factor
- Dynamic: Adjusts slowdown based on text length (recommended)
Speed Correction Examples:
// For highest quality output (default)
var highQualityConfig = new DiaConfig 
{ 
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.Hybrid,
    SlowdownMode = AudioSlowdownMode.Dynamic
};
// For testing multiple correction methods
var testConfig = new DiaConfig 
{ 
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.All // Generates multiple output variants
};
// For no speed correction (fastest processing)
var fastConfig = new DiaConfig 
{ 
    SpeedCorrectionMethod = AudioSpeedCorrectionMethod.None
};Text Format Requirements:
- Always begin input text with [S1]speaker tag
- Alternate between [S1]and[S2]for dialogue (repeating the same speaker tag consecutively may impact generation)
- Keep input text moderate length (10-20 seconds of corresponding audio)
Non-Verbal Communications: Dia supports various non-verbal tags. Some work more consistently than others (laughs, chuckles), but be prepared for occasional unexpected output from some tags (sneezes, applause, coughs ...)
var textWithNonVerbals = "[S1] I can't believe it! (gasps) [S2] That's amazing! (laughs)";Supported non-verbals: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
Voice Cloning Best Practices:
- Provide 5-10 seconds of reference audio for optimal results
- Include the transcript of the reference audio before your generation text
- Use correct speaker tags in the reference transcript
- Approximately 1 second per 86 tokens for duration estimation
// Voice cloning example with transcript
var referenceTranscript = "[S1] This is the reference voice speaking clearly.";
var newText = "[S1] Now I will say something completely different.";
var clonedOutput = diaModel.Generate(
    text: referenceTranscript + " " + newText,
    audioPromptPath: "reference.wav");Memory Usage: Similar to the python implementation, ~10-11GB GPU memory is required for the Dia model with DAC codec.
Speed Comparison (RTX 3090): The C# implementation shows slight performance improvement compared to the original Python version in my limited testing (Windows/no torch compile):
- Python (original): ~35 tokens/second (without torch.compile)
- C# (NeuralCodecs): ~40 tokens/second
Performance Notes:
- TorchSharp currently lacks torch.compile support, which limits potential speed gains compared to PyTorch
- Dia's performance is reduced on Windows machines compared to Linux environments
- Actual performance will vary based on hardware configuration, text length, and generation parameters
Check out the Example project for a complete implementation, including:
- Model loading and configuration
- Audio processing workflows
- Command-line interface implementation
- Audio Visualization
The example includes tools for visualizing and comparing audio spectrograms:
Audio before and after compression with DAC Codec 24kHz

- SNAC - hubertsiuzdak's original python implementation
- Descript Audio Codec - Descript's original python implementation
- Encodec - Meta's original python implementation
- Dia - Nari Labs' original python implementation
Suggestions and contributions are welcome! Here's how you can help:
- Bug Reports: Submit issues with reproduction steps
- Feature Requests: Propose new codec implementations or features
- Code Contributions: Submit pull requests with improvements
- Documentation: Help improve examples and documentation
- Testing: Test with different models and platforms
This project is licensed under the Apache-2.0 License, see the LICENSE file for more information.
This project uses libraries under several different licenses, see THIRD-PARTY-NOTICES for more information.