-
-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Problem
MISSING: Low-level kernel optimization for critical operations, enabling hardware-specific acceleration.
Missing Implementations
Custom Kernels (HIGH):
- Optimized matrix multiplication (GEMM)
- Fused attention kernels
- Depthwise separable convolution
- Group convolution
- Custom activation functions
SIMD/Vectorization (HIGH):
- AVX2/AVX-512 on x86
- NEON on ARM
- Explicit vectorization
- Auto-vectorization hints
GPU Optimization (HIGH):
- CUDA kernel implementations
- Shared memory utilization
- Warp-level primitives
- Tensor cores usage (Ampere+)
CPU Optimization (MEDIUM):
- Cache-aware algorithms
- Loop tiling
- Prefetching
- Parallel decomposition
Integration Points
- Custom operator registration
- Fallback to reference implementation
- Platform detection
- Performance profiling
Architecture
Success Criteria
- 2-5x speedup on critical operations
- Hardware-specific optimization
- Graceful fallback
- Benchmarks vs MKL, cuBLAS
Metadata
Metadata
Assignees
Labels
No labels