Skip to content

AI-powered browser-based vulnerability scanner using UniXcoder embeddings and RAG with LLM to detect security flaws across 9 languages.

Notifications You must be signed in to change notification settings

butlerem/vulnerability-scanner-UniXcoder-RAG

Repository files navigation

Sylint

Sylint is an AI-powered, browser-based static vulnerability scanner that combines deep code embeddings (UniXcoder) with retrieval-augmented generation (RAG) via LLMs to detect, explain, and debug security issues with greater precision than traditional static analyzers or basic LLM-based tools. Sylint goes beyond conventional static analysis by understanding code semantically and grounding AI explanations in real-world vulnerability patterns. in 9 different programming languages

Code Understanding with UniXcoder

Built on UniXcoder (microsoft/unixcoder-base) by Microsoft, a model trained across source code, comments, and ASTs.

Supports 9 languages: Python, JavaScript, Java, C, C++, PHP, Ruby, Go, and TypeScript.

Encodes code into 768-dimensional embeddings that capture logic, not just syntax.

Detects vulnerabilities even if code is obfuscated (e.g., renamed variables, reordered structures).

Vulnerability-Driven Dataset (CVEfixes)

Uses a filtered version of the CVEfixes dataset, focused solely on vulnerable samples from the NVD.

Associates each sample with language, CVE IDs, and CWE vulnerability classes.

Prioritizes matching real-world exploit patterns rather than safe or patched code.

Retrieval-Augmented Generation (RAG) for Explanations

On code submission, Sylint retrieves similar vulnerabilities based on embeddings.

These examples inform the LLM's explanation, grounding it in real-world evidence.

Results in more accurate, consistent, and trusted vulnerability analysis.

Tech Stack

Frontend: Next.js (React + TypeScript) + Tailwind CSS

Backend:

  • Convex — Realtime database and serverless backend
  • FastAPI — Python service for AI model interaction (embeddings + LLM calls)

Authentication: Clerk.dev

AI Models:

  • UniXcoder (microsoft/unixcoder-base) — Code embedding model
  • Groq API (Mixtral llama-3.3-70b-versatile) — Vulnerability explanation model

Vector Database:

  • Pinecone

Key Features

  • Monaco-based code editor with multi-language support
  • Syntax highlighting and language-aware defaults
  • Upload code and trigger vulnerability scans
  • Code embedding generation using UniXcoder
  • Similarity search against known vulnerable codebases
  • Retrieval-augmented vulnerability explanations
  • Save and view scan history (Convex backend)
  • Clerk authentication with Pro subscription gating
  • Webhook integration with Lemon Squeezy
  • Full HTTPS (SSL-secured) frontend/backend communication
  • Automatic CWE/CVE tagging based on LLM analysis
  • Exportable vulnerability reports (PDF/Markdown)

Getting Started

  1. Clone and set up the project
git clone https://github.com/your-username/sylint.git
cd sylint
npm install
  1. Run Frontend (Next.js):
npm run dev
  1. Run Convex Backend:
npx convex dev
  1. Run FastAPI service (embeddings and explanations)
cd ai-service

python3 -m venv venv

source venv/bin/activate

pip install -r requirements.txt

uvicorn main:app --reload --port 8000

Make sure your .env file contains:

  • GROQ_API_KEY
  • Clerk keys (frontend/backend)

AI Service Endpoints

  • POST /embed — Generate 768-dimension code embedding via UniXcoder
  • POST /explain — Generate vulnerability explanation via Groq (Mixtral model)

Vector Database Setup

Store embeddings of vulnerable code samples (CWE/CVE dataset).

Enable semantic similarity search on user-submitted code.

Power retrieval-augmented LLM explanations.

Sylint can detect vulnerabilities even when the submitted code looks different from the original vulnerable example.

LLM Integration Details

Model: Mixtral (Llama 3.3 70B Versatile) via Groq API

Responsibilities:

  • Deep vulnerability reasoning
  • CWE/CVE tagging
  • Auto-fix patch suggestions
  • Exportable full vulnerability reports

Sylint Notes on Architecture, RAG Usage, and Next Steps

Current Architecture (RAG-Based Vulnerability Scanner)

  • User submits source code (JavaScript, Python, etc.)
  • Code is embedded using UniXcoder
  • Query sent to Pinecone (vector DB) containing ~4,000 vulnerable code snippets
  • Retrieve top-k most semantically similar code examples
  • Forward original code + top-k results to LLM (Groq’s Mixtral, llama-3, etc.)
  • LLM returns:
    • Explanation of vulnerability
    • CWE classification
    • (Optionally) suggested fix

This approach is a form of Retrieval-Augmented Generation (RAG).
It uses semantic similarity to ground the LLM’s response with real-world examples, enabling detection of non-pattern-matching, obfuscated, or fuzzy vulnerabilities.

Planned Improvements: Compliance-Aware RAG

Add a second Pinecone index (or namespace) for compliance rules:

  • Sources: NIST SP 800-53, PCI DSS v4.0, HIPAA, OWASP ASVS
  • Each rule is embedded as text (same model or a text-focused one)

Dual Retrieval on Submission:

  • Step 1: User submits code
  • Step 2: Embed the code
  • Step 3: Run two vector searches:
    1. Against vulnerable code examples
    2. Against compliance rule embeddings
  • Step 4: Combine both sets of results + user code and send to LLM

LLM returns:

  • Vulnerability summary
  • CWE mapping
  • Matched compliance rules (e.g., “Violates PCI 6.2.4”, “Conflicts with NIST SI-10”)
  • (Optional) Fix recommendation contextualized by regulation

Parameterized Retrieval

Users can optionally select:

  • A compliance mode (e.g., “PCI only”, “HIPAA only”)
  • A vulnerability scan scope (e.g., “Common CWEs only”, “Critical CWEs”)

This is handled by adding metadata to Pinecone records:

{
  "type": "compliance_rule",
  "source": "PCI",
  "id": "PCI 6.2.4",
  "cwe": "CWE-79"
}

Then filter retrieval using Pinecone’s metadata filtering:

pinecone.query({
  vector: embeddedCode,
  topK: 5,
  filter: { source: "PCI" }
});

RAG still applies — you’re just narrowing retrieval to user-defined categories.

Static Analysis Tools Comparison

Examples:

  • Semgrep
  • Bandit
  • ESLint
  • SonarQube
  • Fortify

Strengths:

  • Fast and deterministic
  • Easy CI/CD integration
  • Rule-based, low compute cost
  • No hallucination

Limitations:

  • Rigid pattern matching (can’t generalize)
  • Can’t reason about complex logic or control flow
  • High false positive rate if rules aren’t tightly scoped
  • Weak cross-language/generalization ability

What Makes Sylint Unique

  • Uses semantic similarity instead of regex/patterns
  • Can detect non-exact, obfuscated, or stylistic vulnerabilities
  • Explanations in plain English (good for junior devs or audits)
  • Compliance-aware output, not just raw CVEs or rule flags
  • Supports multi-language input via embedding model
  • Easily extensible with new rules, frameworks, or CWE mappings

About

AI-powered browser-based vulnerability scanner using UniXcoder embeddings and RAG with LLM to detect security flaws across 9 languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published