Sylint

Sylint is an AI-powered, browser-based static vulnerability scanner that combines deep code embeddings (UniXcoder) with retrieval-augmented generation (RAG) via LLMs to detect, explain, and debug security issues with greater precision than traditional static analyzers or basic LLM-based tools. Sylint goes beyond conventional static analysis by understanding code semantically and grounding AI explanations in real-world vulnerability patterns. in 9 different programming languages

Code Understanding with UniXcoder

Built on UniXcoder (microsoft/unixcoder-base) by Microsoft, a model trained across source code, comments, and ASTs.

Supports 9 languages: Python, JavaScript, Java, C, C++, PHP, Ruby, Go, and TypeScript.

Encodes code into 768-dimensional embeddings that capture logic, not just syntax.

Detects vulnerabilities even if code is obfuscated (e.g., renamed variables, reordered structures).

Vulnerability-Driven Dataset (CVEfixes)

Uses a filtered version of the CVEfixes dataset, focused solely on vulnerable samples from the NVD.

Associates each sample with language, CVE IDs, and CWE vulnerability classes.

Prioritizes matching real-world exploit patterns rather than safe or patched code.

Retrieval-Augmented Generation (RAG) for Explanations

On code submission, Sylint retrieves similar vulnerabilities based on embeddings.

These examples inform the LLM's explanation, grounding it in real-world evidence.

Results in more accurate, consistent, and trusted vulnerability analysis.

Tech Stack

Frontend: Next.js (React + TypeScript) + Tailwind CSS

Backend:

Convex — Realtime database and serverless backend
FastAPI — Python service for AI model interaction (embeddings + LLM calls)

Authentication: Clerk.dev

AI Models:

UniXcoder (microsoft/unixcoder-base) — Code embedding model
Groq API (Mixtral llama-3.3-70b-versatile) — Vulnerability explanation model

Vector Database:

Pinecone

Key Features

Monaco-based code editor with multi-language support
Syntax highlighting and language-aware defaults
Upload code and trigger vulnerability scans
Code embedding generation using UniXcoder
Similarity search against known vulnerable codebases
Retrieval-augmented vulnerability explanations
Save and view scan history (Convex backend)
Clerk authentication with Pro subscription gating
Webhook integration with Lemon Squeezy
Full HTTPS (SSL-secured) frontend/backend communication
Automatic CWE/CVE tagging based on LLM analysis
Exportable vulnerability reports (PDF/Markdown)

Getting Started

Clone and set up the project

git clone https://github.com/your-username/sylint.git
cd sylint
npm install

Run Frontend (Next.js):

npm run dev

Run Convex Backend:

npx convex dev

Run FastAPI service (embeddings and explanations)

cd ai-service

python3 -m venv venv

source venv/bin/activate

pip install -r requirements.txt

uvicorn main:app --reload --port 8000

Make sure your .env file contains:

GROQ_API_KEY
Clerk keys (frontend/backend)

AI Service Endpoints

POST /embed — Generate 768-dimension code embedding via UniXcoder
POST /explain — Generate vulnerability explanation via Groq (Mixtral model)

Vector Database Setup

Store embeddings of vulnerable code samples (CWE/CVE dataset).

Enable semantic similarity search on user-submitted code.

Power retrieval-augmented LLM explanations.

Sylint can detect vulnerabilities even when the submitted code looks different from the original vulnerable example.

LLM Integration Details

Model: Mixtral (Llama 3.3 70B Versatile) via Groq API

Responsibilities:

Deep vulnerability reasoning
CWE/CVE tagging
Auto-fix patch suggestions
Exportable full vulnerability reports

Sylint Notes on Architecture, RAG Usage, and Next Steps

Current Architecture (RAG-Based Vulnerability Scanner)

User submits source code (JavaScript, Python, etc.)
Code is embedded using UniXcoder
Query sent to Pinecone (vector DB) containing ~4,000 vulnerable code snippets
Retrieve top-k most semantically similar code examples
Forward original code + top-k results to LLM (Groq’s Mixtral, llama-3, etc.)
LLM returns:
- Explanation of vulnerability
- CWE classification
- (Optionally) suggested fix

This approach is a form of Retrieval-Augmented Generation (RAG).
It uses semantic similarity to ground the LLM’s response with real-world examples, enabling detection of non-pattern-matching, obfuscated, or fuzzy vulnerabilities.

Planned Improvements: Compliance-Aware RAG

Add a second Pinecone index (or namespace) for compliance rules:

Sources: NIST SP 800-53, PCI DSS v4.0, HIPAA, OWASP ASVS
Each rule is embedded as text (same model or a text-focused one)

Dual Retrieval on Submission:

Step 1: User submits code
Step 2: Embed the code
Step 3: Run two vector searches:
1. Against vulnerable code examples
2. Against compliance rule embeddings
Step 4: Combine both sets of results + user code and send to LLM

LLM returns:

Vulnerability summary
CWE mapping
Matched compliance rules (e.g., “Violates PCI 6.2.4”, “Conflicts with NIST SI-10”)
(Optional) Fix recommendation contextualized by regulation

Parameterized Retrieval

Users can optionally select:

A compliance mode (e.g., “PCI only”, “HIPAA only”)
A vulnerability scan scope (e.g., “Common CWEs only”, “Critical CWEs”)

This is handled by adding metadata to Pinecone records:

{
  "type": "compliance_rule",
  "source": "PCI",
  "id": "PCI 6.2.4",
  "cwe": "CWE-79"
}

Then filter retrieval using Pinecone’s metadata filtering:

pinecone.query({
  vector: embeddedCode,
  topK: 5,
  filter: { source: "PCI" }
});

RAG still applies — you’re just narrowing retrieval to user-defined categories.

Static Analysis Tools Comparison

Examples:

Semgrep
Bandit
ESLint
SonarQube
Fortify

Strengths:

Fast and deterministic
Easy CI/CD integration
Rule-based, low compute cost
No hallucination

Limitations:

Rigid pattern matching (can’t generalize)
Can’t reason about complex logic or control flow
High false positive rate if rules aren’t tightly scoped
Weak cross-language/generalization ability

What Makes Sylint Unique

Uses semantic similarity instead of regex/patterns
Can detect non-exact, obfuscated, or stylistic vulnerabilities
Explanations in plain English (good for junior devs or audits)
Compliance-aware output, not just raw CVEs or rule flags
Supports multi-language input via embedding model
Easily extensible with new rules, frameworks, or CWE mappings

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
ai-service		ai-service
convex		convex
public		public
src		src
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
next.config.ts		next.config.ts
notes.md		notes.md
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
test.js		test.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sylint

Code Understanding with UniXcoder

Vulnerability-Driven Dataset (CVEfixes)

Retrieval-Augmented Generation (RAG) for Explanations

Tech Stack

Key Features

Getting Started

AI Service Endpoints

Vector Database Setup

LLM Integration Details

Sylint Notes on Architecture, RAG Usage, and Next Steps

Current Architecture (RAG-Based Vulnerability Scanner)

Planned Improvements: Compliance-Aware RAG

Dual Retrieval on Submission:

Parameterized Retrieval

Static Analysis Tools Comparison

Examples:

Strengths:

Limitations:

What Makes Sylint Unique

About

Uh oh!

Releases

Packages

Uh oh!

Languages

butlerem/vulnerability-scanner-UniXcoder-RAG

Folders and files

Latest commit

History

Repository files navigation

Sylint

Code Understanding with UniXcoder

Vulnerability-Driven Dataset (CVEfixes)

Retrieval-Augmented Generation (RAG) for Explanations

Tech Stack

Key Features

Getting Started

AI Service Endpoints

Vector Database Setup

LLM Integration Details

Sylint Notes on Architecture, RAG Usage, and Next Steps

Current Architecture (RAG-Based Vulnerability Scanner)

Planned Improvements: Compliance-Aware RAG

Dual Retrieval on Submission:

Parameterized Retrieval

Static Analysis Tools Comparison

Examples:

Strengths:

Limitations:

What Makes Sylint Unique

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages