Skip to content

Protential ReDoS bug in response_cleaner.py #162

@ShangzhiXu

Description

@ShangzhiXu

Describe the bug

Describe the bug
Hi team, thanks for your great work! I think I found a small bug that might lead to DDoS in the system
At line 77 in response_cleaner.py
the regex r"\*\*Code Summary:\*\*\s*(.*?)\s*provides functions to", is vulnerable to ReDoS when it is used in
text = re.sub(

How to Reproduce

To Reproduce
I have a test file to stimulate the LLMs response

import re
import time

_regex = re.compile(r'\*\*Code Summary:\*\*\s*(.*?)\s*provides functions to')

for i in range(0, 8000, 2000):
    attack_string =   "**Code Summary:**" + "\t" * i
    start_time = time.time()
    match = _regex.match(attack_string)
    end_time = time.time()

The result is like this

i: 0, Time taken: 0.0014803409576416016 seconds
i: 2000, Time taken: 10.000782012939453 seconds
i: 4000, Time taken: 73.86766386032104 seconds
i: 6000, Time taken: 231.146071434021 seconds
i: 8000, Time taken: 547.5873472690582 seconds

As we can see, with around 6k chars, the string can cost the system to hang for around 5 mins and the time consumption increase significantly with the increase of the string length.
If using readme-ai in a server setup (e.g. readme-ai.streamlit), this bug may lead to high CPU usage or DoS risks if users submit malicious or resource-intensive repositories.

Expected behavior
I think we can add a limit like replace .*? with .{0,200}? ? Maybe it can help to solve the recursion problem.

The core of the problem lies within \s*(.*?)\s*. The constructs like \s* and .* tend to eagerly match strings, leading to massive recursion and backtracking when faced with malicious input.

I tested a modification of the regex, and the performance improved significantly:

# after modify it to r'\*\*Code Summary:\*\*\s{0,200}(.*?)\s*provides functions to'
i: 100, Time taken: 0.001453399658203125 seconds
i: 1100, Time taken: 0.6828515529632568 seconds
i: 2100, Time taken: 2.861234188079834 seconds
i: 3100, Time taken: 5.831449031829834 seconds
i: 4100, Time taken: 10.390498399734497 seconds
i: 6000, Time taken: 23.510437726974487 seconds
i: 8000, Time taken: 40.42485165596008 seconds

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions