Skip to content

Common segfault in restore_og when running SWOOLE_PROCESS server. #5761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cjavad opened this issue May 9, 2025 · 3 comments
Open

Common segfault in restore_og when running SWOOLE_PROCESS server. #5761

cjavad opened this issue May 9, 2025 · 3 comments

Comments

@cjavad
Copy link
Contributor

cjavad commented May 9, 2025

Hi, over the last couple of days i've seen a huge influx in worker processes that dies due to the same segmentation fault. This has sadly only been happening in our production systems and not in testing yet, but i've managed to capture 2 core dumps over 1 hour using swoole compiled with debug symbols. We are running swoole 6.0.2 with php 8.4.7 on linux 6.8.0-58-generic under docker in alpine 3.21, and i've provided a Dockerfile environment that matches the exact environment of the application.

Here is an except of the issue:

Image

It appears that the PHPContext output_ptr gets corrupted somehow, but i am not able to run more intrusive debugging tools such as xdebug or valgrind since its production, but any ideas are welcome. This happens consistently - we are not using any native php modules that should impact this and have in periods experienced up to 20 of these faults in a single day.

To interactively inspect the core dumps use the following docker file as a base environment:

FROM php:8.4-cli-alpine AS build

# Install system dependencies
RUN apk --no-cache add \
    postgresql-dev \
    libzip-dev \
    autoconf \
    gcc \
    g++ \
    make \
    curl \
    libpng-dev \
    libjpeg-turbo-dev \
    freetype-dev \
    libxml2-dev \
    openssl-dev \
    curl-dev \
    linux-headers \
    sqlite-dev

RUN docker-php-ext-configure gd --with-freetype --with-jpeg \
    && docker-php-ext-install curl pdo_mysql pdo_sqlite gd sockets soap zip bcmath \
    && pecl install redis \
    && docker-php-ext-enable redis opcache \
    && pecl install --configureoptions 'enable-sockets="yes" enable-openssl="yes" enable-mysqlnd="yes" enable-swoole-curl="yes" enable-cares="no" enable-brotli="no" enable-zstd="no" enable-swoole-pgsql="no" with-swoole-odbc="no" with-swoole-oracle="no" enable-swoole-sqlite="no" enable-swoole-thread="no" enable-iouring="no" enable-debug="yes"' swoole-6.0.2 \
    && docker-php-ext-enable swoole


FROM php:8.4-cli-alpine

RUN apk --no-cache add \
    libzip \
    libpng \
    libjpeg-turbo \
    freetype \
    libxml2 \
    openssl \
    curl \
    libstdc++ \
    sqlite \
    gdb \
    binutils

COPY --from=build /usr/local/etc/php/conf.d /usr/local/etc/php/conf.d 
COPY --from=build /usr/local/lib/php/extensions /usr/local/lib/php/extensions

With GDB inside this container open the core dump with gdb -c [COREDUMPFILE] and use info sharedlibrary to get the start offset of swoole.so (will be something like 0x000073bf19013900).

With this offset load the provided 20240924-debug-swoole.so file with add-symbol-file 20240924-debug-swoole.so [OFFSET] and choose y to read the symbol file. You now have the correct debug symbols to inspect the call-stack etc.

A quick set of commands to get testing fast:

mkdir -p /tmp/dumps
cd /tmp/dumps
curl -O https://my-files.javad.sh/dumps/20240924-debug-swoole.so.gz
curl -O [COREDUMPFILE1]
curl -O [COREDUMPFILE2]
gunzip *
nano Dockerfile # Write dockerfile
docker build . -t gdb-test 
docker run --rm -it -u root -v /tmp/dumps:/tmp/dumps --entrypoint ash gdb-test
gdb -c /tmp/dumps/[COREDUMPFILE1]
> info sharedlibrary
> add-symbol-file /tmp/dumps/20240924-debug-swoole.so [OFFSET]
> bt

Since all the coredumps are from the same process (including the 10 new ones i got today) the offset is always 0x000073bf19013900 which can be used to skip docker and just use it directly.

I am hosting the associated .so and coredump files.

  • 20240924-debug-swoole.so: https://my-files.javad.sh/dumps/20240924-debug-swoole.so.gz

The core dump files have been redacted for all sensitive environment variables etc. but i would prefer sharing them privately, or with GPG, for convenience i've GPG'ed them for the swoole maintainers with public GPG keys, but please provide an email address or a GPG-key you'd like to receive them with.

info.gpg.txt

@matyhtf
Copy link
Member

matyhtf commented May 10, 2025

Please try to trace it with Valrind

USE_ZEND_ALLOC=0 valgrind php your_code.php

@cjavad
Copy link
Contributor Author

cjavad commented May 10, 2025

Please try to trace it with Valrind

USE_ZEND_ALLOC=0 valgrind php your_code.php

Yes i would love to be able to do this, this only happens in production though, i am thinking of doing a custom build with ultra-specific logging to minimize performance impact, if you had any specific things you wanted to track or enable i could do that.

At a high level it appears that the origin_ctx is invalid (maybe double free) when destroy_context is called on a child context. If you wanted some additional information i've emailed you the files as well.

I'll try to create a minimal repo of this issue, do you know of any way we could see what PHP functions the coroutines are based on, that might help me be able to do that.

On a side note i am running all our test instances with valgrind as you suggested above, i'll see what i can do to replicate the issue.

@cjavad
Copy link
Contributor Author

cjavad commented May 15, 2025

I have been able to confirm the issue to be an use after free from Coroutine::on_close -> PHPContext::on_close -> efree(ctx) being called before restore_og on the PHPContext perhaps a race condition of sorts.

I did this by explicitly overwriting the struct with 0xAA bytes before freeing it so the execution path is more or less correct, see https://github.com/SimplyPrint/swoole-src/compare/d5cd7935442a62cd6856ab58029889b8d15efbd3..v6.0.2-output-ptr-track.

Image

I am still open to ideas on how to trace this in a minimal invasive way as the issue still only has occurred in our production systems.

Edit:

Some more context, most likely the issue occurs then a coroutine (A) points to an origin (B) that has been closed with Coroutine::close from check_end. When A exists out from main_func it reschedules the origin triggering the fault. Inspecting more coredumps it appears that A->co->origin (get origin) is 0x0 so it returns main_context which somehow is invalid? But i am not able to inspect the .bss regions in my coredumps so i cannot see the values of Swoole::coroutines nor main_context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants