Skip to content

Conversation

iddm
Copy link
Contributor

@iddm iddm commented Jun 20, 2025

Description

Adds the profiling module for the compute code and exposes a compute endpoint "/profile/cpu" with three methods: GET, POST and DELETE, for checking the status of, starting and stopping the profiling.

The added profilers supported are perf and bcc-profile. perf is less stable than bcc-profile.

Problem

We need to allow for continuous profiling of compute (postgres).

Summary of changes

@iddm iddm self-assigned this Jun 20, 2025
@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch from ac9d2d3 to 193cd7b Compare June 20, 2025 11:12
@iddm iddm requested a review from myrrc June 20, 2025 11:13
Copy link

github-actions bot commented Jun 20, 2025

No tests were run or test report is not available

Test coverage report is not available

The comment gets automatically updated with the latest test results
8dbf5a8 at 2025-07-15T11:00:25.949Z :recycle:

@iddm
Copy link
Contributor Author

iddm commented Jun 20, 2025

@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch from 0d7c759 to 680156b Compare June 26, 2025 12:23
@iddm iddm marked this pull request as ready for review June 26, 2025 12:23
@iddm iddm requested a review from a team as a code owner June 26, 2025 12:23
@iddm iddm requested review from dimitri, thesuhas and myrrc June 26, 2025 12:23
Copy link
Contributor

@myrrc myrrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far, but I'd like one more review

@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch 5 times, most recently from b98d2ef to bdb009a Compare July 1, 2025 14:33
@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch from bdb009a to 28bc14c Compare July 3, 2025 19:14
@iddm
Copy link
Contributor Author

iddm commented Jul 3, 2025

Need to figure out the path to bcc-profile for the CI tests to pass.

@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch from 47a7048 to eaf610e Compare July 8, 2025 09:46
@iddm iddm requested review from a team as code owners July 8, 2025 12:22
@iddm iddm requested a review from gssbzn July 8, 2025 12:22
@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch 5 times, most recently from 86cc306 to d2fb3d9 Compare July 9, 2025 10:46
@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch 3 times, most recently from 2f712c9 to ef8df9b Compare July 11, 2025 10:43
Exposes an endpoint "/profile/cpu" for profiling the postgres
processes (currently spawned and the new ones) using "perf".

Adds the corresponding python test to test the added endpoint
and confirm the output expected is the profiling data in the
expected format.

Add "perf" binary to the sudo list.

Fix python poetry ruff

Address the clippy lints

Document the code

Format python code

Address code review

Prettify

Embed profile_pb2.py and small code/test fixes.

Make the code slightly better.

1. Makes optional the sampling_frequency parameter for profiling.
2. Avoids using unsafe code when killing a child.

Better code, better tests

More tests

Separate start and stop of profiling

Correctly check for the exceptions

Address clippy lint

Final fixes.

1. Allows the perf to be found in $PATH instead of having the path
hardcoded.
2. Changes the path to perf in the sudoers file so that the compute
can run it properly.
3. Changes the way perf is invoked, now it is with sudo and the path
from $PATH.
4. Removes the authentication requirement from the /profile/cpu/
endpoint.

hakari thing

Python fixes

Fix python formatting

More python fixes

Update poetry lock

Fix ruff

Address the review comments

Fix the tests

Try fixing the flaky test for pg17?

Try fixing the flaky test for pg17?

PYTHON

Fix the tests

Remove the PROGRESS parameter

Remove unused

Increase the timeout due to concurrency

Increase the timeout to 60

Increase the profiling window timeout

Try this

Lets see the error

Just log all the errors

Add perf into the build environment

uijdfghjdf

Update tempfile to 3.20

Snapshot

Use bbc-profile

Update tempfile to 3.20

Provide bpfcc-tools in debian

Properly respond with status

Python check

Fix build-tools dockerfile

Add path probation for the bcc profile

Try err printing

Refactor

Add bpfcc-tools to the final image

Add error context

sudo not found?

Print more errors for verbosity

Remove procfs and use libproc

Update hakari

Debug sudo in CI

Rebase and adjust hakari

remove leftover

Add archiving support

Correct the paths to the perf binary

Try hardcoded sudo path

Add sudo into build-tools dockerfile

Minor cleanup

Print out the sudoers file from github

Stop the tests earlier

Add the sudoers entry for nonroot, install kmod for modprobe for bcc-profile

Try hacking the kernel headers for bcc-profile

Redeclare the kernel version argument

Try using the kernel of the runner

Try another way

Check bpfcc-tools
@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch from ef8df9b to d8b0c08 Compare July 11, 2025 10:53
@iddm
Copy link
Contributor Author

iddm commented Jul 11, 2025

Squashed the 71 commits into one to make the rebases easier the next time.

@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch 9 times, most recently from 18d4d6f to 2ac828a Compare July 11, 2025 14:12
@iddm iddm force-pushed the iddm/postgres-continuous-profiling-endpoint branch from 2ac828a to 463429a Compare July 11, 2025 14:20
@iddm
Copy link
Contributor Author

iddm commented Aug 2, 2025

I am about to leave for a couple of weeks, so this is where I stopped.

  1. The spec shell code which we specify is NOT parsed correctly by the spec runner, as the multi-line shell code is completely ignored. So I had to collapse all the lines into a single line of shell code for enabling the kernel headers and the kernel modules.
  2. Some of the kernel paths are not automatically passed through to the VM, so to make sure those are always there, I had to provide yet another step to mount the bpf pseudo-fs.

This is almost a completely working code, apart from the autoscaling issue I mention in this PR: neondatabase/autoscaling#1401 which didn't allow me to test further until I ran out of time.

I, however, do not expect any other compute-related code changes anymore.

@bayandin bayandin requested review from bayandin and removed request for bayandin August 15, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants