Skip to content

CBG-4712 /_sgcollect_info endpoint to work with non default port #7601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 24, 2025

Conversation

torcolvin
Copy link
Collaborator

@torcolvin torcolvin commented Jun 23, 2025

CBG-4712 /_sgcollect_info endpoint to work with non default port

This fix was driven by the desire to have integration testing from go, and doing integration testing in go requires listening on ports that are not :4985 - it uses :0 and picks a port randomly.

Fixes made:

  • pass --sync-gateway-url from /_sgcollect_info
  • change the way URL is picked for SG (--sync-gateway-url, http://127.0.0.1:4985, https://127.0.0.1:4985)
  • always use --sync-gateway-executable first if present, rather than picking up default locations
  • remove get_absolute_path function since this would always throw an exception since upgrade to python3 (found with parsing log output for errors, then using python typing). This function mixes up bytes and str.
  • Close the io.Pipe for stderr/stdout, without this, each time you run sgcollect 2 goroutines are leaked.

Testing only improvements:

  • scope sgCollect instance to the ServerContext. In practice, this is the same for a sync gateway executable but for testing it means not worrying about cleaning up globals since we can modify the ServerContext and it is torn down between tests. This also allows sgCollectPaths to use a ctx with logging information on it, for filtering.
  • Create sgCollectOutputStream struct to make handling output a bit easier. This needs to be closed if the command doesn't start (e.g. path not found). For testing, we need to wait for the output to be closed before reading it. Because it is buffered output, it lasts a bit longer than cmd.Wait() and the race detector is tripped.
  • Add sgCollect.stderr / sgCollect.stdout to monitor errors from sgcollect output.
  • Return sgCollectBinary as a []string{} so it can be triggered via python sgcollect_info. This is only needed for testing, in normal production this will be a single executable that self extracts a python installation.

I think I'd like to set a build flag or environment variable so this test doesn't run as a part of standard dev runs, it is just too slow.

Pre-review checklist

  • Removed debug logging (fmt.Print, log.Print, ...)
  • Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
  • Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Integration Tests

@torcolvin torcolvin requested a review from bbrks June 23, 2025 16:49
@bbrks bbrks requested a review from Copilot June 23, 2025 16:53
Copilot

This comment was marked as outdated.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the /_sgcollect_info endpoint to support non‐default ports and random port assignment during Go integration tests by centralizing URL resolution, passing the admin URL to the Python script, always honoring a user‐specified executable, cleaning up leaked goroutines, and improving test isolation.

  • Refactor Python sgcollect.py to:
    • Introduce get_sg_url and can_connect_to_sg_url for robust URL selection
    • Always use --sync-gateway-executable first, remove broken get_absolute_path
    • Close stdout/stderr pipes to avoid goroutine leaks
  • Enhance Go rest package to:
    • Replace global singleton with per‐ServerContext sgCollect instance
    • Add sgCollectOutputStream to manage and close pipes cleanly
    • Pass the resolved admin URL into sgcollect_info and wire through options
  • Improve tests to expect SystemExit(0), scope instances to contexts, and cover new code paths

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/sgcollect.py Added URL‐selection helpers, removed path helper, tightened auth
tools-tests/upload_test.py Updated tests to expect SystemExit(0)
tools-tests/sgcollect_info_test.py Expanded coverage for new functions and edge cases
rest/sgcollect.go Introduced sgCollectOutputStream, per‐context instance, cleanup
rest/sgcollect_test.go Added long‐running integration test, imported missing packages
rest/admin_api.go Wired through adminURL, switched to per‐context sgcollect
rest/server_context.go Initialized sgcollect in ServerContext
Comments suppressed due to low confidence (2)

tools/sgcollect.py:786

  • The docstring says this function returns None on failure, but the code actually returns an empty string. Update the documentation to reflect that it returns an empty string when no path is found.
    Return the path to the sync gateway binary, returns None if the path is not found.

rest/sgcollect.go:365

  • [nitpick] This comment is unclear and contains a fragment (where is it ,). Consider rewriting to clearly explain that sgCollectPaths returns the SG binary, a slice of strings for the script path, and an error if not found.
// The sgcollect_info return value is allowed to be a list of strings for testing, where is it , or an error if not.

Copy link
Member

@bbrks bbrks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggested change to log output from a /_sgcollect_info REST request, otherwise looks good and tested fine

Co-authored-by: Ben Brooks <ben.brooks@couchbase.com>
@@ -222,3 +226,62 @@ func TestSgcollectOptionsArgs(t *testing.T) {
})
}
}

func TestSGCollectIntegration(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've probably already seen - but something is a bit screwed up on Windows in this test. It eventually failed after 420 seconds

2025-06-24T14:11:44.0631070Z 2025-06-24T14:10:15.092Z [INF] HTTP: c:#5360 GET /_sgcollect_info (as ADMIN)
2025-06-24T14:11:44.0631219Z 2025-06-24T14:10:17.092Z [INF] HTTP: c:#5361 GET /_sgcollect_info (as ADMIN)
2025-06-24T14:11:44.0631335Z     sgcollect_test.go:272: 
2025-06-24T14:11:44.0631645Z         	Error Trace:	D:/a/sync_gateway/sync_gateway/rest/sgcollect_test.go:276
2025-06-24T14:11:44.0632218Z         	            				C:/hostedtoolcache/windows/go/1.24.4/x64/src/runtime/asm_amd64.s:1700
2025-06-24T14:11:44.0632339Z         	Error:      	Not equal: 
2025-06-24T14:11:44.0632530Z         	            	expected: "stopped"
2025-06-24T14:11:44.0632720Z         	            	actual  : "running"
2025-06-24T14:11:44.0632832Z         	            	
2025-06-24T14:11:44.0632999Z         	            	Diff:
2025-06-24T14:11:44.0633170Z         	            	--- Expected
2025-06-24T14:11:44.0633314Z         	            	+++ Actual
2025-06-24T14:11:44.0633471Z         	            	@@ -1 +1 @@
2025-06-24T14:11:44.0633613Z         	            	-stopped
2025-06-24T14:11:44.0633746Z         	            	+running
2025-06-24T14:11:44.0633860Z     sgcollect_test.go:272: 
2025-06-24T14:11:44.0634159Z         	Error Trace:	D:/a/sync_gateway/sync_gateway/rest/sgcollect_test.go:272
2025-06-24T14:11:44.0634329Z         	Error:      	Condition never satisfied
2025-06-24T14:11:44.0634514Z         	Test:       	TestSGCollectIntegration
2025-06-24T14:11:44.0634736Z         	Messages:   	sgcollect_info did not stop running in time
2025-06-24T14:11:44.0634907Z 2025-06-24T14:10:19.091Z [INF] HTTP: c:#5362 GET /_sgcollect_info (as ADMIN)
2025-06-24T14:11:44.0635072Z 2025-06-24T14:10:19.092Z [INF] HTTP: c:#5363 DELETE /_sgcollect_info (as ADMIN)
2025-06-24T14:11:44.0635318Z 2025-06-24T14:10:19.092Z [INF] HTTP: t:TestSGCollectIntegration Closing HTTP Server: 127.0.0.1:54226
2025-06-24T14:11:44.0635555Z 2025-06-24T14:10:19.093Z [INF] HTTP: t:TestSGCollectIntegration Closing HTTP Server: 127.0.0.1:54227
2025-06-24T14:11:44.0635792Z 2025-06-24T14:10:19.093Z [INF] HTTP: t:TestSGCollectIntegration Closing HTTP Server: 127.0.0.1:54228
2025-06-24T14:11:44.0636042Z 2025-06-24T14:10:19.093Z [INF] HTTP: t:TestSGCollectIntegration Closing HTTP Server: 127.0.0.1:54225
2025-06-24T14:11:44.0636144Z --- FAIL: TestSGCollectIntegration (420.12s)

@bbrks bbrks assigned torcolvin and unassigned bbrks Jun 24, 2025
tools/tasks.py Outdated
@@ -178,7 +178,7 @@ def execute(self, fp):
p = subprocess.Popen(
self.command,
bufsize=-1,
stdin=subprocess.PIPE,
stdin=subprocess.DEVNULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see this is not a change already made in cbcollect_info if it's actually required, and I am also not sure quite why this change?

Surely sgcollect_info was not completely broken on Windows up until this point?

The commit message isn't really indicative of the change being made. There's nothing Windows or CI specific about this change, and we're not closing stdin.

Will this change impact interactive password prompts in future? I guess maybe not since that's not stdin, and it's instead just a prompt from the program.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We close stdin right after launching this, but this makes sure it is already closed and is more correct. I wondered if there was a race condition. sgcollect isn't designed to collect any data from stdin for any diagnostic programs it runs.

https://github.com/couchbase/sync_gateway/actions/runs/15852522483/job/44689423355?pr=7601

Shows wmic running at:

2025-06-24T14:03:48.830Z [INF] c:SGCollect-5382 sgcollect_info: Computer system (wmic computersystem) - OK

And then the test runs until:

2025-06-24T14:10:39.220Z [INF] HTTP: c:#5592 GET /_sgcollect_info (as ADMIN)

which is 7 minutes. I guessed on two reasons for this:

  1. wmic is slow - it would have been running WindowsTask("Installed software", "wmic product get name, version"), for 7 minutes, which seems weird, because that seems like it would normally be fast. That's why I thought it could be asking for something from stdin.
  2. wmic is also known to be slow, when getting the event log. That's the most useful behavior it can do, and also the slowest. So this is expensive for the last 30 days, but on a github actions machine it doesn't stay running for very long. This is the equivalent of the systemd/dmesg logs for windows.

wmic is in fact deprecated and going to be removed - https://learn.microsoft.com/en-us/windows/win32/wmisdk/wmic

That said, I'm going to have another commit for moving the tests to a separate package so it doesn't bang up against the 10 minute timeout for the rest package. When running the tests via github, this will also increase parallelism since packages are run simultaneously.

@torcolvin torcolvin merged commit 910c3d3 into main Jun 24, 2025
46 checks passed
@torcolvin torcolvin deleted the CBG-4712 branch June 24, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants