Open
Description
Description
The licensedcode cache is not always utilized when running multiple processes in parallel. This was noticed while running stress-tests on scancode. We observed that, when multiple tests were started in separate processes at the same time, each process would separately build its own cache instead of using the existing one. This had a considerable performance cost, eventually leading to a LockTimeout
.
The root cause is in licensedcode/cache.py: After a process obtains a lock, it should check if another thread has already built the cache, but it does not.
How To Reproduce
This was noticed when stress-testing a local test:
with NamedTemporaryFile() as test_file:
test_contents = bytes(MIT_LICENSE_TEXT.encode("utf-8"))
test_file.write(test_contents)
test_file.seek(0)
results = get_licenses(test_file.name) # slow
license_expression = results["detected_license_expression"]
self.assertEqual(license_expression, "mit")
We ran this on 100 processes in parallel.
Traceback (most recent call last):
File "scancode/api.py", line 200, in get_licenses
for detection in detections:
File "licensedcode/detection.py", line 1947, in detect_licenses
index = cache.get_index()
File "licensedcode/cache.py", line 459, in get_index
return get_cache(
File "licensedcode/cache.py", line 399, in get_cache
return populate_cache(
File "licensedcode/cache.py", line 419, in populate_cache
_LICENSE_CACHE = LicenseCache.load_or_build(
File "licensedcode/cache.py", line 136, in load_or_build
with lockfile.FileLock(lock_file).locked(timeout=timeout):
File "runtime/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "scancode/lockfile.py", line 29, in locked
raise LockTimeout(timeout)
scancode.lockfile.LockTimeout: 360
System configuration
- OS: Linux
- What version of scancode-toolkit was used to generate the scan file? scancode-toolkit-mini 32.3.2
- What installation method was used to install/run scancode? pip