Skip to content

WebSocketClosedError on write_message (correlation with memory accumulation) #699

@dwsutherland

Description

@dwsutherland

Description

Cause not known, but it would appear a message is sent before closure of the websocket could complete or known about:

Task exception was never retrieved
future: <Task finished name='Task-10268090' coro=<TornadoSubscriptionServer.on_start() done, defined at /opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py:129> exception=WebSocketClosedError()>
Traceback (most recent call last):
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 148, in on_start
    await self.send_execution_result(
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 173, in send_execution_result
    await BaseSubscriptionServer.send_execution_result(
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/graphql_ws/base_async.py", line 181, in send_message
    return await connection_context.send(message)
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 49, in send
    await self.ws.write_message(data)
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/tornado/websocket.py", line 360, in write_message
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 151, in on_start
    await self.send_error(connection_context, op_id, e)
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/graphql_ws/base_async.py", line 181, in send_message
    return await connection_context.send(message)
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 49, in send
    await self.ws.write_message(data)
  File "/opt/niwa/share/cylc/8.4.3/lib/python3.9/site-packages/tornado/websocket.py", line 360, in write_message
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError

Note: This code has changed location content since 1.6.x/8.4.x, however the send mechanism is likely similar so a fix should be against 1.7.0 once understood

Also, this was found in NIWA operations, and there are indications this may have a large contribution to memory accumulation of the UI Server...

Servers usually start with about ~3-4Gb memory used on their designated VMs and currently, after a similar load and run duration, the oper and test systems differ in their memory used:
Oper:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            31G         16G         13G         57M        664M         13G
Swap:            0B          0B          0B

Test:

              total        used        free      shared  buff/cache   available
Mem:            15G        3.8G        592M         61M         11G         10G
Swap:            0B          0B          0B

The only difference found was oper has 36 of these errors in the log (test none):

[ecox_oper@w-ec-admin02 ~]$ grep -c 'During handling of the above exception' nohup.out 
36

where oper UIS appears to have a compounding memory build..

Perhaps a coincidence, however, if subscriptions aren't cleaned up properly their queues could be accumulating deltas.. (which would explain the apparent compounding effect as the error count increases)

Reproducible Example

Unknown.

Expected Behaviour

Send attempt on closed web-socket is handled gracefully, ending all associated subscriptions (etc...) like on_close

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions