Skip to content

Implement remote table registration calls #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 20, 2025

Conversation

DavidStirling
Copy link
Member

This PR adds the second stage of remote table registration intended for use with the omero-plus API.

Changes as follows:

  • Remote table registration can now be performed with just omero2pandas. Usage docs included.
  • Registration has been broken up into multiple functions to permit just creating the tiledb and not registering as has been done with some test workflows. This is now accessible at omero2pandas.remote.create_tiledb (was register_table)
  • Had to fix Refactor handling of existing connection objects #21 to get the new system to play nicely with existing omero client objects
  • omero-py already requires requests so no new dependencies were needed
  • Remote registration needed a CSRF token to interact with the API, so we first make a call to get the token and then send our registration request. This is a bit more fragile than I'd like it to be, so I'd be happy to hear of alternative approaches.

@DavidStirling DavidStirling requested review from sbesson, erindiel, kkoz and mabruce and removed request for sbesson April 11, 2025 11:37
Copy link
Member

@kkoz kkoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the README, we should decide on whether or not we're capitalizing TileDB

README.md Outdated
Comment on lines 351 to 354
validation process to ensure that the table seen by the server is the same one that
the user has requested the API to register. This is achieved by writing a "SecretToken"
key to the tiledb array metadata. The tiledb file seen by the server must have a key
matching the one provided in the registration call managed by omero2pandas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the SecretToken is not really validation - it's to ensure the client attempting to register a file has read access to that file.

# path.as_uri() exists but mangles any spaces in the path!
write_path = str(write_path)
output_path = str(output_path)
# Use a default chunk size if not set
chunk_size = chunk_size or 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default chunk_size in the function definition sufficient here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions above this one permit explicitly supplying chunk_size=None which is interpreted as "determine automatically". For most modes this process involves inspecting the input table to figure out how many rows the API can process at once, but this is largely irrelevant for local TileDB conversion.

It'd be possible to resolve this in the calls to create_tiledb by not passing chunk_size at all if it was None, but having to unpack this created more mess than just having a fallback here.

content = _check_response(result)
ann_id = content["data"]["file_annotation"]
LOGGER.info(f"Registered table successfully as FileAnnotation {ann_id}")
return ann_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we return the annotation ID instead of the Original File ID? In most places we use Original File ID for table identification.

Copy link
Contributor

@mabruce mabruce Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the annotation ID in omero table register

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do this elsewhere in the library. This was chosen on the basis that getting an OriginalFile given the FileAnnotation is easier than figuring out the FileAnnotation starting from the OriginalFile.

@kkoz
Copy link
Member

kkoz commented Apr 17, 2025

Testing:

First I attempted to run the following:

    client = omero.client('localhost', 4064)
    df = pd.read_csv(csv_path)
    res = omero2pandas.upload_table(
        df, 'o2p Test Table',
        local_path='/home/kevin/testdata/tables/o2ptest.tiledb',
        omero_connector=client,
        username='kevin', password='mypass')
    print(res)

And got

Generating TileDB file...:   0%|                                                                                                                      | 6/? rows, 00:00 
Traceback (most recent call last):
  File "/home/kevin/code/scripts/o2ptest.py", line 27, in <module>
    o2ptest()
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/home/kevin/code/scripts/o2ptest.py", line 19, in testclient
    res = omero2pandas.upload_table(
  File "/home/kevin/code/omero2pandas/omero2pandas/__init__.py", line 245, in upload_table
    ann_id = create_remote_table(source, table_name, local_path,
  File "/home/kevin/code/omero2pandas/omero2pandas/remote.py", line 47, in create_remote_table
    ann_id = register_table(connector, remote_path, table_name, links, token,
  File "/home/kevin/code/omero2pandas/omero2pandas/remote.py", line 100, in register_table
    raise ValueError("Unknown server? This should never happen!")
ValueError: Unknown server? This should never happen!

Seems supplying your own omero.client is not supported here - should it be?

I then attempted the following:

And got the following error:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='localhost', port=443): Max retries exceeded with url: /api/v0/token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x74dca0518e50>: Failed to establish a new connection: [Errno 111] Connection refused'))

This is almost certainly because my local install is not configured for https. Do we want to have an option to support this or are we comfortable only supporting https-enabled servers?

I proceeded with my testing by replacing https with http in line 101 or remote.py

However, I then got the following error:

ERROR:omero2pandas.remote:Request returned HTTP code 400: Badly formed request body
Traceback (most recent call last):
  File "/home/kevin/code/scripts/o2ptest.py", line 39, in <module>
    o2ptest()
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/home/kevin/code/scripts/o2ptest.py", line 31, in testcreds
    res = omero2pandas.upload_table(
  File "/home/kevin/code/omero2pandas/omero2pandas/__init__.py", line 245, in upload_table
    ann_id = create_remote_table(source, table_name, local_path,
  File "/home/kevin/code/omero2pandas/omero2pandas/remote.py", line 47, in create_remote_table
    ann_id = register_table(connector, remote_path, table_name, links, token,
  File "/home/kevin/code/omero2pandas/omero2pandas/remote.py", line 131, in register_table
    content = _check_response(result)
  File "/home/kevin/code/omero2pandas/omero2pandas/remote.py", line 148, in _check_response
    response.raise_for_status()
  File "/home/kevin/omero/server/omerovenv/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost/omero_plus/api/v0/table?bsession=e5b745c0-aed6-4c5a-b73a-f609e888c7f7

Not sure what's happening here - I'll look into it a bit more.

There was also some unexpected naming behavior I encountered during testing.
The first time you run upload_table as I did above, the TileDB is created at local_path. However, running the exact same command again will result in another TileDB getting written inside the original one, since write_path.is_dir() now evaluates to true.
https://github.com/glencoesoftware/omero2pandas/pull/28/files#diff-66280cb8c4d95664d7c51ed98c0b88b22f42efa4244f1abfe2123a5e3c61f704R36
This is probably not the behavior clients will be expecting - I would guess either the operation should fail if anything is found at local_path or the TileDB should be overwritten.

@kkoz
Copy link
Member

kkoz commented Apr 17, 2025

It looks like the BadRequest error was because there were no targets specified in the request body. There should probably be a better error message from the API, but in the meantime we should require links here.

@kkoz
Copy link
Member

kkoz commented Apr 17, 2025

With a link provided the registration works, however the table headers are not what I was expecting:
image

@kkoz
Copy link
Member

kkoz commented Apr 17, 2025

With a link provided the registration works, however the table headers are not what I was expecting:

This may be an issue with the csv being formatted for import with omero-metadata. I'll check

@kkoz
Copy link
Member

kkoz commented Apr 17, 2025

Confirmed this was the issue. I don't think most users would expect this tool to work with the omero-metadata stuff in the csv so don't worry about this.

@DavidStirling
Copy link
Member Author

Seems supplying your own omero.client is not supported here - should it be?

Should be fixed now, handling the different client objects we may get is tricky so I'm glad I put a check in there 😄

This is almost certainly because my local install is not configured for https. Do we want to have an option to support this or are we comfortable only supporting https-enabled servers?

Path parsing in Python without adding dependencies is more clunky than I'd like, so for now I'd hard-coded in https as a requirement. If we want to support http we could stick in an extra argument or something? Happy to hear ideas, especially those that don't involve urllib.

There was also some unexpected naming behavior I encountered during testing.
The first time you run upload_table as I did above, the TileDB is created at local_path. However, running the exact same command again will result in another TileDB getting written inside the original one, since write_path.is_dir() now evaluates to true.

Nice catch, I missed that TileDBs are seen as directories by the os. For convenience I'd wanted to let the user supply folder names instead of full paths to dump the TileDB based on the table name instead. Pushed a tweak that should catch the TileDB nesting issue, but if that's not sufficient we can require absolute paths to .tiledb files as the inputs.

Support for an overwrite arg is a good idea. IMO this would probably be worth it's own PR as the download methods would benefit here, it's not exclusive to remote mode.

It looks like the BadRequest error was because there were no targets specified in the request body. There should probably be a better error message from the API, but in the meantime we should require links here.

Fixed

Confirmed this was the issue. I don't think most users would expect this tool to work with the omero-metadata stuff in the csv so don't worry about this.

👍

Copy link
Member

@kkoz kkoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling and functionality working as expected.

@mabruce
Copy link
Contributor

mabruce commented Apr 30, 2025

I'm getting that same Bad Request for url ... error, but can't seem to find a workable combination of parameters.

df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
omero2pandas.upload_table(df, "Table", parent_type='Image', parent_id=36445,
                          local_path="table.tiledb",
                          remote_path="/data/marc/tables/table1.tiledb")
ERROR:omero2pandas.remote:Request returned HTTP code 400: Failed to open file at given URI
Traceback (most recent call last):
  File "C:\Users\Marc\Desktop\remote_table.py", line 5, in <module>
    omero2pandas.upload_table(df, "Table", parent_type='Image', parent_id=36445,
  File "C:\Users\Marc\Desktop\remote_csv.venv\lib\site-packages\omero2pandas\__init__.py", line 245, in upload_table
    ann_id = create_remote_table(source, table_name, local_path,
  File "C:\Users\Marc\Desktop\remote_csv.venv\lib\site-packages\omero2pandas\remote.py", line 48, in create_remote_table
    ann_id = register_table(connector, remote_path, table_name, links, token,
  File "C:\Users\Marc\Desktop\remote_csv.venv\lib\site-packages\omero2pandas\remote.py", line 132, in register_table
    content = _check_response(result)
  File "C:\Users\Marc\Desktop\remote_csv.venv\lib\site-packages\omero2pandas\remote.py", line 149, in _check_response
    response.raise_for_status()
  File "C:\Users\Marc\Desktop\remote_csv.venv\lib\site-packages\requests\models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://[url]/omero_plus/api/v0/table?bsession=[uuid]

I've tried adding links=[('Image', 36445)] but get the same error.

Auth here is through omero-user-token.

@DavidStirling
Copy link
Member Author

I'm getting that same Bad Request for url ... error, but can't seem to find a workable combination of parameters.

df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
omero2pandas.upload_table(df, "Table", parent_type='Image', parent_id=36445,
                          local_path="table.tiledb",
                          remote_path="/data/marc/tables/table1.tiledb")

@mabruce It looks like you're trying to save to table.tiledb locally but register table1.tiledb on the server side?

@mabruce
Copy link
Contributor

mabruce commented Apr 30, 2025

Yes, my assumption was that this would work like with ROI_Converter_NGFF --server_directory where you could run the conversion on your local machine (telling OMERO where to find it later) and then move the data to its final location. Would this approach be expected to work with remote.create_tiledb + remote.register_table?

@DavidStirling
Copy link
Member Author

DavidStirling commented Apr 30, 2025

Yes, you'd have to perform the steps individually to achieve that workflow. The wrapper is primarily intended for scenarios where storage is mounted on both the client and server machines.

Copy link
Contributor

@mabruce mabruce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully registered a table "remotely" via symlinks to destination folder on server.

@sbesson sbesson merged commit ef09991 into glencoesoftware:main May 20, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor handling of existing connection objects
4 participants