Add remote csv reading functions #29

DavidStirling · 2025-04-17T13:04:20Z

A few users have encountered situations where they want to get a dataframe with omero2pandas but they've uploaded their table as a normal FileAnnotation CSV rather than an OMERO.tables object. Particularly with large files this is frustrating, so I've tried to extend omero2pandas to support reading these OriginalFile objects.

To achieve this we construct a file-like interface around the Python API (OriginalFileIO). This is compatible with pandas' standard CSV reader and can be loaded directly as a table in a chunk-wise fashion. I've also supported gzipped CSVs since those should be compatible with the interface.

New top-level convenience functions are as follows:

omero2pandas.read_csv loads a FileAnnotation/OriginalFile into a pandas dataframe.
omero2pandas.download_csv saves a FileAnnotation/OriginalFile directly to disk.

These are designed to have a similar API to the read_table and download_table methods, but naturally don't support the full spectrum of features such as running queries.

You could technically use the download_csv function to download any OriginalFile object that has it's size correctly defined in the model, so I did include a check_type argument that can be disabled to allow it to be a bit more flexible. I imagine this could be useful if the mimetype field is incorrect or something.

kkoz · 2025-04-24T15:19:10Z

omero2pandas/io_tools.py

+            reporter.reset(total=size)
+
+    def __enter__(self):
+        self.open()


If open() is called in __init__ I don't think it needs to be called here as well.

omero2pandas/io_tools.py

kkoz · 2025-04-24T15:33:54Z

omero2pandas/io_tools.py

+        if file_id == self._file_id:
+            # We only close if the current file is the active one


It's not clear to me how this check will ever fail.

Thanks @kkoz

Much of this is down to odd behaviour in BlitzGateway. As far as I can tell the user can actually only create a single RawFileStore object, subsequent calls modify the existing instance. Therefore if they were to call and access another file while this object exists any subsequent reads would be interacting with whichever file they accessed last instead of the file our reader was made to interact with.

To solve this I've made the reader defensively check that the target file hasn't changed by calling self.open() before any reads. In most scenarios the active file ID will still be correct and nothing happens, but we don't want to risk reading the contents of a totally irrelevant file.

Hmm - so is the idea that if the client hands you a BlitzGateway and you attempt to use the RawFileStore associated with it, they may also be using the RawFileStore associated with it and you'll be clobbering each other's setFileId calls? I would say that if that happens (i.e. client calls setFileId after omero2pandas does) we should throw, not just overwrite again. Otherwise we'll screw up whatever the client was trying to do.

For the most part we can probably expect users to not need to return to RawFileStore objects after the fact, but a challenge with this library is that we also want to support jupyter notebook usage. In that scenario (and indeed during testing) it was possible to create an OriginalFileIO reader, load in a second file and then encounter errors when trying to use that first reader again. From a user perspective if I create two file readers I'd expect them both to work seamlessly, so I opted for checking and updating the file id as reads are requested.

The most notable use case I can imagine would be if a user wanted to merge two large csv files stored as FileAnnotations. To do this chunk-wise you might use two readers simultaneously and so throwing if you touch another reader would break this. Admittedly this is all rather messy and niche to begin with, but perhaps we should show a warning if the file id was changed instead of throwing?

My understanding is that the short answer here is you can't really re-use the reader (or create multiple readers) at all. Once close() is called on a RawFileStore, it becomes unusable. See https://docs.openmicroscopy.org/omero-common/5.6.1/javadoc/ome/api/StatefulServiceInterface.html#close--
So if a user attempts to read two files at the same time, once __exit__() is called by the first file, the reads on the second file will begin to fail.
I agree that that's not what users are expecting, but that's just how RawFileStore works, so I don't think there's much we can do about it right now. Maybe @chris-allan would have some ideas about how to make this work as desired.

omero2pandas/__init__.py

kkoz · 2025-04-24T19:44:31Z

Testing:

Read CSV by FileAnnotation ID:
Basically successful. Resulting dataframe is slightly different for high precision floating point values.
When I use omero2pandas.read_csv and then to_csv, then diff the original and resulting csvs, I get slightly less precision than in the original csv, e.g.

< 98574,1.8146135158913197,apple
---
> 98574,1.81461351589132,apple

But I'm not sure if we can/need to do anything about it. The result appeared in all tests but I won't bother restating the issue there.

Read CSV by OriginalFile ID:
Success

Read Gzip by FileAnnotation:
Success

Read Gzip by OriginalFile ID:
Success

Download CSV By FileAnnotation ID:
Success. Also no precision issues

Download Gzip By FileAnnotation ID:
Success

Read CSV no mimetype:
Success

Read CSV Incorrect mimetype:
Success: ValueError: Unsupported mimetype: text/banana

Non-existent File Annotation ID:
Success: AssertionError: FileAnnotation with ID 14099 not found

Non-existent OFID:
Success: AssertionError: OriginalFile with ID 14099 not found

Non-CSV file content:
Strange behavior but maybe not a problem. When I created a file notacsv.csv and a FileAnnotation with it as the target, reading the FileAnnotation to a dataframe yields:

>>> df
Empty DataFrame
Columns: [this is not a csv]
Index: []

Not sure this is really a bug though. I suppose in some sense it is a valid CSV.

When I try to do the same with an image instead of a text file, I get

...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I'm probably fine with this handling of the error.

Small Chunk Size:
Success

Large Chunk size (bigger than file):
Success

mabruce

I was able to successfully read and download a raw and gzipped csv, with varying chunk sizes, and with all or a subset of columns.

mabruce · 2025-04-24T19:43:09Z

omero2pandas/__init__.py

+    object_id, object_type = _validate_requested_object(
+        file_id=file_id, annotation_id=annotation_id)
+
+    assert not os.path.exists(target_path), \


I think an optional overwrite would be a useful parameter here.

I agree, but debating whether to do that as a distinct PR since we'd also want to apply that to the download_table function.

DavidStirling · 2025-04-25T08:15:54Z

Thanks @kkoz @mabruce

Basically successful. Resulting dataframe is slightly different for high precision floating point values.
When I use omero2pandas.read_csv and then to_csv, then diff the original and resulting csvs, I get slightly less precision than in the original csv

This is something of a known issue due to pandas sacrificing precision for speed when loading in csvs. There's actually an option to supply float_precision="round_trip" to the pandas function to resolve this. Since that'd be nice to have I've added the ability to supply extra kwargs to omero2pandas.read_csv and have them forwarded to the pandas method. That should allow users to supply this arg if needed, as well as all the other bells and whistles pandas offers (column renaming, custom delimiters, etc).

Non-CSV file content:
Strange behavior but maybe not a problem. When I created a file notacsv.csv and a FileAnnotation with it as the target, reading the FileAnnotation to a dataframe yields...

I think this is expected, if it looks like a CSV we'll attempt to feed it to pandas and let that library interpret what was sent in.

DavidStirling · 2025-04-25T15:58:29Z

Per discussion, turns out the RawFileStore instance schenanigans are a BlitzGateway quirk, so by using the client object we can create a dedicated RawFileStore for our reader. Have revised this PR with that in mind.

In terms of managing these, we should no longer need to check whether it's open when reading. However we'll still call self.open() in context manager mode in case the user is using the object repeatedly. If the file is already open no harm is done. It turns out that calling close() does indeed reset the RawFileStore but doesn't make it inoperable, so re-activating it is just a matter of calling setFileId again.

In terms of cleanup context manager mode will automatically close the reader, alternatively RawIOBase objects also call .close() when the object is deleted. I think that should be sufficient to clean everything up nicely.

Add remote csv reading mode

23e4169

DavidStirling requested review from emilroz and erindiel April 17, 2025 13:04

erindiel requested review from kkoz and mabruce and removed request for emilroz and erindiel April 23, 2025 13:29

kkoz requested changes Apr 24, 2025

View reviewed changes

mabruce reviewed Apr 24, 2025

View reviewed changes

omero2pandas/__init__.py Outdated Show resolved Hide resolved

mabruce reviewed Apr 24, 2025

View reviewed changes

DavidStirling added 2 commits April 25, 2025 08:43

Typo

8838839

Permit supplying read_csv args

657f063

Don't use Blitz to get RawFileStore objects

5401d09

kkoz approved these changes Apr 25, 2025

View reviewed changes

erindiel requested a review from mabruce May 6, 2025 12:11

mabruce approved these changes May 6, 2025

View reviewed changes

DavidStirling added 2 commits May 20, 2025 15:56

Merge branch 'refs/heads/main' into csv-raw

6fb6a70

Use get_connection with new methods

efaeac0

sbesson merged commit 34db2fe into glencoesoftware:main May 21, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add remote csv reading functions #29

Add remote csv reading functions #29

Uh oh!

DavidStirling commented Apr 17, 2025

Uh oh!

kkoz Apr 24, 2025

Uh oh!

Uh oh!

kkoz Apr 24, 2025

Uh oh!

DavidStirling Apr 24, 2025 •

edited

Loading

Uh oh!

kkoz Apr 24, 2025

Uh oh!

DavidStirling Apr 25, 2025

Uh oh!

kkoz Apr 25, 2025

Uh oh!

Uh oh!

kkoz commented Apr 24, 2025

Uh oh!

mabruce left a comment

Uh oh!

mabruce Apr 24, 2025

Uh oh!

DavidStirling Apr 25, 2025

Uh oh!

DavidStirling commented Apr 25, 2025

Uh oh!

DavidStirling commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

		if file_id == self._file_id:
		# We only close if the current file is the active one

Add remote csv reading functions #29

Add remote csv reading functions #29

Uh oh!

Conversation

DavidStirling commented Apr 17, 2025

Uh oh!

kkoz Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkoz Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

DavidStirling Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkoz Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

DavidStirling Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

kkoz Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkoz commented Apr 24, 2025

Uh oh!

mabruce left a comment

Choose a reason for hiding this comment

Uh oh!

mabruce Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

DavidStirling Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

DavidStirling commented Apr 25, 2025

Uh oh!

DavidStirling commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

DavidStirling Apr 24, 2025 •

edited

Loading