-
Notifications
You must be signed in to change notification settings - Fork 4
Add remote csv reading functions #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
reporter.reset(total=size) | ||
|
||
def __enter__(self): | ||
self.open() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If open()
is called in __init__
I don't think it needs to be called here as well.
if file_id == self._file_id: | ||
# We only close if the current file is the active one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how this check will ever fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kkoz
Much of this is down to odd behaviour in BlitzGateway. As far as I can tell the user can actually only create a single RawFileStore
object, subsequent calls modify the existing instance. Therefore if they were to call and access another file while this object exists any subsequent reads would be interacting with whichever file they accessed last instead of the file our reader was made to interact with.
To solve this I've made the reader defensively check that the target file hasn't changed by calling self.open()
before any reads. In most scenarios the active file ID will still be correct and nothing happens, but we don't want to risk reading the contents of a totally irrelevant file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - so is the idea that if the client hands you a BlitzGateway and you attempt to use the RawFileStore
associated with it, they may also be using the RawFileStore
associated with it and you'll be clobbering each other's setFileId
calls? I would say that if that happens (i.e. client calls setFileId
after omero2pandas does
) we should throw, not just overwrite again. Otherwise we'll screw up whatever the client was trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part we can probably expect users to not need to return to RawFileStore
objects after the fact, but a challenge with this library is that we also want to support jupyter notebook usage. In that scenario (and indeed during testing) it was possible to create an OriginalFileIO
reader, load in a second file and then encounter errors when trying to use that first reader again. From a user perspective if I create two file readers I'd expect them both to work seamlessly, so I opted for checking and updating the file id as reads are requested.
The most notable use case I can imagine would be if a user wanted to merge two large csv files stored as FileAnnotations. To do this chunk-wise you might use two readers simultaneously and so throwing if you touch another reader would break this. Admittedly this is all rather messy and niche to begin with, but perhaps we should show a warning if the file id was changed instead of throwing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the short answer here is you can't really re-use the reader (or create multiple readers) at all. Once close()
is called on a RawFileStore
, it becomes unusable. See https://docs.openmicroscopy.org/omero-common/5.6.1/javadoc/ome/api/StatefulServiceInterface.html#close--
So if a user attempts to read two files at the same time, once __exit__()
is called by the first file, the reads on the second file will begin to fail.
I agree that that's not what users are expecting, but that's just how RawFileStore
works, so I don't think there's much we can do about it right now. Maybe @chris-allan would have some ideas about how to make this work as desired.
Testing: Read CSV by FileAnnotation ID:
But I'm not sure if we can/need to do anything about it. The result appeared in all tests but I won't bother restating the issue there. Read CSV by OriginalFile ID: Read Gzip by FileAnnotation: Read Gzip by OriginalFile ID: Download CSV By FileAnnotation ID: Download Gzip By FileAnnotation ID: Read CSV no mimetype: Read CSV Incorrect mimetype: Non-existent File Annotation ID: Non-existent OFID: Non-CSV file content:
Not sure this is really a bug though. I suppose in some sense it is a valid CSV. When I try to do the same with an image instead of a text file, I get
I'm probably fine with this handling of the error. Small Chunk Size: Large Chunk size (bigger than file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to successfully read and download a raw and gzipped csv, with varying chunk sizes, and with all or a subset of columns.
object_id, object_type = _validate_requested_object( | ||
file_id=file_id, annotation_id=annotation_id) | ||
|
||
assert not os.path.exists(target_path), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an optional overwrite
would be a useful parameter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but debating whether to do that as a distinct PR since we'd also want to apply that to the download_table
function.
This is something of a known issue due to pandas sacrificing precision for speed when loading in csvs. There's actually an option to supply
I think this is expected, if it looks like a CSV we'll attempt to feed it to pandas and let that library interpret what was sent in. |
Per discussion, turns out the RawFileStore instance schenanigans are a BlitzGateway quirk, so by using the client object we can create a dedicated RawFileStore for our reader. Have revised this PR with that in mind. In terms of managing these, we should no longer need to check whether it's open when reading. However we'll still call In terms of cleanup context manager mode will automatically close the reader, alternatively |
A few users have encountered situations where they want to get a dataframe with omero2pandas but they've uploaded their table as a normal FileAnnotation CSV rather than an OMERO.tables object. Particularly with large files this is frustrating, so I've tried to extend omero2pandas to support reading these OriginalFile objects.
To achieve this we construct a file-like interface around the Python API (
OriginalFileIO
). This is compatible with pandas' standard CSV reader and can be loaded directly as a table in a chunk-wise fashion. I've also supported gzipped CSVs since those should be compatible with the interface.New top-level convenience functions are as follows:
omero2pandas.read_csv
loads a FileAnnotation/OriginalFile into a pandas dataframe.omero2pandas.download_csv
saves a FileAnnotation/OriginalFile directly to disk.These are designed to have a similar API to the
read_table
anddownload_table
methods, but naturally don't support the full spectrum of features such as running queries.You could technically use the
download_csv
function to download any OriginalFile object that has it's size correctly defined in the model, so I did include acheck_type
argument that can be disabled to allow it to be a bit more flexible. I imagine this could be useful if the mimetype field is incorrect or something.