Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New table API endpoints for lower level table access #564

Merged
merged 22 commits into from
Jul 29, 2024

Conversation

knabar
Copy link
Member

@knabar knabar commented Jun 21, 2024

This PR adds two new endpoints for lower level table access:

  • For querying: /webgateway/table/123/rows/ takes a query argument in the usual format and returns matching row numbers. No actual data is returned. Limited paging support by always returning up to MAX_TABLE_SLICE_SIZE rows and taking a start argument from where to start the search. rowCount and end are returned to easily detect the end of results or need for additional calls (with new start set to previous end).
  • For data retrieval: /webgateway/table/123/slice/ takes numeric lists of rows and columns and performs a slice. Data is returned in columnar format (vs. rows in the current table API). The number of requested rows times columns cannot exceed MAX_TABLE_SLICE_SIZE, which defaults to 1 million. Both GET and POST are supported to allow for large numbers of rows and columns (exceeding the possible query string length). For retrieval of consecutive rows or columns, ranges can be specified as e.g. 1-5 instead of 1,2,3,4,5.

Examples:

/webgateway/table/123/rows/?query=object%3E100000&start=8470
{"rows":[8470,8471,8472,8473,8474,8475,8476],"meta":{"rowCount":8477,"start":8470,"end":8477}}
/webgateway/table/123/slice/?rows=8005-8010&columns=0-1
{"columns":[[114391,114392,114393,114394,114395,114396],[NaN,NaN,NaN,NaN,NaN,NaN]],"meta":{"columns":["object","name"],"rowCount":8477}}

Notes:

  • Added a way to supply additional formatting options to json.dumps, allowing e.g. for removing whitespace from the output
  • Unrelated to this PR, since NaNs are allowed in the output but not supported by plain JSON, clients can use JSON5 for parsing

@knabar knabar requested review from chris-allan and will-moore June 21, 2024 10:10
@knabar knabar force-pushed the feature-row-queries branch from 0fa91ba to db88511 Compare June 21, 2024 10:24
@will-moore
Copy link
Member

It looks like rowCount and end are based on the whole table rather than the results of the query:
e.g. on merge-ci with webgateway/table/22909/rows/?query=(Count>12)%26(Count<23) I'm seeing:

{
"rows": [
  1,
  3
],
"meta": {
  "rowCount": 13,
  "start": 0,
  "end": 13
}
}

Is that expected?

@knabar
Copy link
Member Author

knabar commented Jun 28, 2024

@will-moore Yes, that is expected.

The total number of results is not available, since the query is not run on the whole table, but only from start to start+MAX_TABLE_SLICE_SIZE.

Figuring out the total is left to the client, by adding up the number of returned rows for each call. For tables with less than MAX_TABLE_SLICE_SIZE rows, it'll be just one call anyway.

end is not actually the number of rows of the table, but the row where querying ended, so the client can easily continue querying by passing in the end of the first query as the start of the second. end==rowCount indicates that there is nothing else to query.

start is there to indicate at which row the querying started, which is redundant, since the client would know anyway, but perhaps there are situations where it is useful.

@will-moore
Copy link
Member

Everything is working just fine for /slice when I use the query correctly, so I tried playing about a bit (Not all these need to be fixed/changed)...
If I make a mistake, the error handling is varied:

E.g. If I make an invalid query like this:
/slice/?rows=3-1&columns=oops
I get a nice "error": "Need to specify comma-separated list of rows and columns".

If I forget what needs to be in the query e.g.
/slice/?rows=3-1
I get a less friendly:

"message": "'NoneType' object has no attribute 'split'",
"stacktrace": "Traceback (most recent call last):\n  File \"/home/omero/workspace/OMERO-web/.venv3/lib64/python3.9/site-packages/omeroweb/webgateway/views.py\", line 1447, in wrap\n    rv = f(request, *args, **kwargs)\n  File \"/home/omero/workspace/OMERO-web/.venv3/lib64/python3.9/site-packages/omeroweb/webgateway/views.py\", line 3586, in perform_slice\n    for item in source.get(\"columns\").split(\",\")\nAttributeError: 'NoneType' object has no attribute 'split'\n"

if I use the range the wrong way around I get the whole table (regardless of the number of rows and columns):

?rows=3-1&columns=4-3

If I go out of range,

/slice/?rows=3-1&columns=1-100

I get "error": "Error slicing table". Don't know how hard/expensive it is to check this before trying the slicing to give a more helpful message?

@knabar
Copy link
Member Author

knabar commented Jul 1, 2024

@will-moore added better error handling and checks

@knabar
Copy link
Member Author

knabar commented Jul 2, 2024

Made some more convenience changes:

  • Added a collapse query string argument for /webgateway/table/123/rows/ that collapses sequential row numbers into the same format that is supported by the /slice/ call, so the results can be passed back in easily while significantly reducing the amount of data transferred:
{
  "rows": [
    2,
    3,
    "5-7",
    9,
    10,
    "12-20",
    "22-27",
    "33-35"
  ],
  "meta": {}
}
  • Added additional items to the returned metadata, including columnCount and maxCells to give more information to the client on how to range check requests before submitting them, and partialCount, which is the number of matching rows returned from the /rows/ call. Note that this is not necessarily the number of matches in the whole table, which is why I called it partialCount, but open to suggestions for better names.
"meta": {
  "partialCount": 25,
  "rowCount": 14336,
  "columnCount": 105,
  "start": 0,
  "end": 14336,
  "maxCells": 1000000
}

@chris-allan
Copy link
Member

chris-allan commented Jul 2, 2024

I don't think having collapse is a good precedent to set. Looping over the results in pure Python is going to perform wildly differently depending on the result set so knowing whether to use collapse is not something that is easy to do.

Edit: Furthermore, I think it's a usability downgrade. The client then would need to decompress the result set of getWhereList() in order to know which rows actually match or use it with slice(). If you want contiguous slices then exposing read() is better anyway as it just takes the column numbers and a start stop.

Copy link
Member

@will-moore will-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling is improved thanks.
Looks good.

@knabar
Copy link
Member Author

knabar commented Jul 3, 2024

Removed the collapse option after some performance testing that did not show a worthwhile improvement

@knabar knabar added this to the 5.27.0 milestone Jul 5, 2024
@chris-allan
Copy link
Member

Before this goes in we definitely need to expand the ome/openmicroscopy integration tests to cover these new endpoints. Specifically components/tools/OmeroWeb/test/integration/test_table.py.

omeroweb/settings.py Outdated Show resolved Hide resolved
omeroweb/webgateway/urls.py Show resolved Hide resolved
@knabar knabar merged commit b071a89 into ome:master Jul 29, 2024
10 checks passed
@knabar knabar deleted the feature-row-queries branch July 29, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants