-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on CDX Server requirements page #305
Comments
Thanks @ikreymer for the valuable input. First of all, this is a work in progress so there are probably a lot of use-cases to be written including use-cases for processing by a Map-Reduce framework. To get started, I begun to write up some use-cases based on the end-user experience on OpenWayback. As you correctly mention, several of these use-cases can be supported by the same set of functionality from the CDX-Server. I just wanted the list to be as complete as possible. While working with OWB and the CDX-Server, it is a challenge for me to understand all the options supported by the CDX-Server. Both because several parameters are not documented, but also because I do not understand the use-case for all parameters. I might be wrong, but my impression is that sometimes functionality has been implemented to work around shortcomings which preferably should have been fixed other places. I think that the 3.0.0 version could be a good time to clean up even if it breaks backward compatibility. After all the current CDX-Server is still marked as BETA. I was also thinking of including a comment for each use-case if the use-case was valid or doesn't make sense. This way we would keep track of ideas that have been rejected. The main goal with this document is to make it easier for new people (like me) to participate in coding. The use-cases written so far only reflects my understanding. I would love if you (or anybody else) add, remove or improve use-cases. One thought I have, is whether it would be better to split up the API into more than one path and reduce the number of parameters. For example from/to parameter doesn't make sense combined with sort=closest. Instead the closest parameter is used for the timestamp. I don't think wildcard queries should be combined with closest either (at least I have not found the use-case). Maybe we could split API into something like /query which allows a timestamp parameter and a /search which allows from/to parameters and wildcards and/or matchType. This way the API would be more explicit and less code for checking combinations of parameters would be needed on server side. I would also like to know the use-cases for parallel queries with the paging API. If i'm not mistaking, the current implementation could give different results for the paging API from the regular API if you use collapsing. This is because the paging API will not know if the starting point for a page, except for the first one, is in the middle of several lines to be collapsed. Maybe a separate path (e.g. /bulk) should added which do not allow manipulation like collapse? For revisits, I wonder if that is a task for the CDX-server to resolve. I see the efficiency with one pass as you mention, but also the problem with url agnostic revisits. Maybe that should either go into a separate index keyed on digest, or maybe as extra fields when creating CDX-files instead of adding them dynamically. |
Thanks @johnerikhalse It sounds like you'd prefer editing in the wiki directly. I can move some of my comments to the page, and add additional use cases. Another option might be to have an issue for each proposed cdx server change or suggestions, so that folks can comment on additional suggestions as well, and then at the end, modify the wiki based on the suggestions.
Hm, i think that from/to can still make sense with sort closest, if you wanted to restrict the results to a certain date range (not certain if that works currently). The sort closest is definitely a special case.
Yes, the collapsing would just happen on that specific page, as that is all that is read.. I can see the benefits of a separate 'bulk' api in some sense, though removing functionality may not be best approach. I think the cdx server APIs should be thought of as a very low level api, designed to be used by other systems and perhaps advanced users. As such, there could be combinations don't quite make sense. There could be higher level APIs, such as the 'calendar page api' or 'closest replay api' which prevent invalid combinations. Similarly, there could be a 'bulk query api' which helps the user run bulk queries.
Yeah, I suppose this can be removed and it would be up to the caller to resolve the revisit. Or, better yet, there could be a separate 'record lookup API' which queries the cdx API once or more times to find the record and the original (if revisit). |
I agree that we should have an issue for each change in the cdx server. I also agree that cdx server is a pretty low level API. The document I started writing is meant as a background for proposing changes. It will hopefully be easier to see why a change is necessary when you can relate it to one or more use cases. It wasn't my intention to describe the cdx server in that document. That should go into a separate documentation of the API. Usually fulfilling the requirements of a use case will be a combination of functionality in the cdx server and the replay engine.
Even if the API is low level, I think it's important that users get predictable results. For example, I think it's a bad idea if a function return different results if you are using paging or not. Another example would be if a function return different results for a configuration with one cdx-file from a configuration with multiple cdx-files. I have a question about match type. I can see the justification for |
One comment to how low level the cdx server api is. I agree that it is to low level for most end users, but there is built in logic assuming certain use cases which, in my opinion, doesn't qualify for calling it very low level. For example filtering is done before collapsing (which makes sense). Given the query parameters For a real low level api we should implement a kind of query language, but I think that is to overcomplicate things. I think it is better to try to document real world use cases and create enough functionality to support them. It is important tough to document those built in assumptions. Makes sense? |
Yes, a full query language may be a bit too much at this point. If collapsing is the main issue, perhaps it can be taken out, except in a very specific use case. For example, I believe the As a reference, in the pywb cdx server API, I have implemented CDX features only as they became needed (https://github.com/ikreymer/pywb/wiki/CDX-Server-API) resulting in a smaller subset of the current openwayback cdx server API. Collapsing has not really been needed, and it is something that I think may best be done of the client. For example, a calendar display may offer dynamic grouping as needed as user switches different zoom levels, w/o having to make a server-side request each time. Also, taking a look at http://iipc.github.io/openwayback/api/cdxserver-api.html I would strongly advise against going in this direction. The
which would make things more complicated. I think that most of the CDX server features can be combined (collapsing perhaps being an exception), and when they can't, that can be documented. |
Thanks for the comments @ikreymer
If it is just an optimization not altering the results compared to the general collapse function, then I think such optimizations should be done by analyzing the query and not by adding more parameters.
That's true if we by client understand the browser. If the client is OpenWayback, then I think a new roundtrip to the Cdx Server is better to avoid keeping state in OWB.
I agree that the names 'search' and 'query' are not good names (suggestions welcome). But I do think the split makes sense. The query path is ment to support the Memento TimeGate use case and the search path supports the Memento TimeMap as described in your first posting on this issue.
No, they are not the same.
|
The optimization does alter the results, I believe it allows for skipping of secondary index zipnum blocks when the data was too dense (eg. over 3000+ captures of same url within one minute/hour/day, etc...)
Well, you could use timegate and timemap, but the cdx server does a lot more than memento, so that may only add to confusion. The difference between these is 'list all urls' and 'list all urls sorted by date closest to x', so in this case Even if the names were different, switching between The separate bulk endpoint might make sense, if there is a good reason that the
That is a very dubious reason for such a big change. Modern load-balancers can spread the load automatically amongst worker machines. This is like arguing for going back to the era of using Also, when is a query considered 'bulk'? Suppose a user wants to get all results for a single url, there could be 1 result or there could be, say 50,000. Should they use the bulk query or the 'non-bulk' query? In the current system, Again, I would suggest ways to improve the current API rather than creating distinct (and often conflicting) endpoints that only add more user decision (bulk vs non-bulk, search vs query, etc..) and do not actually add any new features. |
I'm not sure if I understood this correctly, but I wondered if the result was altered different from using ordinary collapse or if collapsetime is just an optimisation for collapse=time. In the latter case I think the server should be smart enough to optimize just by looking at what fields you are collapsing on.
Looking at http://iipc.github.io/openwayback/api/cdxserver-api.html, I realized that it might not be obvious that the methods (the blue get boxes) are clickable to get a detailed description of the properties. I added a label above the column to make it clearer. I also updated the descriptions somewhat. The reason I mention this is because there are other differences than just the sort which hopefully should be clear from the documentation and not needed to be repeated here. If that is not the case, then feedback is valuable to enable me to enhance it.
Well, I might have been a little too fast when stating that as the primary reason, it should be more like a possibility. Another possible use could be to hand bulk requests over to dedicated servers for that purpose. Anyway, the primary reason was to get closer to the guidelines for REST. REST advocates the use of absolute urls for references. That leaves less work on the client and ease the evolution of the api. Even though the Cdx Server is not restful, I think following the guidelines where it is possible is reasonable.
The bulk api is not meant to be used by OpenWayback, but by processing tools, and yes it definitely supports paging which the other apis in the proposal don't. To not discuss everything in one issue, I created a separate issue for the bulk api (#309) where I try to describe it in more detail. |
In response to the CDX Server API, I think I understand the motives behind having both a Also, if I understand, you are saying for
Not having Regarding
Many RESTful systems use pagination. If the next and previous pages are discoverable from a page, then it does not violate REST. Previous and next links can be given in HTTP Link headers and/or in the body of the response (would be relatively normal looking in a JSON response); the total number of results could also be returned in the headers or the response body. What I see currently in the proposal for /{collection}/bulk looks similar to the sort of request you would have if you just used pagination on |
I agree with everything @ldko stated. Most of the options can interact with all the other options, including My point was that if There can definitely be improvements to the An example of it is seen here: For IA, this query: http://web.archive.org/cdx/search/cdx?url=*.com&showNumPages=true returns A lot of thought went into making the original API as it is, and there are already a few tools that work with the current api (for bulking querying) so I think any significant changes should have a clear positive benefit. |
Response to @ldko
No, get is for following a single resource. Search is for calendar pages.
In the current implementation you get the number of pages by adding the showNumPages which does not give any discoverable uris. Instead you need to know how to build the query and what combinations are meaningful. You are right that it doesn't violate REST if a header leads you to the next page, but then processing in parallel is not possible because you need to request the pages in sequence.
Yes, Cdx Server is not RESTful, and will probably never be because there is no way to reliably address resources. On the other side, I think using concepts from REST where it is possible could ease the usage of the api. /get is as close as we get to address a resource. The main difference between /search and /get is that the latter needs to sort result closest to a certain timestamp. That causes much work on the server for big result sets. Because of that I have tried to design it to avoid as much as possible to work with such big results. Response to @ikreymer
That is true if the current way of implementing paged queries shall be used. As stated in #309 it is up to the server to choose how many batches to split the result into. My intention was to not return a bigger set of batches than is needed for a good distribution in a processing framework. That's why each batch might have resumption keys since each batch might be quite big.
I do not question that and I really like the modularity the Cdx Server brings to OpenWayback. To me it looks like the api was thought through, but also has evolved over time. Some functions seems to be added to solve a particular requirement without altering existing functions. This is of course reasonable to avoid breaking backwards compatibility. But since we now want to change the status of Cdx Server from an optional part of deployment to beeing the default, we also want to remove the beta label. That is a good time to look into the api once more to see if it still serves the requirements and also can meet the requirements of the foreseeable future. |
Since this discussion has turned to be about the justification for one or many entry points, I think I should recap and try to explain why I came up with the suggestion of more than one entry point. By just looking at the current api, I can understand why there are some objections. Using different entry points is not a goal in itself, it just felt like a natural way of solving some issues I came across while playing with the api and reading the code. I'll start with some examples form the current api: First to follow up on @ikreymer's example: Then I tried different sorting:
Since none of these queries returned a resume key, I expected the results to be complete and only differ in order. That was not the case. The first query returned 464023 captures while the second returned 423210 captures. So maybe resume key isn't implemented in IA's deployment. But using paging should give me the complete list. I start by getting the number of pages:
Both queries return 36 pages which is fine. Then I created a loop to get all the pages like this: for i in `seq 0 36`; do
curl "http://web.archive.org/cdx/search/cdx?url=hydro.com/*&page=${i}" > cdxh-asc-p${i}.txt
curl "http://web.archive.org/cdx/search/cdx?url=hydro.com/*&sort=closest&closest=20130101000000&page=${i}" > cdxh-closest-p${i}.txt
done Even though non of the queries was executed in parallel I got several failures caused by stressing the server to much. Of course I don't know what other queries was executed on the server, but it seems like it already got to much to do. I concatenated the page queries and expected the result from non-paged queries to be the same as those from the concatenated, paged queries. At least I only expected differences at the end of the result, if the non-paged result was cut off by a server limit. Unfortunately that was not the case. Even though the paged queries returned more results (as was expected if the non-paged query was cut off), they also missed a lot of captures which was present in the non-paged result. I also recognized that there where duplicates in the responses for the normal sort (both paged and non-paged) and the result wasn't sorted in ascending order, which could be seen from the following table (numbers are the line count in each result):
Documentation says that paged results might not be as up to date as the non-paged ones. That can explain the differences. The duplicates could perhaps be due to errors in creation of the cdxes. Even though I'm not happy, I accept that for now. So I looked at the results for
The duplicates are gone 👍 , but I expected the results to be the same size as deduplicated non-sorted results e.g. I have played around with other parameters as well and there are lot of examples where the results are not what I would expect. Conclusion My first thought was to "fix the bugs". I started to guess what the right outcome for the different parameters should be. That was not easy and the reason I started to write on the use-cases document. I found the api itself to be unlogical and redundant. For example there are no less than three different ways to achieve some form of paging or reducing the number of captures in each response. You have the Even with those things put aside, I found the earlier mentioned bugs hard or impossible to fix without causing to much work on the server. For example to get what I would expect from the paged queries with Since the Cdx Server until now has been an optional part of an OpenWayback deployment and the api is labeled beta in the documentation, I thought this was the time if we ever should rethink it from the ground up. With that in mind I made a list for myself of things I would like to achieve with a redesign: 1 Avoid redundant functionality 2 Disallow combination of functionality which is causing to much work on the server 3 Avoid letting parameters alter the format of the response 4 Try to implement the same set of functionality independently of the backing store 5 Avoid hidden, server side limits which alter the result in any way except from returning partial results 6 Let it be possible to use the Cdx Server as a general frontend to cdx-data 7 Allow for distributed processing and aggregation of results 8 Avoid keeping state on server 9 Illegal queries should be impossible 10 Testability 11 Maintainability 12 Using HTTP protocol where it is possible The above points are guidelines and maybe not possible to achieve in its entirety, but I think it is possible to come pretty close. Based on this I ended up with proposing three different paths in the api. To be more RESTful we could have one entry point with references to the other paths. The path names could definitely be better and are certainly not finalized. /search is for all functions which needs to process cdx-data sequentially. It is assumed that all backing-stores can do that pretty efficiently, either because the cdx-data is presorted, or there exist sorted indexes on keys if the backing-store is a database. This allows for the richest set of parameters, but does not allow sorting which might require a full table scan to use db-terminology. For example this allows /get is for getting the closest match for a url/timestamp pair. This implies sorting which might be costly. The set of functionality is therefor somewhat limited. Ideally it should only return one result, but cause to the nature of web archives it might be needed to return a few results close to best one. This path will not allow scanning through a lot of results so no resume key is used. It is not allowed to use urls with wildcards because of the potentially huge amount of responses and it is not well defined what a closest match is to a fuzzy query. /bulk is primarily to support number 6 above. I can't see any uses for this in a replay engine like OpenWayback, but it might be really useful as a standardized api for browsing and processing cdx-data. One requirement is to allow parallel executions of batches. That makes it impossible to support sorting and comparing of captures without posing to much work on the server. The server should split the response into enough batches to support a reasonable big map-reduce environment, but each batch could further be limited into parts (which is requested in sequence for each batch) to overcome network limitations. |
Quick comment on https://iipc.github.io/openwayback/api/cdxserver-api.html Not sure if that's meant to be normative or what. But imho it was an unfortunate choice for wayback to drop the trailing comma in the host part. In heritrix the surt would look like (It might be too late to change this. I think Ilya tried with pywb but it gave him too many headaches. It really bugs me though.) |
@nlevitt Yeah, after some consideration, the change did not see the extra effort, as would have to support both with and without the comma. My reasoning was that between |
@johnerikhalse Thanks for the detailed analysis and testing of pagination, and explaining the endpoints again. I understand your point but still disagree, especially with regards to The cdx api is a low level api, non-end user api, so I think its best to show what is actually going on. I like the idea of returning a
I think you're right in that the sort option, which sorts by timestamp, only really makes sense for exact matches. It can be costly with a large result set, but can be quite fast if I think that addresses the sorting issues. Now, with pagination, unfortunately, its a bit complicated. In addition to requiring zipnum, the pagination query can only be run on a single zipnum cluster at a time, or with multiple clusters if the split points are identical. IA uses several zipnum clusters and the results are merged. Thus, the In an ideal world, the pagination API would support both regular cdxs and multiple zipnum clusters, however, this is a hard problem and (to my knowledge) no one is working on a solution to this. My recommendation would be to keep the pagination API in beta and put any cycles in solving the hard problem of supporting bulk querying across multiple cdx and multiple zipnum clusters, if this is a priority. While I can see the value of a separate The above suggested restriction on |
No, it's not meant to be normative. I wasn't aware of the differences to Heritrix. Anyway, the Cdx Server only serves what's in the cdx files, so this is probably more of a cdx-format concern than Cdx Server. Preferably the normalization of urls should be the same at harvest time as at search time, so I totally agree to your concern. I don't know what the problem with changing this is. Is it hard to implement, or is it just that you need to regenerate the cdxes? |
I don't think whether the api is low level or end user, is important. What is important is whether the api is public or not. By public I mean that it is exposed to other uses than internally in OpenWayback. You mention that other tools are using the api which in my opinion, makes it public. IMHO public apis have most of the same requirements as end user apis even if they are meant for expert use. That includes clear definitions on both input and output, but also taking care to not unnecessarily breaking backwards compatibility, which seems to be your main concern. But if we should break compatibility, I think this is the right time.
Sure, but that is high level. A service like that should use the Cdx Server
I'm aware of this and think this alone justifies a separate endpoint. API consumers would expect that the query parameter The reason I suggest dropping the |
So, I would like to discuss another point here. Which is more related to the response serialization than the lookup process and filtering. CDX index in general has (and should have) keys for lookup/filtering and data to locate the content and some other metadata that may be useful for tools. However, the method of locating data might differ depending on the data store, for example local warc files, data stored in isolated files (such as downloaded by wget [without the warc flag]), stored in cloud storage services and whatnot. Keeping this in mind I think bringing some uniformity in how the data location is described should be considered. CDX file format had pre-defined fields, but as we are moving away from that and adopting a more flexible serialization format such as CDXJ, we can play with our options. I would propose that multiple CDX fields that have no other purpose than to locate the data in combination such as the file name, offset, and the length should be consolidated in a URI scheme such as:
Note: "file" URI scheme does not talk about the fragment though. If needed URN can be used instead to offload the resolution of the location to a different level without polluting the standards. A URN can also be useful when the actual location is determined with the help of a path-index file. This type of URI scheme based identification of the content will allow us to merge data from various sources in a uniform way in a single CDX response. Additionally, the same data might be available in different locations (this is not something that was considered before I guess), so the CDX server should provide a way to list more than one places where the data can be found. The client may choose which place it wants to grab the data from or fallback to other resources if the primary one is not available. To put this together, we can introduce a key in CDXJ named
|
@ibnesayeed While this may seems like an interesting idea at first glance, I don't think this provides much of a practical benefit. A regular sort-merge will merge the same cdx lines together, there is no need for custom merge for just the filename field. Eg, if querying multiple archives one could get:
This allows for easily determining duplicate copies of the same url, from different sources. If the intent is to load the best resource, the sources will be tried in succession (though the filenames may be in arbitrary order after the merge) until one succeeds, with fallback to the next resource, and the next-best match, etc... No special case is needed here. Also, traditionally, a separate data source (a path index or prefix list) has been used to resolve a WARC filename to an absolute path internal to a single archive. This has many benefits, including supporting of changing WARC server locations, keep cdx small by avoiding absolute paths, and avoiding exposing private/internal. For example, an archive could be configured with 3 prefixes: An API responsible for doing the loading would be configured with these internal paths and would then check There is little benefit in baking these paths into an index response, as that would add extraneous data and potentially expose private paths that aren't accessible anyway. |
In general, I think that this approach is probably trying to solve the following problem: the need to query an index and the need to load a single resource from the index. The I would like to suggest the following solution, consisting of just two API endpoints that are almost identical.
The resource API is currently outside the scope of the CDX server as its currently defined, but I think its relevant to think of the two together. For example, If a client wants to examine the list of the 5 best matches for http://example.com/ at 20160101000000, the query would be:
but if the user wants to load the first available resource of the 5 best matches, the query would be:
There is only one API to learn, and the end-user has full flexibility about what sort of data they want to get in a single call. Query features, such as filtering, are also available for both endpoints allowing for more sophisticated retrieval options. Of course, the |
"4. Get the best match for embedded resources" seems to be a scenario with multiple URIs and one shared timestamp, presumably the timestamp of the page referencing the resources. As I understand the current API, that requires 1 lookup/resource. That means 1 request-parse-serialize-response overhead per resource. If the API allowed looking up multiple URIs with the same as-near-to-this-as-possible-timestamp, that overhead could be reduced considerably. |
My colleague @thomasegense is experimenting with a light page-render for WARC-based web archives. Instead of doing |
Hi, I wanted to give feedback on the CDX Server requirements wiki page.
There's not really a good way to comment on the page though, so rather than just editing the wiki page, I thought it'd be easier to start a conversation as an issue. Feedback follows as comments.
I think that's a great idea, especially this API can be shared across multiple implementations, not just OpenWayback.
The intent was to keep it separate (and there is support for different output formats, eg. JSON lines). The zipnum cluster does provide extra APIs, such as Pagination, but that is mostly because pagination is otherwise technically difficult without a secondary index, nothing ties it to zipnum cluster implementation in particular. The 'secondary index' is presented as a separate concept and perhaps could be kept abstracted out further.
Sure, the wildcard query was added a 'shortcut' in place of the matchType query, 'syntactic sugar', but if people feel strongly about removing one or the other, I don't think its a big deal
The CDX Server API was not just designed for GUI access in OpenWayback, but a more general API for querying web archives. The interactions from a GUI in OpenWayback should be thought of as a subset of the functionality that the API provides. Everything that was in the API had a specific use case at one point or another.
As a starting point, the CDX Server API provides two APIs that are defined by memento:
The closest match functionality is designed to provide an easy way to provide the next closest fallback, if replay of the first memento fails, and allow for trying the next best, and so forth..
Another use case was better support for the prefix query, where the result is a list of unique urls per prefix, followed by the starting date, end date and count. The query can then be continued to get more results from where the end of the previous query.
Another important use case is parallel bulk querying, which can be used for data extraction.
For example, a user may wish to extract all captures by host, prefix, or domain across a very large archive. The user can create MapReduce job to query the CDX server in parallel, where each map task sets the page value. (Implementations of this use case already exist in several forms).
The difference between the bulk query and the regular prefix query, is that the pagination api allows you to query a large dataset in parallel, instead of continuing from where the previous query left off.
But this requires pagination support, which requires the zipnum cluster, but it would in theory be possible to support without (just requires do a lot more work to sample the cdx to determine page distribution).
Another use case was resolving revisit records, if the original was the same url, in a single pass, to avoid having to do a second lookup. This is done by appending the original record as extra fields.
This may be not as useful if most deduplication is 'url agnostic'
_
This is more of a replay system option, rather than cdx query.
What happens if the exact url doesn't exist? There is not a way to guarantee exact match just by url and timestamp, you would also need the digest, and you can filter by url, timestamp and digest with cdx server, but not with a replay (archival url) format.
I think this is not at all the same as above, but closest capture/timegate behavior. An option could be added to remove closest match and only do exact match, but again, this is a replay system option, not cdx server option..
It seems that these all fall under the 'closest match' / Memento TimeGate use case
Yep, this is the Memento TimeMap use case.
These are all different examples of the prefix query use case.
This is already possible with the timemap query, right?
But, could also add an "only after" or "only before" query, to support navigating in one direction explicitly.
This seems more like a replay api, as cdx server is not aware of embeds or relationships between different urls.
The text was updated successfully, but these errors were encountered: