-
Notifications
You must be signed in to change notification settings - Fork 25
Requirements
These requirements where gathered from different Invenio based services, i.e.:
- EUDAT B2SHARE
- CERN Document Server
- Zenodo
- Archivematica/SIP Store integration
- CERN Open Data
- CERN Analysis Preservation
These requirements cover all mentioned statistics related features. The goal is not to implement everything listed here, but rather to keep a record of all this input so that different teams can contribute to this project without loosing any past knowledge.
This document will evolve with time as more input comes in.
Some features will be marked as "out of scope", which means that they will not be implemented in this module. They might however be implemented in another module.
Features should be prioritized as best as possible. For each feature the services expecting them should be listed with their deadline. When no deadline is given the priority will be "nice to have".
Aggregated statistics like counters, "Top N" lists, maps... High level statistics need to be computed from low level events. This processing can be done either at run time or with partial preprocessing.
Examples:
Dashboard | Statistic | CDS | B2SHARE | Zenodo | CAP | COD |
---|---|---|---|---|---|---|
Record | # page views of a record page | ASAP | ASAP | ASAP | High | High |
Record | # download of a record (counting all files download as 1 download) | Nice | ||||
Record | # download of a record's metadata for each output format (MARC21, DC...) | Medium | ||||
File | # downloads per file in the record. | ASAP | ASAP | Medium | High | High |
Collection | # new records for the entire collection. | Medium | ||||
Collection | # new submissions for the entire collection. | Medium | ||||
Collection | Particular features usage (e.g. # of comments, # of alerts, usage of recommendations, etc.) | Medium | ||||
Collection | # record views for the entire collection | Medium | Medium | Medium | ||
Collection | # file download for the entire collection | Medium | Medium | Medium | ||
Collection | Top uploaders in collection | ASAP | ||||
Collection | Open access vs closed access | ASAP | ||||
Collection | World map with where users are coming from | Medium | ||||
User | # record views for all the users records | Medium | ||||
Circulation | # renewals, loans, overdue... | Medium | ||||
UI | #views of every page | High | ||||
Database | #different fields in all bibliographic records (detecting deprecated fields) | Nice |
SIP store/Archivematica specific
Dashboard | Statistic | Archivematica integration/SIP Store |
---|---|---|
SIP STORE | #/% of SIP packages created, globally or within a ‘collection’ | Low |
SIP STORE | #/% of records/files sent to Archivematica store | Low |
SIP STORE | #/% of records with failures reported by Archivematica | Low |
SIP STORE | Average delay for the processing of single SIP | Low |
SIP STORE | Total used size of the AM store for one Invenio instance / multiple collections ; history of the line numbers interesting to preserve to see the evolution of the Archivematica store. | Low |
Note: Collection ~= community
Note: CAP and COD are still based on Inveno 2 so statistics are a nice to have. For them ASAP has been replaced with High.
Some services would need a history of the statistic. Example: #File downloads per day/month, since one year, possibly since the beginning.
This history would need to be queried, filtered so that only a given range is displayed.
Comments by | Comment |
---|---|
Archivematica | The 3 first SIP Store statistics can be derived direclty from the Archivematica-Invenio table. Delay and total size would probably need more logging. |
For most services, new statistics would be added from time to time, but this won't happen very often. The format of the statistics and event should change very rarely.
For CERN Open Data and CERN Analysis Preservation new statistics could be added at any time but for now at least Kibana is good enough. The goal would be to answer requests on service usage from different administrations.
Creating a dedicated UI presenting fine grained events is OUT OF SCOPE. The querying could be done by another module or directly via Kibana.
The goal is to help administrators investigate low level events.
Example of data: One record view event with information about the user who performed it, the time and date, geoip, etc...
The current version of CDS server (based on Invenio 1) pushes all events in a "statistics/log" elasticsearch server. These events can later be investigated. The resulting cluster has about 750GB of data.
Priority | Medium |
---|
Zenodo mentioned a need for an audit log enabling the following kind of queries:
- List events that a user has performed.
- List events that are performed on an organisation.
Priority | OUT OF SCOPE |
---|
Use the statistics to change the ranking of records.
Needed by | Zenodo |
---|---|
Priority | Low |
Register data which could be used by a module correcting invalid records. (see invenio-checkers).
This is OUT OF SCOPE. Of course the registered events could be used by another module.
Needed by | CAP (Nice to have) |
---|---|
Priority | OUT OF SCOPE |
Send alerts (ex: emails) when a statistic changes in a predefined way.
Needed by | CAP (nice to have) |
---|---|
Priority | LOW |
Possibility to define what is a "page view"
Needed by | COD |
---|---|
Priority | Low |
Widgets showing multiple statistics from the point of view of a Record, a Community, a User... Those widgets would be on a dedicated page or added on existing pages like "Community page", "Record page"
Needed by | CDS, Zenodo, Archivematica integration |
---|---|
Priority | Blocking. |
AJAX queries to the REST API returning statistics. This would make statistics filtering more dynamic from the UI point of view.
Needed by | Zenodo, B2SHARE and CDS |
---|---|
Priority | Important but not blocking |
Statistics could be added when returning resources on existing endpoints. Example:
$ curl -XGET /api/records/123
{
"metadata": {...},
"links": {...},
"stats": {
"views": 42,
"downloads": 10
}
}
This would be done as a first approach by custom B2SHARE code, thus not part of this module.
Priority | OUT OF SCOPE |
---|
Statistics would be accessed via Kibana. This is only for administrators.
For CAP and COD Admin/curator will access and then send data to whoever asked for the stats.
Needed by | CDS, CAP, COD |
---|---|
Priority | Nothing needs to be done as long as the statistics are in elasticsearch |
Access to some statistics could be restricted to some users.
CAP and COD: each curator has access to his community’s stats.
Needed by | CDS, CAP, COD |
---|---|
Priority | Blocking for CDS, CAP and COD |
Comments | Zenodo and B2SHARE don't need access control for now. Every statistic is public. |
Comments | Warning from Tibor: publishing openly the statistics encourage users to do SEO. |
Automatic removal of old events or statistics.
No need for CDS, CAP and COD. Old data is sent to backup storage
Needed by | B2SHARE, Zenodo |
---|---|
Priority | Low. The data can be deleted manually in the mean time. |
Automatic aggregation of low level statistics and events into higher level of statistics.
Example: aggregate record view events in "record views per day documents".
This would improve the performance as old events can be removed more easily.
Needed by | B2SHARE, Zenodo |
---|---|
Priority | Medium |
Be compatible with COUNTER (see https://www.projectcounter.org/guides/)
Needed by | Zenodo |
---|---|
Priority | Low |
Needed by | Zenodo |
---|---|
Priority | Low |
By downtime we mean that the statistics are not available but Invenio is still up. This can happen for long processing or migrations.
For all services a long downtime seems acceptable.
Amount of data which will be stored:
CDS | Our current ES cluster holds 750GB of data / 2’414 Mil objects (including apache logs). Just the statistics should be close to ~100GB of data. |
---|---|
Zenodo | Small - it’s in Piwik |
CAP & COD | <10GB |
Here is the list of deadlines per project:
CAP and COD are still on Invenio 2 so invenio-stats is only a nice to have.
What | CDS | Zenodo | B2SHARE | Archivematica | CAP | COD |
---|---|---|---|---|---|---|
invenio-stats alpha | July, a student working on Frontend for statistics for 10 weeks | End of June | ||||
invenio-stats integrated in service | Q3 | July | Only a nice to Have | Nice to have | Nice to Have |