Skip to content

Requirements

Nicolas Harraudeau edited this page Jun 13, 2017 · 11 revisions

Introduction

Requirement Gathering

These requirements where gathered from different Invenio based services, i.e.:

  • EUDAT B2SHARE
  • CERN Document Server
  • Zenodo
  • Archivematica/SIP Store integration
  • CERN Open Data
  • CERN Analysis Preservation

Goal

These requirements cover all mentioned statistics related features. The goal is not to implement everything listed here, but rather to keep a record of all this input so that different teams can contribute to this project without loosing any past knowledge.

This document will evolve with time as more input comes in.

Some features will be marked as "out of scope", which means that they will not be implemented in this module. They might however be implemented in another module.

Rules

Features should be prioritized as best as possible. For each feature the services expecting them should be listed with their deadline. When no deadline is given the priority will be "nice to have".

Requirements

Required Statistics

Aggregated statistics like counters, "Top N" lists, maps... High level statistics need to be computed from low level events. This processing can be done either at run time or with partial preprocessing.

Examples:

Dashboard Statistic CDS B2SHARE Zenodo CAP COD
Record # page views of a record page ASAP ASAP ASAP High High
Record # download of a record (counting all files download as 1 download) Nice
Record # download of a record's metadata for each output format (MARC21, DC...) Medium
File # downloads per file in the record. ASAP ASAP Medium High High
Collection # new records for the entire collection. Medium
Collection # new submissions for the entire collection. Medium
Collection Particular features usage (e.g. # of comments, # of alerts, usage of recommendations, etc.) Medium
Collection # record views for the entire collection Medium Medium Medium
Collection # file download for the entire collection Medium Medium Medium
Collection Top uploaders in collection ASAP
Collection Open access vs closed access ASAP
Collection World map with where users are coming from Medium
User # record views for all the users records Medium
Circulation # renewals, loans, overdue... Medium
UI #views of every page High
Database #different fields in all bibliographic records (detecting deprecated fields) Nice

SIP store/Archivematica specific

Dashboard Statistic Archivematica integration/SIP Store
SIP STORE #/% of SIP packages created, globally or within a ‘collection’ Low
SIP STORE #/% of records/files sent to Archivematica store Low
SIP STORE #/% of records with failures reported by Archivematica Low
SIP STORE Average delay for the processing of single SIP Low
SIP STORE Total used size of the AM store for one Invenio instance / multiple collections ; history of the line numbers interesting to preserve to see the evolution of the Archivematica store. Low

Note: Collection ~= community

Note: CAP and COD are still based on Inveno 2 so statistics are a nice to have. For them ASAP has been replaced with High.

Some services would need a history of the statistic. Example: #File downloads per day/month, since one year, possibly since the beginning.

This history would need to be queried, filtered so that only a given range is displayed.

Comments by Comment
Archivematica The 3 first SIP Store statistics can be derived direclty from the Archivematica-Invenio table. Delay and total size would probably need more logging.

New statistics and formats

For most services, new statistics would be added from time to time, but this won't happen very often. The format of the statistics and event should change very rarely.

For CERN Open Data and CERN Analysis Preservation new statistics could be added at any time but for now at least Kibana is good enough. The goal would be to answer requests on service usage from different administrations.

Use cases based on fine grained events

Creating a dedicated UI presenting fine grained events is OUT OF SCOPE. The querying could be done by another module or directly via Kibana.

Admin investigation

The goal is to help administrators investigate low level events.

Example of data: One record view event with information about the user who performed it, the time and date, geoip, etc...

The current version of CDS server (based on Invenio 1) pushes all events in a "statistics/log" elasticsearch server. These events can later be investigated. The resulting cluster has about 750GB of data.

Priority Medium

Audit Log

Zenodo mentioned a need for an audit log enabling the following kind of queries:

  • List events that a user has performed.
  • List events that are performed on an organisation.
Priority OUT OF SCOPE

Use case: record ranking

Use the statistics to change the ranking of records.

Needed by Zenodo
Priority Low

Use case: use stats to correct records

Register data which could be used by a module correcting invalid records. (see invenio-checkers).

This is OUT OF SCOPE. Of course the registered events could be used by another module.

Needed by CAP (Nice to have)
Priority OUT OF SCOPE

Use case: alerts based on statistics

Send alerts (ex: emails) when a statistic changes in a predefined way.

Needed by CAP (nice to have)
Priority LOW

Use case: custom "page view" definition

Possibility to define what is a "page view"

Needed by COD
Priority Low

Presentation/Access

Dashboard

Widgets showing multiple statistics from the point of view of a Record, a Community, a User... Those widgets would be on a dedicated page or added on existing pages like "Community page", "Record page"

Needed by CDS, Zenodo, Archivematica integration
Priority Blocking.

Dedicated REST API via AJAX queries

AJAX queries to the REST API returning statistics. This would make statistics filtering more dynamic from the UI point of view.

Needed by Zenodo, B2SHARE and CDS
Priority Important but not blocking

Statistics added to existing endpoints

Statistics could be added when returning resources on existing endpoints. Example:

$ curl -XGET /api/records/123
{
  "metadata": {...},
  "links": {...},
  "stats": {
    "views": 42,
    "downloads": 10
  }
}

This would be done as a first approach by custom B2SHARE code, thus not part of this module.

Priority OUT OF SCOPE

Kibana/Elasticsearch

Statistics would be accessed via Kibana. This is only for administrators.

For CAP and COD Admin/curator will access and then send data to whoever asked for the stats.

Needed by CDS, CAP, COD
Priority Nothing needs to be done as long as the statistics are in elasticsearch

Features

Access control

Access to some statistics could be restricted to some users.

CAP and COD: each curator has access to his community’s stats.

Needed by CDS, CAP, COD
Priority Blocking for CDS, CAP and COD
Comments Zenodo and B2SHARE don't need access control for now. Every statistic is public.
Comments Warning from Tibor: publishing openly the statistics encourage users to do SEO.

Removing old data

Automatic removal of old events or statistics.

No need for CDS, CAP and COD. Old data is sent to backup storage

Needed by B2SHARE, Zenodo
Priority Low. The data can be deleted manually in the mean time.

Preprocessing and aggregation of statistics.

Automatic aggregation of low level statistics and events into higher level of statistics.

Example: aggregate record view events in "record views per day documents".

This would improve the performance as old events can be removed more easily.

Needed by B2SHARE, Zenodo
Priority Medium

COUNTER compatibility

Be compatible with COUNTER (see https://www.projectcounter.org/guides/)

Needed by Zenodo
Priority Low

Sending anonymized events to OpenAIRE piwik HTTP API instance

Needed by Zenodo
Priority Low

Constraints

Downtime

By downtime we mean that the statistics are not available but Invenio is still up. This can happen for long processing or migrations.

For all services a long downtime seems acceptable.

Performance and resources

Amount of data which will be stored:

CDS Our current ES cluster holds 750GB of data / 2’414 Mil objects (including apache logs). Just the statistics should be close to ~100GB of data.
Zenodo Small - it’s in Piwik
CAP & COD <10GB

Deadlines

Here is the list of deadlines per project:

CAP and COD are still on Invenio 2 so invenio-stats is only a nice to have.

What CDS Zenodo B2SHARE Archivematica CAP COD
invenio-stats alpha July, a student working on Frontend for statistics for 10 weeks End of June
invenio-stats integrated in service Q3 July Only a nice to Have Nice to have Nice to Have