For the ADS Expansion Project, we need quantitative measures for the discipines we are expanding into:
- Heliophysics (HP)
- Planetary Science (PS)
- Earth Sciences (ES)
- Biology and Physics Sciences in NASA context (BPS)
For context, these measures will also be provided for Astrophysics (AST). The measures covered by this environment are:
- Publication completeness
- Full text completeness
- Level of reference matching
- Summary of general statistics
The reporting format depends on the intended audience. For example, the ADS curation team will need to know coverage levels by data source.
The general pattern for generating reports is the following
python3 run.py --collection <COLLECTION> --format <FORMAT> --subject <SUBJECT>
The collection
parameter corresponds with the discipline. It accepts values corresponding with discipline abbreviations (HP, PS, ES, BPS, AST). The format
parameter determines what will be included in a report; e.g. general full text coverage versus full text coverage split up by source ("publisher" and "arXiv"). This parameter accepts either NASA or CURATORS as values. Finally, the subject
parameter determines the subject of reporting. The acceptable values are RECORDS, FULLTEXT, REFERENCES or SUMMARY.
The collection
parameter determines which publications will be used for the reporting. Besides a collection of journals (via their journal abbreviations, i.e. bibstems), collections may also have queries associated with. These queries are supposed to be representative for the discipline and incorporate content that goes beyond core discipline journals. More details can be found in the content selection
section, below.
Having coverage/completeness measures for core discipline journals is of course very important. However, we also need to have measures for the holdings increase outside of the core collection, i.e. the "outer rings" in the ADS curation model (see e.g. Kurtz et al (2021)). These will need to be based on content queries that will incorporate publications beyond the core journals. These content queries will have to be determined largely based on heuristics. One approach can be called a "network approach": since reference sections are the best summaries of scholarly publications and citations a good measure for how research is being used, it makes sense to start with the core journals and construct a set of publications by augmenting this collection with the references and citations for the papers in this collection. Of course, this will not include anything that is not connected in this network, i.e. works that do not cite or are not cited by articles in the core journals. It can be argued, however, that this network provides the most support for and representation of current, active research in any discipline.
This section describes the means of retrieving the necessary data for each of the reports (specified via the subject
parameter).
If the full text origin is not important, coverage can be established completely via API calls. For any given query, the number of records returned by that query that have full text indexed can be retrieved by adding the filter fulltext_mtime:["1000-01-01t00:00:00.000Z" TO *]
to that original query.
Currently (05/12/2022), the origin of full text is not indexed in Solr. So, if we want to make the distinction of publisher
versus arXiv
as full text origin, we will have to use the Classic index file for full text. So, the process that generates this report will need to access the appropriate partition in the Classic back office.
This report generates matching levels for journals, per volume. This means that for each journal volume, we determine the overall total of references and how many of those were successfully matched to existing ADS records. Currently (05/12/2022), this can only be done using data generated by the Classic reference resolver, stored on the Classic back office partition.
Generating this report means establising how many articles were published for any given volume of journals being analyzed, and checking how many have been indexed in the ADS holdings. We need an external source to retrieve the first part, the number of articles that were actually published. We will use Crossref for this part. In practice this means we will use the ADS Journals Database to generate this report.
The general design pattern is very similar to that of other ADS applications that use run.py
to start processes. The Python framework is based on Python 3.8. Some data processing and all the generation of reports uses the python Pandas tool. Each of the reports, from a programmatic point of view, is a Python class that inherits from a generic Report class. The report on coverage levels make use of conditional formatting to color cells in the spreadsheets, depending on values. This is done using functionality within the Pandas tool, which uses the Jinja2
tool. The choice was made to export to Excel spreadsheets, which makes use of the openpyxl
module. Further specifics can be found in the comments throughout the code.
The following general rule was applied to ADS API calls: try to avoid as much as possible to retrieve individual records, even when this means just retrieving bibcodes. For large sets of records, this is just not very efficient. For this reason, whenever possible, data is retrieved from Solr using facet and pivot queries.
- Find better ways to report on usage
- Augment the Solr schema to better support origin-specific queries (e.g. retrieve records with full text from the publisher)
- Report the number of frequent users that have read/downloaded publications within a given collection