Skip to content

adsabs/ExpansionReporting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADS Expansion Project Reporting Environment

Introduction

For the ADS Expansion Project, we need quantitative measures for the discipines we are expanding into:

  • Heliophysics (HP)
  • Planetary Science (PS)
  • Earth Sciences (ES)
  • Biology and Physics Sciences in NASA context (BPS)

For context, these measures will also be provided for Astrophysics (AST). The measures covered by this environment are:

  • Publication completeness
  • Full text completeness
  • Level of reference matching
  • Summary of general statistics

The reporting format depends on the intended audience. For example, the ADS curation team will need to know coverage levels by data source.

Usage and logic

The general pattern for generating reports is the following

python3 run.py --collection <COLLECTION> --format <FORMAT> --subject <SUBJECT>

The collection parameter corresponds with the discipline. It accepts values corresponding with discipline abbreviations (HP, PS, ES, BPS, AST). The format parameter determines what will be included in a report; e.g. general full text coverage versus full text coverage split up by source ("publisher" and "arXiv"). This parameter accepts either NASA or CURATORS as values. Finally, the subject parameter determines the subject of reporting. The acceptable values are RECORDS, FULLTEXT, REFERENCES or SUMMARY.

The collection parameter determines which publications will be used for the reporting. Besides a collection of journals (via their journal abbreviations, i.e. bibstems), collections may also have queries associated with. These queries are supposed to be representative for the discipline and incorporate content that goes beyond core discipline journals. More details can be found in the content selection section, below.

Content selection

Having coverage/completeness measures for core discipline journals is of course very important. However, we also need to have measures for the holdings increase outside of the core collection, i.e. the "outer rings" in the ADS curation model (see e.g. Kurtz et al (2021)). These will need to be based on content queries that will incorporate publications beyond the core journals. These content queries will have to be determined largely based on heuristics. One approach can be called a "network approach": since reference sections are the best summaries of scholarly publications and citations a good measure for how research is being used, it makes sense to start with the core journals and construct a set of publications by augmenting this collection with the references and citations for the papers in this collection. Of course, this will not include anything that is not connected in this network, i.e. works that do not cite or are not cited by articles in the core journals. It can be argued, however, that this network provides the most support for and representation of current, active research in any discipline.

Data sources

This section describes the means of retrieving the necessary data for each of the reports (specified via the subject parameter).

Data sources for full text report

If the full text origin is not important, coverage can be established completely via API calls. For any given query, the number of records returned by that query that have full text indexed can be retrieved by adding the filter fulltext_mtime:["1000-01-01t00:00:00.000Z" TO *] to that original query.

Currently (05/12/2022), the origin of full text is not indexed in Solr. So, if we want to make the distinction of publisher versus arXiv as full text origin, we will have to use the Classic index file for full text. So, the process that generates this report will need to access the appropriate partition in the Classic back office.

Data sources for reference data

This report generates matching levels for journals, per volume. This means that for each journal volume, we determine the overall total of references and how many of those were successfully matched to existing ADS records. Currently (05/12/2022), this can only be done using data generated by the Classic reference resolver, stored on the Classic back office partition.

Data sources for general record coverage

Generating this report means establising how many articles were published for any given volume of journals being analyzed, and checking how many have been indexed in the ADS holdings. We need an external source to retrieve the first part, the number of articles that were actually published. We will use Crossref for this part. In practice this means we will use the ADS Journals Database to generate this report.

Implementation specifics

The general design pattern is very similar to that of other ADS applications that use run.py to start processes. The Python framework is based on Python 3.8. Some data processing and all the generation of reports uses the python Pandas tool. Each of the reports, from a programmatic point of view, is a Python class that inherits from a generic Report class. The report on coverage levels make use of conditional formatting to color cells in the spreadsheets, depending on values. This is done using functionality within the Pandas tool, which uses the Jinja2 tool. The choice was made to export to Excel spreadsheets, which makes use of the openpyxl module. Further specifics can be found in the comments throughout the code.

The following general rule was applied to ADS API calls: try to avoid as much as possible to retrieve individual records, even when this means just retrieving bibcodes. For large sets of records, this is just not very efficient. For this reason, whenever possible, data is retrieved from Solr using facet and pivot queries.

Future additions

  • Find better ways to report on usage
  • Augment the Solr schema to better support origin-specific queries (e.g. retrieve records with full text from the publisher)
  • Report the number of frequent users that have read/downloaded publications within a given collection