-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a3feec6
commit d1d0410
Showing
1 changed file
with
280 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,282 @@ | ||
# publicationclassificationlabeling | ||
|
||
This Java package can be used to obtain labels for clusters of scientific publications. | ||
This Java package can be used to obtain labels for clusters of scientific publications. These clusters can be created using the publicationclassification package. | ||
|
||
Labels are obtained based on the titles of a sample of publications in each cluster. The package uses OpenAI GPT language models. It supports the [GPT-3.5 and Updated GPT-3.5 Turbo models](https://platform.openai.com/docs/models/gpt-3-5) as well as the [GPT-4 and GPT-4 Turbo models](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo). | ||
|
||
The publicationclassificationlabeling package was developed by [Nees Jan van Eck](https://orcid.org/0000-0001-8448-4521) at the [Centre for Science and Technology Studies (CWTS)](https://www.cwts.nl) at [Leiden University](https://www.universiteitleiden.nl/en). | ||
|
||
## Documentation | ||
|
||
Documentation of the source code of publicationclassificationlabeling is provided in the code in `javadoc` format. The documentation is also available in a [compiled format](https://CWTSLeiden.github.io/publicationclassificationlabeling). | ||
|
||
## Installation | ||
|
||
### Maven | ||
|
||
``` | ||
<dependency> | ||
<groupId>nl.cwts</groupId> | ||
<artifactId>publicationclassificationlabeling</artifactId> | ||
<version>1.0.0</version> | ||
</dependency> | ||
``` | ||
|
||
### Gradle | ||
|
||
``` | ||
implementation group: 'nl.cwts', name: 'publicationclassificationlabeling', version: '1.0.0' | ||
``` | ||
|
||
## Usage | ||
|
||
The publicationclassificationlabeling package requires Java 8 or higher. The latest version of the package is available as a pre-compiled `jar` file on [Maven Central](https://central.sonatype.com/artifact/nl.cwts/publicationclassificationlabeling) and [GitHub Packages](https://github.com/CWTSLeiden/publicationclassificationlabeling/packages). | ||
Instructions for compiling the source code of the package are provided [below](#development-and-deployment). | ||
|
||
Use the command-line tool `PublicationClassificationLabelingCreator` to obtain cluster labels. The tool can be run as follows: | ||
|
||
``` | ||
java -cp publicationclassificationlabeling-1.0.0.jar nl.cwts.publicationclassificationlabeling.PublicationClassificationLabelingCreator | ||
``` | ||
|
||
If no further arguments are provided, the following usage notice will be displayed: | ||
|
||
``` | ||
PublicationClassificationLabelingCreator version 1.0.0 | ||
By Nees Jan van Eck | ||
Centre for Science and Technology Studies (CWTS), Leiden University | ||
Usage: PublicationClassificationLabelingCreator | ||
<pub_titles_file> <label_file> | ||
<api_key> <gpt_model> <print_labeling> | ||
(to create a publication classification labeling based on data in text files) | ||
or PublicationClassificationLabelingCreator | ||
<server> <database> <pub_titles_table> <label_table> | ||
<api_key> <gpt_model> <print_labeling> | ||
(to create a publication classification labeling based on data in an SQL Server database) | ||
Arguments: | ||
<pub_titles_file> | ||
Name of the publication titles input file. This text file must contain two tab-separated | ||
columns (without a header line): a column of cluster numbers and a column of publication | ||
titles. The cluster numbers in the first column must be integers starting at zero. The | ||
publication titles in the second column (e.g., the titles of a sample of 100 publications) | ||
must be concatenated into a single string. The lines in the file must be sorted by the | ||
cluster numbers in the first column. | ||
<label_file> | ||
Name of the labels output file. This text file will contain six tab-separated columns | ||
(without a header line): a column of cluster numbers, a column of short labels, a column | ||
of long labels, a column of keywords, a column of descriptions, and a column of Wikipedia | ||
page links. Cluster numbers are integers starting at zero. | ||
<server> | ||
SQL Server server name. A connection will be made using integrated authentication. | ||
<database> | ||
Database name. | ||
<pub_titles_table> | ||
Name of the publication titles input table. This table must have two columns: cluster_no | ||
and pub_titles. The cluster numbers in the first column must be integers starting at zero. | ||
The publication titles in the second column (e.g., the titles of a sample of 100 | ||
publications) must be concatenated into a single string. | ||
<label_table> | ||
Name of the labels output table. This table will have six columns: cluster_no, | ||
short_label, long_label, keywords, summary, and wikipedia_url. Cluster numbers are | ||
integers starting at zero. | ||
<api_key> | ||
OpenAI API key. | ||
<gpt_model> | ||
OpenAI GPT model. The models supported are: 'gpt-4-1106-preview', 'gpt-4', | ||
'gpt-3.5-turbo-1106', and 'gpt-3.5-turbo'. | ||
<print_labeling> | ||
Boolean indicating whether the generated publication classification labeling should be | ||
printed to the standard output or not. | ||
``` | ||
|
||
### Example | ||
|
||
The following example illustrates the use of the `PublicationClassificationLabelingCreator` tool. Suppose you have a text file `cluster_pub_titles.txt`: | ||
|
||
``` | ||
0 The link-prediction problem for social networks Twitter Power: Tweets as Electronic Word of Mout... | ||
1 The journal coverage of Web of Science and Scopus: a comparative analysis What do citation count... | ||
2 Social network analysis: a powerful strategy, also for the information sciences Google Scholar, ... | ||
3 The sharing economy: Why people participate in collaborative consumption How open is innovation?... | ||
4 Academic engagement and commercialisation: A review of the literature on university-industry rel... | ||
5 Growth rates of modern science: A bibliometric analysis based on the number of publications and ... | ||
6 The determinants of national innovative capacity Citations, family size, opposition and the valu... | ||
7 Developing a framework for responsible innovation Technologies of humility: Citizen participatio... | ||
8 Theory and practise of the g-index An approach for detecting, quantifying, and visualizing the e... | ||
9 Technological transitions as evolutionary reconfiguration processes: a multi-level perspective a... | ||
10 Software survey: VOSviewer, a computer program for bibliometric mapping CiteSpace II: Detecting ... | ||
``` | ||
|
||
The `PublicationClassificationLabelingCreator` tool can then be run as follows: | ||
|
||
``` | ||
java -cp publicationclassificationlabeling-1.0.0.jar nl.cwts.publicationclassificationlabeling.PublicationClassificationLabelingCreator cluster_pub_titles.txt label.txt <your OpenAI API key> gpt-3.5-turbo-1106 true | ||
``` | ||
|
||
The cluster labels obtained using the tool can be found in the text file `label.txt`: | ||
|
||
``` | ||
0 Information Retrieval Information Retrieval and Knowledge Management ... | ||
1 Bibliometric Analysis Bibliometric Analysis and Research Evaluation ... | ||
2 Scientific Collaboration Patterns and Impact of Scientific Collaboration ... | ||
3 Open Innovation Open Innovation and Collaborative Knowledge Sharing ... | ||
4 University-Industry Relations University-Industry Relations and Technology Transfer ... | ||
5 Scholarly Communication Scholarly Communication in the Digital Age ... | ||
6 Innovation Studies Determinants of National Innovative Capacity and Patent Analysis ... | ||
7 Research Impact Assessment Assessing the Societal Impact of Research ... | ||
8 Bibliometric Analysis Bibliometric Analysis in Scholarly Communication ... | ||
9 Technological Transitions Technological Transitions as Evolutionary Reconfiguration Processes ... | ||
10 Bibliometric Mapping Bibliometric Mapping and Interdisciplinary Research Analysis ... | ||
``` | ||
|
||
The tool displays the following output: | ||
|
||
``` | ||
PublicationClassificationLabelingCreator version 1.0.0 | ||
By Nees Jan van Eck | ||
Centre for Science and Technology Studies (CWTS), Leiden University | ||
Reading publication titles from file... Finished! | ||
Reading publication titles from file took 0h 0m 0s. | ||
Creating labeling for each cluster... | ||
Creating labeling cluster 0... Finished! | ||
Labeling: | ||
Short label: Information Retrieval | ||
Long label: Information Retrieval and Knowledge Management | ||
Keywords: Information Retrieval; Knowledge Management; Social Networks; Sentiment Analysis; User Engagement; Web Searching; Information Literacy; Data Mining; Online Communities; Credibility Assessment | ||
Summary: This cluster of papers focuses on information retrieval, knowledge management, and related topics such as social networks, sentiment analysis, user engagement, web searching, information literacy, data mining, online communities, and credibility assessment. | ||
Wikipedia: https://en.wikipedia.org/wiki/Information_retrieval | ||
Creating labeling cluster 1... Finished! | ||
Labeling: | ||
Short label: Bibliometric Analysis | ||
Long label: Bibliometric Analysis and Research Evaluation | ||
Keywords: Bibliometric Analysis; Research Evaluation; Citation Impact Indicators; Journal Rankings; University Research Funding; Scientific Performance Measurement; Peer Review Bias; Interdisciplinary Research; Altmetrics; Publication Delay | ||
Summary: This cluster of papers focuses on bibliometric analysis, research evaluation, and the use of citation impact indicators in assessing scientific performance. It covers topics such as journal rankings, university research funding systems, peer review bias, interdisciplinary research, altmetrics, and publication delay. The papers also delve into the challenges and implications of using various metrics to measure research productivity and impact. | ||
Wikipedia: https://en.wikipedia.org/wiki/Bibliometrics | ||
Creating labeling cluster 2... Finished! | ||
Labeling: | ||
Short label: Scientific Collaboration | ||
Long label: Patterns and Impact of Scientific Collaboration | ||
Keywords: Scientific Collaboration; Bibliometrics; Research Impact; International Collaboration; Co-authorship Networks; Research Productivity; Knowledge Production; Citation Analysis; Gender Differences; Technology Innovation | ||
Summary: This cluster of papers focuses on the patterns and impact of scientific collaboration, bibliometrics, research productivity, and knowledge production. It explores topics such as international collaboration, co-authorship networks, research impact, gender differences in research productivity, and technology innovation. The papers analyze the relationship between innovation and subjective wellbeing, the growth of indexed journals in Latin America and the Caribbean, and the feasibility of text mining techniques to detect similarity between patent documents and scientific publications. | ||
Wikipedia: https://en.wikipedia.org/wiki/Scientific_collaboration | ||
Creating labeling cluster 3... Finished! | ||
Labeling: | ||
Short label: Open Innovation | ||
Long label: Open Innovation and Collaborative Knowledge Sharing | ||
Keywords: Open Innovation; Collaborative Consumption; Knowledge Sharing; R&D Cooperation; Innovation Performance; Environmental Innovation; SMEs; Crowdsourcing; Absorptive Capacity; User Innovations | ||
Summary: This cluster of papers explores the concept of open innovation, collaborative consumption, and knowledge sharing in the context of R&D cooperation, innovation performance, environmental innovation, and SMEs. It delves into the dynamics of crowdsourcing, absorptive capacity, and user innovations, emphasizing the importance of collaborative networks for driving innovation. | ||
Wikipedia: https://en.wikipedia.org/wiki/Open_innovation | ||
Creating labeling cluster 4... Finished! | ||
Labeling: | ||
Short label: University-Industry Relations | ||
Long label: University-Industry Relations and Technology Transfer | ||
Keywords: University-Industry Relations; Technology Transfer; Entrepreneurial University; Innovation; Academic Entrepreneurship; Incubator; Spin-off Companies; Knowledge Transfer; Venture Capital; Science Parks | ||
Summary: This cluster of papers explores the dynamics of university-industry relations, technology transfer, and the entrepreneurial activities of academic institutions. It delves into topics such as the impact of organizational practices on technology transfer, factors influencing university-industry collaboration, the role of academic entrepreneurship, and the effectiveness of incubators in fostering innovation and new venture creation. | ||
Wikipedia: https://en.wikipedia.org/wiki/University-industry_collaboration | ||
Creating labeling cluster 5... Finished! | ||
Labeling: | ||
Short label: Scholarly Communication | ||
Long label: Scholarly Communication in the Digital Age | ||
Keywords: Altmetrics; Open Access; Social Media; Bibliometrics; Research Impact; Scientific Collaboration; Academic Networking; Citation Analysis; Webometrics; Research Data Management | ||
Summary: This cluster of papers explores the impact of digital technologies on scholarly communication, including the use of altmetrics, open access publishing, social media, and research data management. It also delves into topics such as citation analysis, scientific collaboration, academic networking, and webometrics. | ||
Wikipedia: https://en.wikipedia.org/wiki/Scholarly_communication | ||
Creating labeling cluster 6... Finished! | ||
Labeling: | ||
Short label: Innovation Studies | ||
Long label: Determinants of National Innovative Capacity and Patent Analysis | ||
Keywords: Innovation; Patent; Technology; Knowledge Flow; National Innovation System; Entrepreneurship; R&D Spillovers; Intellectual Property Rights; Science-Technology Linkage; Innovation Policy | ||
Summary: This cluster of papers explores the determinants of national innovative capacity, patent analysis, technology as a complex adaptive system, knowledge flow, entrepreneurship, R&D spillovers, and the impact of intellectual property rights on innovation. It delves into the interplay between science and technology, innovation policy, and the role of national innovation systems in economic development. | ||
Wikipedia: https://en.wikipedia.org/wiki/Innovation | ||
Creating labeling cluster 7... Finished! | ||
Labeling: | ||
Short label: Research Impact Assessment | ||
Long label: Assessing the Societal Impact of Research | ||
Keywords: Research Impact Assessment; Scientific Collaboration; Innovation Policy; Interdisciplinary Research; Knowledge Transfer; Academic Entrepreneurship; Science Policy Interfaces; University-Industry Collaboration; Bibliometric Analysis; Societal Relevance | ||
Summary: This cluster of papers focuses on assessing the societal impact of research, including topics such as research impact assessment, scientific collaboration, innovation policy, interdisciplinary research, knowledge transfer, academic entrepreneurship, science policy interfaces, university-industry collaboration, and bibliometric analysis. The papers explore the influence of funding agencies, international collaboration, gender differences in research collaboration, and the public understanding of science. They also discuss the challenges and opportunities in evaluating the effectiveness of science-policy interfaces and highlight the importance of societal relevance in research. | ||
Wikipedia: https://en.wikipedia.org/wiki/Research_impact_assessment | ||
Creating labeling cluster 8... Finished! | ||
Labeling: | ||
Short label: Bibliometric Analysis | ||
Long label: Bibliometric Analysis in Scholarly Communication | ||
Keywords: h-index; citation analysis; bibliometric indicators; research impact; co-authorship networks; Google Scholar; Scopus; scientific evaluation; publication output; academic collaboration | ||
Summary: This cluster of papers focuses on the analysis of bibliometric indicators, such as the h-index, citation counts, and co-authorship networks, to evaluate research impact and scholarly communication. It compares data sources like Google Scholar and Scopus, explores the influence of self-citation, and discusses the challenges and benefits of using various metrics for scientific evaluation. | ||
Wikipedia: https://en.wikipedia.org/wiki/Bibliometrics | ||
Creating labeling cluster 9... Finished! | ||
Labeling: | ||
Short label: Technological Transitions | ||
Long label: Technological Transitions as Evolutionary Reconfiguration Processes | ||
Keywords: Sustainability Transitions; Innovation Systems; Multi-level Perspective; Intermediaries; Knowledge Diffusion; Policy Mixes; Business Models; Regional Innovation Systems; Socio-technical Regimes; Demand-side Policies | ||
Summary: This cluster of papers explores technological transitions as evolutionary reconfiguration processes, focusing on sustainability transitions, innovation systems, multi-level perspective, intermediaries, knowledge diffusion, policy mixes, business models, regional innovation systems, socio-technical regimes, and demand-side policies. | ||
Wikipedia: https://en.wikipedia.org/wiki/Technological_transition | ||
Creating labeling cluster 10... Finished! | ||
Labeling: | ||
Short label: Bibliometric Mapping | ||
Long label: Bibliometric Mapping and Interdisciplinary Research Analysis | ||
Keywords: Bibliometric Mapping; Interdisciplinary Research; Scientific Literature; Citation Analysis; Co-citation Networks; Knowledge Structure; Science Mapping Software; Research Fronts; Author Cocitation Analysis; Topic Modeling | ||
Summary: This cluster of papers focuses on the analysis and visualization of scientific literature through bibliometric mapping, citation analysis, and co-citation networks. It explores interdisciplinary research, knowledge structure, and the use of various software tools for science mapping. The papers also delve into author cocitation analysis, research fronts, and topic modeling to understand the evolution and connections within different research fields. | ||
Wikipedia: https://en.wikipedia.org/wiki/Bibliometrics | ||
Creating labeling for each cluster took 0h 0m 37s. | ||
Writing labeling to file... Finished! | ||
Writing labeling to file took 0h 0m 0s. | ||
``` | ||
|
||
## License | ||
|
||
The publicationclassificationlabeling package is distributed under the [MIT license](LICENSE). | ||
|
||
## Issues | ||
|
||
If you encounter any issues, please report them using the [issue tracker](https://github.com/CWTSLeiden/publicationclassificationlabeling/issues) on GitHub. | ||
|
||
## Contribution | ||
|
||
You are welcome to contribute to the development of the publicationclassificationlabeling package. Please follow the typical GitHub workflow: Fork from this repository and make a pull request to submit your changes. | ||
Make sure that your pull request has a clear description and that the code has been properly tested. | ||
|
||
## Development and deployment | ||
|
||
The latest stable version of the source code is available in the [`main`](https://github.com/CWTSLeiden/publicationclassificationlabeling/tree/main) branch on GitHub. The most recent version of the source code, which may be under development, is available in the [`develop`](https://github.com/CWTSLeiden/publicationclassificationlabeling/tree/develop) branch. | ||
|
||
### Compilation | ||
|
||
To compile the source code of the publicationclassificationlabeling package, a [Java Development Kit](https://jdk.java.net) needs to be installed on your system (version 8 or higher). Having [Gradle](https://www.gradle.org) installed is optional as the [Gradle Wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.html) is also included in this repository. | ||
|
||
On Windows systems, the source code can be compiled as follows: | ||
|
||
``` | ||
gradlew build | ||
``` | ||
|
||
On Linux and MacOS systems, use the following command: | ||
|
||
``` | ||
./gradlew build | ||
``` | ||
|
||
The compiled `class` files can be found in the directory `build/classes`. | ||
The compiled `jar` file can be found in the directory `build/libs`. | ||
The compiled `javadoc` files can be found in the directory `build/docs`. | ||
|
||
The class `nl.cwts.publicationclassificationlabeling.run.PublicationClassificationLabelingCreator` has a `main` method. After compiling the source code, the `PublicationClassificationLabelingCreator` tool can be run as follows: | ||
|
||
``` | ||
java -cp build/libs/publicationclassificationlabeling-<version>.jar nl.cwts.publicationclassificationlabeling.run.PublicationClassificationLabelingCreator | ||
``` |