A simple application to collect Github repositories as input datasets of HiGitClass.
These instructions will set the project up for use or development.
- Java Development Kit 11+
- Java Runtime Environment 11+
At the root of the project directory, run the following command:
./gradlew shadowJar
A packaged JAR will be generated in build/libs
.
java -jar build/libs/HiGitClass-DatasetCollector.jar
# or
cd build/libs && java -jar HiGitClass-DatasetCollector.jar
Run java -jar HiGitClass-DatasetCollector.jar --help
for explanation of options, or read below.
There are three subcommands supported by this application.
This subcommand fetches a list of repositories from GitHub based on a given query.
Required Arguments:
NUM_TO_FETCH
: Number of repositories to fetch
Optional Arguments:
--token
: Specifies the Github Personal Access Token, if any.- This may be useful if fetching from a large amount of repositories, as it raises the limit of available requests
--query
: Specifies a custom query to execute.- Do not use this option to override the sort field or order!
--output
Specifies a file to output the JSON dataset to.--parallel
: Download README and repo topics in parallel--pretty
: Output pretty JSON
This subcommand transforms the output.json
emitted by the fetch
command to the format accepted by HiGitClass, which
is a list of JSON objects delimited by newlines. In addition, this subcommand also performs text preprocessing and emits
dataset.txt
required by ESim for embedding discovery.
This step can be run immediately after the data has been collected by fetch
, or deferred until all entries in the
dataset has been labeled.
Required Arguments:
[INPUT]
: The input file emitted byfetch
Optional Arguments:
--output-dir
: The directory to outputHier.json
anddataset.txt
This subcommand outputs labels.txt
based on the given transformed dataset.
This step should be run only after all entries in the dataset has been labeled.
Required Arguments:
[INPUT]
: The path to the input transformed JSON- Note that the file must be transformed (by running through
transform-dataset
)
- Note that the file must be transformed (by running through
Optional Arguments:
--output
: The directory to outputlabels.txt