Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
ruoxining committed Mar 17, 2024
1 parent 6b99264 commit 563b189
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@

# πŸš€ Introduction

**NovelQA** is a benchmark to evaluate the long-text understanding and retrieval ability of LLMs. This dataset is constructed by manually collecting questions and answers about English novels which are above 50,000 words. Moreover, most of the questions are designed to be focusing on either minor details in the novel, or requiring information spanning multiple chapters, which are inherently challenging for LLMs. We welcome submissions with any LLM with long-context abilities!
**NovelQA** is a benchmark to evaluate the long-text understanding and retrieval ability of LLMs. This dataset is constructed by manually collecting questions and answers about English novels that are above 50,000 words. Moreover, most of the questions are designed to focus on either minor details in the novel, or require information spanning multiple chapters, which are inherently challenging for LLMs. We welcome submissions with any LLM with long-context abilities!

The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs.

We create this [πŸ† Leaderboard Website](https://novelqa.github.io/) for presenting the models top scores on NovelQA. We encourage participants to refer to this leaderboard for your ranking.
We create this [πŸ† Leaderboard Website](https://novelqa.github.io/) to present the model's top scores on NovelQA. We encourage participants to refer to this leaderboard for your ranking.


# πŸ“ Dataset
Expand Down Expand Up @@ -60,7 +60,7 @@

# πŸ† Evaluation & Submission

Due to confidentiality considerations, the submission procedure is deployed through multiple steps on several platforms. An overview of the submission is shown as the following flowchart.
Due to confidentiality considerations, the submission procedure is deployed through multiple steps on several platforms. An overview of the submission is shown in the following flowchart.

```mermaid
graph LR
Expand All @@ -70,11 +70,11 @@ graph LR
D ----> E[[πŸ†Leaderboard Website]]
```

Our input data (including novel, question and options) is opensourced on the [πŸ€— Huggingface]() platform. Participants who expect to evaluate their model are expected to download the data through Huggingface first. You may either execute the generative subtask with only the novel and quetion, or executing the multichoice subtask by inputing the novel, question and the options. Warning: The input data are only for internal evaluation use. Please do not publicly spread the input data online. The competition hosts are not responsible for any possible violation of novel copyright caused by the participants' spreading the input data publicly online.
Our input data (including the novel, question, and options) is open-source on the [πŸ€— Huggingface]() platform. Participants who expect to evaluate their model are expected to download the data through Huggingface first. You may either execute the generative subtask with only the novel and quetion, or execute the multichoice subtask by inputting the novel, question, and options. Warning: The input data are only for internal evaluation use. Please do not publicly spread the input data online. The competition hosts are not responsible for any possible violation of novel copyright caused by the participants' spreading the input data publicly online.

After inputing the data and obtaining the model output, you are expected to submit your model output to the [βš–οΈ Codabench]() platform for evaluation. Such a procedure is set for the purpose of perserving the confidentiality of the gold answers. The Codabench platform automatically runs evaluation on your result, and generates the accuracy score within an average of 5 minutes. If your submission fails or your evaluation is obviously above average, you may email us with the results to have us manually run the evaluation for you. For details about the Codabench platform and the evaluation procedure, see our instructions in our Codabench page.
After inputting the data and obtaining the model output, you are expected to submit your model output to the [βš–οΈ Codabench]() platform for evaluation. Such a procedure is set for the purpose of preserving the confidentiality of the gold answers. The Codabench platform automatically runs evaluation on your result, and generates the accuracy score within an average of 5 minutes. If your submission fails or your evaluation is obviously above average, you may email us with the results to have us manually run the evaluation for you. For details about the Codabench platform and the evaluation procedure, see our instructions in our Codabench page.

Your accuracy score are further expected to submit to us through the [πŸ—³οΈ Google Form]() if you evluate your results through Codabench to have us update it on our [πŸ† Leaderboard](). Our leaderboard presents the Top 7 models on the two subtasks seperately.
Your accuracy score is further expected to submit to us through the [πŸ—³οΈ Google Form]() if you evaluate your results through Codabench to have us update it on our [πŸ† Leaderboard](). Our leaderboard presents the Top 7 models on the two subtasks separately.

# πŸ“œ License

Expand Down

0 comments on commit 563b189

Please sign in to comment.