TriviaHG is an extensive dataset crafted specifically for hint generation in question answering. Unlike conventional datasets, TriviaHG provides 10 hints per question instead of direct answers. This unique approach encourages users to engage in critical thinking and reasoning to derive the solution. Covering diverse question types across varying difficulty levels, the dataset is partitioned into training, validation, and test sets. These subsets facilitate the fine-tuning and training of large language models, enhancing the generation of high-quality hints.
TriviaHG comprises several sub-datasets, each encompassing ⬇️Training, ⬇️Validation, and ⬇️Test sets. You can access and download each subset by clicking on its respective link.
The dataset is structured as JSON files, including training.json, validation.json, and test.json for training, validation, and test phases, respectively:
[
{
"Q_ID": "",
"Question": "",
"Hints": [ ],
"Hints_Sources": [ ],
"Snippet": "",
"Snippet_Sources": [ ],
"ExactAnswer": [ ],
"MajorType": "",
"MinorType": "",
"Candidates_Answers": [ ],
"Q_Popularity": { },
"Exact_Answer_Popularity": { },
"H_Popularity": [ ],
"Scores": [ ],
"Convergence": [ ],
"Familiarity": [ ]
}
]
Training | Validation | Test | |
---|---|---|---|
Num. of Questions | 14,645 | 1,000 | 1,000 |
Num. of Hints | 140,973 | 9,638 | 9,619 |
The Framework
directory houses essential files for the hint generation framework. Notably, you will find Framework.ipynb
, a Jupyter Notebook tailored for executing and exploring the framework's code. Utilize 🌐Google Colab to seamlessly run this notebook and delve into the hint generation process.
We have finetuned several large language models, including LLaMA 7b, LLaMA 13b, and LLaMA 70b, on the TriviaHG dataset. These models are not available for direct download but can be accessed via API functions provided by AnyScale.com. Below are the IDs for the finetuned models:
- LLaMA 7b Finetuned:
meta-llama/Llama-2-7b-chat-hf:Hint_Generator:X6odC0D
- LLaMA 13b Finetuned:
meta-llama/Llama-2-13b-chat-hf:Hint_Generator:ajid9Dr
- LLaMA 70b Finetuned:
meta-llama/Llama-2-70b-chat-hf:Hint_Generator:NispySP
Using CURL:
export ENDPOINTS_AUTH_TOKEN=YOUR_API_KEY
curl "https://api.endpoints.anyscale.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ENDPOINTS_AUTH_TOKEN" \
-d '{
"model": "meta-llama/Llama-2-70b-chat-hf:Hint_Generator:NispySP",
"messages": [
{"role": "user", "content": "Generate 10 hints for the following question. Question: Which country has the highest population?"}
],
"temperature": 0.0
}'
Or using Python:
import os
import requests
s = requests.Session()
api_base = "https://api.endpoints.anyscale.com/v1"
# Replace with long-lived credentials for production
token = YOUR_API_KEY
url = f"{api_base}/chat/completions"
body = {
"model": "meta-llama/Llama-2-70b-chat-hf:Hint_Generator:NispySP",
"messages": [
{"role": "user", "content": "Generate 10 hints for the following question. Question: Which country has the highest population?"}
],
"temperature": 0.0
}
with s.post(url, headers={"Authorization": f"Bearer {token}"}, json=body) as resp:
print(resp.json())
The Human Evaluation - Answering
folder is a repository that houses Excel files utilized to gather responses from six human participants. Each participant was assigned ten distinct Excel files, each containing a set of ten questions. The table below outlines the types of questions included in the Excel files, along with corresponding statistics collected from participants. The columns in the table below adhere to the format {Difficulty}-{Model}
, where B, F, and V represent Bing, LLaMA 7b Finetuned, and LLaMA 7b Vanilla, respectively.
Question Type | Hard-B | Hard-F | Hard-V | Medium-B | Medium-F | Medium-V | Easy-B | Easy-F | Easy-V |
---|---|---|---|---|---|---|---|---|---|
ENTITY | 5 / 9 | 5 / 9 | 4 / 9 | 8 / 8 | 6 / 8 | 4 / 8 | 8 / 8 | 8 / 8 | 6 / 8 |
HUMAN | 2 / 9 | 0 / 9 | 0 / 9 | 5 / 8 | 1 / 8 | 0 / 8 | 6 / 8 | 6 / 8 | 4 / 8 |
LOCATION | 0 / 9 | 0 / 9 | 0 / 9 | 7 / 8 | 5 / 8 | 2 / 8 | 7 / 8 | 6 / 8 | 4 / 8 |
OTHER | 3 / 9 | 2 / 9 | 0 / 9 | 5 / 8 | 2 / 8 | 0 / 8 | 8 / 8 | 7 / 8 | 7 / 8 |
The Human Evaluation - Quality
folder encompasses ten Excel files, each containing human annotation values assigned to 2,791 hints across various quality attributes such as relevance, readability, ambiguity, convergence, and familiarity. These attributes are essential markers in assessing the overall quality and effectiveness of the hints generated. The table below provides a concise summary of the average scores attained for each quality attribute, offering insights into the perceived quality of the hints evaluated by human participants.
Method | Match | Readability | Ambiguity | Convergence | Familiarity |
---|---|---|---|---|---|
Copilot | 4.09 | 4.67 | 1.51 | 2.23 | 2.47 |
LLaMA 7b - Finetuned | 4.01 | 4.70 | 1.56 | 2.20 | 2.41 |
LLaMA 7b - Vanilla | 3.64 | 4.47 | 1.87 | 2.12 | 2.02 |
Within the Model Performance
folder, comprehensive insights into the generated hints and their evaluation values for convergence (HICOS) and familiarity (HIFAS) quality attributes are provided. The table below presents a comparative analysis of the results obtained from various models, shedding light on their respective performances in terms of HICOS and HIFAS. This comparative assessment is a valuable resource for gauging the efficacy and effectiveness of each model's hint generation capabilities, thereby informing further enhancements and refinements in the generation process.
Model | HICOS | HIFAS |
---|---|---|
LLaMA_7b_Vanilla | 0.307 | 0.833 |
LLaMA_13b_Vanilla | 0.350 | 0.929 |
LLaMA_7b_Finetuned | 0.400 | 0.890 |
LLaMA_13b_Finetuned | 0.410 | 0.881 |
LLaMA_70b_Vanilla | 0.425 | 0.941 |
GPT_3.5 | 0.438 | 0.911 |
WizardLM_70b | 0.446 | 0.942 |
Gemini | 0.455 | 0.911 |
LLaMA_70b_Finetuned | 0.494 | 0.862 |
GPT_4_turbo | 0.525 | 0.875 |
Copilot | 0.540 | 0.946 |
The Entities
folder contains a JSON file with 50,000 entities utilized in the Interquartile Range (IQR) method to determine Q1 and Q3 for normalization purposes. These entities play a crucial role in statistical normalization, ensuring robust data distributions and accurate analytical insights. Leveraging the IQR method with this extensive entity set enables users to effectively manage variations and outliers, enhancing the accuracy and reliability of analyses across various domains.
Jamshid Mozafari, Anubhav Jangra, and Adam Jatowt. 2024. TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '24). Association for Computing Machinery, New York, NY, USA, 2060–2070. https://doi.org/10.1145/3626772.3657855
@inproceedings{10.1145/3626772.3657855,
author = {Mozafari, Jamshid and Jangra, Anubhav and Jatowt, Adam},
title = {TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657855},
doi = {10.1145/3626772.3657855},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2060–2070},
numpages = {11},
keywords = {hint generation, large language models, question answering},
location = {Washington DC, USA},
series = {SIGIR '24}
}