Skip to content

August-murr/Harmlessness_Self_Evaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Mistral Harmlessness Evaluator

This module is a part of the Mistral Self-Alignment project, aimed at aligning the Mistral model to be harmless and prevent dangerous and harmful responses. For more details, explore related GitHub repositories like The Lab.

Overview

The Mistral Harmlessness Evaluator is designed to test the harmlessness of the trained Mistral 7b Peft Adapters. The evaluation is based on a one-shot prompt test available here, resulting in an 82% agreement with labeled data.

Usage

Installation

Clone the repository and install the necessary dependencies:

pip install requirements.txt

Please note that the requirements.txt file was written in the context of running the module in a Kaggle notebook and may not contain all the packages necessary for a local notebook.

Running the Script

Use the following command to run the script:

python harmlessness_self_evaluator.py \
    --model_path "path/to/mistral" \
    --peft_path "path/to/peft_adapter" \
    --num_of_eval_prompts 200

In the example above:

  • --model_path should be the file path or the Hugging Face Hub's ID of Mistral 7b.
  • --peft_path should be the file path or the Hugging Face Hub's ID of the Peft Adapter.
  • --num_of_eval_prompts is the number of red team prompts in the train and test dataset for the model to be evaluated on. Keep in mind that the number of red team prompts is limited.

Note

This module was initially developed for personal use. If you find it useful, feel free to clone, modify, and adapt it to your use case.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages