Skip to content

Mechanistic Interpretability of Generative Language Models ( Fine tuning GPT-2 )

Notifications You must be signed in to change notification settings

Mohammed20201991/Blue_team_MechanisticInterpretability

Repository files navigation

Blue_team_MechanisticInterpretability

The goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights . we have no idea how they work nor how to write one ourselves .
So our project is to Investigating Agent Behavior In different Reinforcement Learning methods by optimizing strategy to gain as much reward as possible.
This notebook presents methods for pruning different Reinforcement Learning algorithms that show different agent behavior in different models, to facilitate research into understanding the behavior of these strategies. Implementing and vitalizing Dynamic Programming, Monte Carlo (MC), and Temporal difference (TD) algorithms and comparing their results. The pruning algorithm takes a given dataset that has 2 cases to show the shortest path depending on how the agent learns from their behavior where probability are given .

Algorithms model and non-model based
- 1: Policy & Value Iteration
- 2: Monte Carlo
- 3: Temporal Difference

How to use is ?

Download the notebook and you can run it using Jupyter or Google collab

Dataset

Set your correct path to upload dataset . Data was genereted randomly you can choose any row by ID(Neptune) which has 2 cases feel free to play with it and see how the agent behave .
data

About

Mechanistic Interpretability of Generative Language Models ( Fine tuning GPT-2 )

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published