Repository of projects using games as environments to operationalize and explore various AI safety risks and scenarios.
Currently a work-in-progress.
First project is to use the card game Cheat! (aka I Doubt It!) as a toy model of deception, and train and then interpret a decision-transformer-based player to look for evidence of circuits that control "deceptive behavior" (i.e. the decision about when to play a cheating card), opponent modeling, or other behaviors of interest.