GPT-4 encodes self-enhancement across the process, while our PR correlates with human pairwise comparisons well.
We compare our PR evaluation with GPT-4 based evaluation and the Chatbot Arena leaderboard. Our eval correlates better with Arena win rate, especially on weaker models.
Peer rank final round weights of each reviewer assigned by PR All (weighted).
@article{li2023prd,
title={PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations},
author={Li, Ruosen and Patel, Teerth, and Du, Xinya},
journal={arXiv preprint arXiv:2307.02762},
year={2023}
}