PRD: Peer Rank and Discussion
Improve Large Language Model based Evaluations

Ruosen Li         Teerth Patel         Xinya Du*
Department of Computer Science, University of Texas at Dallas

Peer Rank Overview

Peer Discussion Overview

Abstract

Nowadays, the quality of responses generated by different modern large language models (LLMs) are hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs as a reference-free metric for open-ended question answering. More specifically, they use the recognized ``strongest'' LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems such as bringing in self-enhancement (favoring its own answers) and positional bias.

We draw insights and lessons from the educational domain to improve LLM-based evaluations. Specifically we propose the (1) peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preference of two answers.

Interestingly, PR can induce a relatively accurate rankings of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

View Discussions (between two LLM reviewers)

Discussion Template

Peer Rank Elo Scores

GPT-4 encodes self-enhancement across the process, while our PR correlates with human pairwise comparisons well.

Pairwise win rate heatmaps

We compare our PR evaluation with GPT-4 based evaluation and the Chatbot Arena leaderboard. Our eval correlates better with Arena win rate, especially on weaker models.

Reviewer Weights Pie Chart

Peer rank final round weights of each reviewer assigned by PR All (weighted).


BibTeX


        @article{li2023prd,
          title={PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations},
          author={Li, Ruosen and Patel, Teerth, and Du, Xinya},
          journal={arXiv preprint arXiv:2307.02762},
          year={2023}
        }