Exploring Psychometric Evaluation of LLMs: Our CHI 2025 HEAL Workshop Paper

Reflections on presenting our research on using psychological scales to evaluate LLM experiences

Last week at the CHI 2025 HEAL (Human-centered Evaluations of LLMs) workshop in Yokohama, I presented our work-in-progress paper exploring whether psychometric scales can detect experiential differences between LLM versions. Our research investigates a question I’ve been grappling with: can established psychological measurement tools adequately capture how people experience different versions of LLMs? Or do they fall short of measuring the nuanced ways humans interact with these increasingly sophisticated systems?

For this study, we created a journaling application called Journal Kernel powered by two different Claude versions (2.0 and 3.5 Sonnet). Thirty-nine participants engaged with the system, and we measured their experiences using scales for reflection quality, emotional awareness, motivation, and basic psychological needs.

Our initial findings showed subtle but consistent differences favoring the newer model, with the most pronounced gains in interpersonal dimensions like relatedness and autonomy. While these differences didn’t reach conventional significance thresholds, the pattern suggests psychometric scales might indeed detect meaningful experiential differences between LLM versions.

Discussing these results with fellow experts at the workshop was invaluable. We explored questions about developing more sensitive measurement tools specifically for human-LLM interaction, finding the right balance between context-specific and generalized evaluation frameworks, and how these approaches might integrate into actual LLM development cycles.

I’m excited to continue this research with a more rigorous methodology, implementing mixed-methods approaches, tighter experimental controls, and exploring generalizability across different LLM architectures and domains. As LLMs become increasingly embedded in our daily lives, developing robust methods to evaluate not just their technical capabilities but their impact on human experience becomes crucial. I believe psychological perspectives can offer valuable insights for this emerging field.

The position paper is available here.