Abstract
Despite initial research about the biases and perceptions of Large Language Models (LLMs), we lack evidence on how LLMs evaluate occupations, especially in comparison to human evaluators. In this paper, we present a systematic comparison of occupational evaluations by GPT-4 with those from an in-depth, high-quality and recent human respondents survey in the United Kingdom. Covering the full ISCO-08 occupational landscape, with 580 occupations and two distinct metrics (prestige and social value), our findings indicate that GPT-4 and human scores are highly correlated across all ISCO-08 major groups. In absolute terms, GPT-4 scores are more generous than those of the human respondents. At the same time, GPT-4 substantially under or overestimates the occupational prestige and social value of many occupations, particularly for emerging digital and stigmatized occupations.