This study evaluates whether large language models can substitute for human survey respondents. I replicate analyses from a representative households survey (the Italian Survey of Consumer Expectations, ISCE) across three domains: behavioral reactions to information treatments, the formation of economic expectations, and the prediction of persistent household traits. Using gpt-4o-mini with post-training data to mitigate contamination bias, I find that the model reproduces certain aggregate patterns but systematically diverges from observed human behavior. It fails to respond appropriately to information treatments, does not capture demographic heterogeneity in risk perceptions, and does not exhibit prudence. Incorporating demographic embeddings further reduces alignment, indicating that the model struggles to simulate human decision processes. However, the model attains 74% accuracy in predicting income
categories and 72% in predicting consumption levels, suggesting potential as an auxiliary tool for imputing persistent traits rather than as a replacement for human respondents.
Keywords: GPT, Large language models; Survey Experiment
JEL Classification: C81, C83, C91, D84, O33