Study reveals that GPT-4 outperforms human analysts at predicting future revenue growth from financial statements



OpenAI 's large-scale language model (LLM) GPT-4 has already been shown to exceed human capabilities in some areas, such as outperforming human college students on morality tests and exploiting real vulnerabilities by reading security advisories . It has now been demonstrated that GPT-4 can perform financial statement analysis with an accuracy comparable to that of a professional analyst.

Financial Statement Analysis with Large Language Models by Alex Kim, Maximilian Muhn, Valeri V. Nikolaev :: SSRN
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311



The future of financial analysis: How GPT-4 is disrupting the industry, according to new research | VentureBeat
https://venturebeat.com/ai/the-future-of-financial-analysis-how-gpt-4-is-disrupting-the-industry-according-to-new-research/

Research shows OpenAI's GPT-4 'outperforms' humans in financial statement analysis, but skeptics aren't convinced - SiliconANGLE
https://siliconangle.com/2024/05/26/research-shows-openais-gpt-4-outperforms-humans-financial-statement-analysis-skeptics-arent-convinced/

A research group at the University of Chicago Booth School of Business conducted a study to verify the 'LLM's ability to analyze financial statements.' In the study, only corporate financial statements were input into the LLM to predict future revenues. In the test, even when only 'anonymized balance sheets' and 'income statements with no context' were provided, GPT-4 succeeded in achieving prediction accuracy that exceeded that of human analysts.

The research group said, 'We found that the predictive accuracy of LLM is comparable to the performance of state-of-the-art machine learning (ML) models with limited training,' and 'LLM's predictions do not arise from training memory. We found that LLM produces useful narrative insights about a company's future performance.' They praised LLM's ability to read financial statements.

The graph below shows the 'accuracy score for corporate earnings forecasts' on the left and the ' F-score for corporate earnings forecasts' on the right. In a study by the University of Chicago, GPT-4 outperformed human analysts in both accuracy and F-score for earnings forecasts.



A key point in this study is the use of so-called

Chain of Thought (CoT) prompts, which are prompts that achieve complex reasoning skills through intermediate reasoning steps.

By adopting CoT prompts, GPT-4 can emulate the analytical process of a financial analyst, identifying trends, calculating ratios, and integrating information to form predictions. It is clear from the graph above that the score when using CoT prompts, 'GPT (with CoT)', is higher than the score when CoT prompts are not used, 'GPT (without CoT)'. In addition, when using CoT prompts to make GPT-4 predict revenue, the prediction accuracy is about 60%, while the prediction accuracy of human analysts is 53-57%, making it clear that using CoT prompts can make more accurate revenue predictions than human analysts.

The research group concluded, 'Taken together, our findings suggest that LLMs may play a central role in decision-making. ' The advantages of LLMs, the research group noted, are their 'vast knowledge base' and 'their ability to recognize patterns and business concepts, which allows them to perform intuitive reasoning even with incomplete information.'

The diagram below shows the process of having LLM predict a company's revenue. The balance sheet and income statement are input into GPT-4 Turbo, and revenue is predicted using CoT prompts. Specifically, trend analysis, ratio analysis, and theoretical interpretation are performed.



Until now, analyzing numbers has been one of the major challenges for language models, so technology media VentureBeat noted that 'these research results are noteworthy.'

Alex Kim, one of the authors of the paper, said, 'Numerical domains are one of the most challenging domains for language models. In this domain, language models must perform calculations, interpret like humans, and make complex judgments. LLMs are known to be effective in text tasks, but they have been thought to lack the deep numerical reasoning and flexibility of the human mind when it comes to understanding numbers.' He emphasized that GPT-4's revenue predictions using CoT prompts have outperformed conventional LLMs.

VentureBeat noted, 'The ability of a general-purpose language model to match the performance of a domain-specific ML model and exceed human experts shows the disruptive potential of LLM in the financial sector.' While it is unlikely that human expertise and judgment will be replaced by AI any time soon, it is clear that powerful tools like GPT-4 can significantly enhance and streamline the work of analysts, and there is a possibility that major changes will occur in the field of financial statement analysis in the next few years.

in Software, Posted by logu_ii