Summary: Researchers found that ChatGPT could assess social interactions in videos and images almost as accurately as humans. The AI's evaluations of social features like cooperation, hostility, and body movements were even more consistent than those of a single person.
Using AI instead of human raters saved over 10,000 work hours, offering a cost-effective and efficient solution for large-scale neuroscience studies. Beyond research, the technology could transform healthcare monitoring, marketing analysis, and security systems by automating social evaluation tasks.
People are constantly making quick evaluations about each other's behaviour and interactions.
The latest AI models, such as the large language model ChatGPT developed by OpenAI, can describe what is happening in images or videos.
However, it has not been clear whether AI's interpretive capabilities are limited to easily recognisable details or whether it can also interpret complex social information.
Researchers at the Turku PET Centre in Finland studied how accurately the popular language model ChatGPT can assess social interaction.
The model was asked to evaluate 138 different social features from videos and pictures. The features described a wide range of social traits such as facial expressions, body movements or characteristics of social interaction, such as co-operation or hostility.
The researchers compared the evaluations made by AI with more than 2,000 similar evaluations made by humans.
The research results showed that the evaluations provided by ChatGPT were very close to those made by humans. AI's evaluations were even more consistent than those made by a single person.
"Since ChatGPT's assessment of social features were on average more consistent than those of an individual participant, its evaluations could be trusted even more than those made by a single person. However, the evaluations of several people together are still more accurate than those of artificial intelligence," says Postdoctoral Researcher Severi Santavirta from the University of Turku.
The researchers used AI and human participants' evaluations of social situations to model the brain networks of social perception using functional brain imaging in the second phase of the study.
Before researchers can look at what happens in the human brain when people watch videos or pictures, the social situations they depict need to be assessed. This is where AI proved to be a useful tool.
"The results were strikingly similar when we mapped the brain networks of social perception based on either ChatGPT or people's social evaluations," says Santavirta.
Researchers say this suggests that AI can be a practical tool for large-scale and laborious neuroscience experiments, where, for example, interpreting video footage during brain imaging would require significant human effort. AI can automate this process, thereby reducing the cost of data processing and significantly speeding up research.
"Collecting human evaluations required the efforts of more than 2,000 participants and a total of more than 10,000 work hours, while ChatGPT produced the same evaluations in just a few hours," Santavirta summarises.
While the researchers focused on the benefits of AI for brain imaging research, the results suggest that AI could also be used for a wide range of other practical applications.
The automatic evaluation of social situations by AI from video footage could help doctors and nurses, for example, to monitor patients' well-being. Furthermore, AI could evaluate the likely reception of audiovisual marketing by the target audience or predict abnormal situations from security camera videos.
"The AI does not get tired like a human, but can monitor situations around the clock. In the future, the monitoring of increasingly complex situations can probably be left to artificial intelligence, allowing humans to focus on confirming the most important observations," Santavirta says.
GPT-4V shows human-like social perceptual capabilities at phenomenological and neural levels
Humans navigate the social world by rapidly perceiving social features from other people and their interaction.
Recently, large-language models (LLMs) have achieved high-level visual capabilities for detailed object and scene content recognition and description.
This raises the question whether LLMs can infer complex social information from images and videos, and whether the high-dimensional structure of the feature annotations aligns with that of humans.
We collected evaluations for 138 social features from GPT-4V for images (N = 468) and videos (N = 234) that are derived from social movie scenes.
These evaluations were compared with human evaluations (N = 2,254).
The comparisons established that GPT-4V can achieve human-like capabilities at annotating individual social features.
The GPT-4V social feature annotations also express similar structural representation compared to the human social perceptual structure (i.e., similar correlation matrix over all social feature annotations).
Finally, we modeled hemodynamic responses (N = 97) to viewing socioemotional movie clips with feature annotations by human observers and GPT-4V.
These results demonstrated that GPT-4V based stimulus models can also reveal the social perceptual network in the human brain highly similar to the stimulus models based on human annotations.
These human-like annotation capabilities of LLMs could have a wide range of real-life applications ranging from health care to business and would open exciting new avenues for psychological and neuroscientific research.