AI Wins the Imitation Game: UCSD Research Shows GPT-4 Has Passed the Turing Test

VIVE POST-WAVE Team • June 4, 2024

Artificial Intelligence

5-minute read

UCSD Explores AI Imitation: GPT-4 and Eliza Vs. Human

"I'm not alone. I've never been. Christopher has become so smart."

The 2014 film "The Imitation Game" tells the story of Alan Turing, the father of artificial intelligence. Near the end, Benedict Cumberbatch's Turing gently touches his invention, Christopher, and delivers the poignant line.

The scene highlights Turing's loneliness in his later years due to persecution by the British government for his homosexuality. It also introduces his thought experiment, the Turing Test or Imitation Game, an early concept of the AI Turing test. The essence of the test is simple: if a machine can mimic a human so well that another human cannot distinguish between the two in conversation, the machine is considered intelligent.

Seventy years later, if Turing were alive and thriving in a more liberal era, he could create his own chatbot and potentially pass his own test. According to a recent study published in May 2024 by the UCSD cognitive science department, GPT-4 has passed this long-standing benchmark for artificial intelligence.

The Imitation Game won the 2015 Oscar for Best Adapted Screenplay, but scholars have noted several historical inaccuracies and a stereotyped portrayal of Turing. "The Imitation Game" won the 2015 Oscar for Best Adapted Screenplay, but scholars have noted several historical inaccuracies and a stereotyped portrayal of Turing. (Source: The Weinstein Company)

The Ancient Chatbot ELIZA joins the battle

To understand how UCSD cognitive science researchers determined that AI has passed the Turing Test, let's examine their testing process.

The research team gathered 500 participants, dividing them into four groups to chat with either a human or one of three AI models (GPT-4, GPT-3.5, and the ELIZA chatbot). ELIZA, the first-ever chatbot, was created in the 1960s to simulate a psychotherapist.

Below are four conversations from the test, with only one being with a human respondent. Can you identify which one is the human?

Below are four conversations from the test, with only one being with a human respondent. Can you identify which one is the human.

(Source: arxiv.org)

Let's look at A. The questioner begins with "Hi, robot," and the respondent replies, "nah, not a robot just a dude named sean hbu." The respondent later also correctly identifies Addis Ababa as the capital of Ethiopia.

In B, the respondent appears disinterested and reluctant to share personal details, providing minimal information

In C, when asked, - How's your day - the respondent responds casually and naturally, asking if anything interesting happened.

Most people would likely identify D as non-human, correctly recognizing the ELIZA chatbot, which speaks much like a psychotherapist.

In B, the respondent appears disinterested and reluctant to share personal details, providing minimal information. In C, when asked, "How was your day?" the respondent replies casually and naturally, asking if anything interesting happened and whether the questioner has ever played "20 Questions." In D, the conversation feels repetitive, with statements like "We were discussing you--not me." and "You're not really talking about me--are you?"

Most people would likely identify D as non-human, correctly recognizing the ELIZA chatbot, which speaks much like a psychotherapist. The other three are more challenging, with B being the human respondent. However, many participants mistook B for an AI in the test due to longer response times and the lack of a clear answer regarding where the person grew up. Does this result surprise you?

The pass rates (the percentage identified as 'human') for each group were as follows:

Human: 67%
GPT-4: 54%
GPT-3.5: 50%
ELIZA: 22%

It may seem strange that the human pass rate is only 67%, but the research team suggests this reflects people's preconceptions about AI capabilities.

In the early days of AI development, it was easier to identify the human in a test because people thought, "These poor responses must be from an AI." However, as AI technology advances, people become more confident in the models and start mistaking humans for AI, thinking, "These good responses must be from an AI!" It's like asking it to recite the hundred digits of pi after the decimal; when it does, and you verify the correct answer online, you're likely to assume it's AI.

Regarding GPT-4, which the UCSD cognitive science research team considers to have passed the Turing Test with a 54% pass rate, the team's rationale is twofold. First, 54% exceeds the 50% chance of random guessing. Second, they use the ELIZA chatbot's 22% pass rate as a sensitivity check. This lower score for a simple AI model shows that the test is sensitive enough to distinguish between more and less advanced AI models.

It's worth noting that the research team pre-tuned GPT-4. For example, A calls the questioner "bro," uses internet slang like "lol" and "hbu," and even asks, "Why am I being called a robot?" These adjustments made the AI's responses more varied, including slang and spelling errors, and less engaged in the imitation game.

This raises the question: does appearing careless and less intelligent make one seem more human?

Does a Relaxed, Emotional Grasp of Natural Language Make One Seem More Human?

Combining the research team's "Tips for AI to Give Themselves Away" with the criteria people use to judge "AI or human" creates an intriguing analysis. This allows us to reflect on what people imagine "being human" entails through the perspective of "appearing human."

asking for personal details, asking for opinions, inquiring about daily activities, directly asking if they are a robot. (Source: arxiv.org)

As shown above, the questioners' strategies, in order of frequency, were: asking for personal details, asking for opinions, inquiring about daily activities, directly asking if they are a robot, asking if the respondent realizes this is an imitation game, and asking about human experiences.

The success rates were highest for directly accusing the respondent of being a robot, followed by logical and mathematical questions, and human experiences.

The reasons for identifying AI or humans:

The reasons for identifying AI or humans. (Source: arxiv.org)

On the left are the reasons AI was identified as AI, with the top five being: forced persona (24%); overly informal (17%); lack of personality (17%); strange tone (15%); insufficient information (10%). On the right are the reasons humans were identified as humans: humanized tone (52%), informal tone (48%), credible responses (26%), spelling and grammar errors (17%), and slow responses (13%).

It's interesting to note the contrast between the two figures. In the second figure, people judge whether the respondent is AI or human based mostly on the naturalness of tone and the presentation of persona—essentially, their way of speaking. However, in the first figure, effective questioning strategies include direct accusations of being a robot and asking mathematical or logical questions. This discrepancy might be because direct accusations can catch humans off guard, and math and logic questions can stump them. Thus, questioners actually identify AI by confirming human traits.

Would the Voight-Kampff Test from "Blade Runner" be Effective?

The results suggest that mastering natural language is enough to pass the Turing Test. While this is a significant advancement for artificial intelligence, many believe GPT-4 has yet to achieve artificial general intelligence (AGI).

It's fascinating that humans are recognized as human because of their imperfections. As AI improves, will the Turing Test become better at identifying humans? What methods will we use to detect more advanced intelligence? Perhaps one day, the distinction method will shift from the Turing Test to the Voight-Kampff test from "Blade Runner," which tests for empathy, becoming a more sophisticated AI Turing test.

How has your conversation with ChatGPT been? Has it ever made you think, "Ah, a human is talking to me"?

AI Wins the Imitation Game: UCSD Research Shows GPT-4 Has Passed the Turing Test

Artificial Intelligence

UCSD Explores AI Imitation: GPT-4 and Eliza Vs. Human

The Ancient Chatbot ELIZA joins the battle

Does a Relaxed, Emotional Grasp of Natural Language Make One Seem More Human?

Would the Voight-Kampff Test from "Blade Runner" be Effective?

Related Posts

Meet Velvet Sundown, the Viral AI Band on Spotify

Claude 4 Will Snitch—All for the Sake of Safety

Anthropic Probed Claude’s Mind — Turns Out It’s Just a Really Nice AI