Monday, 25 November 2024

ChatGPT, Scholarly Trust and Inconsistency


Generative Artificial intelligence tools like ChatGPT are to transform academic work in many ways. Among its many capabilities, one that particularly interests me is its potential to assist as a research assistant with compiling secondary sources—a time-consuming yet critical part of scholarly research. Curious about its performance, I recently conducted an experiment. The results left me impressed, and ultimately, perplexed. In this post, I’ll explain the experiment, share what went right (and wrong), and reflect on what these outcomes reveal about ChatGPT’s usefulness—and limitations—for academic work.


The Experiment

I prompted ChatGPT (I used the free version and the Scholar GPT option) to compile lists of secondary sources for three very different topics: AI governance, the philosophy of human agency, and the Prince in Shakespeare’s Romeo and Juliet. My prompts were simple but specific, requiring bibliographic information, direct quotations, and page references.

For the first topic, AI governance, ChatGPT performed quite well. Most of the sources it provided were real, and many of the quotations were accurate. However, a small part of its output was mere hallucination, with non-existent articles and invented citations. Still, I found its accuracy in this domain surprisingly high.

The second topic, the philosophy of human agency, was where ChatGPT truly excelled. Not only were the sources relevant, but the quotations and even the page references were entirely accurate with one exception. I double-checked them – it was easy as most of the books were on my bookshelf – and found that everything aligned perfectly with the original texts.

As far as the third topic is concerned, the Prince in Romeo and Juliet, ChatGPT performed as expected. While it managed to name actual books and authors, the quotations attributed to these works were nothing but invention. Worse, the journal articles it cited—along with their supposed quotations—were entirely hallucinated.

I wasn’t shocked that ChatGPT hallucinated information. This is a well-documented weakness of GAI tools, and anyone using them for research should be prepared to verify everything. What really bothers me is the inconsistency. For one topic, the tool delivered a mixed bag: mostly correct, with some inaccuracies. For another, it achieved almost absolute precision, as if it had memorised entire books. And for the third, it produced nonsense that only looked plausible at first glance.

What’s maddening about this inconsistency is that it raises an uncomfortable question: why is ChatGPT able to excel in one area and fail so catastrophically in another? Is it a matter of training data? Some topics, like foundational philosophical texts, are likely well-represented in its corpus, whereas niche areas, like specific aspects of Shakespearean studies, are probably underrepresented. But even so, the drastic difference in accuracy is hard to let go. It is also important to mention though that when I let ChatGPT know about the mere hallucination with R&J, and prompted it to come up with five verified sources, it provided journal articles with links – all accurate and relevant.



Can We Trust GAI?

This experiment left me with mixed feelings about ChatGPT’s potential as a research assistant. On the one hand, it’s capable of producing accurate and useful results—sometimes even impressively so. On the other, its tendency to hallucinate and its painful inconsistency make it unreliable.

As GAI tools become more embedded in academic workflows, this inconsistency presents a serious challenge. If ChatGPT can sometimes be astonishingly accurate, how are we to distinguish these moments from the ones where it confidently presents mere hallucinations? And more importantly, how much of our time will be spent trying to sort the real from the hallucinated? Should this be a matter of excellence in prompting, critical thinking, introducing filters on the machine end?