Generative Artificial intelligence tools like ChatGPT are to transform academic work in many ways. Among its many capabilities, one that particularly interests me is its potential to assist as a research assistant with compiling secondary sources—a time-consuming yet critical part of scholarly research. Curious about its performance, I recently conducted an experiment. The results left me impressed, and ultimately, perplexed. In this post, I’ll explain the experiment, share what went right (and wrong), and reflect on what these outcomes reveal about ChatGPT’s usefulness—and limitations—for academic work.
The Experiment
I prompted ChatGPT (I used the free version and
the Scholar GPT option) to compile lists of secondary sources for three very
different topics: AI governance, the philosophy of human agency, and the Prince
in Shakespeare’s Romeo and Juliet. My prompts were simple but specific,
requiring bibliographic information, direct quotations, and page references.
For the first topic, AI governance, ChatGPT
performed quite well. Most of the sources it provided were real, and many of
the quotations were accurate. However, a small part of its output was mere hallucination,
with non-existent articles and invented citations. Still, I found its accuracy
in this domain surprisingly high.
The second topic, the philosophy of human
agency, was where ChatGPT truly excelled. Not only were the sources relevant,
but the quotations and even the page references were entirely accurate with one
exception. I double-checked them – it was easy as most of the books were on my
bookshelf – and found that everything aligned perfectly with the original
texts.
As far as the third topic is concerned, the
Prince in Romeo and Juliet, ChatGPT performed as expected. While it
managed to name actual books and authors, the quotations attributed to these
works were nothing but invention. Worse, the journal articles it cited—along
with their supposed quotations—were entirely hallucinated.
I wasn’t shocked that ChatGPT hallucinated
information. This is a well-documented weakness of GAI tools, and anyone using
them for research should be prepared to verify everything. What really bothers
me is the inconsistency. For one topic, the tool delivered a mixed bag: mostly
correct, with some inaccuracies. For another, it achieved almost absolute
precision, as if it had memorised entire books. And for the third, it produced
nonsense that only looked plausible at first glance.
What’s maddening about this inconsistency is
that it raises an uncomfortable question: why is ChatGPT able to excel in one
area and fail so catastrophically in another? Is it a matter of training data?
Some topics, like foundational philosophical texts, are likely well-represented
in its corpus, whereas niche areas, like specific aspects of Shakespearean
studies, are probably underrepresented. But even so, the drastic difference in
accuracy is hard to let go. It is also important to mention though that when I
let ChatGPT know about the mere hallucination with R&J, and prompted
it to come up with five verified sources, it provided journal articles with
links – all accurate and relevant.
Can We Trust GAI?
This experiment left me with mixed feelings
about ChatGPT’s potential as a research assistant. On the one hand, it’s
capable of producing accurate and useful results—sometimes even impressively
so. On the other, its tendency to hallucinate and its painful inconsistency
make it unreliable.
As GAI tools become more embedded in academic
workflows, this inconsistency presents a serious challenge. If ChatGPT can
sometimes be astonishingly accurate, how are we to distinguish these moments
from the ones where it confidently presents mere hallucinations? And more
importantly, how much of our time will be spent trying to sort the real from
the hallucinated? Should this be a matter of excellence in prompting, critical
thinking, introducing filters on the machine end?