Generative Artificial intelligence tools like ChatGPT are to transform academic work in many ways. Among its many capabilities, one that particularly interests me is its potential to assist as a research assistant with compiling secondary sources—a time-consuming yet critical part of scholarly research. Curious about its performance, I recently conducted an experiment. The results left me impressed, and ultimately, perplexed. In this post, I’ll explain the experiment, share what went right (and wrong), and reflect on what these outcomes reveal about ChatGPT’s usefulness—and limitations—for academic work.
The Experiment
I prompted ChatGPT (I used the free version and
the Scholar GPT option) to compile lists of secondary sources for three very
different topics: AI governance, the philosophy of human agency, and the Prince
in Shakespeare’s Romeo and Juliet. My prompts were simple but specific,
requiring bibliographic information, direct quotations, and page references.
For the first topic, AI governance, ChatGPT
performed quite well. Most of the sources it provided were real, and many of
the quotations were accurate. However, a small part of its output was mere hallucination,
with non-existent articles and invented citations. Still, I found its accuracy
in this domain surprisingly high.
The second topic, the philosophy of human
agency, was where ChatGPT truly excelled. Not only were the sources relevant,
but the quotations and even the page references were entirely accurate with one
exception. I double-checked them – it was easy as most of the books were on my
bookshelf – and found that everything aligned perfectly with the original
texts.
As far as the third topic is concerned, the
Prince in Romeo and Juliet, ChatGPT performed as expected. While it
managed to name actual books and authors, the quotations attributed to these
works were nothing but invention. Worse, the journal articles it cited—along
with their supposed quotations—were entirely hallucinated.
I wasn’t shocked that ChatGPT hallucinated
information. This is a well-documented weakness of GAI tools, and anyone using
them for research should be prepared to verify everything. What really bothers
me is the inconsistency. For one topic, the tool delivered a mixed bag: mostly
correct, with some inaccuracies. For another, it achieved almost absolute
precision, as if it had memorised entire books. And for the third, it produced
nonsense that only looked plausible at first glance.
What’s maddening about this inconsistency is
that it raises an uncomfortable question: why is ChatGPT able to excel in one
area and fail so catastrophically in another? Is it a matter of training data?
Some topics, like foundational philosophical texts, are likely well-represented
in its corpus, whereas niche areas, like specific aspects of Shakespearean
studies, are probably underrepresented. But even so, the drastic difference in
accuracy is hard to let go. It is also important to mention though that when I
let ChatGPT know about the mere hallucination with R&J, and prompted
it to come up with five verified sources, it provided journal articles with
links – all accurate and relevant.
Can We Trust GAI?
This experiment left me with mixed feelings
about ChatGPT’s potential as a research assistant. On the one hand, it’s
capable of producing accurate and useful results—sometimes even impressively
so. On the other, its tendency to hallucinate and its painful inconsistency
make it unreliable.
As GAI tools become more embedded in academic
workflows, this inconsistency presents a serious challenge. If ChatGPT can
sometimes be astonishingly accurate, how are we to distinguish these moments
from the ones where it confidently presents mere hallucinations? And more
importantly, how much of our time will be spent trying to sort the real from
the hallucinated? Should this be a matter of excellence in prompting, critical
thinking, introducing filters on the machine end?
Hi Zsolt,
ReplyDeleteyour reflection is very interesting and inspiring. However, I’m not sure I see your experience as something to be overly concerned about. First of all, I don’t perceive much inconsistency in it; rather, it seems to be a consequence of the “training data,” as you also pointed out. ChatGPT cooks with the “ingredients” it finds.
Yes, it’s true that ChatGPT (especially the free version) makes mistakes, and sometimes even ridiculous ones. For instance, I recently asked it for a very simple date, and it got it wrong. When I pointed out the error and urged it to reconsider, it found the correct answer. Curious, I asked from it how such mistakes can occur while it sometimes displays surprising accuracy in other cases. Its response was that during the initial query, it retrieves information from a basic database layer—the “first catch” with its net, so to speak.
Honestly, I feel much more comfortable with a ChatGPT that makes mistakes and requires double-checking. For me, it would be much more unsettling if it were flawless. This fallibility reassures me that AI systems won’t replace scholars and their critical, systematic pursuit of truth. Instead, they remain tools that can make researchers’ work easier—much like the earlier transitions brought by digitalization, the internet, Google, Wikipedia, and so on.
Indeed, the critical use of these tools must not be abandoned by serious scholars or students. This is, and will always be, the defining difference between rigorous, thoughtful users and those who view these platforms as a shortcut to appear knowledgeable without actually being so. Thanks again for this stimulating reflection! Üdv, Berényi Laci
Dear Laci,
DeleteThank you so much for your thoughtful comment. I agree with you on many points. You’re absolutely right: critical thinking is indispensable, double-checking is essential, GenAI should never replace scholars in the pursuit of truth, and, indeed, there is something intriguingly human about the fact that GenAI is not flawless. In essence, I find myself in agreement with much of what you have written.
That said, I would like to approach these ideas from a slightly different perspective. As scholars, when I address a research question and formulate a hypothesis, a useful and necessary step is gathering the insights of other scholars. This process is not just a precaution to avoid replicating existing arguments but also an opportunity to refine my own ideas by engaging in a dilaogue with my colleagues. Such fieldwork is undeniably valuable, yet it is also slow, often haphazard, and invariably leaves a lingering doubt: have I overlooked an important study, or a key thinker?
Over the years, various solutions to this challenge have emerged. In some institutions, PhD students or other students are tasked with this legwork, while wealthier departments hire dedicated research assistance. For me, however, neither option is available—or, in the first case, ethically justifiable. This is where I see potential for AI. The announcements of Agentive AI, which promises to solve real-world problems autonomously, could theoretically address this issue. But my recent experiences with GenAI remind me that we are not there yet.
This brings me to a fundamental frustration: if I use GenAI, I don’t want to waste my critical faculties verifying whether my “assistant” is hallucinating or misleading me. Imagine employing a secretary who occasionally excels but frequently fabricates plausible-sounding data. You hired them to lighten your workload, to free up your time for tasks only you can perform. But if you must constantly double- or triple-check their work, what have you gained? In such a scenario, most of us would quickly tfire them. Sadly, this is how GenAI feels to me at present.
Moreover, while I wholeheartedly agree that scholars cannot and should not be replaced by AI, I would prefer to use my critical skills to engage deeply with my sources—not to investigate whether those sources exist at all. My role as a scholar involves formulating hypotheses, analysing data (literary, theatrical, cultural, or philosophical, in my case), and drawing conclusions. A trustworthy AI agent could assist with peripheral tasks, but only if it truly gives a hand, not if it shifted the burden of fact-checking back onto me.
Finally, I agree entirely with your point about the superficial appearance of expertise that these tools can grant. GenAI can enable people to look smarter and more knowledgeable without truly being so. However, this appearance of intelligence introduces additional work for those of us committed to genuine scholarship. I used to be content if a student referenced relevant secondary sources or if I found an article supported by solid citations to researchers whose expertise lay beyond my own. Now, I must verify every citation because it might well be a figment of GenAI’s imagination.
In sum, while GenAI holds promise, its current limitations undermine its potential. Until it becomes truly reliable, I remain cautious—not about the machine’s capabilities, but about its impact on the integrity and efficiency of my work as scholars.