Scientists have expected a lot from Large Language Models (LLMs), especially in terms of creativity and ideation. In a recent study, however, questions have emerged about whether their potential in generating innovative research ideas truly surpasses human creativity or remains constrained by practical limitations.
Published on September 6 on arxiv.org, the large-scale human study conducted by Chenglei Si, Diyi Yang, and Tatsunori Hashimoto aimed to evaluate the potential of LLMs in the ideation process. The experiment involved over 100 natural language processing (NLP) researchers in generating and evaluating research ideas, comparing both human and LLM-generated outputs.
While LLM-generated ideas showed greater novelty with statistical significance (p<0.05), the study revealed notable weaknesses, particularly regarding the feasibility and practical application of these ideas. One of the most remarkable results from the study was the clear superiority of LLMs in terms of novelty. LLM-generated ideas received an average novelty score of 5.64, and when reranked by human reviewers, the score rose slightly to 5.81, compared to 4.84 for human-generated ideas.
The models, using techniques like retrieval-augmented generation (RAG), were able to sift through massive quantities of research papers, generating ideas with a level of novelty that human experts found difficult to match.
But while novelty is a key element of creative research, feasibility remains just as important—if not more so—when assessing the practicality and real-world application of research ideas. In this regard, LLMs faltered. Feasibility scores for LLM-generated ideas were consistently lower than those for human-generated ones. And this has revealed a fundamental gap between LLMs’ ability to ideate and their capacity to envision practically implementable solutions. The tendency of LLMs to propose resource-intensive projects, such as fine-tuning large models like BLOOM, illustrates how AI-generated ideas, although creative, can face significant hurdles in real-world execution.
This tension between novelty and feasibility propounds a critical question about the role LLMs should play in research. If LLMs can consistently outperform humans in terms of ideation, should they be integrated into the early stages of research development, leaving humans to refine and implement these ideas? Or do their feasibility shortcomings limit them to being mere ideation assistants?
Another major challenge identified in the study is the issue of self-evaluation. LLMs, despite their ability to generate novel ideas, struggle to reliably evaluate their own outputs. Various evaluation methods, including pairwise ranking, revealed that LLMs exhibit a lower consistency in idea evaluation compared to human reviewers.
The best-performing model, Claude-3.5, achieved an accuracy of just 53.3% in evaluating ideas, lower than human inter-reviewer consistency at 56.1%. This exposes the inherent difficulties in using LLMs as autonomous research agents capable of both generating and critically assessing ideas.
Moreover, the issue of diversity in LLM-generated ideas cannot be overlooked. The study showed that although LLMs can generate a large number of ideas—up to 4,000 seed ideas per topic—only a small fraction were unique. Most were duplicates — this points to a bottleneck in the diversity of ideas.
This lack of diversity could lead to a narrowing of perspectives in the research sector, a problem that could undermine the benefits of AI-generated creativity in the long term. As the authors of the study rightly suggest, refining LLM models or ideation methods will therefore be necessary to ensure more diverse thinking in future AI-generated research.
The study has also accentuated the importance of human supervision in the LLM-driven research process. In the reranking of ideas, human reviewers consistently improved the outcomes, particularly in terms of novelty. This signifies that while LLMs are capable of generating novel ideas, their outputs can be significantly enhanced when combined with human expertise.
This raises a broader question about the future of research: rather than asking whether LLMs can generate better research ideas than humans, perhaps we should be exploring how human-AI collaboration can elevate the research process to new heights.
It is essential to recognize that while LLMs demonstrate remarkable capabilities, they are not yet capable of fully autonomous research. According to the findings, even when LLMs are integrated into the research pipeline—from paper retrieval to idea generation and evaluation—human intervention remains crucial at multiple stages.
Expert researchers there provided critical input by reranking ideas and conducting qualitative reviews; they demarcated shortcomings such as the misuse of datasets and unrealistic assumptions in LLM-generated proposals. And without this human input, it’s clear that the feasibility and practicality of AI-generated research ideas would be severely compromised.
Review bias and the subjectivity of idea evaluation
An intriguing aspect of the study is the subjectivity involved in idea evaluation. Reviewing research ideas, especially those that are not yet fully developed into papers, presents inherent challenges.
The study has reported an inter-reviewer agreement of just 56.1%, lower than the 66% found in NeurIPS 2021 reviewer consistency experiments and the 71.9% in ICLR 2024 submissions. This low level of agreement begets the subjective nature of evaluating raw research ideas, as opposed to fully executed projects.
The subjective biases of human reviewers also raise concerns about the validity of the evaluation process. While AI-generated ideas were rated as more novel, the reviewers’ own expectations and preferences likely influenced these outcomes.
Furthermore, the novelty of an idea does not guarantee its effectiveness or impact, especially in fields like NLP, where the execution of ideas is paramount. This subjectivity, coupled with the known biases of LLMs in evaluation tasks, implies that neither humans nor AI are fully equipped to handle the complexities of research ideation independently.
Scaling LLM capabilities
The study’s attempt to scale LLM capabilities using the over-generate-and-rank method has revealed significant diminishing returns. Of the 4,000 seed ideas generated per research topic, only 200 were unique. As the number of generated ideas increased, the percentage of non-duplicates dropped.
This bottleneck in idea generation challenges the assumption that simply scaling LLM outputs will lead to better or more creative research ideas. Instead, it indicates that beyond a certain point, LLMs struggle to sustain creativity and diversity in their outputs; this has further accentuated the need for human input to guide and refine the ideation process.
The ethical dilemma and impact on human creativity
It’s almost certain that important ethical questions can arise with the increasing use of LLMs in research. As AI-generated ideas become more prevalent, there is a growing concern about the flood of low-quality academic submissions that may result from this trend.
The ambiguity surrounding intellectual credit and the potential for AI-generated ideas to be misused for harmful applications add another layer of complexity to this issue. Transparency in the role of AI in research will be crucial to maintaining academic integrity, as will continued safety research to mitigate potential risks.
Another ethical concern is the potential homogenization of ideas. As LLMs become more widespread, there is a risk that research outputs may become less diverse, with LLMs generating a narrower range of perspectives. Addressing this issue will require careful refinement of AI models and ideation methods to encourage more diverse and innovative thinking.
Perhaps the most profound question raised by the study is the potential impact of LLMs on human creativity and collaboration. If AI-generated ideas are consistently rated as more novel, does this signal a decline in human creativity, or does it simply reflect a new paradigm in which AI augments human capabilities?
The authors of the study argue that over-reliance on AI could stifle human creativity and social collaboration, which obviously have long been central to the research process.
However, rather than viewing AI as a replacement for human researchers, it may be more productive to see LLMs as tools that can enhance human creativity. By automating certain aspects of research ideation and evaluation, LLMs can free up time for researchers to focus on more complex and subtle tasks.
However, only by maintaining a balance between utilizing AI’s strengths and preserving the uniquely human elements of creativity and collaboration will true output come in scientific discoveries.
- Can LLMs generate better research ideas than humans? A critical analysis of creativity and feasibility - September 25, 2024
- Artificial Super Intelligence: Transcending Imagination - September 15, 2024
- Artificial General Intelligence: Start of a New Era - September 8, 2024