Can LLMs Provide Useful Feedback on Papers?

https://cbarkinozer.medium.com/llmler-makaleler-hakk%C4%B1nda-yararl%C4%B1-geri-bildirim-sa%C4%9Flayabilir-mi-303351e58078

“Can large language models provide useful feedback on research papers? “A large-scale empirical analysis.” Summary of the article.

Abstract

Feedback from experts forms the basis of rigorous research. However, the rapid growth of scientific production and the mastering of complex knowledge challenges traditional scientific feedback mechanisms. High-quality peer reviews are becoming increasingly difficult to obtain. Researchers who are more junior or come from under-resourced settings have a particularly difficult time receiving timely feedback.

With the emergence of large language models (LLMs) such as GPT-4, there is growing interest in using LLMs to generate scholarly feedback on research papers. However, the utility of LLM-generated feedback has not been systematically investigated. To address this gap, we created an automated pipeline using GPT-4 to comment on full PDFs of scientific articles.

We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4-generated feedback to feedback from human peer reviewers at 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in points raised by GPT-4 and human reviewers (average overlap of 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap of 28.58% for Nature journals, 35% for ICLR ,25) ). The overlap between GPT-4 and human reviewers is greater for weaker articles (i.e., rejected ICLR articles; average overlap is 43.80%).

We then conducted a prospective user study with 308 researchers from 110 US institutions in artificial intelligence and computational biology to understand how researchers perceived the feedback generated by our GPT-4 system on their papers. Overall, more than half (57.4%) of users found the feedback generated by GPT-4 helpful/very helpful, and 82.4% found it more helpful than feedback from at least some real-person reviewers.

Although our findings suggest that LLM-generated feedback can be helpful to researchers, we also identify some limitations. For example, GPT-4 tends to focus on certain aspects of scientific feedback (e.g. ‘adding experiments on more datasets’) and often struggles to provide an in-depth critique of method design.

Our results show that LLM and human feedback can complement each other. While human expert review is and should continue to be the foundation of the rigorous scientific process, LLM feedback can benefit researchers, especially when timely expert feedback is not available and in the early stages of manuscript preparation before peer review.

Images

Figure 1. Characterization of the ability of the large language model to provide useful feedback to researchers. a, Pipeline for generating LLM scientific feedback using GPT-4. Given a PDF, we parse and extract the title, abstract, figure and table titles, and main text of the article to generate the prompt. We then ask GPT-4 to follow the feedback structure of leading interdisciplinary journals and conferences, providing structured comments in four parts: significance and novelty, potential reasons for acceptance, potential reasons for rejection, and suggestions for improvement. b, Retrospective analysis of LLM feedback on 3,096 Nature family articles and 1,709 ICLR articles. We systematically compare LLM feedback with human feedback using a two-stage comment matching pipeline. The pipeline performs subtractive text summarization to extract comment points appearing in LLM and human-written feedback, respectively, and then performs semantic text matching to match shared comment points between LLM and human feedback. c, Prospective user study survey of 308 researchers from 110 US institutions in artificial intelligence and computational biology. Each researcher uploaded a paper they wrote and completed a survey based on LLM feedback created for them.
Figure 2. Retrospective analysis of LLM and human scientific feedback. a, Retrospective analysis of overlap between feedback from the LLM and feedback from individual human reviewers on papers submitted to Nature Family Journals. Approximately one-third (30.85%) of the comments reported in GPT-4 coincide with the comments of an individual reviewer (hit rate). “GPT-4 (mixed)” indicates feedback from GPT-4 for another randomly selected article from the same journal and category. As a null model, if LLM mostly produces general feedback applicable to many papers, then there will be a slight decrease in the pairwise overlap between LLM feedback and each reviewer's comments after the shuffle. In contrast, after shuffle, the hit rate drops significantly from 57.55% to 1.13%, indicating that the LLM feedback is article specific. b, At the International Conference on Learning Representations (ICLR), more than a third (39.23%) of the comments expressed on GPT-4 coincide with the comments of an individual reviewer. The Shuffle experiment shows a similar result; this suggests that LLM feedback is article specific. c–d, overlap between LLM feedback and human feedback, two human reviewers in Nature family journals © (r = 0.80, P = 3.69 × 10−4) and ICLR decision results (d) (r = 0.98). appears to be comparable to the overlap observed between . , P = 3.28 × 10−3). e-f, Comments made by more than one natural person are disproportionately more likely to be influenced by GPT-4 in Nature Family Journals (e) and ICLR (f). The x-axis shows the number of reviewers who upgraded the comment. The y-axis shows the probability that a human reviewer comment matches a GPT-4 comment (GPT-4 recall rate). g-h, Comments submitted at the beginning of peer feedback are more likely to be identified by GPT-4 on Nature Family Journals (g) and ICLR (h). The x-axis shows a comment's position in the order of comments created by the human reviewer. Error bars represent 95% confidence intervals. *P < 0.05, **P < 0.01, ***P < 0.001 and ****P < 0.0001.
Figure 3. LLM-based feedback emphasizes certain aspects more than humans. LLM comments on research results 7.27 times more often than human reviewers. Conversely, LLMs are 10.69 times less likely to comment on innovation compared to real people. Although both LLM and humans frequently recommend additional experiments, their focus differs: Human reviewers are 6.71 times more likely than LLM to request additional ablation experiments; LLMs are 2.19 times more likely than people to request experiments on more data sets. Circle size indicates the prevalence of each element in human feedback.
Figure 4. LLM human study and human review feedback (n = 308). a-b, LLM-generated feedback is generally useful and has significant overlap with actual feedback from human reviewers. c-d, Compared to human feedback, LLM feedback is slightly less useful and less specific. e-f, Users generally believe that the LLM feedback system can improve the accuracy and comprehensiveness of reviews and reduce the workload of reviewers. g, Most users plan to use the LLM feedback system again or potentially. h, Users believe that the LLM feedback system mostly helps authors, then reviewers and editors/field heads. Numbers are in percentage form (%).
LLM-based scholarly feedback is thought to be useful among participants with different publishing experiences.
LLM-based scientific feedback is considered useful among participants of different professional statuses.
Article text, including figure captions, is extracted from article PDFs and integrated into a prompt for LLM GPT-4, which then generates feedback. The feedback generated provides structured comments in four sections: importance and novelty, potential reasons for acceptance, potential reasons for rejection, and suggestions for improvement. In the example, GPT-4 commented that the article reported the phenomenon of a method gap but did not suggest methods to close the gap or demonstrate the benefits of doing so.
Workflow of a retrospective comment matching pipeline for scientific feedback texts. a,This two-stage pipeline compares the comments expressed in LLM-generated feedback with comments from human reviewers. b, Extraction: Leveraging LLM's information extraction capabilities, important comments are extracted from both LLM-generated and human-written reviews. c, Matching: LLM is used for semantic similarity analysis where comments from LLM and human feedback are matched. A similarity rating and justifications are provided for each paired comment. The similarity threshold was set to ≥ 7 to filter out poorly matched comments. This threshold is chosen based on human verification of the matching phase.
Prompt template used with GPT-4 to generate scientific feedback on articles in the Nature journal family dataset. <Paper_content> specifies the text extracted from the paper, including the paper's abstract, figure and table titles, and other main text sections. For the sake of clarity and brevity, GPT-4 was instructed to formulate a structured outline of the scientific feedback. GPT-4 was asked to create four feedback sections: importance and novelty, potential reasons for acceptance, potential reasons for rejection, and suggestions for improvement. The feedback was generated by GPT-4 in a single pass.
Prompt template used with GPT-4 for extractive text summarization of comments and human feedback in LLM. The output is structured in JSON (JavaScript Object Notation) format, where each JSON key assigns an ID to a specific point and the corresponding value provides the content of the point.
Prompt template used with GPT-4 for semantic text matching to match points of comments shared between two feedbacks. The input consists of two lists of comments in JSON format obtained from the previous step. GPT-4 was then directed to identify common points between the two lists and generate a new JSON, where each key corresponded to a pair of matching point IDs and the corresponding value provided the rationale for the match.

Summary

  • Large language models (LLMs) can potentially provide feedback on research papers.
  • Due to the increase in scientific production and specialization, traditional scientific feedback mechanisms are becoming more challenging.
  • LLM-generated feedback was evaluated through two large-scale studies comparing it to human peer reviewer feedback.
  • The overlap between GPT-4 and the scores reported by real personal reviewers is comparable to the overlap between two real reviewers.
  • In a user study, more than half of researchers found GPT-4 feedback to be helpful.
  • Limitations of LLM-generated feedback include focusing on specific aspects and struggling with in-depth criticism.
  • LLMs and human feedback can complement each other.
  • There is an urgent need for scalable and efficient feedback mechanisms in scientific research.
  • LLMs have great potential but their use for scientific feedback remains largely unknown.
  • This study provides the first large-scale analysis of the use of LLMs to generate scholarly feedback.
  • A GPT-4-based pipeline was developed to generate structured feedback on various aspects of research articles.
  • Developed an automated pipeline using GPT-4 to generate feedback on scientific articles.
  • Two datasets (Nature Family Journals and ICLR) were used to assess the quality of LLMs' feedback.
  • We conducted a retrospective evaluation comparing LLM feedback to human feedback.
  • Subtractive text summarization and semantic text matching were applied to identify shared interpretations between LLM and human feedback.
  • It found that there was a significant overlap between LLM feedback and human-generated feedback.
  • The overlap between LLM feedback and human feedback was comparable to the overlap between two human reviewers.
  • The results were consistent across decision results of different academic journals and articles — the overlap between LLM feedback and human feedback comments was analyzed in the ICLR dataset.
  • There was an average of 30.63% overlap between LLM feedback and human feedback comments on papers accepted through oral presentations.
  • The average overlap increased to 32.12% for papers accepted through spotlight presentation and to 47.09% for rejected papers.
  • Similar trends were observed in the overlap between two human reviewers.
  • Rejected papers may have more specific problems or flaws that both human reviewers and LLM Experts can consistently identify.
  • LLM feedback can be constructive for papers that require significant revisions.
  • LLM feedback is article-specific, not general.
  • LLMs are more likely to identify common problems recognized by multiple human reviewers.
  • LLMs are aligned with human perspectives on big or important issues.
  • Earlier comments in human feedback are more likely to coincide with LLM comments.
  • LLM feedback emphasizes certain aspects more than humans, such as the implications of the research and requesting experiments on more data sets.
  • Human-AI collaboration can provide benefits by combining the focus of the LLM with the highlights of human reviewers.
  • A survey study was conducted on researchers to evaluate the utility and performance of LLM-generated scientific feedback. The approach is subject to self-selection biases.
  • The data provides researchers with valuable information and subjective perspectives.
  • User research results show a significant overlap between LLM feedback and human feedback.
  • The feedback generated by the LLM is considered useful by the majority of participants.
  • LLM feedback is less specific from some reviewers but more specific from others.
  • Perceptions of agreeableness and helpfulness are consistent across various demographic groups.
  • Participants express their desire to reuse the system and believe in its potential for improvement.
  • LLMs can generate new feedback that people aren't talking about.
  • Limitations of LLM feedback include the ability to generate specific and actionable feedback.
  • LLM feedback can be a valuable resource for writers looking for constructive feedback and suggestions.
  • Feedback from LLMs can be particularly useful for researchers who do not have access to timely quality feedback mechanisms.
  • The developed framework can be used to self-check and improve the work promptly.
  • LLM feedback is useful for people with different educational backgrounds and publishing experiences.
  • Expert human feedback scientific evaluation 
  • e will continue to maintain its importance.
  • Feedback from LLMs has its limitations and can feel generic to authors — LLM feedback should primarily be used by researchers to identify areas for improvement in their papers before formal submission.
  • Expert human reviewers should engage with papers in-depth and provide independent evaluation without relying on LLM's feedback.
  • Automatically generating reviews without thoroughly reading the manuscript undermines the rigorous evaluation process.
  • LLMs and generative AI have the potential to increase productivity, and creativity, and facilitate scientific discovery when implemented responsibly.
  • The results of the research are based on a specific example of scientific feedback using the GPT-4 model.
  • The system only leverages GPT-4's zero-shot learning without fine-tuning additional datasets.
  • Future work could explore other LLMs, conduct more complex prompt engineering, and combine labelled datasets for fine-tuning.
  • The study used Nature family data and ICLR data, but future studies need to evaluate the framework more broadly.
  • User research is limited in scope and suffers from a self-selection problem.
  • The current version of the GPT-4 model does not understand and interpret visual data such as tables, graphs, and figures.
  • Future studies could investigate integrating visual LLMs or dedicated modules for comprehensive scientific feedback.
  • Future studies could investigate to what extent the proposed approach can help identify and correct errors in scientific articles.
  • It is crucial to understand the limitations and challenges associated with error detection and correction by LLM.
  • The scope of scientific articles evaluated may be expanded to include articles in languages other than English or for non-native English speakers.
  • The dataset includes papers from 15 Nature family journals and papers from the International Conference on Learning Representations (ICLR).
  • The Nature dataset contains 3,096 accepted articles and 8,745 reviews, while the ICLR dataset contains 1,709 articles and 6,506 reviews.
  • PDFs and related reviews were retrieved using the OpenReview API.
  • Prototyped a pipeline to generate scientific feedback using OpenAI’s GPT-4.
  • The input to the system was an academic article in PDF format, parsed using the ScienceBeam PDF parser.
  • The first 6,500 tokens of the article were used to create the GPT-4 claim.
  • Specific instructions were provided for creating the following four feedback sections: importance and novelty, potential reasons for acceptance, potential reasons for rejection, suggestions for improvement.
  • A two-stage comment matching pipeline was developed to evaluate the overlap between LLM feedback and human feedback.
  • Inferential text summarization was used to extract comment points from the feedback.
  • Semantic text matching was performed to match human feedback with comments from LLM.
  • Relevant” or above was retained for subsequent analysis.
  • The accuracy of the inferential summarization phase was verified using human verification.
  • The semantic text-matching phase showed good inter-annotator agreement and reliability.
  • The specificity of LLMs' feedback was evaluated by comparison with scrambled human feedback.
  • Pairwise overlap of LLMs and Human and Human and Human feedback was evaluated using the hit rate.
  • The results showed similar hit rates for both comparisons; This suggests that LLM feedback is often not general. — The study examines the robustness of the results using different cluster overlap metrics.
  • A compiled annotation scheme of 11 key aspects is used to analyze comment aspects in human and LLM feedback.
  • A random sample of 500 articles from the ICLR dataset is selected for annotation.
  • To ensure reliability, two researchers carry out the annotations.
  • A prospective user study and survey are conducted to verify the effectiveness of leveraging LLM for scientific feedback.
  • Users upload research articles and receive generated reviews, fill out a survey.
  • Participants are recruited through relevant institute mailing lists and authors of preprints in computer science and computational biology.
  • The study was approved by Stanford University's Institutional Review Board.
  • The article presents a modality gap and discusses multimodal models.
  • The review draft includes sections on significance and novelty, possible reasons for acceptance and rejection, and suggestions for improvement.
  • Artificial intelligence tools have been developed for various tasks in the scientific publication process.
  • Previous studies have investigated the effectiveness of ChatGPT and GPT-4 in peer review and analysis of published articles.
  • The article presents a new approach to multimodal comparative representation learning.
  • It introduces the concept of modality space, which is a geometric phenomenon.
  • The authors propose a multimodal model that maps inputs from different data modalities into a common representation space, providing a new approach to address the method gap.
  • The paper provides empirical evidence of the modality gap phenomenon, supported by histograms of cosine similarity between embeddings.
  • The proposed multimodal model shows promising results in reducing the method gap and improving representation learning across different methods.
  • The study contributes to a better understanding of the underlying mechanisms by providing insights into the effects of nonlinear activation functions on the cone effect.
  • The paper lacks a comprehensive comparison with existing methods in multimodal comparative representation learning, limiting the assessment of its superiority.
  • Failure to adequately provide experimental setup and reproducibility details makes it difficult for other researchers to replicate and confirm the findings.
  • The ethical implications of the research, such as privacy and data security concerns, are not adequately discussed, raising potential concerns about the societal impact of the proposed approach.

Resources

[1] arXiv:2310.01783

Post a Comment

0 Comments