Self-Refine: Iterative Improvement with Self-Feedback

Abstract

Just like humans don't, large language models don't always produce the best output on the first try. Motivated by the way people improve their written texts, we introduce SELF-REFINE, an approach to improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to first create an initial output using LLM; The same LLM then provides feedback for its own output and uses it to iteratively improve itself. SELF-REFINE does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as a generator, refiner, and feedback provider. We also evaluate SELF-REFINE on 7 different tasks, from conversational response generation to mathematical reasoning, using state-of-the-art (GPT-3.5 and GPT-4) LLMs. Across all tasks evaluated, outputs generated with SELF-REFINE are preferred by humans and automated measurements over outputs generated with the same LLM using traditional one-shot generation, with an absolute average improvement in task performance of ∼20%. Our work shows that even state-of-the-art LLMs such as GPT-4 can be further improved at test time using our simple, standalone approach.

Images

Figure 1: Given an input 0) SELF-REFINE starts by creating an output and sending it back to the same M model to receive feedback 1). The feedback is passed back to M, which improves the previously generated output 2). Steps 1) and 2) are repeated until a stopping condition is met. SELF-REFINE is initialized with a language model like GPT-3.5 and does not require human assistance.

Figure 2: SELF-REFINE examples: an initial output created by the base LLM and then sent back to the same LLM for feedback to improve the output. The top row shows the dialogue creation process, where the initial dialogue response can be transformed into a more engaging response that also understands the user by applying feedback. The bottom line is this is illustrated for code optimization, where code is made more efficient by applying feedback.

Table 1: SELF-REFINE results on various tasks using GPT-3.5, ChatGPT and GPT-4 as base LLM. SELF-REFINE continuously improves the LLM.

Table 2: Asking to generate general feedback (or having the model generate no feedback at all) results in decreased scores, demonstrating the importance of SELF-REFINE's feedback step. These experiments were carried out with ChatGPT (Code Optimization and Sentiment Reversal) and GPT-3.5 (acronym generation).

Figure 4: Left: Score improvements in terms of iteration. Early iterations significantly improve output quality, and scores generally continue to improve with more iterations. Right: Performance improvements with SELF-REFINE iterations. Most of the gains (∆) are in the first iterations for both Code Operations and Emotion Reversal. Numbers averaged across ChatGPT, GPT-3.5 and GPT-4.

Figure 5: Comparison of code created by Madaan et al. (2023) (left) and output after applying SELF-REFINE (right). The first code in the baseline, which is almost identical to the slower input program, fails to improve efficiency and merely changes the logic for reading the input. SELF-REFINE first produces feedback diagnosing that this code is slow because it uses six nested loops to iterate through all possible combinations of coins to pay the amount, suggesting that a more efficient approach would be… SELF-REFINE then uses this feedback to generate the revised code (right), reducing the time complexity to O (amount ∗ coins).

Table 3: Comparison of SELF-REFINE with previous closely related remediation approaches.

Summary

We describe a method called SELF-REFINE for improving the output of large language models (LLMs) through iterative feedback and refinement. The approach involves creating an initial output using the same LLM to provide feedback on the output and iteratively improving it.
This process requires no additional training or supervision and can be applied to a variety of tasks.
The authors evaluate SELF-REFINE on 7 different tasks and show that it improves task performance by an average of 20% compared to traditional single-step production. The results show that even state-of-the-art LLMs can be further improved using this simple approach.
The text discusses the benefits of using the SELF-REFINE approach when creating sentences containing the given concepts. SELF-REFINE leads to high gains in preference-based tasks such as Conversational Response Generation, Emotion Reversal, and Acronym Generation.
The quality of feedback plays a crucial role in REFINE YOURSELF, with specific, actionable feedback delivering superior results compared to general feedback. Multiple iterations of feedback and improvement improve the quality of the output, with improvement decreasing with more iterations. SELF-REFINE performs better at generating multiple initial outputs without requiring optimization. When using the SELF-REFINE method, weaker models may struggle with the healing process and fail to follow instructions. Qualitative analysis demonstrates the effectiveness of SELF-REFINE in producing improved outcomes.
The researchers analyzed a total of 70 examples, focusing on Code Optimization and Mathematical Reasoning tasks.
They found that the feedback provided was mostly actionable and helped identify problematic aspects of the original generation.
When SELF-REFINE failed to improve output, this was usually due to faulty feedback rather than faulty improvements. Accurate and helpful feedback played a vital role in successful cases, leading to precise corrections in the refining process. The refiner was capable of correcting problems even when the feedback was partially inaccurate and demonstrated resilience to suboptimal feedback. The study demonstrated the potential of SELF-REFINE in real-world tasks such as website creation.
SELF-REFINE has been demonstrated on a variety of tasks, including sentiment inversion, dialogue response generation, code optimization, mathematical reasoning, abbreviation generation, and constrained generation.
The results show that SELF-REFINE is effective in improving performance on these tasks, and statistically significant gains are observed for different datasets. Experiments involve human annotators to evaluate human performance and compare it to SELF-REFINE output.

Conclusion

We present SELF-REFINE, a new approach that allows large language models to iteratively provide self-feedback and improve their own output. SELF-REFINE runs within a single LLM and does not require additional training data or reinforcement learning. We demonstrate the simplicity and ease of use of SELF-REFINE in a wide variety of tasks. By demonstrating the potential of SELF-REFINE on a variety of tasks, our research contributes to the ongoing research and development of large language models to reduce the cost of human creative processes in real-world environments. We hope that our iterative approach will facilitate further research in this area. For this purpose, we make all our codes, data and prompts available anonymously at “https://selfrefine.info/”.