AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

https://cbarkinozer.medium.com/autogenin-%C3%A7ok-ajanl%C4%B1-konu%C5%9Fma-yap%C4%B1s%C4%B1yla-yeni-nesil-llm-uygulamalar%C4%B1-1695793b5a92 

Look at the “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation” paper.


With this paper, Microsoft Research introduced new ideas about multi-agent communication patterns. This is one of the most significant recent papers on autonomous agents.

Abstract

AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviours. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

Summary

  • AutoGen is an open-source framework that enables developers to build LLM (large language model) applications via multiple agents that can converse with each other to accomplish tasks.
  • AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools.
  • Developers can flexibly define agent interaction behaviours using natural language or computer code for programming conversation patterns.
  • AutoGen serves as a generic framework for building diverse applications of various complexities and LLM capacities.
  • Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.
  • Benefits of Multi-Agent Conversations for LLM ApplicationsAuto-Reply Mechanism in AutoGen
  • The auto-reply mechanism in AutoGen is a decentralized, modular, and unified way to define workflow, allowing the conversation to proceed automatically (Eleti et al, 2023).

Control in AutoGen

  • AutoGen allows control via programming and natural language, using LLMs (large language models) to control the conversation flow with natural language prompts.
  • Programming-language control is also possible in AutoGen, with Python code used to specify termination conditions, human input mode, and tool execution logic.
  • AutoGen supports flexible control transition between natural and programming language, with the transition from code to natural-language control achieved via LLM inference and the transition from natural language to code control via LLM-proposed function calls.

Conversation Programming Paradigm

  • AutoGen supports multi-agent conversations of diverse patterns, including static conversation with predefined flow and dynamic conversation flows with multiple agents.
  • Customized generate reply function and function calls are used to achieve dynamic conversation flows, with AutoGen also supporting more complex dynamic group chat via built-in GroupChatManager.

Applications of AutoGen

  • AutoGen has been used to develop high-performance multi-agent applications, including math problem solving, retrieval-augmented code generation and question answering, decision-making in real-world environments, and multi-agent coding.

Math Problem Solving with AutoGen

  • AutoGen has been used to build a system for autonomous math problem solving, with built-in agents yielding better performance compared to alternative approaches, including open-source methods and commercial products.
  • AutoGen has also been used to demonstrate a human-in-the-loop problem-solving process, with the system effectively incorporating human inputs to solve challenging problems that cannot be solved without humans.

Retrieval-Augmented Code Generation and Question Answering

  • AutoGen has been used to build a Retrieval-Augmented Generation (RAG) system named Retrieval-augmented Chat, which consists of a Retrieval-augmented User Proxy agent and a Retrieval-augmented Assistant agent.
  • Retrieval-augmented Chat has been evaluated in both question-answering and code-generation scenarios, with the interactive retrieval mechanism playing a non-trivial role in the process.

Decision-Making in Text World Environments

  • AutoGen has been used to develop effective applications that involve interactive or online decision-making, with a two-agent system implemented to solve tasks from the ALFWorld benchmark.
  • A grounding agent has been introduced to supply crucial commonsense knowledge, significantly enhancing the system’s ability to avoid getting entangled in error loops.

Multi-Agent Coding

  •  AutoGen has been used to build a multi-agent coding system based on OptiGuide, with the core workflow code reduced from over 430 lines to 100 lines, leading to significant productivity improvement.
  • The multi-agent design has been shown to boost the F-1 score in identifying unsafe code by 8% and 35%.

Role-play Prompts in Task Consideration

  • In a pilot study of 12 manually crafted complex tasks, utilizing a role-play prompt led to more effective consideration of conversation context and role alignment, resulting in a higher success rate and fewer LLM calls (Hong et al, 2023).

Conversational Chess Using AutoGen

  • Conversational Chess is a natural language interface game featuring built-in agents for players and a board agent to provide information and validate moves based on standard rules.
  • AutoGen enables two essential features in Conversational Chess: natural, flexible, and engaging game dynamics, and grounding to maintain game integrity by checking each proposed move for legality.

AutoGen: A Unified Conversation Interface

  • AutoGen is an open-source library that incorporates the paradigms of conversable agents and conversation programming, utilizing capable agents well-suited for multi-agent cooperation.
  • It features a unified conversation interface among the agents, along with auto-reply mechanisms, which help establish an agent-interaction interface that capitalizes on the strengths of chat-optimized LLMs with broad capabilities while accommodating a wide range of applications.

Recommendations for Using Agents in AutoGen

  • Consider using built-in agents first, such as AssistantAgent and UserProxyAgent, and customize them based on human input mode, termination condition, code execution configuration, and LLM configuration.
  • Start with a simple conversation topology, such as the two-agent chat or group chat setup, and try to reuse built-in reply methods based on LLM, tool, or human before implementing a custom reply method.
  • When developing a new application with UserProxyAgent, start with humans always in the loop, and consider using other libraries/packages when necessary for specific applications or tasks.

Future Work

  • Designing optimal multi-agent workflows, creating highly capable agents, and enabling scale, safety, and human agency are important research directions for AutoGen.
  • Building fail-safes against cascading failures, mitigating reward hacking, out-of-control and undesired behaviours, maintaining effective human oversight, and understanding the appropriate level and pattern of human involvement are crucial for the safe and ethical use of AutoGen agents.

Instructions for Solving Tasks

  • Follow the language skill or code-based approach to solve tasks, depending on the requirement.
  • If using code, provide the full code instead of partial code or code changes.
  • Include the script type in the code block and use the ‘print’ function for the output.
  • Check the execution result and fix any errors before re-outputting the code.
  • Analyze the problem and revisit assumptions if the task is not solved even after the code is executed successfully.
  • Include verifiable evidence in the response if possible.
  • Reply “TERMINATE” in the end when everything is done.

Evaluation of Math Problem-Solving Systems

  • AutoGen achieves the highest problem-solving success rate among the compared methods.
  • ChatGPT+Code Interpreter and ChatGPT+Plugin struggle to solve certain problems.
  • AutoGPT fails on problems due to code execution issues.
  • LangChain agent fails on problems, producing incorrect answers in all trials.

Human-in-the-loop Problem Solving

  • AutoGen consistently solved the problem across all three trials.
  • ChatGPT+Code Interpreter and ChatGPT+Plugin managed to solve the problem in two out of three trials.
  • AutoGPT was unable to yield a correct solution in any of the trials.

Multi-User Problem Solving

  • AutoGen can be used to construct a system involving multiple real users for collectively solving a problem with the assistance of LLMs.
  • A student interacts with an LLM assistant to address problems, and the LLM automatically resorts to the expert when necessary.
  • The expert is supposed to respond to the problem statement or the request to verify the solution to a problem.
  • After the conversation between the expert and the expert’s assistant, the final message is sent back to the student assistant as the response to the initial message.

Retrieval-Augmented Chat Workflow

  • The Retrieval-Augmented User Proxy retrieves document chunks based on the embedding similarity and sends them along with the question to the Retrieval-Augmented Assistant.
  • The Retrieval-Augmented Assistant generates code or text as answers based on the question and context provided.
  • If the LLM is unable to produce a satisfactory response, it replies with “Update Context” to the Retrieval-Augmented User Proxy.
  • If the response includes code blocks, the Retrieval-Augmented User Proxy executes the code and sends the output as feedback.
  • If human input solicitation is enabled, individuals can proactively send any feedback, including “Update Context”, to the Retrieval-Augmented Assistant.

Limitations of Math Problem-Solving Systems

  • BabyAGI, CAMEL, and MetaGPT are not suitable choices for solving math problems out of the box.
  • MetaGPT begins developing software to address the problem but most of the time, it does not solve the problem.
  • The LLM gives code without the print function so the result is not printed.
  • The return from Wolfram Alpha contains 2 simplified results, including the correct answer, but GPT-4 always chooses the wrong answer.
  • LangChain gives 3 different wrong answers due to calculation errors.

Scenario 1: Evaluation of Natural Questions QA Dataset

  • Retrieval-Augmented Chat’s end-to-end question-answering performance is evaluated using the Natural Questions dataset (Kwiatkowski et al, 2019).
  • 5,332 non-redundant context documents and 6,775 queries are collected from HuggingFace.
  • A document collection is created based on the entire context corpus and stored in the vector database.
  • The system utilizes Retrieval-Augmented Chat to answer the questions.
  • The advantages of the interactive retrieval feature are demonstrated through an example from the NQ dataset.
  • The LLM assistant (GPT-3.5-turbo) replies “Sorry, I cannot find any information about who carried the USA flag in the opening ceremony.
  • UPDATE CONTEXT.” when it can’t answer the question.
  • The agent can generate the correct answer to the question after the context is updated.
  • An experiment is conducted using the same prompt as illustrated in (Adlakha et al, 2023) to investigate the advantages of AutoGen W/O interactive retrieval.
  • The F1 score and Recall for the first 500 questions are 23.40% and 62.60%, respectively.
  • Approximately 19.4% of questions in the Natural Questions dataset trigger an “Update Context” operation, resulting in additional LLM calls.

Scenario 2: Code Generation Leveraging Latest APIs from the Codebase

  • The question is “How can I use FLAML to perform a classification task and use Spark for parallel training? Train for 30 seconds and force cancel jobs if the time limit is reached.”.
  • The original GPT-4 model is unable to generate the correct code due to its lack of knowledge regarding Spark-related APIs.
  • With Retrieval-Augmented Chat, the latest reference documents are provided as context.
  • GPT-4 generates the correct code blocks by setting use spark and force cancel to True.

Observation: ALFWorld (Shridhar et al, 2021)

  • ALFWorld is a synthetic language-based interactive decision-making task that simulates real-world household scenes.
  • The agent needs to extract patterns from the few-shot examples provided and combine them with the agent’s general knowledge of household environments to fully understand task rules.

Figures

Figure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left) AutoGen agents are conversable, customizable, and can be based on LLMs, tools, humans, or even a combination of them. (Top-middle) Agents can converse to solve tasks. (Right) They can form a chat, potentially with humans in the loop. (Bottom-middle) The framework supports flexible conversation patterns.
Figure 2: Illustration of how to use AutoGen to program a multi-agent conversation. The top subfigure illustrates the built-in agents provided by AutoGen, which have unified conversation interfaces and can be customized. The middle sub-figure shows an example of using AutoGen to develop a two-agent system with a custom reply function. The bottom sub-figure illustrates the resulting automated agent chat from the two-agent system during program execution.
Figure 3: Six examples of diverse applications built using AutoGen. Their conversation patterns show AutoGen’s flexibility and power.
Figure 4: Performance on four applications A1-A4. (a) shows that AutoGen agents can be used out of the box to achieve the most competitive performance on math problem-solving tasks; (b) shows that AutoGen can be used to realize effective retrieval augmentation and realize a novel interactive retrieval feature to boost performance on Q&A tasks; © shows that AutoGen can be used to introduce a three-agent system with a grounding agent to improve performance on ALFWorld; (d) shows that a multi-agent design helps boost performance in coding tasks that need safeguards.
Figure 5: Default system message for the built-in assistant agent in AutoGen (v0.1.1). This is an example of conversation programming via natural language. It contains instructions of different types, including role play, control flow, output confine, facilitate automation, and grounding.
Table 2: Qualitative evaluation of two math problems from the MATH dataset within the autonomous problem-solving scenario. Each LLM-based system is tested three times on each of the problems. This table reports the problem-solving correctness and summarizes the reasons for failure.
Figure 6: Examples of three settings utilized to solve math problems using AutoGen: (Gray) Enables a workflow where a student collaborates with an assistant agent to solve problems, either autonomously or in a human-in-the-loop mode. (Gray + Orange) Facilitates a more sophisticated workflow wherein the assistant, on the fly, can engage another user termed “expert”, who is in the loop with their assistant agent, to aid in problem-solving if its solutions are not satisfactory
Figure 7: Overview of Retrieval-augmented Chat which involves two agents, including a Retrievalaugmented User Proxy and a Retrieval-augmented Assistant. Given a set of documents, the Retrieval-augmented User Proxy first automatically processes documents — splits, and chunks, and stores them in a vector database. Then for a given user input, it retrieves relevant chunks as context and sends it to the Retrieval-augmented Assistant, which uses LLM to generate code or text to answer questions. Agents converse until they find a satisfactory answer.
Figure 8: Retrieval-augmented Chat without (W/O) and with (W/) interactive retrieval
Figure 9: We use AutoGen to solve tasks in the ALFWorld benchmark, which contains household tasks described in natural language. We propose two designs: a two-agent design where the assistant agent suggests the next step, and the Executor executes actions and provides feedback. The threeagent design adds a grounding agent that supplies commonsense facts to the executor when needed.
Figure 10: Comparison of results from two designs: (a) Two-agent design which consists of an assistant and an executor, (b) Three-agent design which adds a grounding agent that serves as a knowledge source. For simplicity, we omit the in-context examples and part of the exploration trajectory and only show parts contributing to the failure/success of the attempt.
Table 3: Comparisons between ReAct and the two variants of ALFChat on the ALFWorld benchmark. For each task, we report the success rate out of 3 attempts. Success rate denotes the number of tasks completed by the agent divided by the total number of tasks. The results show that adding a grounding agent significantly improves the task success rate in ALFChat.
Figure 11: Our re-implementation of OptiGuide with AutoGen streamlining agents’ interactions. The Commander receives user questions (e.g., What if we prohibit shipping from supplier 1 to roastery 2?) and coordinates with the Writer and Safeguard. The Writer crafts the code and interpretation, the Safeguard ensures safety (e.g., not leaking information, no malicious code), and the Commander executes the code. If issues arise, the process can repeat until resolved. Shaded circles represent steps that may be repeated multiple times.
Figure 12: A5: Dynamic Group Chat: Overview of how AutoGen enables dynamic group chats to solve tasks. The Manager agent, which is an instance of the GroupChatManager class, performs the following three steps–select a single speaker (in this case Bob), ask the speaker to respond, and broadcast the selected speaker’s message to all other agents
Figure 13: Comparison of two-agent chat (a) and group chat (b) on a given task. The group chat resolves the task successfully with a smoother conversation, while the two-agent chat fails on the same task and ends with a repeated conversation.
Figure 14: A6: Conversational Chess: Our conversational chess application can support various scenarios, as each player can be an LLM-empowered AI, a human, or a hybrid of the two. Here, the board agent maintains the rules of the game and supports the players with information about the board. Players and the board agent all use natural language for communication
Figure 15: Example conversations during a game involving two AI player agents and a board agent
Figure 16: Comparison of two designs–(a) without a board agent, and (b) with a board agent–in Conversational Chess.
Figure 17: We use AutoGen to build MiniWobChat, which solves tasks in the MiniWob++ benchmark. MiniWobChat consists of two agents: an assistant agent and an executor agent. The assistant agent suggests actions to manipulate the browser while the executor executes the suggested actions and returns rewards/feedback. The assistant agent records the feedback and continues until the feedback indicates task success or failure.

Reference

https://arxiv.org/abs/2308.08155

Post a Comment

0 Comments