Achieving Self-Improvement in Agentic Systems with Skill Harvesting

Product
December 3, 2024
May 14, 2024
Tamer Abuelsaad

Prasenjit Dey

Ravi Kokki

Co-Founder & CTO

Deepak Akkil

Introduction

Intelligent agents are transforming interactive software, by significantly improving task management across diverse digital environments. Agent-driven (“agentic”) systems are typically designed with a set of predefined skills, each of which involves performing a specific task within a digital environment.

In this blog post, we seek to enhance these systems by introducing the approach called 'skill harvesting.' Skill harvesting allows agentic systems to self-reflect, autonomously developing more specialized skills.

We will dive into skill harvesting as a transformative technique that not only enhances the capabilities of existing agent frameworks but also reduces their operational demands and optimizes their resource utilization.

Understanding Skills-Driven Agent Systems

Agent-driven systems such as those designed for tasks like browser navigation start with a set of basic, or “primitive,” skills. These skills enable agents to perform actions fundamental to their goals, for example, actions required to interact with web content: clicking buttons, entering keystrokes in text boxes, fetching URLs, and retrieving DOM elements based on content type. We introduced this primitive-skill-based browser navigation functionality in our open source project called Agent-E [1, 2].

Agent-E is built on top of a multi-agent conversational framework called AutoGen [3]. Agent-E's architecture leverages the interplay between skills and agents (see figure below).

Skills-driven Agent-E architecture.

The architecture includes a Browser Navigation Agent and a User-Proxy Agent that coordinate to compose and execute a variety of web workflow automations using multiple primitive skills. This approach allows for the representation of complex web workflows using a small number of skills in a Skills Library.

We next ask the question: how can we improve this Skills Library towards even more efficacy in future execution?

Self-Reflection and Skill Harvesting

An active area of research in agent-driven systems is agents' self-reflection after executed tasks in order to improve future performance. This improvement could include cost reduction or faster response times.

We approach self-reflection in a way that mirrors human cognitive development, where rapid, instinctual responses (System 1) and more considered, analytical thinking (System 2) interact to get the best cost-accuracy-delay tradeoff [4, 5].

In our approach, agents analyze their performance and 'harvest' new skills from this reflection, similar to how humans develop expertise through practice and self-reflection. The analysis could be intra-agent (an agent self-reflecting to improve itself) or inter-agent (multiple agents helping each other to reflect and improve in a system). We bootstrapped the reflection process in Agent-E with chat logs from our benchmarking runs.

Below you can see a comparison of the steps recorded in the chat logs before and after skill harvesting. The user gave the command: find nothing phone 2 on Amazon and sort the results by best seller.

For the detailed chat logs, see the Appendix section.

Agent-E steps to fulfill user request before and after skill harvesting showing 5 interactions with the LLM reduced to 1.
Visualization of the elements interacted with in the website to fulfill the example user’s command (mmid is an attribute that we inject in every HTML element to uniquely represent it to the LLM).

In the course of Agent-E’s execution of a task, most of its time and resources are spent by the LLM, as it analyzes the HTML DOM of each web page and identifies which DOM element on which to perform its next step. This role of the LLM is referred to as assistant ("role": "assistant") in the chat log.

To make element selection concise, we inject a sequential numeric attribute mmid into every HTML element. The LLM is asked to identify the actionable elements using their corresponding mmids. Given that websites tend to have a more dynamic nature, mmids are ephemeral and strictly single-use. Based on the plan provided by the assistant, each skill is executed by an actor whose role is defined as a tool ("role": "tool").

We made the skills (tools) chatty in their response. Rather than returning true/false regarding their performance success, they return natural language details about the HTML DOM element on which they performed an operation. For example, the click skill may return: Select menu option "exact-aware-popularity-rank" selected. The select element's outer HTML is: <option value="exact-aware-popularity-rank">. This verbose response is imperative to the success of skill harvesting.

When the offline process of self-reflection takes place, it assesses whether a step can be done without the need for an LLM (cost and latency) to reason over the DOM of the page. The harvesting process continues in the available chat logs until it encounters a step in the chat log that requires reasoning on the DOM content. A harvested skill will be composed of sequential calls to primitive (existing) skills that were observed in the chat logs without a call to an LLM. In some cases, a harvested skill might skip one or more chat log steps if it can achieve them in a more efficient way. For example, directly navigating to a search URL.

Below is an example of a skill harvested from the above chat log (in Python):

from typing import Annotated
from ae.core.skills.click_using_selector import click as click_element
from ae.core.skills.enter_text_and_click import enter_text_and_click
from ae.core.skills.get_url import geturl
from ae.core.skills.open_url import openurl
async def search_amazon_and_sort_by_best_seller(search_term: Annotated[str, "The search term to use on Amazon"]) -> str:
    """
    Searches for a product on Amazon and sorts the results by best seller rank.
    Parameters:
    - search_term: The search term to use on Amazon
    Returns:
    - A message indicating the search was performed and results sorted by best seller, along with the final URL.
    """
    await openurl("https://www.amazon.com")
    await enter_text_and_click("[id='twotabsearchtextbox']", search_term, "[id='nav-search-submit-button']")
    await click_element("[value='exact-aware-popularity-rank']")
    url = await geturl()
    return f"The search for \"{search_term}\" on Amazon has been successfully performed and sorted by best seller. Final URL: {url}"

The harvested skill, search_amazon_and_sort_by_best_seller , is a specialized, higher-order skill tailored to searching and sorting on Amazon's website. We did not have to write this skill, but rather the harvester discovered it and added it to the agent’s skills library. This new skill significantly reduces the need for costly and delay-prone interactions with underlying LLM(s). This transition from pre-harvest to post-harvest skills showcases the evolution of the agent’s capabilities and the effectiveness of self-reflection in real-world applications. The browser navigation agent now has more capabilities than those with which it began.

Benefits

The introduction of harvested skills has led to remarkable improvements in system performance. Not only do these skills reduce dependency on LLM calls, but they also expedite task completion, leading to greater efficiency.

Here are five user commands, taken from our GitHub repo’s benchmark, that highlight the agent’s practical application and the range of tasks it can perform efficiently with the integration of harvested skills:

  • find nothing phone 2 on amazon and sort the results by best seller
  • find nothing phone 2 on amazon and sort the results by price highest first
  • find soccer on ESPN
  • go to MIT and navigate to the Alumni website
  • put the video in full-screen

These commands showcase the diverse capabilities of the agent, from simple task execution to complex interactions, demonstrating the agent's ability to adapt and respond more efficiently after skill harvesting.

Performance Improvement

An illustration of the number of LLM calls and processing delays before and after skill harvesting.

An illustration of the number of LLM calls and processing delays before and after skill harvesting

Task IdTask
0find nothing phone 2 on amazon and sort the results by best seller
1find nothing phone 2 on amazon and sort the results by price highest first
2find soccer on ESPN
8Go to MIT and navigate to Alumni website
14Put the video in full screen (this video was playing)

Our metrics show up to an 80% reduction in time taken to complete tasks, and up to a 70% reduction in LLM interactions, underscoring the effectiveness of skill harvesting in enhancing system responsiveness and cost efficiency.

By incorporating user commands in this manner, readers can see examples of how the system operates before and after skill harvesting. This practical demonstration adds depth to the discussion of performance improvements and helps both technical and non-technical audiences understand the value of these developments.

Drawbacks

While harvested skills are advantageous in many ways, this process is not without its drawbacks. The main drawback is that the DOM selector(s) that were once harvested may become stale over time. This can happen if a website’s creator changes the HTML element attributes on which harvested skills rely. If this happens, the task orchestrator should be able to backtrack and make use of its available primitive skills as a fallback. The failed harvested skill would be marked for examination. As a preventative measure, higher-order skills with DOM selector dependencies need to be checked periodically and re-harvested as necessary. With this approach, the harvested skills would be guaranteed to always function properly if they are in an agent’s Skills Library.

Conclusion

The practice of skill harvesting not only streamlines the task execution process but also contributes to the evolution of agent-driven systems, making them smarter and more autonomous. This development mirrors the natural learning processes in humans, showcasing the potential of AI systems to adapt and improve over time. Self-reflection and self-improvement using skills harvesting is one of many techniques to improve agent performance. Many more agentic patterns where agents improve each other by acting as proposer-verifier loops or generating datasets using self-reflection to improve themselves can be used to build robust multi-agent systems. We will discuss some of those techniques in our subsequent blogs.

Appendix

Chat log of Agent-E performing user task with just the primitive skills in the library:

{
    "<autogen.agentchat.user_proxy_agent.UserProxyAgent object at 0x16e4d22d0>": [
        {
            "content": "Execute the user task \"find nothing phone 2 on amazon and sort the results by best seller\" using the appropriate agent. Current URL: https://www.google.com/",
            "role": "user"
        },
        {
            "tool_calls": [
                {
                    "id": "call_AxqHwGafK4qNjVnyWeCkyCkG",
                    "function": {
                        "arguments": "{\"url\":\"https://www.amazon.com\"}",
                        "name": "openurl"
                    },
                    "type": "function"
                }
            ],
            "content": null,
            "role": "assistant"
        },
        {
            "content": "Page loaded: https://www.amazon.com/",
            "tool_responses": [
                {
                    "tool_call_id": "call_AxqHwGafK4qNjVnyWeCkyCkG",
                    "role": "tool",
                    "content": "Page loaded: https://www.amazon.com/"
                }
            ],
            "role": "tool"
        },
        {
            "tool_calls": [
                {
                    "id": "call_Jiz2TLHUqo8c7SDlYCwCXAZu",
                    "function": {
                        "arguments": "{\"content_type\":\"input_fields\"}",
                        "name": "get_dom_with_content_type"
                    },
                    "type": "function"
                }
            ],
            "content": null,
            "role": "assistant"
        },
        {
            "content": "DOM content redacted for brevity",
            "tool_responses": [
                {
                    "tool_call_id": "call_Jiz2TLHUqo8c7SDlYCwCXAZu",
                    "role": "tool",
                    "content": "DOM content redacted for brevity",
                }
            ],
            "role": "tool"
        },
        {
            "tool_calls": [
                {
                    "id": "call_ERgKryVU2ZnXhLjJLaI3atPy",
                    "function": {
                        "arguments": "{\"click_selector\":\"[mmid='179']\",\"text_selector\":\"[mmid='174']\",\"text_to_enter\":\"nothing phone 2\"}",
                        "name": "enter_text_and_click"
                    },
                    "type": "function"
                }
            ],
            "content": null,
            "role": "assistant"
        },
        {
            "content": "Success. Text \"nothing phone 2\" set successfully in the element with selector [mmid='174'] and outer HTML: <input id=\"twotabsearchtextbox\" name=\"field-keywords\" aria-label=\"Search Amazon\" placeholder=\"Search Amazon\" type=\"text\">. Element with selector: \"[mmid='179']\" clicked. The clicked element's outer HTML is: <input id=\"nav-search-submit-button\" type=\"submit\" value=\"Go\">.",
            "tool_responses": [
                {
                    "tool_call_id": "call_ERgKryVU2ZnXhLjJLaI3atPy",
                    "role": "tool",
                    "content": "Success. Text \"nothing phone 2\" set successfully in the element with selector [mmid='174'] and outer HTML: <input id=\"twotabsearchtextbox\" name=\"field-keywords\" aria-label=\"Search Amazon\" placeholder=\"Search Amazon\" type=\"text\">. Element with selector: \"[mmid='179']\" clicked. The clicked element's outer HTML is: <input id=\"nav-search-submit-button\" type=\"submit\" value=\"Go\">."
                }
            ],
            "role": "tool"
        },
        {
            "tool_calls": [
                {
                    "id": "call_WX2jmNA0P7WT2LVhvgyN449E",
                    "function": {
                        "arguments": "{\"content_type\":\"all_fields\"}",
                        "name": "get_dom_with_content_type"
                    },
                    "type": "function"
                }
            ],
            "content": null,
            "role": "assistant"
        },
        {
            "content": "DOM content redacted for brevity",
            "tool_responses": [
                {
                    "tool_call_id": "call_WX2jmNA0P7WT2LVhvgyN449E",
                    "role": "tool",
                    "content": "DOM content redacted for brevity"
                }
            ],
            "role": "tool"
        },
        {
            "tool_calls": [
                {
                    "id": "call_vugjihhZXUE6XnupE8phDhHe",
                    "function": {
                        "arguments": "{\"selector\":\"[mmid='485']\"}",
                        "name": "click"
                    },
                    "type": "function"
                }
            ],
            "content": null,
            "role": "assistant"
        },
        {
            "content": "Select menu option \"exact-aware-popularity-rank\" selected. The select element's outer HTML is: <option value=\"exact-aware-popularity-rank\">.",
            "tool_responses": [
                {
                    "tool_call_id": "call_vugjihhZXUE6XnupE8phDhHe",
                    "role": "tool",
                    "content": "Select menu option \"exact-aware-popularity-rank\" selected. The select element's outer HTML is: <option value=\"exact-aware-popularity-rank\">."
                }
            ],
            "role": "tool"
        },
        {
            "content": "The search for \"nothing phone 2\" on Amazon has been successfully sorted by best seller. ##TERMINATE##",
            "role": "assistant"
        }
    ]
}

Chat log for Agent-E after the harvested skill search_amazon_and_sort_by_best_seller was added:

[
        {
            "content": "Execute the user task \"find nothing phone 2 on amazon and sort the results by best seller\" using the appropriate agent. Current URL: https://www.google.com/",
            "role": "user"
        },
        {
            "tool_calls": [
                {
                    "id": "call_17LggVRh1hvCNq6QWxApYFhh",
                    "function": {
                        "arguments": "{\"search_term\":\"nothing phone 2\"}",
                        "name": "search_amazon_and_sort_by_best_seller"
                    },
                    "type": "function"
                }
            ],
            "content": null,
            "role": "assistant"
        },
        {
            "content": "The search for \"nothing phone 2\" on Amazon has been successfully performed and sorted by best seller. Final URL: https://www.amazon.com/s?k=nothing+phone+2&s=exact-aware-popularity-rank&crid=2QPNVPBITCZPY&qid=1713552914&sprefix=nothing+phone+2+%2Caps%2C132&ref=sr_st_exact-aware-popularity-rank&ds=v1%3ACtxPvvnR3lGtxG8uuos%2FCXkqhOM2R03bFxYhzvkrS3k",
            "tool_responses": [
                {
                    "tool_call_id": "call_17LggVRh1hvCNq6QWxApYFhh",
                    "role": "tool",
                    "content": "The search for \"nothing phone 2\" on Amazon has been successfully performed and sorted by best seller. Final URL: https://www.amazon.com/s?k=nothing+phone+2&s=exact-aware-popularity-rank&crid=2QPNVPBITCZPY&qid=1713552914&sprefix=nothing+phone+2+%2Caps%2C132&ref=sr_st_exact-aware-popularity-rank&ds=v1%3ACtxPvvnR3lGtxG8uuos%2FCXkqhOM2R03bFxYhzvkrS3k"
                }
            ],
            "role": "tool"
        },
        {
            "content": "I've found the \"nothing phone 2\" on Amazon and sorted the results by best seller. You can view the sorted results [here](https://www.amazon.com/s?k=nothing+phone+2&s=exact-aware-popularity-rank&crid=2QPNVPBITCZPY&qid=1713552914&sprefix=nothing+phone+2+%2Caps%2C132&ref=sr_st_exact-aware-popularity-rank&ds=v1%3ACtxPvvnR3lGtxG8uuos%2FCXkqhOM2R03bFxYhzvkrS3k). ##TERMINATE##",
            "role": "assistant"
        }
    ]

Further Reading and Resources

For those interested in the foundational technologies and previous iterations of our agent-driven initiatives, refer to the following resources:

[1] Agent-E blog

[2] GitHub repo for Agent-E

[3] Autogen

[4] Of 2 Minds: How Fast and Slow Thinking Shape Perception and Choice [Excerpt]

[5] https://openreview.net/pdf?id=BZ5a1r-kVsf

More from the Journal