Fixing the Agent Handoff Problem in LlamaIndex's AgentWorkflow System

The position bias in LLMs is the root cause of the problem

Fixing the Agent Handoff Problem in LlamaIndex's AgentWorkflow System.
Fixing the Agent Handoff Problem in LlamaIndex's AgentWorkflow System. Image by DALL-E-3

LlamaIndex AgentWorkflow, as a brand-new multi-agent orchestration framework, still has some shortcomings. The most significant issue is that after an agent hands off control, the receiving agent fails to continue responding to user requests, causing the workflow to halt.

In today's article, I'll explore several experimental solutions to this problem with you and discuss the root cause behind it: the positional bias issue in LLMs.

I've included all relevant source code at the end of this article. Feel free to read or modify it without needing my permission.


Introduction

My team and I have been experimenting with LlamaIndex AgentWorkflow recently. After some localization adaptations, we hope this framework can eventually run in our production system.

During the adaptation, we encountered many obstacles. I've documented these problem-solving experiences in my article series. You might want to read them first to understand the full context.

Today, I'll address the issue where after the on-duty agent hands off control to the next agent, the receiving agent fails to continue responding to the user's most recent request.

Here's what happens:

The receiving agent doesn't immediately respond to the user's latest request - the user has to repeat their question.
The receiving agent doesn't immediately respond to the user's latest request - the user has to repeat their question. Image by Author

After the handoff, the receiving agent doesn't immediately respond to the user's latest request - the user has to repeat their question.

Why should I care?

In this article, we'll examine this unique phenomenon and attempt to solve it from multiple perspectives, including developer recommendations and our own experience.

During this process, we'll intentionally review AgentWorkflow's excellent source code, having a cross-temporal conversation with its authors through code to better understand Agentic AI design principles.

We'll also touch upon LLM position bias for the first time, understanding how position bias in chat history affects LLM responses.

These insights aren't limited to LlamaIndex - they'll help us handle similar situations when working with other multi-agent orchestration frameworks.

Let's go.


First, let's see what the developers say

Before we begin, if you need background on LlamaIndex AgentWorkflow, feel free to read my previous article:

Diving into LlamaIndex AgentWorkflow: A Nearly Perfect Multi-Agent Orchestration Solution
And fix the issue where the agent can’t continue with past requests

In short, LlamaIndex AgentWorkflow builds upon the excellent LlamaIndex Workflow framework, encapsulating agent function calling, handoff, and other cutting-edge Agentic AI developments. It lets you focus solely on your agent's business logic.

In my previous article, I first mentioned the issue where agents fail to continue processing user requests after handoff.

Others have noticed this too. In this thread, someone referenced my article's solution when asking the developers about it. I'm glad I could help:

I was building a multi-agent workflow, where each agent...
@kapa.ai I was building a multi-agent workflow, where each agent has multiple tools. I started the workflow with user query, the query goes to the root agent an…

Developer Logan M proposed including the original user request in the handoff method's output to ensure the receiving agent continues processing.

Unfortunately, as of this writing, LlamaIndex's release version hasn't incorporated this solution yet.

So today's article starts with the developer's response - we'll try rewriting the handoff method implementation ourselves to include the original user request in the handoff output.

First attempt

Since this solution modifies the handoff method implementation, we don't need to rewrite FunctionAgent code. Instead, we'll modify AgentWorkflow's implementation.

The handoff method is core to AgentWorkflow's handoff capability. It identifies which agent the LLM wants to hand off to and sets it in the context's next_agent. During workflow execution, this method merges with the agent's tools and gets called via function calling when the LLM needs to hand off.

This is how AgentWorkflow implements multi-agent handoff.

In the original code, after handoff sets the next_agent, it returns a prompt as the tool call result to the receiving agent. The prompt looks like this:

DEFAULT_HANDOFF_OUTPUT_PROMPT = """
Agent {to_agent} is now handling the request due to the following reason: {reason}.
Please continue with the current request.
"""

This prompt includes {to_agent} and {reason} fields. But since the prompt goes to the receiving agent, {to_agent} isn't very useful. Unless {reason} contains the original user request, the receiving agent can't get relevant information from this prompt. That's why the developer suggested including the user request in the prompt output.

The original handoff implementation didn't include user request information.
The original handoff implementation didn't include user request information. Image by Author

Let's modify this method first.

We'll create an enhanced_agent_workflow.py file and write the modified HANDOFF_OUTPUT_PROMPT:

ENHANCED_HANDOFF_OUTPUT_PROMPT = """
    Agent {to_agent} is now handling the request.
    Check the previous chat history and continue responding to the user's request: {user_request}.
"""

Compared to the original, I added a requirement for the LLM to review chat history and included the user's most recent request.

The output of the updated handoff now includes both chat history review and user request information.
The output of the updated handoff now includes both chat history review and user request information. Image by Author

Next, I rewrote the handoff method to return the new prompt:

async def handoff(ctx: Context, to_agent: str, user_request: str):
    """Handoff control of that chat to the given agent."""
    agents: list[str] = await ctx.get('agents')
    current_agent_name: str = await ctx.get("current_agent_name")
    if to_agent not in agents:
        valid_agents = ", ".join([x for x in agents if x != current_agent_name])
        return f"Agent {to_agent} not found. Please select a valid agent to hand off to. Valid agents: {valid_agents}"

    await ctx.set("next_agent", to_agent)
    handoff_output_prompt = PromptTemplate(ENHANCED_HANDOFF_OUTPUT_PROMPT)
    return handoff_output_prompt.format(to_agent=to_agent, user_request=user_request)

The rewrite is simple - I just changed the reason parameter to user_request and returned the new prompt. The LLM will handle everything else.

Since we modified handoff's source code, we also need to modify AgentWorkflow's code that calls this method.

The _get_handoff_tool method in AgentWorkflow calls handoff, so we'll implement an EnhancedAgentWorkflow subclass of AgentWorkflow and override _get_handoff_tool:

class EnhancedAgentWorkflow(AgentWorkflow):
    def _get_handoff_tool(
        self, current_agent: BaseWorkflowAgent
    ) -> Optional[AsyncBaseTool]:
        """Creates a handoff tool for the given agent."""
        agent_info = {cfg.name: cfg.description for cfg in self.agents.values()}
        configs_to_remove = []
        for name in agent_info:
            if name == current_agent.name:
                configs_to_remove.append(name)
            elif (
                current_agent.can_handoff_to is not None
                and name not in current_agent.can_handoff_to
            ):
                configs_to_remove.append(name)

        for name in configs_to_remove:
            agent_info.pop(name)

        if not agent_info:
            return None

        handoff_prompt = PromptTemplate(ENHANCED_HANDOFF_PROMPT)
        fn_tool_prompt = handoff_prompt.format(agent_info=str(agent_info))
        return FunctionTool.from_defaults(
            async_fn=handoff, description=fn_tool_prompt, return_direct=True
        )

Our modifications are complete. Now let's write test code in example_2.py to verify our changes. (example_1.py contains the original AgentWorkflow test.)

I'll base the code on this user's scenario to recreate the situation.

We'll create two agents: search_agent and research_agent. search_agent searches the web and records notes, then hands off to research_agent, who writes a research report based on the notes.

search_agent:

search_agent = FunctionAgent(
    name="SearchAgent",
    description="You are a helpful search assistant.",
    system_prompt="""
    You're a helpful search assistant.
    First, you'll look up notes online related to the given topic and recorde these notes on the topic.
    Once the notes are recorded, you should hand over control to the ResearchAgent.
    """,
    tools=[search_web, record_notes],
    llm=llm,
    can_handoff_to=["ResearchAgent"]
)

research_agent:

research_agent = FunctionAgent(
    name="ResearchAgent",
    description="You are a helpful research assistant.",
    system_prompt="""
    You're a helpful search assistant.
    First, you'll look up notes online related to the given topic and recorde these notes on the topic.
    Once the notes are recorded, you should hand over control to the ResearchAgent.
    """,
    llm=llm
)

search_agent is a multi-tool agent that uses search_web and record_notes methods:

search_web:

async def search_web(ctx: Context, query: str) -> str:
    """
    This tool searches the internet and returns the search results.
    :param query: user's original request
    :return: Then return the search results.
    """
    tavily_client = AsyncTavilyClient()
    search_result = await tavily_client.search(str(query))
    return str(search_result)

record_notes:

async def record_notes(ctx: Context, notes: str, notes_title: str) -> str:
    """
    Useful for recording notes on a given topic. Your input should be notes with a title to save the notes under.
    """
    return f"{notes_title} : {notes}"

Finally, we'll use EnhancedAgentWorkflow to create a workflow and test our modifications:

workflow = EnhancedAgentWorkflow(
    agents=[search_agent, research_agent],
    root_agent=search_agent.name
)


async def main():
    handler = workflow.run(user_msg="What is LLamaIndex AgentWorkflow, and what problems does it solve?")
    async for event in handler.stream_events():
        if isinstance(event, AgentOutput):
            print("=" * 70)
            print(f"🤖 {event.current_agent_name}")
            if event.response.content:
                console.print(Markdown(event.response.content or ""))
            else:
                console.print(event.tool_calls)


if __name__ == "__main__":
    asyncio.run(main())
After research_agent takes over, it recognizes the user request but still doesn't respond.
After research_agent takes over, it recognizes the user request but still doesn't respond. Image by Author

After research_agent takes over, it recognizes the user request but still doesn't respond. Our attempt failed. 😭


My Proposed Solution

How I view this issue

In my previous article, I speculated about the cause:

FunctionAgent puts both tool_call and tool_call_result info into ChatMemory, which pushes user requests to the back of the queue.
FunctionAgent puts both tool_call and tool_call_result info into ChatMemory, which pushes user requests to the back of the queue. Image by Author

As shown, FunctionAgent stores all chat messages in a MemoryBuffer - essentially a FIFO queue where user requests enter first.

After completing function calling based on user requests, FunctionAgent saves both tool_call and tool_call_result as chat messages in memory.

Each function call generates two messages. Multiple tool calls create even more messages.

This pushes the original user request deeper into the queue - either far from the latest message or, due to MemoryBuffer'sAfter token limit, completely out of the queue.

Consequently, the LLM struggles to perceive the original request from chat history. I'll explain the technical reasons in the position bias section.

When the next agent takes over, it can't immediately respond to the user request.

So I tried a simple fix: After each handoff, I copy the original user request to the queue's end, ensuring the LLM notices it.

After each handoff, I copy the original user request to the queue's end.
After each handoff, I copy the original user request to the queue's end. Image by Author

Second attempt

This attempt's code is in reordered_function_agent.py.

The implementation is simple: I subclass FunctionAgent as ReorderedFunctionAgent and override take_step.

class ReorderedFunctionAgent(FunctionAgent):
    @override
    async def take_step(
            self,
            ctx: Context,
            llm_input: List[ChatMessage],
            tools: Sequence[AsyncBaseTool],
            memory: BaseMemory,
    ) -> AgentOutput:
        last_msg = llm_input[-1] and llm_input[-1].content
        state = await ctx.get("state", None)

        if "handoff_result" in last_msg:
            for message in llm_input[::-1]:
                if message.role == MessageRole.USER:
                    last_user_msg = message
                    llm_input.append(last_user_msg)
                    break

        return await super().take_step(ctx, llm_input, tools, memory)

When I detect the last message in llm_input is a handoff tool_call_result, I traverse backward to find the user's last request and append it to the queue's end.

To identify handoff tool_call_result messages, I manually pass a handoff_output_prompt during AgentWorkflow initialization, adding a "handoff_result:" string as a marker. The test code is in example_3.py:

workflow = AgentWorkflow(
    agents=[search_agent, research_agent],
    root_agent=search_agent.name,
    handoff_output_prompt=(
        "handoff_result: Due to {reason}, the user's request has been passed to {to_agent}."
        "Please review the conversation history immediately and continue responding to the user's request."
    ),
)

Let's run the test:

This time, research_agent's output is just a summary of partial notes rather than a final research report.
This time, research_agent's output is just a summary of partial notes rather than a final research report. Image by Author

This time, research_agent successfully detects and responds to the user request after taking over. But the result isn't perfect - it doesn't realize the web search and note-taking already happened. It thinks research isn't complete, so the output is just a summary of partial notes rather than a final research report.

I believe this happens because after appending the user request to ChatMemory's end, previous tool_call information gets pushed to the front, causing the LLM to lose critical information.

Next, we'll examine the theoretical basis of this problem and propose an ultimate solution.


Theoretical Cause: Position Bias of LLMs

This issue of messages at the queue's front being ignored relates to a rarely discussed topic: position bias.

Since this isn't an academic discussion, I won't cite many research papers or delve deep into theory. If interested, search for "position bias of large language model."

I'll explain this phenomenon in simple terms:

Different positions in the chat context have different attention weights.
Different positions in the chat context have different attention weights. Arxiv 2407.01100

Our text instructions to LLMs typically include two segments: system_promptan average and chat history - collectively called the LLM's context.

LLMs have an attention weight decay mechanism. As context expands, attention weights for earlier information naturally decay.

When knowledge sits at chat history's front, its influence diminishes rapidly with new dialogue turns. Experiments show that in an 8k token context window, tokens in the first 10% positions see an average over 60% influence weight drop. (Large Language Model Agent: A Survey on Methodology, Applications and Challenges - Junyu Luo et al., 2025)

System prompts are designed as global control signals, with information there having higher confidence (about 3- 5x weight difference).

In different LLMs, the positions where the model focuses on important info don't always match the actual important info spots. Arxiv 2404.01430

Imagine entering a restaurant. You first notice the menu cover (system prompt) featuring special dishes and chef introductions, then seasonal items (latest chat history), and finally regular dishes. Delicacies hidden in regular dishes often get overlooked.

Just like how people always notice the menu cover and seasonal specials.
Just like how people always notice the menu cover and seasonal specials. Photo by Frank Holleman / Unsplash

Understanding the cause leads us to the ultimate solution.


My Final Attempt

What I plan to do

Next, I'll walk you through my final attempt. First, here's what our project output looks like after implementing it:

ResearchAgent not only continues processing the user request but fully perceives the search notes, ultimately producing a perfect research report.
ResearchAgent not only continues processing the user request but fully perceives the search notes, ultimately producing a perfect research report. Image by Author

After taking over, ResearchAgent not only continues processing the user request but fully perceives the search notes, ultimately producing a perfect research report.

My solution approach: