
Best Practices Guide for Writing Efficient Tools (Essential Edition)
Original source from Anthropic: Writing effective tools for agents — with agents
This article outlines key strategies for building efficient tools for AI agents, emphasizing the importance of tools to agent effectiveness. The article details how to optimize tools through prototyping, comprehensive evaluation, and collaboration with AI agents such as Claude Code. It also explains principles for writing high-quality tools, including selecting appropriate tools, managing namespaces, returning meaningful context, optimizing token efficiency, and prompt engineering. The ultimate goal is to enable agents to solve real-world tasks more intuitively and efficiently.
Introduction
The effectiveness of AI agents is closely tied to the quality of tools they use. In traditional software development, we write software for deterministic systems—for example, getWeather("NYC") always retrieves weather information in the same way. However, tools represent a new type of software contract connecting deterministic systems with non-deterministic agents. Agents may call tools, leverage general knowledge, ask clarifying questions to respond to user queries, or even hallucinate or fail to understand how to use tools. This means we need to fundamentally rethink our approach to writing software for agents and design agent-centered solutions.
This guide will provide you with proven methods and principles to help you build, evaluate, and optimize high-quality agent tools that expand the effective range of tasks agents can solve.
Core Methodology: Build, Evaluate, and Collaborate Optimization
Writing efficient tools is an iterative process involving building prototypes, running comprehensive evaluations, and collaborating with agents to optimize tools.
Building a prototype
- Quick start: Start by quickly establishing tool prototypes to personally experience which tools are “ergonomic” for agents.
- Provide documentation: If you’re using agents like Claude Code to write tools, providing documentation for any software libraries, APIs, or SDKs it depends on can be very helpful, especially LLM-friendly plain text documentation (like
llms.txtfiles). - Local testing and feedback: Wrap tools in a local MCP server or desktop extension (DXT), connect to Claude Code or Claude Desktop applications for testing. Test personally to discover issues and collect user feedback to build intuition about expected use cases and prompts.
Running an evaluation
- Systematically measure performance: Evaluations systematically measure tool performance.
- Generate evaluation tasks:
- Based on real-world usage: Generate a large number of evaluation tasks based on real-world use cases. Prompts should come from actual data sources and services (such as internal knowledge bases and microservices), avoiding overly simple or shallow “sandbox” environments that don’t adequately test tool complexity. Strong evaluation tasks may require multiple tool calls.
- Strong task examples:
- “Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Include minutes from our last project planning meeting and book a conference room.”
- “Customer ID 9182 reports they were charged three times for one purchase attempt. Find all related log entries and determine if other customers have been affected by the same issue.”
- “Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they’re leaving, (2) what retention offer would be most appealing, (3) what risk factors we should consider before making the offer.”
- Weak task examples:
- “Schedule a meeting with [email protected] next week.”
- “Search payment logs for
purchase_completeandcustomer_id=9182.” - “Find cancellation request for customer ID 45892.”
- Verifiable responses: Each evaluation prompt should be paired with a verifiable response or result. Validators can range from exact string comparisons to complex tools letting Claude judge the response. Avoid overly strict validators that might reject correct responses due to minor differences like formatting, punctuation, or valid alternative phrasing.
- Optional expected tool calls: You can optionally specify expected tools the agent should call when solving each task, to measure whether the agent successfully understands each tool’s purpose during evaluation. But avoid over-specifying or overfitting strategies, as there may be multiple valid paths to solve a task.
- Run evaluations programmatically: It’s recommended to run evaluations programmatically using direct LLM API calls. Use a simple agent loop (
whileloop wrapping alternating LLM API and tool calls) for each evaluation task. - Collect multi-dimensional metrics: Beyond top-level accuracy, it’s also recommended to collect other metrics such as total runtime per individual tool call and task, total number of tool calls, total token consumption, and tool errors. Tracking tool calls helps reveal common workflows agents follow and provides opportunities for tool consolidation.
- System prompt guidance for reasoning: In the system prompt for evaluating agents, it’s recommended to instruct the agent to output not only structured response blocks (for validation) but also reasoning and feedback blocks. Output these before tool calls and response blocks, which can improve LLM’s effective intelligence by triggering chain-of-thought (CoT) behavior.
Analyzing results
- Agents as important partners: Agents are useful partners for discovering problems and providing feedback, with feedback ranging from contradictory tool descriptions to inefficient tool implementations and confusing tool patterns.
- Interpreting agent behavior: Observe where agents get stuck or confused. Read the evaluation agent’s reasoning and feedback (or CoT) carefully to identify issues. Review raw logs (including tool calls and tool responses) to capture any behaviors not explicitly described in the agent’s CoT. Remember, agents don’t necessarily know the correct answer and strategy, requiring “reading between the lines”.
- Analyze tool call metrics:
- Excessive redundant tool calls may indicate need for adjusting pagination or token limit parameters.
- Tool errors due to invalid parameters may suggest tools need clearer descriptions or better examples.
- Example: Claude’s web search tool previously unnecessarily appended “2025” to the
queryparameter, causing search result bias and performance degradation. By improving the tool description, Claude was guided toward the correct direction.
Collaborating with agents
- Let agents analyze and improve tools: You can have agents analyze evaluation results and improve tools themselves. Simply concatenate the evaluation agent’s logs and paste them into Claude Code. Claude excels at analyzing logs and refactoring large numbers of tools in one go, ensuring tool implementations and descriptions remain consistent when making new changes.
- Most of Anthropic team’s recommendations come from repeatedly optimizing their internal tool implementations using Claude Code. They use holdout test sets to avoid overfitting to “training” evaluations and found that additional performance improvements can be extracted even beyond “expert” tool implementations (whether human-written or Claude-generated).
Key Principles for Writing Efficient Tools
From the above process, Anthropic distilled the following guiding principles for writing efficient tools:
Choosing the right tools for agents
- Less is more, focus on high impact: More tools don’t necessarily mean better. A common mistake is simply encapsulating existing software functions or API endpoints without considering whether the tool is suitable for agents.
- Understand agent affordances: LLM agents have limited “context” (i.e., limited information they can process at once). If an agent uses a tool that returns all contacts and then reads through them token by token, it wastes limited context space on irrelevant information.
- Prioritize contextual relevance: For example, in an address book case, implement
search_contactsormessage_contacttools instead of alist_contactstool. - Consolidate functionality: Tools can consolidate functionality, handling multiple discrete operations (or API calls) at the underlying level. For example, tools can enrich responses with relevant metadata, or handle multi-step tasks that are frequently chained together in a single tool call.
- Examples:
- Instead of implementing
list_users,list_events, andcreate_eventtools, consider implementing aschedule_eventtool that can find availability and schedule events. - Instead of implementing a
read_logstool, consider implementing asearch_logstool that returns only relevant log lines and some context. - Instead of implementing
get_customer_by_id,list_transactions, andlist_notestools, implement aget_customer_contexttool that compiles all recent and relevant customer information in one go.
- Instead of implementing
- Examples:
- Clear and unique objectives: Ensure each tool has a clear, unique objective. Tools should enable agents to subdivide and solve tasks like humans would (given access to the same underlying resources), while reducing context consumed by intermediate outputs.
- Avoid redundancy and interference: Too many tools or tools with overlapping functionality distract agents from taking efficient strategies. Careful, selective planning of tools you build (or don’t build) can yield significant returns.
Namespacing your tools
- Address functional overlap and ambiguity: Agents may access dozens of MCP servers and hundreds of different tools. When tool functionality overlaps or purposes are ambiguous, agents become confused.
- Use prefixes for grouping: Namespacing (grouping related tools under common prefixes) helps define boundaries between tools. For example, namespacing tools by service (like
asana_search,jira_search) and resource (likeasana_projects_search,asana_users_search) can help agents select the right tool at the right time. - Consider prefix vs suffix: The choice between prefix and suffix namespace affects tool usage evaluation significantly, with effects varying by LLM. It’s recommended to choose naming schemes based on your own evaluations.
- Reduce agent error risk: By selectively implementing tools whose names reflect natural task subdivisions, you can simultaneously reduce the number of tools and tool descriptions loaded into agent context, and offload agent computation back to tool calls themselves, reducing the total risk of agent errors.
Returning meaningful context from your tools
- High-signal information first: Tool implementations should focus on returning only high-signal information to agents.
- Contextual relevance over flexibility: Prioritize contextual relevance, avoiding low-level technical identifiers (e.g.:
uuid,256px_image_url,mime_type). Fields likename,image_url, andfile_typeare more likely to directly influence the agent’s subsequent actions and responses. - Use natural language over cryptic identifiers: Agents typically handle natural language names, terms, or identifiers more successfully than cryptic ones. Parsing arbitrary alphanumeric UUIDs into more semantically meaningful and interpretable language (even 0-indexed ID schemes) significantly improves Claude’s precision in retrieval tasks and reduces hallucinations.
- Flexible response formats: In some cases, agents may need flexibility to handle both natural language and technical identifier outputs, even just to trigger downstream tool calls. You can achieve this by exposing a simple
response_formatenum parameter in tools, allowing agents to control whether tools return “concise” or “detailed” responses.- Example:
- Slack threads and replies are identified by unique
thread_ts, which is necessary to retrieve thread replies.thread_tsand other IDs (channel_id,user_id) can be retrieved from “detailed” tool responses to enable further tool calls that require these IDs. “Concise” tool responses return only thread content without IDs. In this case, using “concise” tool responses can save about 1/3 of tokens.
- Slack threads and replies are identified by unique
- Example:
- Optimize response structure: Tool response structures (e.g., XML, JSON, or Markdown) also affect evaluation performance. LLMs perform better with formats matching their training data, so optimal response structures vary by task and agent. It’s recommended to choose optimal response structures based on your evaluations.
Optimizing tool responses for token efficiency
- Quality and quantity matter: While optimizing context quality is important, optimizing the quantity of context returned to agents in tool responses is equally important.
- Implement pagination, range selection, filtering, and/or truncation: For tool responses that may consume substantial context, implement some combination of pagination, range selection, filtering, and/or truncation with reasonable default parameter values. For example, Claude Code defaults to limiting tool responses to 25,000 tokens.
- Guide after truncation: If choosing to truncate responses, be sure to guide agents with helpful instructions. Directly encourage agents to take more token-efficient strategies, such as conducting multiple small targeted searches rather than one broad search.
- Helpful error responses: If tool calls raise errors (e.g., during input validation), design error responses through prompt engineering to clearly communicate specific and actionable improvements, rather than opaque error codes or stack traces.
- Example:
- Unhelpful error response: Such as returning error codes only.
- Helpful error response: Clearly points out parameter issues and provides examples of correct format, guiding the agent toward correction.
- Tool truncation and error responses can guide agents toward more token-efficient tool usage behaviors (using filters or pagination), or provide examples of correctly formatted tool inputs.
- Example:
Prompt-engineering your tool descriptions
- One of the most effective methods: This is one of the most effective ways to improve tool efficiency. Tool descriptions and specifications load into agent contexts and can collectively guide agents toward effective tool-calling behaviors.
- Describe as if to a new employee: Write tool descriptions and specifications as you would describe your tools to a new team member. Make explicit any context you might assume—professional query formats, definitions of specific terms, relationships between underlying resources.
- Avoid ambiguity: Avoid ambiguity by clearly describing (and enforcing with strict data models) expected inputs and outputs. Specifically, input parameters should be explicitly named: for example, instead of using
user, try usinguser_id. - Evaluate validation effects: Evaluations allow more confident measurement of prompt engineering impacts. Even minor improvements to tool descriptions can yield significant performance gains. For example, Claude Sonnet 3.5 achieved state-of-the-art performance on SWE-bench Verified evaluation through precise improvements to tool descriptions, significantly reducing error rates and improving task completion.
- Consult additional resources: You can consult Anthropic’s developer guide for more tool definition best practices. If you’re building tools for Claude, it’s also recommended to read about how tools dynamically load into Claude’s system prompt. For MCP server tools, tool annotations help disclose which tools require open-world access or make destructive changes.
Outlook
To build efficient tools for agents, we need to reposition software development practices from predictable deterministic patterns to non-deterministic patterns. Through the iterative, evaluation-driven process described in this article, we’ve discovered recurring patterns that make tools successful: efficient tools are intentionally and clearly defined, wisely utilize agent context, compose into diverse workflows, and enable agents to intuitively solve real-world tasks.
As agent capabilities strengthen, our systematic, evaluation-driven approach to improving tools ensures that the tools they use can evolve alongside them.
Views