Logo
Agentailor
Published on

Writing Effective Tools for AI Agents: Production Lessons from Anthropic

Authors
  • avatar
    Name
    Ali Ibrahim
    Twitter
Writing Tools for AI Agents

Introduction

You've learned how to prompt AI agents effectively, giving them the heuristics and principles they need to make good decisions autonomously. But even the most carefully crafted prompts fail when agents are given poorly designed tools.

In the previous article on agent prompting, we mentioned that Cameron AI, our personal finance assistant needed a get_expenses_by_date_range tool to access user expenses. We left that as an exercise. But how do you actually build that tool? What makes a good tool for an agent versus a good API for a human developer?

Anthropic's Applied AI team recently published hard-won lessons from building tools for agents like Claude Code. Their insight? Tools represent a fundamentally new software paradigm: contracts between deterministic systems and non-deterministic agents. Unlike traditional APIs, agent tools must account for unpredictability—agents may hallucinate, misunderstand purposes, or call tools incorrectly.

This guide distills those insights into five practical principles, then shows you exactly how to apply them by building Cameron AI's get_expenses_by_date_range tool from the ground up.

Why Tools Are Different for Agents

When you design an API for developers, you assume rational actors who read documentation, understand error codes, and know when to call which endpoint. Agents break all these assumptions.

Agents operate in a loop, making decisions based on limited context and imperfect understanding. They might:

  • Call the wrong tool because two have similar names
  • Pass malformed parameters despite clear documentation
  • Request massive datasets that blow up their context window
  • Misinterpret cryptic error codes and retry the same failing approach

This means agent tools need different design principles than traditional APIs. As Anthropic discovered building Claude Code, tools must be defensively designed: clear enough that agents can't easily misuse them, informative enough to guide agents toward better strategies, and efficient enough to preserve precious context window space.

The good news? Human and agent ergonomics align. Tools that work well for agents turn out to be surprisingly intuitive for humans too. By designing for the non-deterministic nature of agents, you end up with better APIs overall.

The Five Principles for Effective Tool Design

Strategic Tool Selection

Don't wrap every API endpoint as a tool. Instead, focus on high-impact workflows that match how users (and agents) actually think about tasks. Consolidate multi-step operations into single tool calls when it makes semantic sense.

Anti-pattern: Creating get_expense_by_id, list_all_expenses, filter_expenses_by_category, and search_expenses as separate tools. Agents will struggle to choose the right one, and you'll waste prompt space documenting all four.

Better approach: One well-designed get_expenses_by_date_range tool with smart filtering parameters that handles the common workflows naturally.

Takeaway: Build tools around user workflows, not database schemas, one powerful tool beats five fragmented ones.

Clear Namespacing

When agents have access to multiple tools (especially across different MCP servers), naming collisions cause confusion. Organize related tools with consistent prefixes: by service (slack_search, notion_search) or by resource (expenses_get_by_date, expenses_summarize).

For Cameron AI's financial tools, we use a cameron_ prefix for all agent-facing tools: cameron_get_expenses, cameron_get_budget_status, cameron_calculate_savings_progress. This makes it immediately clear which tools belong to the finance domain versus other potential integrations.

Takeaway: Consistent naming prefixes prevent tool selection errors and make your agent's available capabilities instantly scannable.

Meaningful Context Return

Agents need human-readable context, not just technical identifiers. When returning expense data, include the expense description and category—not just a UUID the agent would need to look up with another tool call.

Anthropic recommends making tools' response verbosity configurable. Add a response_format parameter with options like "concise" (just the essentials) or "detailed" (full metadata). This lets agents optimize token usage based on their current needs: detailed for analysis tasks, concise for quick checks.

Takeaway: Return semantic information agents can directly reason about, and make verbosity configurable to balance detail against token limits.

Token Efficiency

Agent context is precious. Every token in a tool response is a token not available for reasoning or additional tool calls. Implement sensible defaults for pagination (e.g., 50 items), add filtering parameters, and truncate appropriately.

When things go wrong, provide error messages that guide agents toward token-efficient strategies, not cryptic codes. Instead of ERROR: TOO_MANY_RESULTS, return: "Found 847 expenses. This is too large to return. Please narrow your date range or specify a category filter." This helps agents self-correct without wasting tool calls.

Takeaway: Design for token efficiency with smart defaults and helpful error messages that guide agents toward better queries.

Prompt Engineering for Tools

Tool descriptions are prompts. Every word in your tool's name, description, and parameter documentation shapes how agents understand and use it. Anthropic found that iterative refinement of these descriptions, informed by evaluation results dramatically improves agent performance.

Be explicit about when to use each tool, what parameters are required versus optional, and what the expected output looks like. Use clear, natural language. For Cameron AI's expense tool, instead of just start_date: string, write: start_date: ISO date string (YYYY-MM-DD) for the beginning of the expense search range. Required.

Takeaway: Treat tool descriptions as critical prompt engineering work, iterate on clarity until agents consistently use tools correctly.

Building Cameron AI's get_expenses_by_date_range: A Complete Example

Let's implement the expense tracking tool using all five principles. Here's the complete tool specification with annotations explaining each design decision:

// Tool definition following MCP (Model Context Protocol) schema
{
  name: "cameron_get_expenses",  // Clear namespace prefix (Principle 2)
  description: `Retrieves user expenses within a specified date range. Use this tool when the user asks about spending patterns, budget tracking, or expense analysis.

  **When to use:**
  - "What did I spend on groceries last month?"
  - "Show me my expenses from January to March"
  - "How much am I spending on dining out?"

  **Token efficiency tip:** Use 'response_format: concise' for quick checks, 'detailed' for analysis.`,

  inputSchema: {
    type: "object",
    properties: {
      start_date: {
        type: "string",
        description: "Start date in ISO format (YYYY-MM-DD). Required. Example: '2025-01-01'"
      },
      end_date: {
        type: "string",
        description: "End date in ISO format (YYYY-MM-DD). Required. Example: '2025-01-31'"
      },
      category: {
        type: "string",
        description: "Optional. Filter by expense category. Available categories: 'groceries', 'dining', 'transportation', 'entertainment', 'utilities', 'healthcare', 'other'. Omit to get all categories.",
        enum: ["groceries", "dining", "transportation", "entertainment", "utilities", "healthcare", "other"]
      },
      response_format: {  // Principle 3: Configurable verbosity
        type: "string",
        description: "Optional. Controls response detail. 'concise' returns only essential fields (date, amount, category, description). 'detailed' includes merchant, payment method, tags, and notes. Defaults to 'concise'.",
        enum: ["concise", "detailed"],
        default: "concise"
      },
      limit: {  // Principle 4: Token efficiency with pagination
        type: "number",
        description: "Optional. Maximum number of expenses to return. Defaults to 50. Use lower values for token efficiency.",
        default: 50
      }
    },
    required: ["start_date", "end_date"]
  }
}

Now let's implement the tool handler that demonstrates defensive design:

// Tool implementation with defensive design patterns
async function handleGetExpenses(params: {
  start_date: string
  end_date: string
  category?: string
  response_format?: 'concise' | 'detailed'
  limit?: number
}) {
  const { start_date, end_date, category, response_format = 'concise', limit = 50 } = params

  // Validate date format (help agents catch errors early), using regex for simplicity of the example
  const dateRegex = /^\d{4}-\d{2}-\d{2}$/
  if (!dateRegex.test(start_date) || !dateRegex.test(end_date)) {
    return {
      error: "Invalid date format. Please use YYYY-MM-DD format. Example: '2025-01-15'",
      example: { start_date: '2025-01-01', end_date: '2025-01-31' },
    }
  }

  // Validate date range (prevent excessive queries)
  const start = new Date(start_date)
  const end = new Date(end_date)
  const daysDiff = Math.ceil((end.getTime() - start.getTime()) / (1000 * 60 * 60 * 24))

  if (daysDiff > 365) {
    return {
      error:
        'Date range too large (> 1 year). Large ranges consume excessive tokens and may hit context limits.',
      suggestion:
        'Try breaking into smaller ranges (e.g., monthly) or specify a category filter to narrow results.',
      requested_days: daysDiff,
    }
  }

  if (start > end) {
    return {
      error: 'start_date must be before end_date',
      received: { start_date, end_date },
    }
  }

  // Fetch expenses from database
  let expenses = await database.getExpenses({
    userId: getCurrentUserId(),
    startDate: start,
    endDate: end,
    category: category,
  })

  // Token efficiency: Warn if results are large
  if (expenses.length > limit) {
    const total = expenses.length
    expenses = expenses.slice(0, limit)

    return {
      warning: `Found ${total} expenses but returning only first ${limit} to preserve tokens.`,
      suggestion: category
        ? 'Try narrowing the date range for more specific results.'
        : "Consider adding a 'category' filter to focus results.",
      total_found: total,
      returned: limit,
      expenses: formatExpenses(expenses, response_format),
    }
  }

  // Principle 3: Return meaningful context based on format
  return {
    date_range: { start_date, end_date },
    category: category || 'all categories',
    total_expenses: expenses.length,
    total_amount: expenses.reduce((sum, e) => sum + e.amount, 0),
    currency: 'USD',
    expenses: formatExpenses(expenses, response_format),
  }
}

// Format helper respecting the response_format parameter
function formatExpenses(expenses: Expense[], format: 'concise' | 'detailed'): any[] {
  if (format === 'concise') {
    // Principle 3: Human-readable, essential info only
    return expenses.map((e) => ({
      date: e.date.toISOString().split('T')[0], // "2025-01-15"
      amount: e.amount,
      category: e.category,
      description: e.description, // "Whole Foods - groceries" not UUID
    }))
  }

  // Detailed format for analysis tasks
  return expenses.map((e) => ({
    date: e.date.toISOString().split('T')[0],
    amount: e.amount,
    category: e.category,
    description: e.description,
    merchant: e.merchant,
    payment_method: e.paymentMethod,
    tags: e.tags,
    notes: e.notes,
    recurring: e.isRecurring,
  }))
}

What makes this implementation effective:

  1. Strategic selection (Principle 1): One tool handles the primary workflow, retrieving expenses by date, rather than fragmenting across multiple tools.

  2. Clear namespace (Principle 2): cameron_get_expenses immediately identifies this as part of Cameron AI's finance domain.

  3. Meaningful context (Principle 3): Returns expense descriptions and categories directly, not IDs requiring lookups. Configurable response_format balances detail against tokens.

  4. Token efficiency (Principle 4): Sensible 50-item default limit, pagination support, and helpful error messages that guide agents toward better queries instead of just failing.

  5. Prompt engineering (Principle 5): Extensive description with usage examples, explicit parameter documentation, and guidance on when to use each option.

Evaluation-Driven Development

Anthropic's approach to building tools is deeply iterative and evaluation-driven. You don't build the perfect tool on the first try, you build a prototype, test it with realistic scenarios, analyze failures, and refine.

The process:

  1. Build a prototype with your best understanding of the requirements
  2. Create realistic evaluation tasks that mirror actual user workflows
  3. Run evaluations and collect metrics (accuracy, token usage, tool errors)
  4. Analyze failures to identify where tool descriptions confuse agents or responses waste tokens
  5. Refine the tool based on findings
  6. Validate with held-out test cases to prevent overfitting

For Cameron AI's expense tool, start with 3-5 evaluation scenarios:

**Test Case 1: Simple monthly summary**
User: "What did I spend last month?"
Expected: Agent calls cameron_get_expenses with last month's date range, returns summary
Success criteria: Correct date calculation, appropriate response_format

**Test Case 2: Category-specific analysis**
User: "How much am I spending on dining out compared to groceries?"
Expected: Agent makes two tool calls (dining + groceries categories), compares totals
Success criteria: Uses category filter, presents clear comparison

**Test Case 3: Budget tracking**
User: "Am I on track with my $500/month dining budget?"
Expected: Agent gets current month's dining expenses, compares to budget
Success criteria: Uses category filter, accurate calculation, considers partial month

Metrics to track:

  • Accuracy: Does the agent solve the task correctly?
  • Token efficiency: Total tokens consumed (prompt + tool responses + reasoning)
  • Tool call errors: How often does the agent pass invalid parameters?
  • Multi-step efficiency: For tasks requiring multiple calls, does the agent choose optimal sequence?

As Anthropic notes, the key principle is: the larger the effect size, the smaller the sample size you need. Start with 3-5 manual test cases. If your refinements obviously improve agent behavior, you're making progress. Add more evaluation tasks as you discover edge cases in production.

Putting It All Together

Building effective tools for AI agents requires a different mindset than traditional API design. The five principles, strategic selection, clear namespacing, meaningful context, token efficiency, and prompt engineering all stem from one core insight: agents are non-deterministic users of deterministic tools.

This means your tools must be: defensively designed, richly documented, token-conscious and semantically aligned with how users think about tasks.

Remember from the previous article on agent prompting: prompts give agents heuristics and principles for decision-making, while tools give them capabilities to act. Great agents need both. A perfectly prompted agent will fail with poorly designed tools, and the best tools won't save an agent with unclear instructions.

Want to dive deeper into AI agent development and tooling? Subscribe to Agent Briefings—a weekly LinkedIn newsletter where I share insights on agent architecture, MCP server development, and deployment strategies.

Additional Resources

Agent Briefings

Level up your agent-building skills with weekly deep dives on prompting, tools, and production patterns.