Multimodal Tool Outputs

What This Feature Unlocks

Returning images and files from the tools enables real agentic feedback loops on completely new modalities. For example, instead of dumping all the data into an agent, and hoping for the best, you can generate a visualization or analyze PDF reports, and allow the agent to provide insights based on that output. Just like a real data analyst. This saves your context window and unlocks autonomous agentic workflows for a lot of new use cases:

New Use Cases

Software Development

Agents can check websites autonomously and iterate until all elements are properly positioned, enabling them to tackle complex projects without manual screenshot feedback.

Brand Asset Generation

Provide brand guidelines, logos, and messaging, then let agents iterate on image and video generation (including Sora 2) until outputs fully match your expectations.

Screen-Aware Assistance

Build agents that help visually impaired individuals navigate websites or create customer support agents that see the user’s current webpage for better assistance.

Data Analytics

Generate visual graphs and analyze PDF reports, then let agents provide insights based on these outputs without overloading the context window.

Output Formats

Images (PNG, JPG)

To return an image from a tool, you can either:

Use the ToolOutputImage class.
Return a dict with the type set to "image" and either image_url (URL or data URL) or file_id.
Use our convenience tool_output_image_from_path function.

from agency_swarm import BaseTool, ToolOutputImage, ToolOutputImageDict
from agency_swarm.tools.utils import tool_output_image_from_path
from pydantic import Field

class FetchGalleryImage(BaseTool):
    """Return a static gallery image."""
    detail: str = Field(default="auto", description="Level of detail")

    def run(self) -> ToolOutputImage:
        return ToolOutputImage(
            image_url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
            detail=self.detail,
        )

class FetchGalleryImageDict(BaseTool):
    """Dict variant of the same image output."""
    detail: str = Field(default="auto", description="Level of detail")

    def run(self) -> ToolOutputImageDict:
        return {
            "type": "image",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
            "detail": self.detail,
        }

class FetchLocalImage(BaseTool):
    """Load an image from disk using the helper."""
    path: str = Field(default="examples/data/landscape_scene.png", description="Image to publish")

    def run(self) -> ToolOutputImage:
        return tool_output_image_from_path(self.path, detail="auto")

Files (PDF)

Similarly to return a file from a tool:

from agency_swarm import BaseTool, ToolOutputFileContent
from agency_swarm.tools.utils import tool_output_file_from_path, tool_output_file_from_url
from pydantic import Field

class FetchReferenceReport(BaseTool):
    """Return a reference PDF hosted remotely."""
    source_url: str = Field(
        default="https://raw.githubusercontent.com/VRSEN/agency-swarm/main/examples/data/sample_report.pdf",
        description="Remote file to share",
    )

    def run(self) -> ToolOutputFileContent:
        return ToolOutputFileContent(file_url=self.source_url)

class FetchLocalReport(BaseTool):
    """Return a report stored on disk."""
    path: str = Field(default="examples/data/sample_report.pdf", description="Local file path")

    def run(self) -> ToolOutputFileContent:
        return tool_output_file_from_path(self.path)

class FetchRemoteReport(BaseTool):
    """Return a remote file using the helper."""
    archive_url: str = Field(default="https://example.com/document.pdf", description="File to expose")

    def run(self) -> ToolOutputFileContent:
        return tool_output_file_from_url(self.archive_url)

When you choose file_data, include filename to hint a download name; URL-based outputs rely on the remote server metadata instead.

tool_output_file_from_path only supports PDF files.

Need to load local files without custom logic? Use the built-in LoadFileAttachment tool instead of creating a custom tool. It handles both images and PDFs and uses these same utility functions under the hood.

Combining Multiple Outputs

Return multiple outputs by returning a list from run.

from agency_swarm import BaseTool, ToolOutputFileContent, ToolOutputImage, ToolOutputText

class PrepareShowcase(BaseTool):
    """Return rich media and a short description."""
    teaser_a: str = "https://example.com/teaser-a.png"
    teaser_b: str = "https://example.com/teaser-b.png"
    report_id: str = "file-report-123"

    def run(self) -> list:
        return [
            ToolOutputImage(image_url=self.teaser_a),
            ToolOutputImage(image_url=self.teaser_b),
            ToolOutputText(text="Gallery updated: Teaser A and Teaser B now live."),
            ToolOutputFileContent(file_id=self.report_id),
        ]

Complete Example (Chart generation tool)

Here’s a complete example using BaseTool:

from agency_swarm import Agent, BaseTool, ToolOutputImage
from pydantic import Field
import base64
import matplotlib.pyplot as plt
import io

class GenerateChartTool(BaseTool):
    """Generate a bar chart from data."""
    
    data: list[float] = Field(..., description="Data points for the chart")
    labels: list[str] = Field(..., description="Labels for each data point")
    
    def run(self) -> ToolOutputImage:
        """Generate and return the chart as a base64-encoded image."""
        # Create the chart
        fig, ax = plt.subplots()
        ax.bar(self.labels, self.data)
        
        # Convert to base64
        buf = io.BytesIO()
        plt.savefig(buf, format='png')
        buf.seek(0)
        image_base64 = base64.b64encode(buf.read()).decode('utf-8')
        plt.close()
        
        # Return in multimodal format
        return ToolOutputImage(image_url=f"data:image/png;base64,{image_base64}")

# Create an agent with the tool
agent = Agent(
    name="DataViz",
    instructions="You generate charts and visualizations for data analysis.",
    tools=[GenerateChartTool]
)

function_tool decorators and BaseTool classes both support multimodal outputs in the exact same way.

from agency_swarm import ToolOutputImage, function_tool

@function_tool
def fetch_gallery_image() -> ToolOutputImage:
    return ToolOutputImage(
        image_url="https://upload.wikimedia.org/wikipedia/commons/0/0c/GoldenGateBridge-001.jpg",
        detail="auto",
    )

Tips & Best Practices

Base64-encoded images can be large. Use file references for large content.
Compress screenshots and other visuals before returning them to cut token usage without sacrificing clarity.
Include the image names in your textual response whenever you return more than one image so the agent can reference them unambiguously.

Real Examples

[TBD: Include repo from YouTube video]
examples/multimodal_outputs.py

Welcome

Core Framework

Additional Features

References

Contributing

Migration

FAQ

Multimodal Tool Outputs

What This Feature Unlocks

New Use Cases

Software Development

Brand Asset Generation

Screen-Aware Assistance

Data Analytics

Output Formats

Images (PNG, JPG)

Files (PDF)

Combining Multiple Outputs

Complete Example (Chart generation tool)

Tips & Best Practices

Real Examples

Welcome

Core Framework

Additional Features

References

Contributing

Migration

FAQ

​What This Feature Unlocks

​New Use Cases

Software Development

Brand Asset Generation

Screen-Aware Assistance

Data Analytics

​Output Formats

​Images (PNG, JPG)

​Files (PDF)

​Combining Multiple Outputs

​Complete Example (Chart generation tool)

​Tips & Best Practices

​Real Examples

What This Feature Unlocks

New Use Cases

Output Formats

Images (PNG, JPG)

Files (PDF)

Combining Multiple Outputs

Complete Example (Chart generation tool)

Tips & Best Practices

Real Examples