Building a Multi-modal AI Agent on n8n (Voice, PDF, and Vision)

What You’ll Build

In this guide, you will build a multi-modal AI agent running entirely on n8n that can understand three distinct types of input:

Voice notes — Audio files are transcribed via Deepgram and then summarized or analyzed by an LLM.
PDF documents — PDFs are parsed, chunked, stored in a vector database, and queried with natural-language questions (RAG).
Images — Photos and screenshots are sent to a vision-capable LLM (GPT-4o or Claude) for description and data extraction.

A single webhook receives any of these inputs, detects the media type automatically, and routes the request to the correct processing pipeline. The final product is exposed as a Telegram or Slack bot so your team can interact with it from a phone or desktop.

By the end you will have four working n8n workflows and a unified router that ties them together.

Prerequisites

Requirement	Purpose	Free Tier Available
n8n (self-hosted or cloud)	Workflow automation engine	Yes (community edition)
Deepgram API key	Speech-to-text transcription	Yes (free credits on signup)
OpenAI API key	GPT-4o for chat, embeddings, and vision	No (pay-as-you-go)
Claude API key (alternative)	Claude for chat and vision	No (pay-as-you-go)
Pinecone API key (optional)	Vector storage for RAG	Yes (free starter index)
Telegram Bot Token or Slack App	Chat frontend	Yes
Docker & Docker Compose	Running n8n locally	Yes

You can substitute OpenAI with Claude (or vice-versa) for any LLM step. The workflow structure remains the same — only the HTTP Request body changes.

Architecture Overview

The overall system follows a hub-and-spoke pattern:

User (Telegram / Slack / API)
        │
        ▼
  ┌─────────────┐
  │  Webhook     │  ← Unified entry point
  │  Router      │
  └──────┬───┬──┬┘
         │   │  │
    ┌────┘   │  └────┐
    ▼        ▼       ▼
 Voice     PDF    Image
Pipeline  Pipeline Pipeline
    │        │       │
    └────┬───┘───────┘
         ▼
   Response back
   to user

Each pipeline is an independent n8n workflow that can also run standalone. The router inspects the incoming MIME type (or file extension) and triggers the appropriate sub-workflow with Execute Workflow nodes.

Step 1: Set Up n8n with API Credentials

1.1 Launch n8n with Docker Compose

Create a docker-compose.yml in a new project folder:

version: "3.8"
services:
  n8n:
    image: n8nio/n8n:latest
    restart: always
    ports:
      - "5678:5678"
    environment:
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin
      - N8N_BASIC_AUTH_PASSWORD=changeme
      - WEBHOOK_URL=https://your-domain.com/
    volumes:
      - n8n_data:/home/node/.n8n
volumes:
  n8n_data:

Start the stack:

docker compose up -d

Open http://localhost:5678 and log in.

1.2 Add Credentials

Navigate to Settings → Credentials and create the following entries:

Deepgram API — Type: Header Auth. Name: Authorization, Value: Token YOUR_DEEPGRAM_KEY.
OpenAI API — Use the built-in OpenAI credential type. Paste your API key.
Pinecone API — Type: Header Auth. Name: Api-Key, Value: YOUR_PINECONE_KEY.
Telegram — Use the built-in Telegram credential. Paste the bot token from BotFather.

Tip: If you prefer Claude, create an Header Auth credential with name x-api-key and your Anthropic key as the value, plus a second header anthropic-version set to 2023-06-01.

Expected Result

You have n8n running on port 5678 with four credential entries ready. No workflows yet — we will build those next.

Step 2: Build the Voice Processing Pipeline

This pipeline accepts an audio file, sends it to Deepgram for transcription, and passes the transcript to an LLM for summarization.

2.1 Create a New Workflow

Name it Voice Pipeline. Add the following nodes in order:

2.2 Webhook Node (Receive Audio)

Setting	Value
HTTP Method	POST
Path	`/voice`
Response Mode	Last Node
Binary Property	`data`

This node will accept multipart/form-data uploads containing an audio file.

2.3 HTTP Request Node (Deepgram Transcription)

Setting	Value
Method	POST
URL	`https://api.deepgram.com/v1/listen?model=nova-2&smart_format=true`
Authentication	Predefined Credential → Deepgram API
Send Body	Binary
Input Data Field Name	`data`
Content Type	Auto-detect

Deepgram returns JSON with the transcript inside results.channels[0].alternatives[0].transcript.

2.4 Set Node (Extract Transcript)

Add a Set node to pull out the transcript:

Setting	Value
Name	`transcript`
Value	`{{ $json.results.channels[0].alternatives[0].transcript }}`

2.5 OpenAI Node (Summarize)

Use the built-in OpenAI node (or an HTTP Request node for Claude):

Setting	Value
Resource	Chat Message
Model	`gpt-4o`
System Message	`You are a concise assistant. Summarize the following voice transcript in 3-5 bullet points.`
User Message	`{{ $json.transcript }}`

If you prefer Claude, use an HTTP Request node:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": "Summarize this transcript in 3-5 bullet points:\n\n{{ $json.transcript }}"
    }
  ]
}

POST to https://api.anthropic.com/v1/messages with your Anthropic credential.

2.6 Respond to Webhook Node

Return the summary as JSON:

{
  "transcript": "{{ $node['Set'].json.transcript }}",
  "summary": "{{ $json.message.content }}"
}

Expected Result

Send an audio file via curl:

curl -X POST https://your-domain.com/webhook/voice \
  -F "data=@meeting-notes.mp3"

You receive a JSON response containing the raw transcript and a bullet-point summary. Deepgram processes most audio in under two seconds; the total round-trip should be under five seconds for clips shorter than five minutes.

Step 3: Build the PDF Analysis Pipeline (RAG)

This pipeline extracts text from a PDF, chunks it, stores embeddings in Pinecone, and answers questions about the document.

3.1 Workflow A — Ingest PDF

Create a workflow called PDF Ingest.

Webhook Node — Same pattern as Step 2 (POST /pdf-ingest, binary data).

Extract from File Node — Use the built-in Extract from File node. Set Operation to PDF and input the binary property data. This outputs the full text of the PDF.

Text Splitter (Code Node) — Add a Code node to chunk the text:

const text = $input.first().json.text;
const chunkSize = 800;
const overlap = 100;
const chunks = [];

for (let i = 0; i < text.length; i += chunkSize - overlap) {
  chunks.push({
    json: {
      chunk: text.slice(i, i + chunkSize),
      index: chunks.length,
    },
  });
}

return chunks;

This produces an array of items, each containing an 800-character chunk with 100-character overlap.

OpenAI Embeddings Node — For each chunk, generate an embedding:

Setting	Value
Resource	Embedding
Model	`text-embedding-3-small`
Input	`{{ $json.chunk }}`

HTTP Request Node (Pinecone Upsert) — Send each embedding to Pinecone:

{
  "vectors": [
    {
      "id": "chunk-{{ $json.index }}",
      "values": {{ $json.embedding }},
      "metadata": {
        "text": "{{ $json.chunk }}"
      }
    }
  ]
}

POST to https://YOUR_INDEX-YOUR_PROJECT.svc.YOUR_ENV.pinecone.io/vectors/upsert with the Pinecone credential.

3.2 Workflow B — Query PDF

Create a second workflow called PDF Query.

Webhook Node — POST /pdf-query, expects JSON body { "question": "..." }.

OpenAI Embeddings Node — Embed the user’s question using the same model (text-embedding-3-small).

HTTP Request Node (Pinecone Query) — Query for the top 5 nearest chunks:

{
  "vector": {{ $json.embedding }},
  "topK": 5,
  "includeMetadata": true
}

POST to the Pinecone /query endpoint.

Set Node — Concatenate the returned chunks into a context string:

{{ $json.matches.map(m => m.metadata.text).join("\n\n---\n\n") }}

OpenAI Chat Node — Pass the context and question to the LLM:

Setting	Value
System Message	`Answer the user's question based ONLY on the provided context. If the answer is not in the context, say so.`
User Message	`Context:\n{{ $json.context }}\n\nQuestion: {{ $node['Webhook'].json.body.question }}`

Respond to Webhook — Return the answer.

Expected Result

Ingest a PDF:

curl -X POST https://your-domain.com/webhook/pdf-ingest \
  -F "data=@annual-report.pdf"

Then query it:

curl -X POST https://your-domain.com/webhook/pdf-query \
  -H "Content-Type: application/json" \
  -d '{"question": "What was the total revenue in Q3?"}'

You receive a grounded answer drawn from the actual PDF content, with no hallucination outside the provided context.

Step 4: Build the Vision Pipeline

This pipeline accepts an image and sends it to a vision-capable LLM for analysis.

4.1 Create the Workflow

Name it Vision Pipeline. Add a Webhook node (POST /vision, binary data).

4.2 Convert Image to Base64 (Code Node)

const binaryData = await this.helpers.getBinaryDataBuffer(0, 'data');
const base64 = binaryData.toString('base64');
const mimeType = $input.first().binary.data.mimeType;

return [{
  json: {
    base64Image: base64,
    mimeType: mimeType,
  },
}];

4.3 HTTP Request Node (Vision API)

For GPT-4o:

POST to https://api.openai.com/v1/chat/completions:

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image in detail. Extract any text, data, or notable visual elements."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:{{ $json.mimeType }};base64,{{ $json.base64Image }}"
          }
        }
      ]
    }
  ],
  "max_tokens": 1024
}

For Claude:

POST to https://api.anthropic.com/v1/messages:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": "{{ $json.mimeType }}",
            "data": "{{ $json.base64Image }}"
          }
        },
        {
          "type": "text",
          "text": "Describe this image in detail. Extract any text, data, or notable visual elements."
        }
      ]
    }
  ]
}

4.4 Respond to Webhook

Return the description:

{
  "description": "{{ $json.choices[0].message.content }}"
}

Expected Result

curl -X POST https://your-domain.com/webhook/vision \
  -F "data=@screenshot.png"

The response contains a detailed natural-language description of the image, including any visible text extracted via OCR and identification of charts, diagrams, or UI elements.

Step 5: Create the Unified Router

Now we wire everything together so a single endpoint can handle any media type.

5.1 Create the Router Workflow

Name it Multi-modal Router.

Webhook Node — POST /agent, binary data, Response Mode: Last Node.

5.2 Switch Node (Detect Media Type)

Add a Switch node with the following routing rules based on {{ $binary.data.mimeType }}:

Rule	Condition	Output
Voice	Starts with `audio/`	→ Execute Workflow: Voice Pipeline
PDF	Equals `application/pdf`	→ Execute Workflow: PDF Ingest, then PDF Query
Image	Starts with `image/`	→ Execute Workflow: Vision Pipeline
Text/JSON	Equals `application/json`	→ Execute Workflow: PDF Query (question only)

5.3 Execute Workflow Nodes

For each output of the Switch node, add an Execute Workflow node pointing to the corresponding sub-workflow. Pass through all data (binary and JSON).

5.4 Merge Node

Add a Merge node (mode: Merge By Index) that collects outputs from all branches and feeds into a single Respond to Webhook node.

Expected Result

A single endpoint now handles all input types:

# Voice
curl -X POST https://your-domain.com/webhook/agent \
  -F "data=@recording.wav"

# Image
curl -X POST https://your-domain.com/webhook/agent \
  -F "data=@photo.jpg"

# PDF
curl -X POST https://your-domain.com/webhook/agent \
  -F "data=@document.pdf"

# Text question (for querying previously ingested PDFs)
curl -X POST https://your-domain.com/webhook/agent \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize the key findings."}'

Each request is automatically routed to the correct pipeline and the response is returned through the same webhook.

Step 6: Add a Telegram or Slack Bot Frontend

6.1 Telegram Bot Setup

Replace the Webhook trigger in the Router workflow with a Telegram Trigger node:

Setting	Value
Credential	Your Telegram Bot credential
Updates	Message

Telegram messages can contain text, voice notes, photos, or documents. Add a Code node after the trigger to normalize the input:

const msg = $input.first().json.message;

if (msg.voice) {
  // Download the voice file via Telegram API
  const fileId = msg.voice.file_id;
  return [{ json: { type: 'voice', fileId } }];
} else if (msg.document && msg.document.mime_type === 'application/pdf') {
  return [{ json: { type: 'pdf', fileId: msg.document.file_id } }];
} else if (msg.photo) {
  const fileId = msg.photo[msg.photo.length - 1].file_id;
  return [{ json: { type: 'image', fileId } }];
} else if (msg.text) {
  return [{ json: { type: 'text', question: msg.text } }];
}

After classification, use an HTTP Request node to download the file from Telegram’s getFile API, then pass it to the appropriate sub-workflow via the same Switch logic from Step 5.

At the end of each branch, add a Telegram node (Send Message) to reply in the same chat:

Setting	Value
Chat ID	`{{ $node['Telegram Trigger'].json.message.chat.id }}`
Text	`{{ $json.summary

6.2 Slack Alternative

For Slack, use the Slack Trigger node listening for message events in a specific channel. The flow is identical — detect file type from event.files[0].mimetype, download via Slack’s files.info API, process, and reply using a Slack node (Post Message).

Expected Result

Send a voice note, a photo, or a PDF to your Telegram bot (or Slack channel). Within seconds, the bot replies with a summary, a description, or an answer. Team members can interact with the agent from their phones without any technical knowledge.

Frequently Asked Questions

Q: What audio formats does Deepgram support? Deepgram accepts MP3, WAV, OGG, FLAC, M4A, and WebM. Telegram voice notes are OGG by default, which works without conversion.

Q: How large can uploaded PDFs be? n8n’s default binary data limit is 16 MB. For larger files, set the environment variable N8N_DEFAULT_BINARY_DATA_MODE=filesystem in your Docker configuration to stream files to disk instead of holding them in memory.

Q: Can I use a free vector database instead of Pinecone? Yes. You can substitute Pinecone with Qdrant (self-hosted via Docker), Chroma, or Milvus. The HTTP Request node configuration changes to match the alternative API, but the workflow structure stays the same.

Q: What if I want to use local models instead of OpenAI or Claude? You can point the HTTP Request nodes at any OpenAI-compatible API. For example, run Ollama locally and set the URL to http://localhost:11434/v1/chat/completions. Note that vision and embedding capabilities vary by model.

Q: How do I handle errors in production? Add an Error Trigger workflow in n8n that catches failures and sends notifications (email, Slack, or Telegram). Within each pipeline, add IF nodes after HTTP Request nodes to check for non-200 status codes before proceeding.

Next Steps

With the multi-modal agent running, here are some directions to explore:

Add memory — Store conversation history in a database (Postgres or Redis) so the agent can reference previous interactions.
Expand input types — Add support for video files by extracting keyframes and audio tracks separately, then processing each through the existing pipelines.
Build an approval workflow — For high-stakes actions (e.g., updating a CRM or sending a report), insert a human-in-the-loop approval step before execution.
Monitor usage — Track token consumption, latency, and error rates per pipeline using n8n’s built-in execution log or an external dashboard like Grafana.
Fine-tune prompts — Customize the system messages in each LLM node for your domain. A legal team needs different summarization patterns than a marketing team.

The modular design means you can upgrade any single pipeline — swap Deepgram for Whisper, replace Pinecone with Qdrant, or switch from GPT-4o to Claude — without touching the rest of the system.

What You’ll Build

Prerequisites

Architecture Overview

Step 1: Set Up n8n with API Credentials

1.1 Launch n8n with Docker Compose

1.2 Add Credentials

Expected Result

Step 2: Build the Voice Processing Pipeline

2.1 Create a New Workflow

2.2 Webhook Node (Receive Audio)

2.3 HTTP Request Node (Deepgram Transcription)

2.4 Set Node (Extract Transcript)

2.5 OpenAI Node (Summarize)

2.6 Respond to Webhook Node

Expected Result

Step 3: Build the PDF Analysis Pipeline (RAG)

3.1 Workflow A — Ingest PDF

3.2 Workflow B — Query PDF

Expected Result

Step 4: Build the Vision Pipeline

4.1 Create the Workflow

4.2 Convert Image to Base64 (Code Node)

4.3 HTTP Request Node (Vision API)

4.4 Respond to Webhook

Expected Result

Step 5: Create the Unified Router

5.1 Create the Router Workflow

5.2 Switch Node (Detect Media Type)

5.3 Execute Workflow Nodes

5.4 Merge Node

Expected Result

Step 6: Add a Telegram or Slack Bot Frontend

6.1 Telegram Bot Setup

6.2 Slack Alternative

Expected Result

Frequently Asked Questions

Next Steps

Related Manuals

Building an Autonomous Content Factory with n8n and Claude 3.5 Sonnet

[Side-Hustle Guide] Building a Zero-Cost Automated Cash-Generating Site with n8n & DeepSeek