Building a Multi-modal AI Agent on n8n (Voice, PDF, and Vision)
A comprehensive guide to constructing versatile n8n agents that can process voice notes via Deepgram, analyze PDFs with RAG, and 'see' images using latest LLMs.
What You’ll Build
In this guide, you will build a multi-modal AI agent running entirely on n8n that can understand three distinct types of input:
- Voice notes — Audio files are transcribed via Deepgram and then summarized or analyzed by an LLM.
- PDF documents — PDFs are parsed, chunked, stored in a vector database, and queried with natural-language questions (RAG).
- Images — Photos and screenshots are sent to a vision-capable LLM (GPT-4o or Claude) for description and data extraction.
A single webhook receives any of these inputs, detects the media type automatically, and routes the request to the correct processing pipeline. The final product is exposed as a Telegram or Slack bot so your team can interact with it from a phone or desktop.
By the end you will have four working n8n workflows and a unified router that ties them together.
Prerequisites
| Requirement | Purpose | Free Tier Available |
|---|---|---|
| n8n (self-hosted or cloud) | Workflow automation engine | Yes (community edition) |
| Deepgram API key | Speech-to-text transcription | Yes (free credits on signup) |
| OpenAI API key | GPT-4o for chat, embeddings, and vision | No (pay-as-you-go) |
| Claude API key (alternative) | Claude for chat and vision | No (pay-as-you-go) |
| Pinecone API key (optional) | Vector storage for RAG | Yes (free starter index) |
| Telegram Bot Token or Slack App | Chat frontend | Yes |
| Docker & Docker Compose | Running n8n locally | Yes |
You can substitute OpenAI with Claude (or vice-versa) for any LLM step. The workflow structure remains the same — only the HTTP Request body changes.
Architecture Overview
The overall system follows a hub-and-spoke pattern:
User (Telegram / Slack / API)
│
▼
┌─────────────┐
│ Webhook │ ← Unified entry point
│ Router │
└──────┬───┬──┬┘
│ │ │
┌────┘ │ └────┐
▼ ▼ ▼
Voice PDF Image
Pipeline Pipeline Pipeline
│ │ │
└────┬───┘───────┘
▼
Response back
to user
Each pipeline is an independent n8n workflow that can also run standalone. The router inspects the incoming MIME type (or file extension) and triggers the appropriate sub-workflow with Execute Workflow nodes.
Step 1: Set Up n8n with API Credentials
1.1 Launch n8n with Docker Compose
Create a docker-compose.yml in a new project folder:
version: "3.8"
services:
n8n:
image: n8nio/n8n:latest
restart: always
ports:
- "5678:5678"
environment:
- N8N_BASIC_AUTH_ACTIVE=true
- N8N_BASIC_AUTH_USER=admin
- N8N_BASIC_AUTH_PASSWORD=changeme
- WEBHOOK_URL=https://your-domain.com/
volumes:
- n8n_data:/home/node/.n8n
volumes:
n8n_data:
Start the stack:
docker compose up -d
Open http://localhost:5678 and log in.
1.2 Add Credentials
Navigate to Settings → Credentials and create the following entries:
- Deepgram API — Type: Header Auth. Name:
Authorization, Value:Token YOUR_DEEPGRAM_KEY. - OpenAI API — Use the built-in OpenAI credential type. Paste your API key.
- Pinecone API — Type: Header Auth. Name:
Api-Key, Value:YOUR_PINECONE_KEY. - Telegram — Use the built-in Telegram credential. Paste the bot token from BotFather.
Tip: If you prefer Claude, create an Header Auth credential with name
x-api-keyand your Anthropic key as the value, plus a second headeranthropic-versionset to2023-06-01.
Expected Result
You have n8n running on port 5678 with four credential entries ready. No workflows yet — we will build those next.
Step 2: Build the Voice Processing Pipeline
This pipeline accepts an audio file, sends it to Deepgram for transcription, and passes the transcript to an LLM for summarization.
2.1 Create a New Workflow
Name it Voice Pipeline. Add the following nodes in order:
2.2 Webhook Node (Receive Audio)
| Setting | Value |
|---|---|
| HTTP Method | POST |
| Path | /voice |
| Response Mode | Last Node |
| Binary Property | data |
This node will accept multipart/form-data uploads containing an audio file.
2.3 HTTP Request Node (Deepgram Transcription)
| Setting | Value |
|---|---|
| Method | POST |
| URL | https://api.deepgram.com/v1/listen?model=nova-2&smart_format=true |
| Authentication | Predefined Credential → Deepgram API |
| Send Body | Binary |
| Input Data Field Name | data |
| Content Type | Auto-detect |
Deepgram returns JSON with the transcript inside results.channels[0].alternatives[0].transcript.
2.4 Set Node (Extract Transcript)
Add a Set node to pull out the transcript:
| Setting | Value |
|---|---|
| Name | transcript |
| Value | {{ $json.results.channels[0].alternatives[0].transcript }} |
2.5 OpenAI Node (Summarize)
Use the built-in OpenAI node (or an HTTP Request node for Claude):
| Setting | Value |
|---|---|
| Resource | Chat Message |
| Model | gpt-4o |
| System Message | You are a concise assistant. Summarize the following voice transcript in 3-5 bullet points. |
| User Message | {{ $json.transcript }} |
If you prefer Claude, use an HTTP Request node:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Summarize this transcript in 3-5 bullet points:\n\n{{ $json.transcript }}"
}
]
}
POST to https://api.anthropic.com/v1/messages with your Anthropic credential.
2.6 Respond to Webhook Node
Return the summary as JSON:
{
"transcript": "{{ $node['Set'].json.transcript }}",
"summary": "{{ $json.message.content }}"
}
Expected Result
Send an audio file via curl:
curl -X POST https://your-domain.com/webhook/voice \
-F "data=@meeting-notes.mp3"
You receive a JSON response containing the raw transcript and a bullet-point summary. Deepgram processes most audio in under two seconds; the total round-trip should be under five seconds for clips shorter than five minutes.
Step 3: Build the PDF Analysis Pipeline (RAG)
This pipeline extracts text from a PDF, chunks it, stores embeddings in Pinecone, and answers questions about the document.
3.1 Workflow A — Ingest PDF
Create a workflow called PDF Ingest.
Webhook Node — Same pattern as Step 2 (POST /pdf-ingest, binary data).
Extract from File Node — Use the built-in Extract from File node. Set Operation to PDF and input the binary property data. This outputs the full text of the PDF.
Text Splitter (Code Node) — Add a Code node to chunk the text:
const text = $input.first().json.text;
const chunkSize = 800;
const overlap = 100;
const chunks = [];
for (let i = 0; i < text.length; i += chunkSize - overlap) {
chunks.push({
json: {
chunk: text.slice(i, i + chunkSize),
index: chunks.length,
},
});
}
return chunks;
This produces an array of items, each containing an 800-character chunk with 100-character overlap.
OpenAI Embeddings Node — For each chunk, generate an embedding:
| Setting | Value |
|---|---|
| Resource | Embedding |
| Model | text-embedding-3-small |
| Input | {{ $json.chunk }} |
HTTP Request Node (Pinecone Upsert) — Send each embedding to Pinecone:
{
"vectors": [
{
"id": "chunk-{{ $json.index }}",
"values": {{ $json.embedding }},
"metadata": {
"text": "{{ $json.chunk }}"
}
}
]
}
POST to https://YOUR_INDEX-YOUR_PROJECT.svc.YOUR_ENV.pinecone.io/vectors/upsert with the Pinecone credential.
3.2 Workflow B — Query PDF
Create a second workflow called PDF Query.
Webhook Node — POST /pdf-query, expects JSON body { "question": "..." }.
OpenAI Embeddings Node — Embed the user’s question using the same model (text-embedding-3-small).
HTTP Request Node (Pinecone Query) — Query for the top 5 nearest chunks:
{
"vector": {{ $json.embedding }},
"topK": 5,
"includeMetadata": true
}
POST to the Pinecone /query endpoint.
Set Node — Concatenate the returned chunks into a context string:
{{ $json.matches.map(m => m.metadata.text).join("\n\n---\n\n") }}
OpenAI Chat Node — Pass the context and question to the LLM:
| Setting | Value |
|---|---|
| System Message | Answer the user's question based ONLY on the provided context. If the answer is not in the context, say so. |
| User Message | Context:\n{{ $json.context }}\n\nQuestion: {{ $node['Webhook'].json.body.question }} |
Respond to Webhook — Return the answer.
Expected Result
Ingest a PDF:
curl -X POST https://your-domain.com/webhook/pdf-ingest \
-F "data=@annual-report.pdf"
Then query it:
curl -X POST https://your-domain.com/webhook/pdf-query \
-H "Content-Type: application/json" \
-d '{"question": "What was the total revenue in Q3?"}'
You receive a grounded answer drawn from the actual PDF content, with no hallucination outside the provided context.
Step 4: Build the Vision Pipeline
This pipeline accepts an image and sends it to a vision-capable LLM for analysis.
4.1 Create the Workflow
Name it Vision Pipeline. Add a Webhook node (POST /vision, binary data).
4.2 Convert Image to Base64 (Code Node)
const binaryData = await this.helpers.getBinaryDataBuffer(0, 'data');
const base64 = binaryData.toString('base64');
const mimeType = $input.first().binary.data.mimeType;
return [{
json: {
base64Image: base64,
mimeType: mimeType,
},
}];
4.3 HTTP Request Node (Vision API)
For GPT-4o:
POST to https://api.openai.com/v1/chat/completions:
{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail. Extract any text, data, or notable visual elements."
},
{
"type": "image_url",
"image_url": {
"url": "data:{{ $json.mimeType }};base64,{{ $json.base64Image }}"
}
}
]
}
],
"max_tokens": 1024
}
For Claude:
POST to https://api.anthropic.com/v1/messages:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "{{ $json.mimeType }}",
"data": "{{ $json.base64Image }}"
}
},
{
"type": "text",
"text": "Describe this image in detail. Extract any text, data, or notable visual elements."
}
]
}
]
}
4.4 Respond to Webhook
Return the description:
{
"description": "{{ $json.choices[0].message.content }}"
}
Expected Result
curl -X POST https://your-domain.com/webhook/vision \
-F "data=@screenshot.png"
The response contains a detailed natural-language description of the image, including any visible text extracted via OCR and identification of charts, diagrams, or UI elements.
Step 5: Create the Unified Router
Now we wire everything together so a single endpoint can handle any media type.
5.1 Create the Router Workflow
Name it Multi-modal Router.
Webhook Node — POST /agent, binary data, Response Mode: Last Node.
5.2 Switch Node (Detect Media Type)
Add a Switch node with the following routing rules based on {{ $binary.data.mimeType }}:
| Rule | Condition | Output |
|---|---|---|
| Voice | Starts with audio/ | → Execute Workflow: Voice Pipeline |
Equals application/pdf | → Execute Workflow: PDF Ingest, then PDF Query | |
| Image | Starts with image/ | → Execute Workflow: Vision Pipeline |
| Text/JSON | Equals application/json | → Execute Workflow: PDF Query (question only) |
5.3 Execute Workflow Nodes
For each output of the Switch node, add an Execute Workflow node pointing to the corresponding sub-workflow. Pass through all data (binary and JSON).
5.4 Merge Node
Add a Merge node (mode: Merge By Index) that collects outputs from all branches and feeds into a single Respond to Webhook node.
Expected Result
A single endpoint now handles all input types:
# Voice
curl -X POST https://your-domain.com/webhook/agent \
-F "data=@recording.wav"
# Image
curl -X POST https://your-domain.com/webhook/agent \
-F "data=@photo.jpg"
# PDF
curl -X POST https://your-domain.com/webhook/agent \
-F "data=@document.pdf"
# Text question (for querying previously ingested PDFs)
curl -X POST https://your-domain.com/webhook/agent \
-H "Content-Type: application/json" \
-d '{"question": "Summarize the key findings."}'
Each request is automatically routed to the correct pipeline and the response is returned through the same webhook.
Step 6: Add a Telegram or Slack Bot Frontend
6.1 Telegram Bot Setup
Replace the Webhook trigger in the Router workflow with a Telegram Trigger node:
| Setting | Value |
|---|---|
| Credential | Your Telegram Bot credential |
| Updates | Message |
Telegram messages can contain text, voice notes, photos, or documents. Add a Code node after the trigger to normalize the input:
const msg = $input.first().json.message;
if (msg.voice) {
// Download the voice file via Telegram API
const fileId = msg.voice.file_id;
return [{ json: { type: 'voice', fileId } }];
} else if (msg.document && msg.document.mime_type === 'application/pdf') {
return [{ json: { type: 'pdf', fileId: msg.document.file_id } }];
} else if (msg.photo) {
const fileId = msg.photo[msg.photo.length - 1].file_id;
return [{ json: { type: 'image', fileId } }];
} else if (msg.text) {
return [{ json: { type: 'text', question: msg.text } }];
}
After classification, use an HTTP Request node to download the file from Telegram’s getFile API, then pass it to the appropriate sub-workflow via the same Switch logic from Step 5.
At the end of each branch, add a Telegram node (Send Message) to reply in the same chat:
| Setting | Value |
|---|---|
| Chat ID | {{ $node['Telegram Trigger'].json.message.chat.id }} |
| Text | `{{ $json.summary |
6.2 Slack Alternative
For Slack, use the Slack Trigger node listening for message events in a specific channel. The flow is identical — detect file type from event.files[0].mimetype, download via Slack’s files.info API, process, and reply using a Slack node (Post Message).
Expected Result
Send a voice note, a photo, or a PDF to your Telegram bot (or Slack channel). Within seconds, the bot replies with a summary, a description, or an answer. Team members can interact with the agent from their phones without any technical knowledge.
Frequently Asked Questions
Q: What audio formats does Deepgram support? Deepgram accepts MP3, WAV, OGG, FLAC, M4A, and WebM. Telegram voice notes are OGG by default, which works without conversion.
Q: How large can uploaded PDFs be?
n8n’s default binary data limit is 16 MB. For larger files, set the environment variable N8N_DEFAULT_BINARY_DATA_MODE=filesystem in your Docker configuration to stream files to disk instead of holding them in memory.
Q: Can I use a free vector database instead of Pinecone? Yes. You can substitute Pinecone with Qdrant (self-hosted via Docker), Chroma, or Milvus. The HTTP Request node configuration changes to match the alternative API, but the workflow structure stays the same.
Q: What if I want to use local models instead of OpenAI or Claude?
You can point the HTTP Request nodes at any OpenAI-compatible API. For example, run Ollama locally and set the URL to http://localhost:11434/v1/chat/completions. Note that vision and embedding capabilities vary by model.
Q: How do I handle errors in production? Add an Error Trigger workflow in n8n that catches failures and sends notifications (email, Slack, or Telegram). Within each pipeline, add IF nodes after HTTP Request nodes to check for non-200 status codes before proceeding.
Next Steps
With the multi-modal agent running, here are some directions to explore:
- Add memory — Store conversation history in a database (Postgres or Redis) so the agent can reference previous interactions.
- Expand input types — Add support for video files by extracting keyframes and audio tracks separately, then processing each through the existing pipelines.
- Build an approval workflow — For high-stakes actions (e.g., updating a CRM or sending a report), insert a human-in-the-loop approval step before execution.
- Monitor usage — Track token consumption, latency, and error rates per pipeline using n8n’s built-in execution log or an external dashboard like Grafana.
- Fine-tune prompts — Customize the system messages in each LLM node for your domain. A legal team needs different summarization patterns than a marketing team.
The modular design means you can upgrade any single pipeline — swap Deepgram for Whisper, replace Pinecone with Qdrant, or switch from GPT-4o to Claude — without touching the rest of the system.
Related Manuals
Building an Autonomous Content Factory with n8n and Claude 3.5 Sonnet
A comprehensive guide to scaling your content operations using n8n workflows and the high-reasoning capabilities of Claude 3.5, from research to automated publishing.
[Side-Hustle Guide] Building a Zero-Cost Automated Cash-Generating Site with n8n & DeepSeek
Stop manual posting. Learn how to use n8n and DeepSeek to create a 24/7 autonomous website that earns passive income for you.