automation Difficulty: Advanced

Building a Multi-modal AI Agent on n8n (Voice, PDF, and Vision)

A comprehensive guide to constructing versatile n8n agents that can process voice notes via Deepgram, analyze PDFs with RAG, and 'see' images using latest LLMs.

What You’ll Build

In this guide, you will build a multi-modal AI agent running entirely on n8n that can understand three distinct types of input:

  • Voice notes — Audio files are transcribed via Deepgram and then summarized or analyzed by an LLM.
  • PDF documents — PDFs are parsed, chunked, stored in a vector database, and queried with natural-language questions (RAG).
  • Images — Photos and screenshots are sent to a vision-capable LLM (GPT-4o or Claude) for description and data extraction.

A single webhook receives any of these inputs, detects the media type automatically, and routes the request to the correct processing pipeline. The final product is exposed as a Telegram or Slack bot so your team can interact with it from a phone or desktop.

By the end you will have four working n8n workflows and a unified router that ties them together.


Prerequisites

RequirementPurposeFree Tier Available
n8n (self-hosted or cloud)Workflow automation engineYes (community edition)
Deepgram API keySpeech-to-text transcriptionYes (free credits on signup)
OpenAI API keyGPT-4o for chat, embeddings, and visionNo (pay-as-you-go)
Claude API key (alternative)Claude for chat and visionNo (pay-as-you-go)
Pinecone API key (optional)Vector storage for RAGYes (free starter index)
Telegram Bot Token or Slack AppChat frontendYes
Docker & Docker ComposeRunning n8n locallyYes

You can substitute OpenAI with Claude (or vice-versa) for any LLM step. The workflow structure remains the same — only the HTTP Request body changes.


Architecture Overview

The overall system follows a hub-and-spoke pattern:

User (Telegram / Slack / API)


  ┌─────────────┐
  │  Webhook     │  ← Unified entry point
  │  Router      │
  └──────┬───┬──┬┘
         │   │  │
    ┌────┘   │  └────┐
    ▼        ▼       ▼
 Voice     PDF    Image
Pipeline  Pipeline Pipeline
    │        │       │
    └────┬───┘───────┘

   Response back
   to user

Each pipeline is an independent n8n workflow that can also run standalone. The router inspects the incoming MIME type (or file extension) and triggers the appropriate sub-workflow with Execute Workflow nodes.


Step 1: Set Up n8n with API Credentials

1.1 Launch n8n with Docker Compose

Create a docker-compose.yml in a new project folder:

version: "3.8"
services:
  n8n:
    image: n8nio/n8n:latest
    restart: always
    ports:
      - "5678:5678"
    environment:
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin
      - N8N_BASIC_AUTH_PASSWORD=changeme
      - WEBHOOK_URL=https://your-domain.com/
    volumes:
      - n8n_data:/home/node/.n8n
volumes:
  n8n_data:

Start the stack:

docker compose up -d

Open http://localhost:5678 and log in.

1.2 Add Credentials

Navigate to Settings → Credentials and create the following entries:

  1. Deepgram API — Type: Header Auth. Name: Authorization, Value: Token YOUR_DEEPGRAM_KEY.
  2. OpenAI API — Use the built-in OpenAI credential type. Paste your API key.
  3. Pinecone API — Type: Header Auth. Name: Api-Key, Value: YOUR_PINECONE_KEY.
  4. Telegram — Use the built-in Telegram credential. Paste the bot token from BotFather.

Tip: If you prefer Claude, create an Header Auth credential with name x-api-key and your Anthropic key as the value, plus a second header anthropic-version set to 2023-06-01.

Expected Result

You have n8n running on port 5678 with four credential entries ready. No workflows yet — we will build those next.


Step 2: Build the Voice Processing Pipeline

This pipeline accepts an audio file, sends it to Deepgram for transcription, and passes the transcript to an LLM for summarization.

2.1 Create a New Workflow

Name it Voice Pipeline. Add the following nodes in order:

2.2 Webhook Node (Receive Audio)

SettingValue
HTTP MethodPOST
Path/voice
Response ModeLast Node
Binary Propertydata

This node will accept multipart/form-data uploads containing an audio file.

2.3 HTTP Request Node (Deepgram Transcription)

SettingValue
MethodPOST
URLhttps://api.deepgram.com/v1/listen?model=nova-2&smart_format=true
AuthenticationPredefined Credential → Deepgram API
Send BodyBinary
Input Data Field Namedata
Content TypeAuto-detect

Deepgram returns JSON with the transcript inside results.channels[0].alternatives[0].transcript.

2.4 Set Node (Extract Transcript)

Add a Set node to pull out the transcript:

SettingValue
Nametranscript
Value{{ $json.results.channels[0].alternatives[0].transcript }}

2.5 OpenAI Node (Summarize)

Use the built-in OpenAI node (or an HTTP Request node for Claude):

SettingValue
ResourceChat Message
Modelgpt-4o
System MessageYou are a concise assistant. Summarize the following voice transcript in 3-5 bullet points.
User Message{{ $json.transcript }}

If you prefer Claude, use an HTTP Request node:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": "Summarize this transcript in 3-5 bullet points:\n\n{{ $json.transcript }}"
    }
  ]
}

POST to https://api.anthropic.com/v1/messages with your Anthropic credential.

2.6 Respond to Webhook Node

Return the summary as JSON:

{
  "transcript": "{{ $node['Set'].json.transcript }}",
  "summary": "{{ $json.message.content }}"
}

Expected Result

Send an audio file via curl:

curl -X POST https://your-domain.com/webhook/voice \
  -F "data=@meeting-notes.mp3"

You receive a JSON response containing the raw transcript and a bullet-point summary. Deepgram processes most audio in under two seconds; the total round-trip should be under five seconds for clips shorter than five minutes.


Step 3: Build the PDF Analysis Pipeline (RAG)

This pipeline extracts text from a PDF, chunks it, stores embeddings in Pinecone, and answers questions about the document.

3.1 Workflow A — Ingest PDF

Create a workflow called PDF Ingest.

Webhook Node — Same pattern as Step 2 (POST /pdf-ingest, binary data).

Extract from File Node — Use the built-in Extract from File node. Set Operation to PDF and input the binary property data. This outputs the full text of the PDF.

Text Splitter (Code Node) — Add a Code node to chunk the text:

const text = $input.first().json.text;
const chunkSize = 800;
const overlap = 100;
const chunks = [];

for (let i = 0; i < text.length; i += chunkSize - overlap) {
  chunks.push({
    json: {
      chunk: text.slice(i, i + chunkSize),
      index: chunks.length,
    },
  });
}

return chunks;

This produces an array of items, each containing an 800-character chunk with 100-character overlap.

OpenAI Embeddings Node — For each chunk, generate an embedding:

SettingValue
ResourceEmbedding
Modeltext-embedding-3-small
Input{{ $json.chunk }}

HTTP Request Node (Pinecone Upsert) — Send each embedding to Pinecone:

{
  "vectors": [
    {
      "id": "chunk-{{ $json.index }}",
      "values": {{ $json.embedding }},
      "metadata": {
        "text": "{{ $json.chunk }}"
      }
    }
  ]
}

POST to https://YOUR_INDEX-YOUR_PROJECT.svc.YOUR_ENV.pinecone.io/vectors/upsert with the Pinecone credential.

3.2 Workflow B — Query PDF

Create a second workflow called PDF Query.

Webhook NodePOST /pdf-query, expects JSON body { "question": "..." }.

OpenAI Embeddings Node — Embed the user’s question using the same model (text-embedding-3-small).

HTTP Request Node (Pinecone Query) — Query for the top 5 nearest chunks:

{
  "vector": {{ $json.embedding }},
  "topK": 5,
  "includeMetadata": true
}

POST to the Pinecone /query endpoint.

Set Node — Concatenate the returned chunks into a context string:

{{ $json.matches.map(m => m.metadata.text).join("\n\n---\n\n") }}

OpenAI Chat Node — Pass the context and question to the LLM:

SettingValue
System MessageAnswer the user's question based ONLY on the provided context. If the answer is not in the context, say so.
User MessageContext:\n{{ $json.context }}\n\nQuestion: {{ $node['Webhook'].json.body.question }}

Respond to Webhook — Return the answer.

Expected Result

Ingest a PDF:

curl -X POST https://your-domain.com/webhook/pdf-ingest \
  -F "data=@annual-report.pdf"

Then query it:

curl -X POST https://your-domain.com/webhook/pdf-query \
  -H "Content-Type: application/json" \
  -d '{"question": "What was the total revenue in Q3?"}'

You receive a grounded answer drawn from the actual PDF content, with no hallucination outside the provided context.


Step 4: Build the Vision Pipeline

This pipeline accepts an image and sends it to a vision-capable LLM for analysis.

4.1 Create the Workflow

Name it Vision Pipeline. Add a Webhook node (POST /vision, binary data).

4.2 Convert Image to Base64 (Code Node)

const binaryData = await this.helpers.getBinaryDataBuffer(0, 'data');
const base64 = binaryData.toString('base64');
const mimeType = $input.first().binary.data.mimeType;

return [{
  json: {
    base64Image: base64,
    mimeType: mimeType,
  },
}];

4.3 HTTP Request Node (Vision API)

For GPT-4o:

POST to https://api.openai.com/v1/chat/completions:

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image in detail. Extract any text, data, or notable visual elements."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:{{ $json.mimeType }};base64,{{ $json.base64Image }}"
          }
        }
      ]
    }
  ],
  "max_tokens": 1024
}

For Claude:

POST to https://api.anthropic.com/v1/messages:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": "{{ $json.mimeType }}",
            "data": "{{ $json.base64Image }}"
          }
        },
        {
          "type": "text",
          "text": "Describe this image in detail. Extract any text, data, or notable visual elements."
        }
      ]
    }
  ]
}

4.4 Respond to Webhook

Return the description:

{
  "description": "{{ $json.choices[0].message.content }}"
}

Expected Result

curl -X POST https://your-domain.com/webhook/vision \
  -F "data=@screenshot.png"

The response contains a detailed natural-language description of the image, including any visible text extracted via OCR and identification of charts, diagrams, or UI elements.


Step 5: Create the Unified Router

Now we wire everything together so a single endpoint can handle any media type.

5.1 Create the Router Workflow

Name it Multi-modal Router.

Webhook NodePOST /agent, binary data, Response Mode: Last Node.

5.2 Switch Node (Detect Media Type)

Add a Switch node with the following routing rules based on {{ $binary.data.mimeType }}:

RuleConditionOutput
VoiceStarts with audio/→ Execute Workflow: Voice Pipeline
PDFEquals application/pdf→ Execute Workflow: PDF Ingest, then PDF Query
ImageStarts with image/→ Execute Workflow: Vision Pipeline
Text/JSONEquals application/json→ Execute Workflow: PDF Query (question only)

5.3 Execute Workflow Nodes

For each output of the Switch node, add an Execute Workflow node pointing to the corresponding sub-workflow. Pass through all data (binary and JSON).

5.4 Merge Node

Add a Merge node (mode: Merge By Index) that collects outputs from all branches and feeds into a single Respond to Webhook node.

Expected Result

A single endpoint now handles all input types:

# Voice
curl -X POST https://your-domain.com/webhook/agent \
  -F "data=@recording.wav"

# Image
curl -X POST https://your-domain.com/webhook/agent \
  -F "data=@photo.jpg"

# PDF
curl -X POST https://your-domain.com/webhook/agent \
  -F "data=@document.pdf"

# Text question (for querying previously ingested PDFs)
curl -X POST https://your-domain.com/webhook/agent \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize the key findings."}'

Each request is automatically routed to the correct pipeline and the response is returned through the same webhook.


Step 6: Add a Telegram or Slack Bot Frontend

6.1 Telegram Bot Setup

Replace the Webhook trigger in the Router workflow with a Telegram Trigger node:

SettingValue
CredentialYour Telegram Bot credential
UpdatesMessage

Telegram messages can contain text, voice notes, photos, or documents. Add a Code node after the trigger to normalize the input:

const msg = $input.first().json.message;

if (msg.voice) {
  // Download the voice file via Telegram API
  const fileId = msg.voice.file_id;
  return [{ json: { type: 'voice', fileId } }];
} else if (msg.document && msg.document.mime_type === 'application/pdf') {
  return [{ json: { type: 'pdf', fileId: msg.document.file_id } }];
} else if (msg.photo) {
  const fileId = msg.photo[msg.photo.length - 1].file_id;
  return [{ json: { type: 'image', fileId } }];
} else if (msg.text) {
  return [{ json: { type: 'text', question: msg.text } }];
}

After classification, use an HTTP Request node to download the file from Telegram’s getFile API, then pass it to the appropriate sub-workflow via the same Switch logic from Step 5.

At the end of each branch, add a Telegram node (Send Message) to reply in the same chat:

SettingValue
Chat ID{{ $node['Telegram Trigger'].json.message.chat.id }}
Text`{{ $json.summary

6.2 Slack Alternative

For Slack, use the Slack Trigger node listening for message events in a specific channel. The flow is identical — detect file type from event.files[0].mimetype, download via Slack’s files.info API, process, and reply using a Slack node (Post Message).

Expected Result

Send a voice note, a photo, or a PDF to your Telegram bot (or Slack channel). Within seconds, the bot replies with a summary, a description, or an answer. Team members can interact with the agent from their phones without any technical knowledge.


Frequently Asked Questions

Q: What audio formats does Deepgram support? Deepgram accepts MP3, WAV, OGG, FLAC, M4A, and WebM. Telegram voice notes are OGG by default, which works without conversion.

Q: How large can uploaded PDFs be? n8n’s default binary data limit is 16 MB. For larger files, set the environment variable N8N_DEFAULT_BINARY_DATA_MODE=filesystem in your Docker configuration to stream files to disk instead of holding them in memory.

Q: Can I use a free vector database instead of Pinecone? Yes. You can substitute Pinecone with Qdrant (self-hosted via Docker), Chroma, or Milvus. The HTTP Request node configuration changes to match the alternative API, but the workflow structure stays the same.

Q: What if I want to use local models instead of OpenAI or Claude? You can point the HTTP Request nodes at any OpenAI-compatible API. For example, run Ollama locally and set the URL to http://localhost:11434/v1/chat/completions. Note that vision and embedding capabilities vary by model.

Q: How do I handle errors in production? Add an Error Trigger workflow in n8n that catches failures and sends notifications (email, Slack, or Telegram). Within each pipeline, add IF nodes after HTTP Request nodes to check for non-200 status codes before proceeding.


Next Steps

With the multi-modal agent running, here are some directions to explore:

  • Add memory — Store conversation history in a database (Postgres or Redis) so the agent can reference previous interactions.
  • Expand input types — Add support for video files by extracting keyframes and audio tracks separately, then processing each through the existing pipelines.
  • Build an approval workflow — For high-stakes actions (e.g., updating a CRM or sending a report), insert a human-in-the-loop approval step before execution.
  • Monitor usage — Track token consumption, latency, and error rates per pipeline using n8n’s built-in execution log or an external dashboard like Grafana.
  • Fine-tune prompts — Customize the system messages in each LLM node for your domain. A legal team needs different summarization patterns than a marketing team.

The modular design means you can upgrade any single pipeline — swap Deepgram for Whisper, replace Pinecone with Qdrant, or switch from GPT-4o to Claude — without touching the rest of the system.

Last Updated: 3/10/2026
#n8n #Multi-modal #AI Agents #Deepgram