Voice-Native Document Intelligence

I recently signed up for the AI Partner Catalyst: Accelerate Innovation hackathon — an initiative focused on accelerating innovation through the Google Cloud partner ecosystem.

For this submission, I built VoiceDoc-Agent : a voice‑native document intelligence system that transforms static text documents into spoken, conversational experiences. The system is powered by Gemini on Vertex AI, ElevenLabs Speech‑to‑Text and Text‑to‑Speech, and Datadog observability, demonstrating how partner AI services can be composed into a cohesive, production‑grade architecture.

VoiceDoc Agent allows users to upload text‑based documents and interact with them entirely through speech. Beyond basic Q&A, the agent supports narration, expressive voice output, persona‑aware responses, and end‑to‑end observability — all deployed serverlessly on Google Cloud.

This project was created specifically for AI Partner Catalyst: Accelerate Innovation, showcasing how Google Cloud partners can collaborate to deliver voice‑first AI systems with real‑time insight into performance, cost, and user experience.

Setup

Firestore

Firestore is used as a serverless document database for ingestion, deduplication, and retrieval‑augmented generation (RAG).

Create a new Firestore database named voicedoc-fs. Two collections are required:

documents — Used for smart ingestion and deduplication. Before processing a file, the system checks whether its content hash already exists.
document_chunks" — Stores chunked document text and embeddings used for retrieval.

Service Account

VoiceDoc Agent follows the principle of least privilege. Only the following IAM roles are required:

Vertex AI User — Gemini API calls, embeddings generation, and document classification
Cloud Datastore User — Firestore read/write access for document chunks and metadata

Datadog Dashboard

To visualize system‑wide metrics:

Navigate to Dashboards → New Dashboard in Datadog
Click Configure → Import dashboard JSON
Paste the contents of setup/datadog-dashboard.json

Datadog Monitors

To enable proactive alerting:

Go to Monitoring → New Monitor
Select Import From JSON
Import each monitor defined in setup/datadob-monitors.json

ElevenLabs API Key

Configure the ElevenLabs API key with least‑privilege access:

✅ Text to Speech → Access
✅ Voices → Read
✅ Speech to Text → Access
❌ Everything else → Disabled

Code Walkthrough

1. Gemini AI Integration

The Gemini integration lives in src/lib/vertex.ts and is structured as a three‑stage pipeline:

Retrieve the raw LLM response
Optionally enrich the response with emotion tags (Expressive Mode)
Stream the final response back to the client

This design keeps narration, expressive speech, and observability concerns cleanly separated.

1.1 Get the raw response from Gemini

The first call retrieves a strictly grounded response from Gemini, instrumented with Datadog tracing and token‑level metrics.

// ============================================================================
// CALL 1: Get the raw response from Gemini (WITH METRICS)
// ============================================================================
async function getGeminiRawResponse(
  history: any[],
  query: string,
  context: string,
  isNarrationRequest: boolean,
  voiceMode: 'standard' | 'expressive' = 'standard',
  trafficType: string = 'user'
): Promise<string> {
  const startTime = Date.now();

  return tracer.trace('gemini.request', { resource: 'getGeminiRawResponse' }, async (span) => {
    span?.setTag('voice_mode', voiceMode);
    console.log('[Call1:getGeminiRawResponse] Getting raw response from Gemini');

    const model: GenerativeModel = getVertexAI().getGenerativeModel({
      model: modelName,
      generationConfig: {
        maxOutputTokens: 8192,
        temperature: 0.7,
      }
    });

    const validHistory: Content[] = history.map((h: any) => ({
      role: h.role,
      parts: h.parts || [{ text: h.text }]
    }));

    let systemPrompt: string;
    let textToRead = context;

    if (isNarrationRequest) {
      const extracted = extractSection(context, query);
      if (extracted) {
        textToRead = extracted;
        console.log('[Call1] Extracted chapter/section');
      }

      systemPrompt = `You are a professional audiobook narrator. Read the following text aloud EXACTLY as written, word-for-word. 

CRITICAL RULES:
1. Do NOT summarize.
2. Do NOT paraphrase.
3. Do NOT skip any words.
4. Output the spoken text ONLY.
5. Do NOT include narration notes, stage directions, or descriptions of tone (e.g., (softly), (muttering), (friendly)).
6. Do NOT include speaker labels.

TEXT TO READ:
${textToRead}`;
    } else {
      systemPrompt = `You are a voice assistant answering questions about a document. Answer based STRICTLY on the provided context. Do NOT use training data to fill gaps.

CONTEXT:
${context}`;
    }

    span?.setTag('llm.model', modelName);
    span?.setTag('llm.is_narration', isNarrationRequest);

    const chat = model.startChat({
      history: validHistory,
      systemInstruction: {
        role: 'system',
        parts: [{ text: systemPrompt }]
      }
    });

    const message = isNarrationRequest && extractSection(context, query)
      ? `Please begin reading.`
      : query;

    try {
      const result = await chat.sendMessage(message);
      const response = await result.response;
      const rawText = response.candidates?.[0].content.parts[0].text || '';

      const usage = response.usageMetadata;
      if (usage && usage.promptTokenCount && usage.candidatesTokenCount) {
        span?.setTag('llm.prompt_tokens', usage.promptTokenCount);
        span?.setTag('llm.completion_tokens', usage.candidatesTokenCount);
        span?.setTag('llm.total_tokens', usage.totalTokenCount);

        // 📊 RECORD METRICS
        MetricsCollector.recordTokens(
          usage.promptTokenCount,
          usage.candidatesTokenCount,
          voiceMode,
          trafficType
        );
        MetricsCollector.recordLLMCost(
          usage.promptTokenCount,
          usage.candidatesTokenCount,
          voiceMode,
          trafficType
        );
      }

      console.log('[Call1] Got raw response, length:', rawText.length);
      return rawText;
    } catch (error: any) {
      span?.setTag('error', true);
      span?.setTag('error.message', error.message);

      // 📊 RECORD ERROR METRIC
      MetricsCollector.recordError('call1_error', voiceMode, trafficType);

      throw error;
    }
  });
}

Key characteristics:

Supports both narration and Q&A modes
Enforces strict grounding against provided document context
Records token usage and estimated cost
Emits latency and error metrics for observability

1.2 Add emotion tags using few-shot learning

When Expressive Mode is enabled, a second Gemini call augments the text with ElevenLabs V3 emotion tags using few‑shot examples.

// ============================================================================
// CALL 2: Add emotion tags using few-shot learning (WITH METRICS)
// ============================================================================
async function addEmotionTagsWithFewShot(
  rawText: string,
  persona: string,
  contextHint?: string,
  voiceMode: 'standard' | 'expressive' = 'expressive',
  trafficType: string = 'user'
): Promise<string> {
  return tracer.trace('gemini.request', { resource: 'addEmotionTagsWithFewShot' }, async (span) => {
    span?.setTag('voice_mode', voiceMode);
    console.log('[Call2:addEmotionTagsWithFewShot] Starting with text length:', rawText.length);

    const model: GenerativeModel = getVertexAI().getGenerativeModel({
      model: modelName,
      generationConfig: {
        temperature: 0.2,
        topP: 0.7,
      }
    });

    const fewShotExamples = getFewShotExamples(persona);

    const history: Content[] = [];
    for (const example of fewShotExamples) {
      history.push({
        role: 'user',
        parts: [{ text: example.userInput }]
      });
      history.push({
        role: 'assistant',
        parts: [{ text: example.expectedOutput }]
      });
    }

    const systemPrompt = `SYSTEM: You add emotion tags to text ONLY. Nothing else.

TAGS ONLY: [excited] [nervous] [frustrated] [sorrowful] [calm] [sigh] [laughs] [gulps] [gasps] [whispers] [pauses] [hesitates] [wearily] [warmly] [playfully] [stunned] [intrigued] [reflectively] [passionately] [poetically]

YOUR JOB:
1. Take input text
2. Add [tag] BEFORE words - no parentheses, no narration notes
3. Keep EVERY word exactly the same
4. Return ONLY the tagged text

NEVER add: (notes), descriptions, or stage directions. 
ONLY add: [emotion_tag] tags in square brackets.`;

    span?.setTag('llm.model', modelName);
    span?.setTag('llm.persona', persona);

    const chat = model.startChat({
      history,
      systemInstruction: {
        role: 'system',
        parts: [{ text: systemPrompt }]
      }
    });

    const userMessage = `Add ONLY emotion tags [like this] to this text. Do NOT add parentheses or narration notes. PRESERVE ALL WORDS:

${rawText}`;

    console.log('[Call2] Sending to Gemini...');

    try {
      const result = await chat.sendMessage(userMessage);
      const response = await result.response;
      let taggedText = response.candidates?.[0].content.parts[0].text || rawText;

      const usage = response.usageMetadata;
      if (usage && usage.promptTokenCount && usage.candidatesTokenCount) {
        span?.setTag('llm.prompt_tokens', usage.promptTokenCount);
        span?.setTag('llm.completion_tokens', usage.candidatesTokenCount);
        span?.setTag('llm.total_tokens', usage.totalTokenCount);

        // 📊 RECORD METRICS
        MetricsCollector.recordTokens(
          usage.promptTokenCount,
          usage.candidatesTokenCount,
          voiceMode,
          trafficType
        );
        MetricsCollector.recordLLMCost(
          usage.promptTokenCount,
          usage.candidatesTokenCount,
          voiceMode,
          trafficType
        );
      }

      taggedText = taggedText.replace(/\([^)]*?\)/g, '').trim();

      console.log('[Call2] After cleanup, length:', taggedText.length);

      return taggedText;
    } catch (error: any) {
      span?.setTag('error', true);
      span?.setTag('error.message', error.message);

      // 📊 RECORD ERROR METRIC
      MetricsCollector.recordError('call2_error', voiceMode, trafficType);

      throw error;
    }
  });
}

Design goals:

Preserve every word of the original text
Insert emotion tags only in square‑bracket format
Prevent narration notes or paraphrasing
Keep expressive behavior deterministic and auditable

1.3 Stream response from LLM

The streaming function orchestrates the two‑call pipeline and yields the final text for downstream TTS processing.

// ============================================================================
// STREAMING RESPONSE (WITH METRICS)
// ============================================================================
export async function* getGeminiStream(
  history: Content[],
  query: string,
  context: string,
  isNarrationRequest: boolean = false,
  expressiveMode: boolean = false,
  persona: string = 'narrative',
  trafficType: string = 'user'
) {
  const startTime = Date.now();
  console.log('[getGeminiStream] 🚀 Starting two-call process', { expressiveMode, persona, trafficType });

  try {
    console.log('[getGeminiStream] Is narration request:', isNarrationRequest);

    // CALL 1: Get raw response
    console.log('[getGeminiStream] ========== CALL 1: RAW RESPONSE ==========');
    const voiceMode = expressiveMode ? 'expressive' : 'standard';

    const rawResponse = await getGeminiRawResponse(history, query, context, isNarrationRequest, voiceMode, trafficType);

    console.log('[getGeminiStream] CALL 1 OUTPUT (length:', rawResponse.length, ')');

    if (!rawResponse.trim()) {
      yield 'No response generated.';

      // 📊 RECORD METRICS
      const duration = Date.now() - startTime;

      return;
    }

    // CALL 2: Add emotion tags if expressive mode
    let finalResponse: string;
    if (expressiveMode) {
      console.log('[getGeminiStream] ========== CALL 2: ADD EMOTION TAGS ==========');

      finalResponse = await addEmotionTagsWithFewShot(rawResponse, persona, undefined, voiceMode, trafficType);

      console.log('[getGeminiStream] CALL 2 OUTPUT (length:', finalResponse.length, ')');
    } else {
      // Cleanup any hallucinations like (narrative tone) or [pauses] that might leak in Standard mode
      finalResponse = rawResponse.replace(/\([^)]*?\)/g, '').replace(/\[[^\]]*?\]/g, '').trim();
      console.log('[getGeminiStream] Standard mode - cleanup applied and Call 2 skipped');
    }

    console.log('[getGeminiStream] ✅ Complete, yielding final response');

    yield finalResponse;
  } catch (error) {
    console.error('[getGeminiStream] ❌ Fatal error:', error);

    // 📊 RECORD ERROR
    MetricsCollector.recordError('fatal_error', expressiveMode ? 'expressive' : 'standard', trafficType);

    if (error instanceof Error) {
      console.error('[getGeminiStream] Message:', error.message);
      console.error('[getGeminiStream] Stack:', error.stack);
    }

    yield `\n\nERROR: ${error instanceof Error ? error.message : 'Unknown error'}\n`;
    throw error;
  }
}

Highlights:

Expressive Mode triggers a controlled second LLM call
Standard Mode performs defensive cleanup of stray tags
All errors and latency are captured via Datadog metrics

2. ElevenLabs API

ElevenLabs is used for both speech transcription and voice generation, with dynamic model selection based on document persona and expressive intent.

2.1 Transcribe API

Speech‑to‑Text is implemented in src/app/api/transcribe/route.ts:

export async function POST(req: NextRequest) {
    try {
        console.log('=== Transcribe API Called ===');

        const formData = await req.formData();
        const audioFile = formData.get('audio') as File;

        if (!audioFile) {
            console.error('No audio file provided');
            return NextResponse.json(
                { error: 'No audio file provided' },
                { status: 400 }
            );
        }

        console.log(`📦 Received audio file: ${audioFile.name}, size: ${audioFile.size} bytes, type: ${audioFile.type}`);

        // Convert File to Blob
        const arrayBuffer = await audioFile.arrayBuffer();
        const audioBlob = new Blob([arrayBuffer], { type: audioFile.type || 'audio/webm' });

        console.log('🎙️ Calling ElevenLabs speechToText.convert()...');

        const transcription = await getClient().speechToText.convert({
            file: audioBlob,
            modelId: 'scribe_v1',
            tagAudioEvents: true,
            languageCode: 'eng',
            diarize: true,
        });

        console.log('✅ Transcription successful:', JSON.stringify(transcription, null, 2));

        // Handle different response structures
        let transcribedText = '';

        // Check various possible response structures
        if ((transcription as any).text) {
            transcribedText = (transcription as any).text;
        } else if ((transcription as any).transcription) {
            transcribedText = (transcription as any).transcription;
        } else if ((transcription as any).result) {
            transcribedText = (transcription as any).result;
        } else if ((transcription as any).chunks && Array.isArray((transcription as any).chunks)) {
            // If it's multichannel with chunks
            transcribedText = (transcription as any).chunks
                .map((chunk: any) => chunk.text || chunk.transcript || '')
                .join(' ');
        } else if (Array.isArray(transcription)) {
            // If it's an array response
            transcribedText = transcription
                .map((item: any) => item.text || item.transcript || '')
                .join(' ');
        }

        console.log('📝 Extracted text:', transcribedText);

        return NextResponse.json({
            text: transcribedText,
            success: true,
            transcription: transcription,
        });

    } catch (error) {
        console.error('❌ Transcription error:', error);

        // Log detailed error info
        if (error instanceof Error) {
            console.error('Error message:', error.message);
            console.error('Error stack:', error.stack);
        }

        const errorMessage = error instanceof Error ? error.message : String(error);

        return NextResponse.json(
            {
                error: 'Transcription failed',
                details: errorMessage,
            },
            { status: 500 }
        );
    }
}

The implementation:

Accepts streamed audio input
Handles multiple response formats defensively
Supports diarization and audio event tagging
Emits structured logs for debugging and traceability

2.2 Speak API

Text‑to‑Speech is implemented in src/app/api/speak/route.ts:

export async function POST(req: NextRequest) {
    try {
        const {
            text,
            persona,
            expressiveMode = false,   // 🔑 mirrors Gemini pipeline
        }: {
            text: string;
            persona?: Persona;
            expressiveMode?: boolean;
        } = await req.json();

        if (!text) {
            return NextResponse.json({ error: 'Text is required' }, { status: 400 });
        }

        const voiceId = getVoiceIdForPersona(persona as Persona);

        // 🎭 Model selection
        const modelId = expressiveMode
            ? 'eleven_v3'
            : 'eleven_flash_v2_5';

        const inferredExpressive =
            expressiveMode || /\[[^\]]+\]/.test(text);

        const finalExpressiveMode = inferredExpressive;

        // 🧹 Strip emotion tags in standard mode
        const processedText = finalExpressiveMode
            ? text
            : text.replace(/\[[^\]]*?\]/g, '').trim();

        console.log(
            `🎙️ TTS Request | Persona: ${persona || 'narrative'} | ` +
            `Mode: ${expressiveMode ? 'expressive' : 'standard'} | ` +
            `Model: ${modelId} | VoiceID: ${voiceId}`
        );

        const audio = await getClient().textToSpeech.convert(voiceId, {
            text: processedText,
            modelId,
            outputFormat: 'mp3_44100_128',
        });

        return new NextResponse(audio as any, {
            headers: {
                'Content-Type': 'audio/mpeg',
            },
        });

    } catch (error: any) {
        console.error('❌ TTS Error:', {
            message: error.message,
            statusCode: error.statusCode,
            body: error.body,
        });

        return NextResponse.json(
            { error: error.message || 'TTS failed', details: error.body || '' },
            { status: error.statusCode || 500 }
        );
    }
}

Notable behaviors:

Automatically selects Eleven Flash v2.5 or Eleven v3
Detects emotion tags to infer expressive output
Strips tags when running in Standard Mode
Streams MP3 audio optimized for low latency

3. Datadog API

3.1 Core Concept: Agentless Observability

VoiceDoc Agent runs on Google Cloud Run without any Datadog Agent or sidecar containers. All telemetry is sent directly via Datadog’s HTTPS APIs, making it ideal for hackathons and serverless workloads.

3.2 Real User Monitoring (RUM) - Frontend

The frontend captures real user interactions, performance metrics, and session replays.
Implementation: src/components/DatadogInit.tsx:

datadogRum.init({
    applicationId,
    clientToken,
    site,
    service: 'voicedoc-agent',
    sessionSampleRate: 100,
    sessionReplaySampleRate: 20, // 100% for development/testing
    trackUserInteractions: true,
    trackResources: true,
    trackLongTasks: true,
})

3.3 Session Replay

Session Replay enables visual debugging by correlating UI behavior with backend traces.

// Start Session Replay Recording
datadogRum.startSessionReplayRecording();

3.4 Custom Metrics (Agentless) - Backend

Custom metrics are sent directly to Datadog without StatsD or agents.

// PATH 1: AGENTLESS (HTTPS API) - Best for Hackathons
if (DD_API_KEY) {
    const response = await fetch(
        `https://api.${DD_SITE}/api/v1/series?api_key=${DD_API_KEY}`,
        {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ series })
        }
    );
}

Key Metrics:

Metric	Purpose	Voice-Specific Tag
`voicedoc.request.latency_ms`	Gemini response latency	`voice_mode:*`
`voicedoc.llm.total_tokens`	Token usage	`voice_mode:*`
`voicedoc.llm.cost`	Estimated cost	`currency:usd`
`voicedoc.request.hits`	Request count	`is_narration:*`
`voicedoc.request.errors`	Error tracking	`error_type:*`
`voicedoc.llm.ttft`	Time to First Token	`voice_mode:*`
`voicedoc.persona.classified`	Persona detection	`persona_type:*`

4. Core UI Components

The main Next.js application lives in src/app/page.tsx and provides:

Document Upload — Text ingestion and persona classification
Persona Settings — Toggle Standard vs Expressive voice modes
Mic Test Lab — Standalone microphone testing
Real-time Metrics — Live visibility into latency and speech performance
User Input — Speech or text‑based interaction

4.1 Persona Settings

Upon upload, src/app/api/upload/route.ts invokes Gemini to classify the document persona. Users can preview voices for:

Professional Legal
Financial Advisor
Technical Expert
Academic Scholar
Storyteller

Users can also experiment with ElevenLabs V3 emotion tags directly from the UI.

4.2 Mic Test Lab

The standalone Mic Test Lab allows users to validate microphone input and immediately hear synthesized output.

4.3 Real-time Metrics

The metrics sidebar provides transparent, user‑visible performance data, including:

Perceivable Latency
Total Round Trip Time
Speech Duration and Confidence
STT Latency and First Token
LLM Latency and TTFT
Token Usage
TTS Latency and TTFB

Note

LLM messages are indicated by a Zap icon (Standard Mode) or Sparkles icon (Expressive Mode). Expressive Mode incurs a longer TTFB due to the additional LLM enrichment step.

Getting Started

To run VoiceDoc Agent locally, ensure that the following dependencies are provisioned and configured correctly:

A Firestore database (voicedoc-fs) with the required collections
A Google Cloud service account with Vertex AI and Firestore access
Valid API keys for ElevenLabs and Datadog

All required environment variables are documented in env.example. Once configured, the local setup mirrors the production Cloud Run environment used for the hackathon.

Running Locally

Start the VoiceDoc Agent development server with:

npm run dev

In Action

To experience Expressive Mode, upload the sample document located at sample/sample_narrative_story.txt, then switch the persona mode to Expressive in the Persona Settings panel.

As a first interaction, try issuing the voice or text command:

“Read Chapter One”

This triggers the full two-stage Gemini pipeline — raw narration followed by emotion tagging — and streams the expressive audio response using ElevenLabs V3 voices.

ElevenLabs Narrative Voice

Below is a sample of the generated expressive narration audio:

Sample Datadog Dashboard:

The Datadog dashboard provides real-time visibility into latency, token usage, cost estimation, and voice-specific performance metrics across the entire request lifecycle.

Sample Datadog Monitor:

Monitors are configured to proactively detect anomalies such as LLM token burn rate spikes, elevated latency, or error surges — enabling early intervention before user experience degrades.

voicedoc-datadog-monitor-llm-token-burn-rate

Sample Datadog Session Replay

Session Replay ties everything together by correlating frontend user behavior with backend LLM and voice performance, making it easier to debug and optimize real-world voice interactions.

Synthetic Traffic Generator

To validate system behavior under repeatable and controlled conditions, VoiceDoc Agent includes a synthetic traffic generator that simulates real user interactions across different personas, voice modes, and request types.

This tool is especially useful for:

Load testing expressive vs standard voice paths
Validating LLM latency, TTFT, and token usage
Exercising Datadog monitors without real users
Reproducing edge cases deterministically during development

Trigger the synthetic traffic generator with:

python scripts\traffic-generator.py

Synthetic traffic is explicitly tagged and can be filtered independently within Datadog dashboards and monitors, ensuring clear separation between test signals and real user behavior.

voicedoc-synthetic-traffic-generator-datadog-dashboard

Conclusion

VoiceDoc Agent demonstrates how voice-first AI systems can be built with production-grade rigor, not just compelling demos. By combining Gemini on Vertex AI for reasoning, ElevenLabs for expressive and real-time voice, and Datadog for end-to-end observability, the project showcases a practical blueprint for deploying conversational, document-aware agents at scale.

More importantly, this project highlights how the Google Cloud partner ecosystem can be composed into a cohesive system where performance, cost, and user experience are first-class concerns. Features such as expressive narration, persona-aware responses, agentless observability, and synthetic traffic testing ensure that the system is not only engaging, but measurable, debuggable, and extensible.

Built for AI Partner Catalyst: Accelerate Innovation, VoiceDoc Agent reflects an approach to AI development that prioritizes clarity, control, and real-world operability — pushing voice-native interfaces beyond novelty and toward production-ready applications.