Skip to content
clusters: prooflayer · edgemarket · edgefinance · synthforge · mediakit · wordmint · webprobe · locale · comppoint
$ man audio-transcribe

/audio-transcribe(1)

agentutility / mediakit / audio-transcribe
PRICE / CALL
$0.01
USDC · base mainnet · scheme: exact
METHOD
POST
CLUSTER
mediakit
CATEGORY
uncategorized
STATUS
live
NAME
audio-transcribe audio transcribe / speech-to-text / whisper-large / multi-language asr / openai whisper api compat
SYNOPSIS
POST https://x402.org/v1/audio-transcribe
     Content-Type: application/json
     X-PAYMENT:    <signed-transferWithAuthorization>

     { ... }
↳ first call → 402 Payment Required. Sign USDCtransferWithAuthorization, retry with theX-PAYMENT header.
DESCRIPTION

Audio transcribe / speech-to-text / Whisper-large / multi-language ASR / OpenAI Whisper API compat. Server-side fetches the audio URL (max 25 MB), relays to Venice's audio/transcriptions endpoint with whisper-large-v3, and returns the transcript with detected language, duration, and per-segment timestamps when response_format='verbose_json' (default). Also supports raw text, SRT, and VTT outputs.

INPUTrequest schema
propertytypedescriptionreq?
audio_urlstringPublic http(s) URL of the audio file (mp3, wav, m4a, ogg, flac, webm). Up to 25 MB.required
languagestringBCP-47 language hint (e.g. 'en', 'es'). 'auto' or omitted = auto-detect.optional
modelstringOverride the model. Default 'openai/whisper-large-v3'.optional
response_formatstringOutput format. Default 'verbose_json' (transcript + segments).
enum: json · text · verbose_json · srt · vtt
optional
OUTPUTresponse shape
fieldtypedescription
transcriptstringFull transcribed text of the audio, concatenated across all detected speech segments.
language_detectedstringISO 639-1 code of the language Whisper auto-detected in the audio (e.g. 'en', 'es', 'fr').
duration_secondsstringLength of the source audio in seconds, as reported by Whisper after decoding.
segmentsstringArray of per-segment objects with start/end timestamps and text, present when response_format is verbose_json.
response_formatstringOutput format used: verbose_json (default), json, text, srt, or vtt.
modelstringWhisper model used for transcription, fixed to 'whisper-large-v3' via Venice's audio/transcriptions endpoint.
bytes_instringSize in bytes of the audio file fetched from the source URL before relay to Whisper.
sourcestringOriginal audio URL the server fetched and transcribed (echoed back from the request).
EXAMPLEStwo ways to call
EXAMPLE 1 · curl
curl -X POST https://x402.org/v1/audio-transcribe \
  -H 'Content-Type: application/json' \
  -d '{ }'
first response = 402 Payment Required with payment requirements; sign + retry with X-PAYMENT.
EXAMPLE 2 · mcp
# install once
claude mcp add x402 --command "npx x402-deployer-mcp"

# then ask Claude Code:
# "use the audio-transcribe tool to ..."
MCP server handles payment automatically — your coding agent just calls the tool by name.
METADATA
tags
mediakitaudiotranscriptionspeech-to-textasrwhispersubtitleswhisper-large-v3
methods
POST
cluster
mediakit
price
$0.01 USDC per call
ADJACENTother endpoints in mediakit
endpointdescriptionprice
csv-to-icsCSV calendar to ICS / iCal file generator.$0.01
image-convertUniversal image format converter (PNG, JPG, WEBP, AVIF, GIF, BMP, TIFF, ICO, HEIC, HEIF, PSD, SVG).$0.01
image-format-convertImage converter.$0.01
merge-pdfPDF merger / combine PDFs / concatenate PDF files / join multiple PDFs into one.$0.01
pdf-mergePDF merger / PDF combiner / PDF concatenator.$0.01
receipt-ocrReceipt OCR.$0.01
receipt-parserReceipt → structured JSON (vendor, address, date, line items with qty/unit_price/total, subtotal, tax, tip, total, payment method).$0.01
youtube-transcriptYouTube transcript / closed-caption fetcher / video subtitles puller / auto-generated CC reader.$0.01
SEE ALSO
agentutility(7) · mediakit(7) · x402(7) · mcp(7) · llms.txt · registry.json · bazaar.x402.org