$ man audio-transcribe
/audio-transcribe(1)
PRICE / CALL
$0.01
USDC · base mainnet · scheme: exact
METHOD
POST
CLUSTER
mediakitCATEGORY
uncategorized
STATUS
● live
NAME
audio-transcribe — audio transcribe / speech-to-text / whisper-large / multi-language asr / openai whisper api compat
SYNOPSIS
POST https://x402.org/v1/audio-transcribe
Content-Type: application/json
X-PAYMENT: <signed-transferWithAuthorization>
{ ... }↳ first call →
402 Payment Required. Sign USDCtransferWithAuthorization, retry with theX-PAYMENT header.DESCRIPTION
Audio transcribe / speech-to-text / Whisper-large / multi-language ASR / OpenAI Whisper API compat. Server-side fetches the audio URL (max 25 MB), relays to Venice's audio/transcriptions endpoint with whisper-large-v3, and returns the transcript with detected language, duration, and per-segment timestamps when response_format='verbose_json' (default). Also supports raw text, SRT, and VTT outputs.
INPUT — request schema
| property | type | description | req? |
|---|---|---|---|
| audio_url | string | Public http(s) URL of the audio file (mp3, wav, m4a, ogg, flac, webm). Up to 25 MB. | required |
| language | string | BCP-47 language hint (e.g. 'en', 'es'). 'auto' or omitted = auto-detect. | optional |
| model | string | Override the model. Default 'openai/whisper-large-v3'. | optional |
| response_format | string | Output format. Default 'verbose_json' (transcript + segments). enum: json · text · verbose_json · srt · vtt | optional |
OUTPUT — response shape
| field | type | description |
|---|---|---|
| transcript | string | Full transcribed text of the audio, concatenated across all detected speech segments. |
| language_detected | string | ISO 639-1 code of the language Whisper auto-detected in the audio (e.g. 'en', 'es', 'fr'). |
| duration_seconds | string | Length of the source audio in seconds, as reported by Whisper after decoding. |
| segments | string | Array of per-segment objects with start/end timestamps and text, present when response_format is verbose_json. |
| response_format | string | Output format used: verbose_json (default), json, text, srt, or vtt. |
| model | string | Whisper model used for transcription, fixed to 'whisper-large-v3' via Venice's audio/transcriptions endpoint. |
| bytes_in | string | Size in bytes of the audio file fetched from the source URL before relay to Whisper. |
| source | string | Original audio URL the server fetched and transcribed (echoed back from the request). |
EXAMPLES — two ways to call
EXAMPLE 1 · curl
curl -X POST https://x402.org/v1/audio-transcribe \
-H 'Content-Type: application/json' \
-d '{ }'first response =
402 Payment Required with payment requirements; sign + retry with X-PAYMENT.EXAMPLE 2 · mcp
# install once claude mcp add x402 --command "npx x402-deployer-mcp" # then ask Claude Code: # "use the audio-transcribe tool to ..."
MCP server handles payment automatically — your coding agent just calls the tool by name.
METADATA
- tags
- mediakitaudiotranscriptionspeech-to-textasrwhispersubtitleswhisper-large-v3
- methods
- POST
- cluster
- mediakit
- price
- $0.01 USDC per call
ADJACENT — other endpoints in mediakit
| endpoint | description | price |
|---|---|---|
| csv-to-ics | CSV calendar to ICS / iCal file generator. | $0.01 |
| image-convert | Universal image format converter (PNG, JPG, WEBP, AVIF, GIF, BMP, TIFF, ICO, HEIC, HEIF, PSD, SVG). | $0.01 |
| image-format-convert | Image converter. | $0.01 |
| merge-pdf | PDF merger / combine PDFs / concatenate PDF files / join multiple PDFs into one. | $0.01 |
| pdf-merge | PDF merger / PDF combiner / PDF concatenator. | $0.01 |
| receipt-ocr | Receipt OCR. | $0.01 |
| receipt-parser | Receipt → structured JSON (vendor, address, date, line items with qty/unit_price/total, subtotal, tax, tip, total, payment method). | $0.01 |
| youtube-transcript | YouTube transcript / closed-caption fetcher / video subtitles puller / auto-generated CC reader. | $0.01 |
SEE ALSO