1. Upload source video
Create an upload project:| Field | Description |
|---|---|
project_id | Project ID used for transcript and vision requests. |
source | Customer-facing metadata for the uploaded video, such as filename, content type, and size when known. |
upload_url | Signed URL that accepts the video bytes. |
upload_headers | Headers to include when uploading. |
upload_expires_in_seconds | Time until the signed URL expires. |
409 Conflict until upload completion succeeds. The completed upload response includes source metadata so you can confirm which video the project points to.
2. Transcript
Start transcript analysis after the project has a completed source upload.Transcript language
language is required. Use one of the supported base tags.
Supported base tags:
| Value | Language |
|---|---|
bg | Bulgarian |
cs | Czech |
da | Danish |
de | German |
el | Greek |
en | English |
es | Spanish |
et | Estonian |
fi | Finnish |
fr | French |
he | Hebrew |
hr | Croatian |
hu | Hungarian |
it | Italian |
lt | Lithuanian |
lv | Latvian |
mt | Maltese |
nl | Dutch |
pl | Polish |
pt | Portuguese |
ro | Romanian |
ru | Russian |
sk | Slovak |
sl | Slovenian |
sv | Swedish |
uk | Ukrainian |
422 Unprocessable Entity.
3. Vision
Start vision analysis with the outputs your app needs.4. Retake removal
Retake removal currently supports Hebrew (he). It analyzes transcript-like word timing and returns the word spans to remove plus keep intervals for a clean read.
Requested outputs
| Analysis | Output | Description |
|---|---|---|
| Transcript | text | Combined transcript text. |
| Transcript | agent_context | Short field notes for timing units, speaker IDs, and word index ranges. |
| Transcript | words | Word-level transcript entries with text, start, end, speaker, and optional confidence. |
| Transcript | utterances | Speaker turns with speaker, start, end, text, word_start_idx, and word_end_idx. |
| Transcript | speakers | Speaker summaries with speaker, total_duration, and utterance_count. |
| Vision | faces | Individual detected face observations. |
| Vision | agent_context | Short field notes for coordinates, timing, scene timestamps, and track IDs. |
| Vision | face_tracks | Face observations grouped across time. |
| Vision | scenes | Scene-level visual segments. |
| Retake removal | words | Word-level inputs used by the retake-removal model. |
| Retake removal | remove_spans | Word index ranges that should be removed. |
| Retake removal | keep_intervals | Time ranges to keep after removing retakes. |
Polling pattern
Poll untilstatus is completed or failed.
queued, running, processing, or completed, starting the same analysis again returns the existing analysis response.
