Add captions and subtitles to YouTube videos
Generate accurate, word-timed captions for a YouTube video right in your browser, then burn them into a 16:9 MP4 or export a clean SRT or VTT to upload alongside it. The AI transcribes your audio with per-word timing, every word stays editable, and there is no watermark, no sign-up, and nothing leaves your device.
Caption a full YouTube video, not just a clip
YouTube is the one place where length is open-ended — a two-minute explainer and a forty-minute deep-dive live side by side, and both deserve captions. Drop your finished cut in here at 1920 × 1080 and the tool transcribes the whole thing with word-level timing, however long it runs, then lays the captions over a 16:9 preview that matches what your viewers see on the watch page.
Because YouTube videos get watched on everything from a phone at lunch to a TV across the room, the captions stay legible at any size. You read the transcript top to bottom, correct anything the audio garbled, restyle the look, and leave with a file ready to upload — the same minute-long flow whether the source is short or feature-length.
Why captions earn their place on YouTube
Captions do quiet work on YouTube. They keep people watching in a noisy kitchen or a silent library, they open your video to deaf and hard-of-hearing viewers and to anyone following in a second language, and they hand YouTube a clean block of on-screen text to read and index — one more signal tying your video to what people actually search for.
That combination tends to lift watch time, the metric the platform weighs most. A viewer who can read along stays past the intro instead of bouncing, and longer sessions are what get a video recommended. Captions are one of the smaller changes you can make to a finished video that still move the numbers.
AI subtitles timed to the word
An on-device speech model handles the transcription. It detects the spoken language on its own, transcribes around 99 of them, and returns every word with its own start and end time. That per-word timing is what lets a word light up the instant it's spoken instead of a whole sentence flashing on at once — the difference between captions that feel produced and captions that feel pasted on.
If the auto-detected language is wrong, override it; if you want the captions in another language entirely, re-generate them in that one. And since no transcription is ever perfect, every word drops into an editable transcript: click a misheard name, an acronym, or a bit of jargon, fix it, and the caption on screen updates immediately. The final wording stays yours.
Your footage stays on your machine
This runs entirely in your browser. The video is decoded, transcribed, and rendered on your own computer — there is no upload step, no server touches your footage, and nothing is handed to any model to keep or train on. The guarantee is structural, not a line in a policy: the file has nowhere else to go.
For a YouTube creator that matters in practice. You can caption a sponsored segment, an unlisted draft, or an embargoed announcement and export it without a single frame reaching someone else's cloud first. The AI model downloads to your device once, then works offline every time after.
Styles that read well on a wide frame
The 16:9 frame gives captions room to breathe, so the styling leans toward clarity. Four animated presets cover the usual looks: Karaoke highlights each word as it's spoken, Highlighted boxes the active word in a color, Minimal shows one clean word at a time, and Dynamic pops a single word in. For long-form talking content, sitting them low and centered keeps them clear of faces and lower-thirds; for a tutorial you might raise them above b-roll labels.
Everything is adjustable from there — typeface (Inter, Montserrat, Oswald, Lora, or JetBrains Mono) and weight, size, position with a fine vertical nudge, text and highlight colors, outline, drop shadow, and how many words appear per line. Keep one or two words on screen for a punchy edit, or widen the line for a calmer documentary feel.
Burn in for the upload, or export SRT and VTT
When the captions look right there are two ways out, and on YouTube both are useful. Burn them straight into a 16:9 MP4 — optimized for sharing or kept at source quality — and the text is baked into the frames, always on, on every device and every player. Or skip the burn-in and export a standard .srt or .vtt subtitle file to add in YouTube Studio, which lets viewers toggle captions on or off and feeds the same searchable text to the platform.
Many creators do both: burn-in for the safety of always-visible text, plus an uploaded SRT so the toggle and the indexing work too. Fine-tune the timing of any line before you export, and nothing you make carries a watermark.
Questions
No — you have both options here. Upload a .srt or .vtt file in YouTube Studio and viewers can toggle captions on or off, and YouTube indexes the text. Or burn the captions into the MP4 so they're always visible on every player and device. Many creators do both: a burned-in copy for guaranteed visibility plus an uploaded SRT for the toggle and search indexing.
On a 16:9 frame, keep them large enough to read on a phone but not so large they cover faces — and remember YouTube overlays its own controls and any uploaded-caption track near the bottom. Centered and low usually reads best for talking-head and long-form, raised a bit when you have lower-thirds or b-roll labels. You control size, position with a fine vertical nudge, outline, and shadow, so the text stays legible over any background.
Yes. There is no fixed length limit — the on-device model transcribes the whole video with word-level timing whether it runs two minutes or much longer. You then scroll the full transcript to correct anything before exporting.
The on-device speech model auto-detects and transcribes around 99 languages, including English, Spanish, Portuguese, French, German, Russian, Hindi, Japanese, Korean, and Chinese. You can override the detected language, or re-generate the captions in another language. Every word stays editable afterward, so you can correct anything before export.
It is free for everyone, by choice — no watermark, no sign-up, no account, and no paywall at export. Every style, language, and export is open. And nothing is uploaded: transcription and burn-in happen entirely in your browser, so your file never leaves your computer and no model trains on it.