Workflow guide

Forced alignment for subtitles: existing transcript to SRT timing

By Yana Li

If you already have the correct transcript, script, or voiceover copy, you do not need a tool to guess the words again. You need a way to align that known text to the audio, inspect mismatches, and turn the timing into subtitle files such as SRT or VTT.

Align your script and audio View sample project

Approved transcript + matching audio

approved-transcript.txt

We ship the approved words.
Then we review the timing.

matching audio: final-voiceover.wav

Source-preserving SRT example

1
00:00:00,600 --> 00:00:02,700
We ship the approved words.

2
00:00:03,000 --> 00:00:05,600
Then we review the timing.

Subtitle-line-specific quality signal

Subtitle line 2 needs attention: the recording says “Then we check the timing,” while the approved transcript says “Then we review the timing.”

Decision points

The short answer

Forced alignment can add timing to text you already trust when that transcript closely matches the recording. A subtitle workflow must then group those timings into readable subtitle lines and validate the resulting SRT or VTT before delivery.

Forced alignment vs transcription

Transcription starts with audio and tries to discover the words. Forced alignment starts with known text and locates when those words are spoken. An aligner may still use acoustic or speech models internally, but the supplied transcript remains the source of truth.

Transcript readiness checklist

Use the same transcript version and final audio edit. Keep the spoken wording, including intentional filler or repeated words; remove stale timestamps, visual directions, and headers that are not spoken; confirm the language and text encoding; and split long recordings into reviewable sections when the chosen aligner needs it. Missing, reordered, translated, or rewritten lines should be treated as mismatch risk rather than silently corrected.

Choose the alignment path

Use Montreal Forced Aligner or another open-source pipeline when you can manage models, dictionaries, normalization, command-line setup, and conversion from TextGrid or word timings. Use an API when you need developer integration. Use an editor workaround when manual captioning is acceptable. Use TimedSubs for managed subtitle delivery from approved text plus matching owned audio, with mismatch checks and SRT/VTT assets on the same workflow path.

Subtitle output is a separate step

Word or phrase timing is not yet a good subtitle file. SRT and VTT still need natural subtitle-line boundaries, positive durations, ordered timestamps, no overlaps, readable line lengths, and a final check that the exported text still matches the approved source.

Practical workflow

1
Complete the transcript readiness checklist above before running alignment.
2
Run the selected aligner against the known text and audio.
3
Check unmatched, low-confidence, skipped, added, or changed regions as explicit quality signals.
4
Group the timing into readable subtitle lines while preserving the approved wording.
5
Validate SRT/VTT structure, timestamp order, overlap, line length, and reading speed.
6
Export the subtitle asset and upload it yourself to the destination platform or editor.

Product boundary

Forced alignment for subtitles requires known text and matching audio. TimedSubs handles the managed subtitle-asset version of that workflow; it does not download public videos, transcribe audio from scratch, or claim universal word-level alignment superiority.

Official references checked for workflow posture

Official reference review: May 17, 2026

Montreal Forced Aligner corpus structure Montreal Forced Aligner mfa align ElevenLabs Forced Alignment docs PyTorch forced alignment tutorial NVIDIA forced alignment explainer

Related guides and tools

Align your script and audio Audio-script alignment Text to SRT generator Create SRT from text Script-first vs auto captions SRT validator online Turn word timestamps into SRT with Python

FAQ

Can I align an existing transcript to audio and get SRT?

Yes, but there are two parts. First, forced alignment finds timing for the known transcript. Second, subtitle tooling converts that timing into readable SRT or VTT subtitle lines. A raw word-timing file or TextGrid may still need segmentation and validation before delivery.

Is forced alignment better than Whisper transcription?

It solves a different problem. Use transcription when the words are unknown. Use forced alignment when the wording is already approved and the remaining work is timing. Re-transcribing approved text can introduce changed names, numbers, product terms, or technical vocabulary.

Can I use Montreal Forced Aligner for subtitles?

Yes, if you are comfortable with its setup. MFA expects paired audio and transcription files plus suitable acoustic models, pronunciation dictionaries, and text normalization. Its alignment output still needs to be converted or packaged into readable SRT/VTT subtitle lines and checked before delivery.

Does Subtitle Edit or Premiere Pro solve this directly?

The answer depends on the version and workflow. Many editor paths are transcription-first: they create captions from audio and then let you edit or replace text. That can work as a manual workaround, but it is not the same as preserving an existing transcript as the source of truth from the start.

What if my transcript is not exact?

Punctuation differences are usually less important than spoken-word differences. Missing words, removed filler, reordered paragraphs, skipped sentences, different takes, or translated text can create drift and should appear as review work after alignment.

Can forced alignment create SRT from text alone?

No—forced alignment itself requires audio. Text alone can still be turned into a structurally valid SRT draft with estimated timestamps, but that draft is not synchronized to the recording. For forced alignment or subtitle timing derived from when the words are actually spoken, provide the matching audio or video audio track.

Which languages are supported?

Language support depends on the aligner, model, dictionary, and normalization rules. Check the selected aligner’s current language resources before planning a large workflow, and test representative recordings rather than assuming one model behaves the same across languages.

When should I choose TimedSubs?

Choose TimedSubs when you have approved text and matching owned audio and want a subtitle-asset workflow: source wording preserved, timing generated from the recording, mismatch signals surfaced, and SRT/VTT assets prepared after quality checks.

When should I not use TimedSubs?

Do not use TimedSubs as the first step if you only have audio and need a transcript discovered from scratch, if you need arbitrary video downloading, or if your main deliverable is phonetic research output rather than subtitle files.

Related guides

View all guides