Transcribing without Zoom
The coaching platform's AI features, competency assessment, development seeding, person insights, and coaching/client reflections, require a VTT (WebVTT) transcript with speaker-attributed timestamps. Zoom Pro and above include this automatically with cloud recordings.
If you record on Microsoft Teams, Google Meet, Zoom Free, or in person (phone, dictaphone), this guide shows you how to produce a VTT file from any audio or video recording.
If you do not need AI features, you can import just the recording (video or audio) via Staging. Core features. Session tracking, ICF certification, people, development plans, coaching log import/export, and data portability all work without any transcript.
What you need
An audio or video file from any source:
- MP4 (video). From Teams, Google Meet, Zoom local recording, or any other platform
- M4A, MP3, WAV (audio). From a phone recording, dictaphone, or audio-only export
What you will get
A WebVTT (.vtt) file with speaker-attributed timestamps. This is the format the coaching platform expects for session review, talk-time analytics, ICF competency tagging, and AI assessment.
WEBVTT
1
00:00:04.000 --> 00:00:08.960
Coach: Tell me about your goal for today.
2
00:00:09.200 --> 00:00:14.400
Client: I want to work on my confidence in team meetings.
3
00:00:15.100 --> 00:00:22.800
Coach: What would confidence in those meetings look like for you?
Each segment has a start time, end time, speaker label, and the transcribed text.
Option 1: Local transcription with Whisper
Free. Private. Runs on your computer.
Whisper is an open-source speech recognition model by OpenAI. It runs locally. Your audio never leaves your machine.
Install
pip install openai-whisper
Or use a macOS app such as MacWhisper for a graphical interface.
Transcribe
whisper recording.mp4 --model medium --output_format vtt
This produces a recording.vtt file with timestamps. The medium model balances accuracy and speed. Use large for higher accuracy on longer sessions (requires more memory).
Speaker labels
Whisper alone does not attribute speakers (it transcribes all speech without identifying who is talking). For speaker labels, use one of these alternatives:
- WhisperX. Adds speaker diarization using pyannote. Produces VTT with speaker labels.
- Whisper + pyannote. Run diarization separately and merge with the Whisper transcript.
Speaker labels are required for the coaching platform's talk-time analytics and per-speaker competency tagging. If your VTT has no speaker labels, AI assessment will still work but talk-time and speaker-specific features will be unavailable.
Privacy
Everything runs on your machine. No audio is uploaded to any server.
Option 2: Cloud transcription with AssemblyAI
Paid. Accurate speaker labels. No local setup.
AssemblyAI is a cloud transcription service with built-in speaker diarization. It produces speaker-attributed transcripts with timestamps.
Sign up
- Create an account at assemblyai.com
- Copy your API key from the dashboard
AssemblyAI charges per audio hour. Check their pricing page for current rates.
Transcribe using the REST API
Step 1: Upload your audio file:
curl --request POST \
--url https://api.assemblyai.com/v2/upload \
--header "authorization: YOUR_API_KEY" \
--data-binary @recording.mp4
This returns a JSON response with an upload_url.
Step 2: Request transcription with speaker labels:
curl --request POST \
--url https://api.assemblyai.com/v2/transcript \
--header "authorization: YOUR_API_KEY" \
--header "content-type: application/json" \
--data '{
"audio_url": "YOUR_UPLOAD_URL",
"speaker_labels": true
}'
This returns a JSON response with an id. Poll the transcript endpoint until status is completed.
Step 3: Download the result:
curl --request GET \
--url https://api.assemblyai.com/v2/transcript/YOUR_TRANSCRIPT_ID \
--header "authorization: YOUR_API_KEY"
The response includes an utterances array with speaker, text, start, and end fields (timestamps in milliseconds).
Step 4: Convert to VTT:
AssemblyAI returns JSON, not VTT directly. Convert the utterances array to WebVTT format:
import json
with open("transcript.json") as f:
data = json.load(f)
def ms_to_vtt(ms):
s, ms = divmod(ms, 1000)
m, s = divmod(s, 60)
h, m = divmod(m, 60)
return f"{h:02d}:{m:02d}:{s:02d}.{ms:03d}"
with open("transcript.vtt", "w") as out:
out.write("WEBVTT\n\n")
for i, u in enumerate(data["utterances"], 1):
start = ms_to_vtt(u["start"])
end = ms_to_vtt(u["end"])
speaker = u["speaker"]
text = u["text"]
out.write(f"{i}\n{start} --> {end}\n{speaker}: {text}\n\n")
Replace the generic speaker labels (Speaker A, Speaker B) with Coach and Client in the VTT file before importing.
AssemblyAI CLI
AssemblyAI also provides a CLI tool that simplifies the upload-transcribe-download workflow into a single command.
Option 3: Other transcription services
Any service that produces a VTT file with speaker labels and timestamps will work. Some alternatives:
| Service | Speaker labels | Output format | Notes |
|---|---|---|---|
| Otter.ai | Yes | Export as TXT or SRT (convert to VTT) | Free tier available with limits |
| Rev | Yes | SRT, VTT | Human and AI transcription options |
| Descript | Yes | Export as SRT or VTT | Also includes audio/video editing |
If the service produces SRT instead of VTT, the format is nearly identical. Rename the file extension to .vtt and replace the WEBVTT header, or use an online converter.
Importing into the coaching platform
Once you have your recording file and VTT transcript:
- Open the Staging page in the coaching platform
- Import your recording. Drag or select the video (MP4) or audio (M4A, MP3, WAV) file
- Import your VTT. Drag or select the
.vtttranscript file - Assign speakers. Map the speaker labels in the VTT (e.g.
Speaker A,Speaker B) to Coach and Client - Import into a session. Create a new session or add to an existing one
All platform features, session review with synced transcript playback, talk-time analytics, ICF competency tagging, and AI assessment, work identically regardless of which platform the recording came from.
Without a transcript
If you do not need AI features, skip the transcription step entirely. Import just the recording via Staging and use all core features:
- Session management and status tracking
- ICF certification hours, freshness window, and submission readiness
- People and peer coaching pair management
- Coach development areas and action plans
- Coaching log import and export
- Full data portability between editions
On the roadmap
Built-in audio extraction and transcription, so you can skip the manual steps above, is being considered for a future release. As an early adopter, you can vote for this feature on the Roadmap to help prioritise it.