Skip to main content

Transcribing without Zoom

The coaching platform's AI features, competency assessment, development seeding, person insights, and coaching/client reflections, require a VTT (WebVTT) transcript with speaker-attributed timestamps. Zoom Pro and above include this automatically with cloud recordings.

If you record on Microsoft Teams, Google Meet, Zoom Free, or in person (phone, dictaphone), this guide shows you how to produce a VTT file from any audio or video recording.

You may not need a transcript

If you do not need AI features, you can import just the recording (video or audio) via Staging. Core features. Session tracking, ICF certification, people, development plans, coaching log import/export, and data portability all work without any transcript.


What you need

An audio or video file from any source:

  • MP4 (video). From Teams, Google Meet, Zoom local recording, or any other platform
  • M4A, MP3, WAV (audio). From a phone recording, dictaphone, or audio-only export

What you will get

A WebVTT (.vtt) file with speaker-attributed timestamps. This is the format the coaching platform expects for session review, talk-time analytics, ICF competency tagging, and AI assessment.

WEBVTT

1
00:00:04.000 --> 00:00:08.960
Coach: Tell me about your goal for today.

2
00:00:09.200 --> 00:00:14.400
Client: I want to work on my confidence in team meetings.

3
00:00:15.100 --> 00:00:22.800
Coach: What would confidence in those meetings look like for you?

Each segment has a start time, end time, speaker label, and the transcribed text.


Option 1: Local transcription with Whisper

Free. Private. Runs on your computer.

Whisper is an open-source speech recognition model by OpenAI. It runs locally. Your audio never leaves your machine.

Install

pip install openai-whisper

Or use a macOS app such as MacWhisper for a graphical interface.

Transcribe

whisper recording.mp4 --model medium --output_format vtt

This produces a recording.vtt file with timestamps. The medium model balances accuracy and speed. Use large for higher accuracy on longer sessions (requires more memory).

Speaker labels

Whisper alone does not attribute speakers (it transcribes all speech without identifying who is talking). For speaker labels, use one of these alternatives:

  • WhisperX. Adds speaker diarization using pyannote. Produces VTT with speaker labels.
  • Whisper + pyannote. Run diarization separately and merge with the Whisper transcript.

Speaker labels are required for the coaching platform's talk-time analytics and per-speaker competency tagging. If your VTT has no speaker labels, AI assessment will still work but talk-time and speaker-specific features will be unavailable.

Privacy

Everything runs on your machine. No audio is uploaded to any server.


Option 2: Cloud transcription with AssemblyAI

Paid. Accurate speaker labels. No local setup.

AssemblyAI is a cloud transcription service with built-in speaker diarization. It produces speaker-attributed transcripts with timestamps.

Sign up

  1. Create an account at assemblyai.com
  2. Copy your API key from the dashboard

AssemblyAI charges per audio hour. Check their pricing page for current rates.

Transcribe using the REST API

Step 1: Upload your audio file:

curl --request POST \
--url https://api.assemblyai.com/v2/upload \
--header "authorization: YOUR_API_KEY" \
--data-binary @recording.mp4

This returns a JSON response with an upload_url.

Step 2: Request transcription with speaker labels:

curl --request POST \
--url https://api.assemblyai.com/v2/transcript \
--header "authorization: YOUR_API_KEY" \
--header "content-type: application/json" \
--data '{
"audio_url": "YOUR_UPLOAD_URL",
"speaker_labels": true
}'

This returns a JSON response with an id. Poll the transcript endpoint until status is completed.

Step 3: Download the result:

curl --request GET \
--url https://api.assemblyai.com/v2/transcript/YOUR_TRANSCRIPT_ID \
--header "authorization: YOUR_API_KEY"

The response includes an utterances array with speaker, text, start, and end fields (timestamps in milliseconds).

Step 4: Convert to VTT:

AssemblyAI returns JSON, not VTT directly. Convert the utterances array to WebVTT format:

import json

with open("transcript.json") as f:
data = json.load(f)

def ms_to_vtt(ms):
s, ms = divmod(ms, 1000)
m, s = divmod(s, 60)
h, m = divmod(m, 60)
return f"{h:02d}:{m:02d}:{s:02d}.{ms:03d}"

with open("transcript.vtt", "w") as out:
out.write("WEBVTT\n\n")
for i, u in enumerate(data["utterances"], 1):
start = ms_to_vtt(u["start"])
end = ms_to_vtt(u["end"])
speaker = u["speaker"]
text = u["text"]
out.write(f"{i}\n{start} --> {end}\n{speaker}: {text}\n\n")

Replace the generic speaker labels (Speaker A, Speaker B) with Coach and Client in the VTT file before importing.

AssemblyAI CLI

AssemblyAI also provides a CLI tool that simplifies the upload-transcribe-download workflow into a single command.


Option 3: Other transcription services

Any service that produces a VTT file with speaker labels and timestamps will work. Some alternatives:

ServiceSpeaker labelsOutput formatNotes
Otter.aiYesExport as TXT or SRT (convert to VTT)Free tier available with limits
RevYesSRT, VTTHuman and AI transcription options
DescriptYesExport as SRT or VTTAlso includes audio/video editing

If the service produces SRT instead of VTT, the format is nearly identical. Rename the file extension to .vtt and replace the WEBVTT header, or use an online converter.


Importing into the coaching platform

Once you have your recording file and VTT transcript:

  1. Open the Staging page in the coaching platform
  2. Import your recording. Drag or select the video (MP4) or audio (M4A, MP3, WAV) file
  3. Import your VTT. Drag or select the .vtt transcript file
  4. Assign speakers. Map the speaker labels in the VTT (e.g. Speaker A, Speaker B) to Coach and Client
  5. Import into a session. Create a new session or add to an existing one

All platform features, session review with synced transcript playback, talk-time analytics, ICF competency tagging, and AI assessment, work identically regardless of which platform the recording came from.


Without a transcript

If you do not need AI features, skip the transcription step entirely. Import just the recording via Staging and use all core features:

  • Session management and status tracking
  • ICF certification hours, freshness window, and submission readiness
  • People and peer coaching pair management
  • Coach development areas and action plans
  • Coaching log import and export
  • Full data portability between editions

On the roadmap

Built-in audio extraction and transcription, so you can skip the manual steps above, is being considered for a future release. As an early adopter, you can vote for this feature on the Roadmap to help prioritise it.