Transcribing without Zoom

The coaching platform's AI features, competency assessment, development seeding, person insights, and coaching/client reflections, require a VTT (WebVTT) transcript with speaker-attributed timestamps. Zoom Pro and above include this automatically with cloud recordings.

If you record on Microsoft Teams, Google Meet, Zoom Free, or in person (phone, dictaphone), this guide shows you how to produce a VTT file from any audio or video recording.

You may not need a transcript

If you do not need AI features, you can import just the recording (video or audio) via Staging. Core features. Session tracking, ICF certification, people, development plans, coaching log import/export, and data portability all work without any transcript.

What you need

An audio or video file from any source:

MP4 (video). From Teams, Google Meet, Zoom local recording, or any other platform
M4A, MP3, WAV (audio). From a phone recording, dictaphone, or audio-only export

What you will get

A WebVTT (.vtt) file with speaker-attributed timestamps. This is the format the coaching platform expects for session review, talk-time analytics, ICF competency tagging, and AI assessment.

WEBVTT

1
00:00:04.000 --> 00:00:08.960
Coach: Tell me about your goal for today.

2
00:00:09.200 --> 00:00:14.400
Client: I want to work on my confidence in team meetings.

3
00:00:15.100 --> 00:00:22.800
Coach: What would confidence in those meetings look like for you?

Each segment has a start time, end time, speaker label, and the transcribed text.

Option 1: Local transcription with Whisper

Free. Private. Runs on your computer.

Whisper is an open-source speech recognition model by OpenAI. It runs locally. Your audio never leaves your machine.

Install

pip install openai-whisper

Or use a macOS app such as MacWhisper for a graphical interface.

Transcribe

whisper recording.mp4 --model medium --output_format vtt

This produces a recording.vtt file with timestamps. The medium model balances accuracy and speed. Use large for higher accuracy on longer sessions (requires more memory).

Speaker labels

Whisper alone does not attribute speakers (it transcribes all speech without identifying who is talking). For speaker labels, use one of these alternatives:

WhisperX. Adds speaker diarization using pyannote. Produces VTT with speaker labels.
Whisper + pyannote. Run diarization separately and merge with the Whisper transcript.

Speaker labels are required for the coaching platform's talk-time analytics and per-speaker competency tagging. If your VTT has no speaker labels, AI assessment will still work but talk-time and speaker-specific features will be unavailable.

Privacy

Everything runs on your machine. No audio is uploaded to any server.

Option 2: Cloud transcription with AssemblyAI

Paid. Accurate speaker labels. No local setup.

AssemblyAI is a cloud transcription service with built-in speaker diarization. It produces speaker-attributed transcripts with timestamps.

Create an account at assemblyai.com
Copy your API key from the dashboard

AssemblyAI charges per audio hour. Check their pricing page for current rates.

Transcribe using the REST API

Step 1: Upload your audio file:

curl --request POST \
  --url https://api.assemblyai.com/v2/upload \
  --header "authorization: YOUR_API_KEY" \
  --data-binary @recording.mp4

This returns a JSON response with an upload_url.

Step 2: Request transcription with speaker labels:

curl --request POST \
  --url https://api.assemblyai.com/v2/transcript \
  --header "authorization: YOUR_API_KEY" \
  --header "content-type: application/json" \
  --data '{
    "audio_url": "YOUR_UPLOAD_URL",
    "speaker_labels": true
  }'

This returns a JSON response with an id. Poll the transcript endpoint until status is completed.

Step 3: Download the result:

curl --request GET \
  --url https://api.assemblyai.com/v2/transcript/YOUR_TRANSCRIPT_ID \
  --header "authorization: YOUR_API_KEY"

The response includes an utterances array with speaker, text, start, and end fields (timestamps in milliseconds).

Step 4: Convert to VTT:

AssemblyAI returns JSON, not VTT directly. Convert the utterances array to WebVTT format:

import json

with open("transcript.json") as f:
    data = json.load(f)

def ms_to_vtt(ms):
    s, ms = divmod(ms, 1000)
    m, s = divmod(s, 60)
    h, m = divmod(m, 60)
    return f"{h:02d}:{m:02d}:{s:02d}.{ms:03d}"

with open("transcript.vtt", "w") as out:
    out.write("WEBVTT\n\n")
    for i, u in enumerate(data["utterances"], 1):
        start = ms_to_vtt(u["start"])
        end = ms_to_vtt(u["end"])
        speaker = u["speaker"]
        text = u["text"]
        out.write(f"{i}\n{start} --> {end}\n{speaker}: {text}\n\n")

Replace the generic speaker labels (Speaker A, Speaker B) with Coach and Client in the VTT file before importing.

AssemblyAI CLI

AssemblyAI also provides a CLI tool that simplifies the upload-transcribe-download workflow into a single command.

Option 3: Other transcription services

Any service that produces a VTT file with speaker labels and timestamps will work. Some alternatives:

Service	Speaker labels	Output format	Notes
Otter.ai	Yes	Export as TXT or SRT (convert to VTT)	Free tier available with limits
Rev	Yes	SRT, VTT	Human and AI transcription options
Descript	Yes	Export as SRT or VTT	Also includes audio/video editing

If the service produces SRT instead of VTT, the format is nearly identical. Rename the file extension to .vtt and replace the WEBVTT header, or use an online converter.

Importing into the coaching platform

Once you have your recording file and VTT transcript:

Open the Staging page in the coaching platform
Import your recording. Drag or select the video (MP4) or audio (M4A, MP3, WAV) file
Import your VTT. Drag or select the .vtt transcript file
Assign speakers. Map the speaker labels in the VTT (e.g. Speaker A, Speaker B) to Coach and Client
Import into a session. Create a new session or add to an existing one

All platform features, session review with synced transcript playback, talk-time analytics, ICF competency tagging, and AI assessment, work identically regardless of which platform the recording came from.

Without a transcript

If you do not need AI features, skip the transcription step entirely. Import just the recording via Staging and use all core features:

Session management and status tracking
ICF certification hours, freshness window, and submission readiness
People and peer coaching pair management
Coach development areas and action plans
Coaching log import and export
Full data portability between editions

On the roadmap

Built-in audio extraction and transcription, so you can skip the manual steps above, is being considered for a future release. As an early adopter, you can vote for this feature on the Roadmap to help prioritise it.

What you need​

What you will get​

Option 1: Local transcription with Whisper​

Install​

Transcribe​

Speaker labels​

Privacy​

Option 2: Cloud transcription with AssemblyAI​

Sign up​

Transcribe using the REST API​

AssemblyAI CLI​

Option 3: Other transcription services​

Importing into the coaching platform​

Without a transcript​

On the roadmap​