Choosing a model

The platform uses AI to evaluate coaching sessions against ICF competency rubrics. The quality of feedback depends on the model you choose.

Not all AI models are equal. Reasoning models and rigorous chat models produce honest, evidence-based assessments. General-purpose chat models tend to produce inflated scores with vague evidence. This page explains the difference and recommends specific models for each supported provider.

Provider	Recommended model	Assessment quality	Speed
Ollama	qwen3:30b	Critical, evidence-based, zero cost. Standard mode only	Varies by hardware
LM Studio	Qwen3.5 35B A3B	Use with caution, optimistic verdicts. Standard mode only	Varies by hardware
llama-server	Qwen3.5 35B A3B (GGUF)	Critical, evidence-based, zero cost. Deep analysis with `--reasoning-budget`	Varies by hardware
OpenAI	o3-mini (reasoning)	Honest, critical, evidence-based	Fast
Anthropic	claude-sonnet-4-6	Honest, evidence-based, rigorous	Slower (~4 min)
xAI (Grok)	grok-4-1-fast-reasoning	Honest, critical, evidence-based	Very fast (<1 min)
Perplexity	sonar-reasoning-pro	Use with caution, optimistic verdicts	Fast
Mistral	mistral-large	Untested for coaching assessment	Fast
Groq	qwen/qwen3-32b, llama-3.3-70b-versatile	Pending confirmation (see notes)	Very fast
Google Gemini	No recommended models	Inflated scores, positivity bias	Fast

Why reasoning models matter

Most AI chat models are designed to be agreeable. When asked to score a coaching session, they tend to rate everything highly and provide generic feedback like "good use of open questions." This is not useful for professional development.

Reasoning models (OpenAI o-series, Qwen3, DeepSeek-R1) work differently. They evaluate against rubrics systematically rather than generating agreeable text. This produces more honest and actionable coaching feedback, with specific transcript evidence for each competency score.

coaching platform AI assessment showing ICF competency scores with transcript evidence

What I tested

I ran a direct comparison using a real coaching session transcript. Both models received the same prompt, the same transcript, and the same parameters (temperature 0.4). Each model evaluated the session against all 22 ICF competency markers.

	GPT-4o	qwen3:30b
Markers rated "Meets"	20 of 22	4 of 22
Markers rated "Partially meets"	1	10
Markers rated "Does not fully meet"	0	8
Evidence style	Vague timestamps, many reusing the same reference	Precise timestamps with actual transcript quotes
Feedback tone	Positive and flattering	Specific and constructive
Submission recommendation	Recommended submission	Recommended against submission

The same transcript, the same rubric, the same prompt. One model said "nearly perfect." The other identified concrete growth areas with evidence.

Where chat models fall short

Four patterns emerged from the comparison.

1. Evidence quality. GPT-4o repeatedly cited the same vague timestamp for multiple markers, suggesting it did not engage with distinct parts of the transcript. qwen3:30b cited specific, different timestamps with actual quotes for each marker.

2. Critical honesty. The prompt explicitly states "be constructive but honest; coaches need actionable feedback." GPT-4o rated 20 of 22 markers as "Meets." qwen3:30b identified real issues: the coach sharing personal experiences instead of focusing on the client, no explicit session agreement, and an abrupt closing. These are legitimate coaching deficiencies that GPT-4o glossed over.

3. Sycophancy bias. GPT-4o has a well-documented tendency to be agreeable and avoid negative feedback. OpenAI acknowledged this problem publicly in April 2025. For coaching assessment, this makes it counterproductive. Telling a coach everything is "Meets" when there are real growth areas does not help them improve.

4. Actionable feedback. For the same competency marker (supporting expression of feelings), GPT-4o produced vague praise: "You effectively supported the client in expressing their feelings." qwen3:30b produced a specific alternative the coach could use next time: "Reflect feelings: 'It sounds like you feel worried when they don't ask for help, and that's hard to watch.'"

General-purpose chat models optimised for helpfulness and agreeableness are fundamentally unsuited for rubric-based professional assessment where honest, critical feedback is the entire point.

OpenAI: recommended models

Model	Type	Best for	Approximate cost
o3-mini	Reasoning	Best value. Honest, critical evaluation	~$1.10 / $4.40 per 1M tokens
o3	Reasoning	Strongest critical analysis, highest quality	Higher cost
gpt-4o, gpt-4o-mini	General	Avoid for scoring. Known positivity bias produces inflated scores	Lower cost

Reasoning models (o-series) evaluate against rubrics systematically rather than generating agreeable text, producing more honest and actionable coaching feedback.

Recommendation: Start with o3-mini. It provides honest, critical feedback at a low cost. Move to o3 if you want the strongest possible analysis and don't mind the higher cost.

Not recommended for coaching assessment

Model	Reason
GPT-4o	Well-documented sycophancy, confirmed by OpenAI
GPT-4o-mini	Same fundamental architecture and bias tendencies
GPT-4.5	"Better emotional intelligence" often means more agreeable, not less

These models are fine for general tasks but produce inflated scores when used for rubric-based coaching evaluation.

Anthropic: recommended models

Model	Type	Best for	Approximate cost
claude-sonnet-4-6	Chat (rigorous)	Honest, evidence-based assessments with specific transcript evidence	~$3 / $15 per 1M tokens

Claude Sonnet 4.6 is not a reasoning model, but it is significantly less sycophantic than general-purpose chat models. It produces honest assessments with concrete evidence, specific timestamps, and actionable feedback. It is willing to recommend against submission when real gaps exist.

Recommendation: Use claude-sonnet-4-6 if you want rigorous assessments from a cloud provider and prefer Anthropic's privacy stance (API data is not used for training).

Assessment speed

Claude assessments for complex sessions (PCC/MCC with 39 markers) take approximately 4 minutes. The platform's default timeout is configured to handle this. If you see timeout errors, increase the timeout in Settings > AI > Advanced tuning.

xAI (Grok): recommended reasoning model

Model	Type	Best for	Speed
grok-4-1-fast-reasoning	Reasoning	Honest, critical evaluation with specific evidence	Under 1 minute

Grok's reasoning model produces rigorous, honest assessments comparable to OpenAI o-series and Claude Sonnet. In testing, it recommended against PCC submission for the same session that Claude and o3 also flagged, identifying specific weak markers (3.2, 8.5, 8.7) in agreements, actions, and accountability.

The standout advantage is speed. Grok completed a full PCC assessment in under 1 minute, compared to approximately 4 minutes for Claude Sonnet on the same session. It also provided concrete alternative phrasings with specific timestamps.

Recommendation: Use grok-4-1-fast-reasoning if you want fast, rigorous feedback. Avoid the "non-reasoning" Grok variants (same positivity bias risk as other chat models).

Google Gemini: not recommended for assessment

Model	Type	Assessment quality
Gemini 2.5 Flash	Chat	Inflated scores, positivity bias
Gemini 2.5 Pro	Chat	Worse than Flash, even more effusive praise

warning

Neither Gemini Flash nor Gemini Pro is suitable for coaching assessment. Both models exhibit the same sycophancy pattern as GPT-4o, producing inflated scores with vague evidence.

In direct testing with the same coaching session transcript, Gemini 2.5 Flash rated a session as "outstanding candidate for MCC submission" with "masterful level coaching." Claude Sonnet 4.6 assessed the same session as "not yet ready for MCC submission" and identified concrete competency gaps.

Gemini 2.5 Pro was even more effusive than Flash, finding it "difficult" to identify growth opportunities. This confirms that model size alone does not fix positivity bias.

Gemini models are fine for general tasks and have generous free tiers, but if you use them for coaching assessment, treat the results as directional only, not as reliable professional feedback.

Perplexity: use with caution

Model	Type	Assessment quality
sonar-reasoning-pro	Reasoning	Specific evidence but optimistic verdicts

Perplexity's sonar-reasoning-pro is a reasoning model that provides specific timestamps and concrete alternative phrasings. The evidence quality is reasonable, but overall verdicts tend to be more generous than Claude Sonnet or OpenAI o-series on the same transcript.

In testing, sonar-reasoning-pro rated a session as "genuinely PCC-level" and "conditional yes" for MCC submission, while Claude Sonnet assessed the same session as "not yet ready" for either level. The feedback identified some real weaknesses (markers 3.3, 6.3, 7.2, 8.2) but framed them as minor refinements rather than significant gaps.

warning

If using Perplexity for coaching assessment, cross-reference results with a more rigorous model (Claude Sonnet or OpenAI o-series) before acting on submission recommendations. Treat overall verdicts as optimistic.

Note: sonar-reasoning was deprecated in December 2025. Use sonar-reasoning-pro instead.

Mistral: untested

Model	Type	Notes
mistral-large	Chat	European provider, data processed in EU

Mistral is a European AI provider based in Paris. Their models have not been tested for coaching assessment quality. Mistral is included as a supported provider for users who require EU data processing for GDPR compliance.

If you test Mistral for coaching assessment, please share your results so this page can be updated with recommendations.

Groq: pending confirmation

Model	Type	Notes
qwen/qwen3-32b	Reasoning (hosted)	Chain-of-thought reasoning, rigorous on Ollama — pending Groq confirmation
llama-3.3-70b-versatile	Open-source (hosted)	Largest available model, good general reasoning — pending confirmation

Groq is an inference provider that runs open-source models on custom LPU (Language Processing Unit) hardware designed for speed. Groq is not a model creator — it hosts models built by others (Meta's Llama, Alibaba's Qwen, Mistral's Mixtral). Your data goes to Groq's infrastructure, not to the model creators.

Groq vs Ollama: Groq offers extremely fast cloud inference with a free tier, while Ollama runs models locally with complete data privacy and no rate limits. If data privacy is your primary concern, use Ollama. If speed and convenience matter more, Groq is a strong alternative.

Data privacy: Groq does not use API inputs or outputs for model training. No data retention by default. Zero Data Retention (ZDR) is available in Groq account settings for stronger guarantees.

Recommended models: qwen/qwen3-32b has been tested as rigorous on Ollama (local) and is expected to perform similarly on Groq. llama-3.3-70b-versatile is the largest model available on Groq and offers good general reasoning. Assessment quality on Groq has not yet been fully confirmed for either model.

warning

The Groq free tier has tokens-per-minute (TPM) limits that are too low for coaching transcripts (which typically exceed 16,000 tokens). You will need the Groq Developer tier to use within the coaching platform. Hard spend limits are available to control costs.

LM Studio: use with caution

Model	Type	Assessment quality	Speed
Qwen3.5 35B A3B (Q8_0)	MoE reasoning	Good evidence but optimistic verdicts	~3 min for PCC

LM Studio runs models locally with complete data privacy. Qwen3.5 35B A3B uses chain-of-thought reasoning (thinking mode) and provides specific timestamps, concrete alternative phrasings, and identifies some real growth areas.

However, in testing against the same session that Claude Sonnet and xAI recommended against PCC submission, Qwen3.5 35B A3B rated 37 of 39 markers as "Meets" and called it a "strong candidate for PCC submission." The evidence quality is good, but the overall verdict is more generous than rigorous cloud models.

warning

If using LM Studio for coaching assessment, cross-reference results with a more rigorous model (Claude Sonnet or OpenAI o-series) before acting on submission recommendations. Treat overall verdicts as optimistic.

Context length setup

LM Studio's default context length (4,096 tokens) is too small for coaching assessments. You must increase it in LM Studio's model settings:

Minimum: 32,768 tokens (32K) for shorter sessions
Recommended: 65,536 tokens (64K) for full session transcripts on 64 GB+ systems

After changing the context length in LM Studio, set the matching value in the platform's Settings > AI > Advanced tuning > Context limit field.

System requirements

System memory	Recommended model	Notes
64 GB+	Qwen3.5 35B A3B (Q8_0)	Best quality, MoE architecture keeps active parameters low
16-32 GB	Qwen3.5 9B (Q8_0)	Good balance of quality and memory usage

Recommendation: Use LM Studio if you want local, private AI assessments and are comfortable cross-referencing with a cloud model for submission decisions. For rigorous standalone assessments, use OpenAI o-series or Claude Sonnet instead.

llama-server: recommended for advanced local AI

Model	Type	Assessment quality	Speed
Qwen3.5 35B A3B (Q4_K_M GGUF)	MoE reasoning	Critical, evidence-based, zero cost	Varies by hardware

llama-server runs models locally with full data privacy and provides an OpenAI-compatible API. It supports server-side reasoning budgets, giving you precise control over how much the model reasons before responding.

Recommended launch command

llama-server \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --flash-attn on \
  --ctx-size 32768 \
  --jinja \
  --reasoning-budget 2048 \
  --host 0.0.0.0 \
  --parallel 1

Required flags

Flag	Value	Why
`--reasoning-budget`	`2048`	Controls the thinking token budget. The server enforces this as a soft cap on reasoning tokens before generating the response. Higher values (e.g. 4096) allow more thorough reasoning but increase generation time and risk sycophantic all-green assessments.
`--parallel`	`1`	Required. Concurrent requests corrupt the KV cache, causing leaked context between assessments. Always set to 1.
`--jinja`	(flag)	Required for Qwen3.5 chat template processing.
`--ctx-size`	`32768`	Context window size. Match this to the platform's context limit setting.
`--flash-attn`	`on`	Enables flash attention for faster inference.

warning

Do not increase --parallel above 1. When multiple requests run concurrently, the KV cache becomes corrupted and assessment context leaks between requests, producing unreliable results.

How reasoning budget works

The --reasoning-budget flag is a server-side control. The platform does not send thinking flags via the API — the server decides how many tokens to spend reasoning based on this budget. Reasoning tokens count against the total max_tokens budget, so the platform sets max_tokens to 16384 to leave room for both reasoning (~2048 tokens) and content (~14K tokens).

Recommendation: Start with --reasoning-budget 2048. This provides enough reasoning for critical analysis without over-thinking. If assessments seem shallow, increase to 4096. If assessments become overly positive (all markers rated "Meets"), reduce the budget.

Ollama: recommended models by system memory

System memory	Recommended model	Context window	Quality
16 GB	qwen3:4b or qwen3:8b	16k to 32k tokens	Faster but less nuanced feedback
32 GB	qwen3:14b	32k tokens	Good balance of quality and speed
64 GB+	qwen3:30b or larger	64k+ tokens	Critical, evidence-based scoring with specific transcript references

Qwen3 models produce honest, critical coaching feedback with specific transcript evidence. They outperform equivalently-sized models for structured rubric evaluation.

Recommendation: If you have 32 GB RAM or more, start with qwen3:14b. On 16 GB Macs, use qwen3:8b for shorter sessions or consider OpenAI for longer ones.

How memory affects quality

The coaching platform automatically adjusts the context window based on your system memory. With less memory, long transcripts may be truncated, producing shorter assessments.

System memory	Max context	Effect on long sessions
16 GB	8,192 tokens	May truncate long transcripts
32 GB	32,768 tokens	Handles most sessions
64 GB	65,536 tokens	Handles long sessions comfortably
128 GB+	131,072+ tokens	No practical limit

How to check your system memory

macOS: Click the Apple menu > About This Mac. Your memory is listed as "Memory" (e.g. "16 GB" or "32 GB"). Windows: Open Settings > System > About. Your installed RAM is listed under "Device specifications".

Tips

If an assessment seems shallow or overly positive, try a reasoning model (OpenAI o-series) or a rigorous chat model (Claude Sonnet)
If you see "transcript too long" errors, use a cloud provider for that session or try a shorter recording
If you see timeout errors with Claude, increase the timeout in Settings > AI > Advanced tuning for the Anthropic provider
You can switch between providers at any time in Settings > AI. Your existing assessments are preserved.
Generating a new assessment for the same session replaces the previous one

Why reasoning models matter​

What I tested​

Where chat models fall short​

OpenAI: recommended models​

Not recommended for coaching assessment​

Anthropic: recommended models​

xAI (Grok): recommended reasoning model​

Google Gemini: not recommended for assessment​

Perplexity: use with caution​

Mistral: untested​

Groq: pending confirmation​

LM Studio: use with caution​

Context length setup​

System requirements​

llama-server: recommended for advanced local AI​

Recommended launch command​

Required flags​

How reasoning budget works​

Ollama: recommended models by system memory​

How memory affects quality​

Tips​

Why reasoning models matter

What I tested

Where chat models fall short

OpenAI: recommended models

Not recommended for coaching assessment

Anthropic: recommended models

xAI (Grok): recommended reasoning model

Google Gemini: not recommended for assessment

Perplexity: use with caution

Mistral: untested

Groq: pending confirmation

LM Studio: use with caution

Context length setup

System requirements

llama-server: recommended for advanced local AI

Recommended launch command

Required flags

How reasoning budget works

Ollama: recommended models by system memory

How memory affects quality

Tips