There is no standard way to evaluate whether a fine-tuned LLM is actually better than the base model for your specific task
devtoolsdevtools0 views
You fine-tuned Llama 3 8B on 50K customer support conversations. Training finished. Is the model better than base Llama 3 8B for customer support? You run it on 100 test queries. It sounds more 'on brand.' But is it more accurate? Does it hallucinate less? Does it handle edge cases better? You do not have ground truth labels for 100 queries — you have to manually read and judge each response. You spend 8 hours manually evaluating 100 responses. Your judgment is subjective: another person might rate 30% of responses differently. You think the model is 15% better. Your cofounder thinks it is 5% worse. You have no way to settle this without getting 5 more people to evaluate, which costs $500-1,000 in labor. So what? The entire value of fine-tuning depends on measurable improvement, but measurement is the unsolved bottleneck. If you cannot reliably quantify whether fine-tuning helped, you cannot justify the cost ($2,000-50,000). You cannot compare fine-tuning approaches. You cannot decide when to stop training. You make decisions by vibes. LLM-as-judge (using GPT-4 to evaluate outputs) helps but introduces its own biases and costs $0.05-0.50 per evaluation. Why does this persist? Open-ended language tasks have no single ground truth. 'Is this customer support response good?' depends on accuracy, tone, completeness, conciseness, and brand alignment — all subjective dimensions. Creating gold-standard evaluation datasets requires domain experts (customer support managers, not ML engineers) who are expensive and slow. LMSYS Chatbot Arena showed that crowd-sourced evaluation works at scale but requires thousands of ratings per model comparison.
Evidence
LMSYS Chatbot Arena requires 10,000+ human ratings for reliable model ranking. LLM-as-judge (MT-Bench) shows 80% agreement with human judges — meaning 20% disagreement. OpenAI's evals framework provides structure but requires custom eval creation per task. No automated tool answers 'is my fine-tuned model better than base for my specific use case' without significant human evaluation effort.