The Report in Brief
The introduction of the F1 score in Vol. II of this ongoing series of publications has allowed us to think a
little differently about how we report our findings related to the evaluation of Large Language Model (LLM)
performance when applied to specific insurance use cases. You will see that evolved thinking reflected in
this report.
We believe the aggregated F1 score, both per scenario and overall, provides the relevant insights required
to understand how the tested LLMs perform against common insurance industry use cases and if the costs
associated with their deployment are in line with performance. This approach also makes it easier to understand
how each model performs against what could be considered “simple scenarios” (information extraction from text
fields, amounts, dates, etc.) as opposed to “complex scenarios” (tasks including several steps and/or information
extraction from lists or fields that are themselves complex objects).
Based on advancements in LLM technology which have been introduced to the market since Vol. II our data
science and research teams have included six new LLMs - GPT4o-Mini, Claude3.5 Sonnet, Mistral Large 2407,
Llama3.1-405b, Llama3.1-70b, and Llama3.1-8b - into the testing. Command r and Command r+ were removed
from evaluation.
As always, this report would not be possible without the efforts of Shift’s data science and research teams.