Testing LLMs

5 min readNov 10, 2023

AI generated image. Prompt: “an AI taking a test”

There are a number of roles in an org that may need to validate output from an application:

Security: needs to verify output to ensure that the application behaves as expected under abnormal conditions, including malicious attack, and that sensitive data remains protected
Developers: need to ensure the output matches expected behavior and accuracy, especially across software updates (for example, database versions, etc.)
Disaster Recovery/Incident Response: needs to make sure that an application that has been impacted in some way is once again performing at the same level as prior to the incident

In traditional application stacks, the process for testing each of these things is pretty clear and well understood. In the new world of LLMs however, testing is very different and it is not trivial. Here’s an example that highlights why this is such an interesting challenge:

Overview

OpenAI provides an API to the GPT LLM. When you use the API, you specify the model you’d like to use. If you just call the API using the model name gpt-3.5-turbo (as many do), the actual model you get changes over time as the company retrains and fine-tunes the product. For example:

At the time of this post gpt-3.5-turbo actually points to gpt-3.5-turbo-0613, which was released in June of…

Testing LLMs

Overview

Written by Jason Ross

No responses yet