Testing LLMs
There are a number of roles in an org that may need to validate output from an application:
- Security: needs to verify output to ensure that the application behaves as expected under abnormal conditions, including malicious attack, and that sensitive data remains protected
- Developers: need to ensure the output matches expected behavior and accuracy, especially across software updates (for example, database versions, etc.)
- Disaster Recovery/Incident Response: needs to make sure that an application that has been impacted in some way is once again performing at the same level as prior to the incident
In traditional application stacks, the process for testing each of these things is pretty clear and well understood. In the new world of LLMs however, testing is very different and it is not trivial. Here’s an example that highlights why this is such an interesting challenge:
Overview
OpenAI provides an API to the GPT LLM. When you use the API, you specify the model you’d like to use. If you just call the API using the model name gpt-3.5-turbo
(as many do), the actual model you get changes over time as the company retrains and fine-tunes the product. For example:
- At the time of this post
gpt-3.5-turbo
actually points togpt-3.5-turbo-0613
, which was released in June of…