In this article, we are going to explore two layers of testing for LLM assistants.
Let’s take a beverage machine assistant as an example for better understanding. A beverage machine assistant is designed to facilitate user interactions. Users can place orders for various beverages, customize their drinks, and manage their orders through a conversational interface.
The beverage machine has some basic functionalities that we have defined in code:
First, we want to make sure that the assistant understands when to call these functions. That's where the unit tests come in handy. As we write unit tests, we're fine-tuning our assistant to recognize exactly when to trigger specific functions. By being careful with these tests, we're basically upgrading and refining our assistant, making sure it's sharp and can handle requests smoothly and without any hiccups!
Example code below, is to ensure that the assistant accurately recognizes and triggers the "cancel_order" function under various input scenarios. This meticulous testing process helps refine and improve the assistant's responsiveness and handling of user requests.
@pytest.mark.repeat(5)
@pytest.mark.parametrize("order_input", [
"I think I don't want anything anymore! ciao!",
"Cancel my order. Goodbye!",
"I changed my mind. No more drinks for me. Bye.",
"I'm not in the mood for coffee anymore. Cancel my order.",
"Sorry, cancel the order. I've changed my mind."
], indirect=True)
def test_cancel_order_function_call(self, order_input):
assistant = BeverageAssistant()
reply = assistant.run_order(order_input)
assistant = None
print("reply from assistant: ", reply)
assert reply[1].function_call.name == "cancel_order"
Once we have ensured that the assistant comprehends its fundamental functions, the next step involves subjecting it to more dynamic scenarios. This phase is crucial for validating that the assistant's responses align with expectations. To accomplish this, we conducted testing on our beverage assistant by pairing it with another language model (LLM).
Distinct from the beverage assistant model, we introduced a separate validation model designed specifically to assess and validate the responses generated by the assistant. The aim is to define various test scenarios that encompass a range of interaction cases with the agent. For this purpose, we employed the Gherkin model.
The input for our validation LLM includes the user's input prompt, the assistant's response, and the specific test scenario being evaluated. Below is an illustrative sample test scenario:
SCENARIO-3: User orders an unavailable drink
GIVEN: The user provides any size or extras.
WHEN: The user requests a specific drink.
THEN: The assistant refrains from adding the drink to the final order list.
AND: The assistant informs the user about the unavailability of the item.
Here's an actual example output of a scenario:
User Prompt: Can I get a milkshake?
Assistant: I'm sorry, but we currently don't have milkshakes available. Is there any other drink you would like to order?
Validation Result: Valid. The user requested a milkshake, aligning with the given scenario of ordering an unavailable drink. The assistant accurately identifies the unavailability of milkshakes and communicates this to the user. The final order result remains empty, indicating that the milkshake was not added to the order. Therefore, the assistant's response is correct within the context of the specified scenario.
In conclusion, rigorous testing, including basic functionality testing and dynamic scenario testing, is imperative for the development of large language model (LLM) assistants, ensuring their fundamental capabilities and responsiveness. The presented example of a beverage machine assistant illustrates the importance of meticulous testing, refining the assistant's understanding of user inputs, and validating its dynamic responses. The incorporation of a separate validation LLM further enhances the assessment of the assistant's performance in diverse scenarios, ultimately contributing to the delivery of a reliable and user-friendly conversational experience.