Behind the Scenes: Testing and Validation

April 30, 2024

The development of large language model (LLM) assistants introduces a unique set of challenges, especially when engaging in dynamic, conversational contexts. Therefore, robust testing is crucial for delivering a seamless user experience. Testing is also particularly important in the development of intelligent virtual assistants (IVA), ensuring to meet user expectations, handling ambiguity and variability, and preventing misinterpretations.

In this article, we are going to explore two layers of testing for LLM assistants.

Deterministic:

‍Before delving into the complexities of dynamic conversations, it's essential to ensure that the fundamental functionalities of the LLM assistant work as intended. This involves testing each function, or capability, to verify that it produces the correct output for a given input.

‍

Non-deterministic:

Due to the nature of natural language, non-deterministic testing is essential. This refers to examining scenarios where outcomes are unpredictable or variable. Specifically, in the context of natural language processing, this relates to instances where the model's response may vary due to the inherent uncertainty in the structure and interpretation of language. The evaluation aims to measure how effectively the LLM adapts and performs in situations where non-deterministic outcomes are challenging to assess.

Let’s take a beverage machine assistant as an example for better understanding. A beverage machine assistant is designed to facilitate user interactions. Users can place orders for various beverages, customize their drinks, and manage their orders through a conversational interface.

The beverage machine has some basic functionalities that we have defined in code:

add_item
remove_item
cancel_order
Submit_order

First, we want to make sure that the assistant understands when to call these functions. That's where the unit tests come in handy. As we write unit tests, we're fine-tuning our assistant to recognize exactly when to trigger specific functions. By being careful with these tests, we're basically upgrading and refining our assistant, making sure it's sharp and can handle requests smoothly and without any hiccups!

Example code below, is to ensure that the assistant accurately recognizes and triggers the "cancel_order" function under various input scenarios. This meticulous testing process helps refine and improve the assistant's responsiveness and handling of user requests.
‍

     
     @pytest.mark.repeat(5) 
     @pytest.mark.parametrize("order_input", [ 
         "I think I don't want anything anymore! ciao!", 
         "Cancel my order. Goodbye!", 
         "I changed my mind. No more drinks for me. Bye.", 
         "I'm not in the mood for coffee anymore. Cancel my order.", 
         "Sorry, cancel the order. I've changed my mind." 
     ], indirect=True) 
   def test_cancel_order_function_call(self, order_input): 
		assistant = BeverageAssistant() 
        reply = assistant.run_order(order_input) 
        assistant = None 
        print("reply from assistant: ", reply) 
        assert reply[1].function_call.name == "cancel_order"

‍

Once we have ensured that the assistant comprehends its fundamental functions, the next step involves subjecting it to more dynamic scenarios. This phase is crucial for validating that the assistant's responses align with expectations. To accomplish this, we conducted testing on our beverage assistant by pairing it with another language model (LLM).

Distinct from the beverage assistant model, we introduced a separate validation model designed specifically to assess and validate the responses generated by the assistant. The aim is to define various test scenarios that encompass a range of interaction cases with the agent. For this purpose, we employed the Gherkin model.

The input for our validation LLM includes the user's input prompt, the assistant's response, and the specific test scenario being evaluated. Below is an illustrative sample test scenario:

‍

SCENARIO-3: User orders an unavailable drink

GIVEN: The user provides any size or extras.

WHEN: The user requests a specific drink.

THEN: The assistant refrains from adding the drink to the final order list.

AND: The assistant informs the user about the unavailability of the item.

Here's an actual example output of a scenario:

User Prompt: Can I get a milkshake?

Assistant: I'm sorry, but we currently don't have milkshakes available. Is there any other drink you would like to order?

Validation Result: Valid. The user requested a milkshake, aligning with the given scenario of ordering an unavailable drink. The assistant accurately identifies the unavailability of milkshakes and communicates this to the user. The final order result remains empty, indicating that the milkshake was not added to the order. Therefore, the assistant's response is correct within the context of the specified scenario.

‍

Conclusion:

In conclusion, rigorous testing, including basic functionality testing and dynamic scenario testing, is imperative for the development of large language model (LLM) assistants, ensuring their fundamental capabilities and responsiveness. The presented example of a beverage machine assistant illustrates the importance of meticulous testing, refining the assistant's understanding of user inputs, and validating its dynamic responses. The incorporation of a separate validation LLM further enhances the assessment of the assistant's performance in diverse scenarios, ultimately contributing to the delivery of a reliable and user-friendly conversational experience.

‍

Facebook

view profile

Mediocre programmer - Obsessed open-source enthusiast and Eternal amateur at everything. Recently fell in love with Rust.

Article collaborators

–

SABO Mobile IT GmbH
Engelstr. 6, 77815 Bühl
‍
HRB 721688 bei Amtsgericht Mannheim
Ust. Id-Nr.: DE 298930118
Verantwortlich für die Redaktion:
SABO Mobile IT GmbH

Policy‍
Imprint
GDPR

Related sites:
SABO Academy '21
Intelligent Assistant

Get in touch
‍+49 7227 4946
E-mail
Contact form

Conclusion:

Other articles by same author

How Not Becoming a Frontend Developer Made Me a Frontend Developer for 1 Week

Article collaborators

Article tags

SABO NEWSLETTER

About SABO Mobile IT

more articles

Recent articles:

Building a Real-Time Speech API for Industrial Voice Assistants

How Not Becoming a Frontend Developer Made Me a Frontend Developer for 1 Week

SABO Mobile IT at ML Prague 2025: Bringing AI Innovations to Life

SABO Mobile IT at the HANNOVER MESSE 2025

Related:

Multi-threaded React WebGL Applications

Across the border into the Czech Republic in the time of Corona

Chatbots – More Than a Fad

Amalthea Group acquires 25% shares in SABO Mobile IT

Behind the Scenes: Testing and Validation

Conclusion:

Other articles by same author

How Not Becoming a Frontend Developer Made Me a Frontend Developer for 1 Week

Article collaborators

Article tags

SABO NEWSLETTER

About SABO Mobile IT

more articles

Recent articles:

Building a Real-Time Speech API for Industrial Voice Assistants

How Not Becoming a Frontend Developer Made Me a Frontend Developer for 1 Week

SABO Mobile IT at ML Prague 2025: Bringing AI Innovations to Life

SABO Mobile IT at the HANNOVER MESSE 2025

Related:

Multi-threaded React WebGL Applications

Across the border into the Czech Republic in the time of Corona

Chatbots – More Than a Fad

Amalthea Group acquires 25% shares in SABO Mobile IT

Cookie Settings

Cookie Settings