Testing

The goal of this project is to test the capabilities of LLMs generate dev env configurations.

All tests are located in the test directory.

The tests have been separated in various categories.

Tests can be run with the following command (while being in AutoDev’s dev enviroment):

just test [TEST-CATEGORY] [MODEL]

where:

[TEST-CATEGORY] is the category of the tests to be run.
[MODEL] is the model to run tests for.

If no test category or model is specified all tests are run.

Because of the nature of tests, i.e. running LLMs locally, test runtime can be quite high depending on the hardware used.

Common strategies

All test will be run with the following strategies:

Laboratories for the Advanced Software Modelling and Design as test projects. These projects are a good testing ground for the abilities of LLMs to generate dev env configurations for non trivial projects.
All the models made available through AutoDev have been tested for every category and lab.
A test is considered passed if commands that would be run in the development workflow execute correctly in the development enviroment obtained from the configuration AutoDev generated.

Since the LLM ‘temperature’ setting can be ambiguous when assessing different LLMs, it is not considered in the test parameters and it is simply ignored. All tested models are used with ‘default settings’.

Given the above considerations, for every test category the number of tests being executed:

models * projects * categories = 6 * 8 * 6 = 288

Projects

TODO fill this up with project names and links

Improvements

Initial testing proved the necessity of the following improvements.

Response cleaning

Even if specifically ‘told’ to not use triple backtick code block delimiters (```) many models still provided nix code delimited by them. To ensure models that struggle to remove the code block delimiters are not penalized they are ‘manually’ removed from the generated code.

Some models provide a ‘thought block’ in the generated code, showing the Chain-of-Tought process employed, even if prompted to remove it. To ensure these models are not penalized they are ‘manually’ removed from the generated code.

Flake check

Often LLMs struggle to provide a correct answer on the first attempt, but the ‘chat-like’ feature of these models can be levereged to enable them to correct their mistakes.

After the nix code is generated, it is parsed and validated using the nix flake check command. Eventual errors encountered are provided to the LLM to give it a chance to improve on it’s provided code.

A total of 3 attempts are done, by providing the error messate to the model and asking it to fix the errors, before giving up on generating the dev env config and notifing the user.

Back to index

Previous Chapter

Next Chapter

Testing

Common strategies

Projects

Categories

Project directory structure

Project files contents

1-shot prompting

Improvements

Response cleaning

Flake check