title | datePublished | cuid | slug | canonical | cover | tags |
---|---|---|---|---|---|---|
Revolutionising Unit Test Generation with LLMs |
Wed Jul 24 2024 13:25:20 GMT+0000 (Coordinated Universal Time) |
clyzvne01000808ju93db6685 |
revolutionising-unit-test-generation-with-llms |
ai, llm, copilot, chatgpt |
“Discovering the unexpected is more important than confirming the known.” – George E. P. Box
As software systems grow in complexity, the importance of comprehensive testing cannot be overstated. However, writing unit tests is often a time-consuming and repetitive task leading to developer fatigue. In this blog we will over come the manual unit test development and move towards LLM driven/AI assisted unit tests.
-
Manual unit test development - the traditional way of writing a unit test. Developer will be analysing the code which is to be tested and decide the testing framework and cover all the code flows of the code by writing the whole unit test along with mocks. This some tiresome process because it takes almost 30% of the developer's time.
import static org.mockito.Mockito.*; import static org.junit.jupiter.api.Assertions.*; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import org.mockito.InjectMocks; import org.mockito.Mock; import org.mockito.MockitoAnnotations; import org.springframework.web.client.HttpClientErrorException; import org.springframework.web.client.RestTemplate; class UserServiceTest { @Mock private RestTemplate restTemplate; @InjectMocks private UserService userService; @BeforeEach void setUp() { MockitoAnnotations.openMocks(this); userService = new UserService(restTemplate, "http://example.com"); } @Test void testGetUserSuccess() { // Arrange int userId = 1; User expectedUser = new User(); expectedUser.setId(userId); expectedUser.setName("John Doe"); when(restTemplate.getForObject("http://example.com/users/" + userId, User.class)).thenReturn(expectedUser); // Act User result = userService.getUser(userId); // Assert assertEquals(expectedUser, result); verify(restTemplate, times(1)).getForObject("http://example.com/users/" + userId, User.class); } @Test void testGetUserNotFound() { // Arrange int userId = 1; when(restTemplate.getForObject("http://example.com/users/" + userId, User.class)).thenThrow(new HttpClientErrorException(HttpStatus.NOT_FOUND)); // Act & Assert Exception exception = assertThrows(RuntimeException.class, () -> userService.getUser(userId)); assertEquals("User not found", exception.getMessage()); verify(restTemplate, times(1)).getForObject("http://example.com/users/" + userId, User.class); } }
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1720776585778/b1b89895-b1bb-4535-a3a3-b91fd3880a73.webp align="center")
-
AI Based testing(co-pilot) - GitHub's Copilot is an advanced AI-driven tool designed to assist developers in writing code effortlessly. By simply commenting on what is to be implemented or taking a brief pause, Copilot intelligently suggests the appropriate code snippets. This functionality extends to the generation of unit tests as well.
However, there is ongoing debate regarding the efficiency of the unit tests produced by Copilot. Critics argue that many of these autogenerated tests are often non-functional or insufficient. Achieving comprehensive test coverage with Copilot can take nearly as much time as traditional manual testing, thus questioning its overall efficiency in test generation.
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1720780291927/db30b6d4-51ac-4679-ac4e-46a394340fc4.gif align="center")
-
Generate all the unit tests for complete applications or files without much human interaction.
-
Cover major edge cases.
-
Provide almost 80-90% coverage with ease.
-
Validate all the tests that are generated automatically.
-
Don't repeat unit tests
Well, using the current unit testing methods we cannot achieve it. We need proper design and utilise the complete power of AI to achieve it.
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1720778103487/c70d7759-3f98-4246-a11f-baeadd01cca5.png align="center")
In early 2023, Meta researchers unveiled an innovative method for enhancing unit test coverage through their paper titled Automated Unit Test Improvement using LLM .This method, leveraging a tool called TestGen-LLM, promised a fully automated approach to improving code coverage with guaranteed benefits to the existing code base. The announcement made a significant impact in the software engineering community.
The design leverages a combination of Large Language Models (LLMs) and automated workflows to revolutionise unit test generation, ensuring high coverage and reliability with minimal human intervention. Here's a detailed breakdown of the process:
-
Candidate Test Case Generation:
- The process begins with LLMs, such as those developed by OpenAI or Meta's TestGen-LLM or azure, which are trained to generate candidate test cases based on the given codebase. These models analyze the code which are prompted and propose test cases that aim to cover various code paths and edge cases.
-
Pre-Processing:
- The candidate test cases generated by the LLMs undergo a pre-processing step. The text response given by LLM is converted into proper unit tests that can be ran.
-
Builds:
- The pre-processed test cases are then integrated into the build system. This step ensures that the test cases are syntactically and semantically correct, and can be compiled without errors. If a test case fails to build, it is discarded. We generally do this by running the unit test command provided.
-
Passes:
- After successfully building, the test cases are executed. This step verifies that the tests run as expected and pass successfully. If a test case fails it then removed from the unit test file and then prompted in next iteration.
-
Post-Processing:
- One of the critical checks we do is check for increase in coverage. Once an iteration this flow is done we check whether the unit test has increased coverage or not if not we will remove the test and prompt it in the next iteration. We also run each unit tests multiple times to validate no flaky tests are produced.
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1721983856037/d32325dc-75c2-4ef0-9bb4-34d56fe5d336.png align="center")
Major design is need in the prompt designing. Following prompts are the prompts that are used in this tool.
-
Indentation prompt - Lets consider the file as 2D graph, We need to know where the new unit tests to be inserted in the test case file. This prompt will ask the LLM to provide the latitude(indentation) on which the unti test should be inserted. (this is need as we have few languages which are indentation sensitive)
[indentation] system="""\ """ user="""\ ## Overview You are a code assistant designed to analyze a {{ .language }} test file. Your task is to provide specific feedback on this file, including the programming language, the testing framework required, the number of tests, and the indentation level of the test headers. You will be given the test file, named `{{ .test_file_name }}` with existing tests, with line numbers added for clarity. These line numbers are not part of the original code. ========= {{ .test_file | trim }} ========= Analyze the provided test file and generate a YAML object matching the $TestsAnalysis type, according to these Pydantic definitions: ===== class TestsAnalysis(BaseModel): language: str = Field(description="The programming language used by the test file") testing_framework: str = Field(description="The testing framework needed to run the tests in the test file") number_of_tests: int = Field(description="The number of tests in the test file") test_headers_indentation: int = Field(description="The indentation of the test headers in the test file.\ For example, "def test_..." has an indentation of 0, " def test_..." has an indentation of 2, " def test_..." has an indentation of 4, and so on.") ===== Example output: ```yaml language: {{ .language }} testing_framework: ... number_of_tests: ... test_headers_indentation: ... ``` The Response should be only a valid YAML object, without any introduction text or follow-up text. Answer: ```yaml """
-
Line Prompt - this prompt is for knowing on which line the unit test should be inserted.
[insert_line] system="""\ """ user="""\ ## Overview You are a code assistant designed to analyze a {{ .language }} test file. Your task is to provide specific feedback on this file, including the programming language, the testing framework required, the number of tests, and the line number after which new tests should be inserted to be part of the existing test suite. You will be given the test file, named `{{ .test_file_name }}`, with line numbers added for clarity and existing tests if there are any. ========= {{ .test_file_numbered | trim }} ========= Analyze the provided test file and generate a YAML object matching the $TestsAnalysis type, according to these Pydantic definitions: ===== class TestsAnalysis(BaseModel): language: str = Field(description="The programming language used by the test file") testing_framework: str = Field(description="The testing framework needed to run the tests in the test file") number_of_tests: int = Field(description="The number of tests in the test file") relevant_line_number_to_insert_after: int = Field(description="The line number in the test file, **after which** the new tests should be inserted, so they will be a part of the existing test suite. Place the new tests after the last test in the suite.") ===== Example output: ```yaml language: {{ .language }} testing_framework: ... number_of_tests: ... relevant_line_number_to_insert_after: ... ``` The Response should be only a valid YAML object, without any introduction text or follow-up text. Answer: ```yaml """
-
Test Generation Prompt - this is the main prompt that return unit test, it contains lot of data like coverage report, failed tests in previous iteration, language etc
[test_generation] system="""\ """ user="""\ ## Overview You are a code assistant designed to accept a {{ .language }} source file and a {{ .language }} test file. Your task is to generate additional unit tests to complement the existing test suite, aiming to increase the code coverage of the source file. Additional guidelines: - Carefully analyze the provided code to understand its purpose, inputs, outputs, and key logic or calculations. - Brainstorm a list of test cases necessary to fully validate the correctness of the code and achieve 100% code coverage. - After adding each test, review all tests to ensure they cover the full range of scenarios, including exception or error handling. - If the original test file contains a test suite, assume that each generated test will be part of the same suite. Ensure the new tests are consistent with the existing suite in terms of style, naming conventions, and structure. ## Source File Here is the source file that you will be writing tests against, called `{{ .source_file_name }}`. Line numbers have been added for clarity and are not part of the original code. ========= {{ .source_file_numbered | trim }} ========= ## Test File Here is the file that contains the existing tests, called `{{ .test_file_name }}`. ========= {{ .test_file | trim }} ========= {%- if additional_includes_section | trim %} {{ .additional_includes_section | trim }} {% endif %} {%- if failed_tests_section | trim %} {{ .failed_tests_section | trim }} {% endif %} {%- if additional_instructions_text | trim %} {{ .additional_instructions_text | trim }} {% endif %} ## Code Coverage The following is the existing code coverage report. Use this to determine what tests to write, as you should only write tests that increase the overall coverage: ========= {{ .code_coverage_report| trim }} ========= ## Response The output must be a YAML object equivalent to type $NewTests, according to the following Pydantic definitions: ===== class SingleTest(BaseModel): test_behavior: str = Field(description="Short description of the behavior the test covers") {%- if language in ["python","java"] %} test_name: str = Field(description=" A short test name, in snake case, that reflects the behaviour to test") {%- else %} test_name: str = Field(description=" A short unique test name, that should reflect the test objective") {%- endif %} test_code: str = Field(description="A single test function, that tests the behavior described in 'test_behavior'. The test should be a written like its a part of the existing test suite, if there is one, and it can use existing helper functions, setup, or teardown code.") new_imports_code: str = Field(description="Code for new imports that are required for the new test function, and are not already present in the test file. Give an empty string if no new imports are required.") test_tags: str = Field(description="A single label that best describes the test, out of: ['happy path', 'edge case','other']") class NewTests(BaseModel): language: str = Field(description="The programming language of the source code") existing_test_function_signature: str = Field(description="A single line repeating a signature header of one of the existing test functions") new_tests: List[SingleTest] = Field(min_items=1, max_items={{ .max_tests }}, description="A list of new test functions to append to the existing test suite, aiming to increase the code coverage. Each test should run as-is, without requiring any additional inputs or setup code. Don't introduce new dependencies") ===== Example output: ```yaml language: {{ .language }} existing_test_function_signature: | ... new_tests: - test_behavior: | Test that the function returns the correct output for a single element list {%- if language in ["python","java"] %} test_name: | test_single_element_list {%- else %} test_name: | ... {%- endif %} test_code: | {%- if language in ["python"] %} def ... {%- else %} ... {%- endif %} new_imports_code: | "" test_tags: happy path ... ``` Use block scalar('|') to format each YAML output. {%- if additional_instructions_text| trim %} {{ .additional_instructions_text| trim }} {% endif %} Response (should be a valid YAML, and nothing else): ```yaml """
To try out the tool you can the following UTGen Documentation. For code base please visit keploy.
-
Currently the LLMs are not so advanced to generate unit tests from scratch. Reason being there lot of complex things like deciding on mocking library, installing it and using it , which LLMs cannot do on its own. So currently this Unit test generator more of an enhancer/assistance to human.
-
LLMs are costly and as tokens increases the price goes higher.
-
Report analysis is a ongoing side in this design because each language has its own report format. Currently cobertura format is quite famous and used extensively. So the unit test generator is also defined for only cobertura for now.
-
This generator will be helpful to get you upto 80-90% coverage if you have even 5% coverage at the start. But achieving that last 10% is hard for the LLM. This will improve as LLMs grow.
In conclusion, the use of AI-driven tools like LLMs for generating unit tests presents a promising step forward in the software testing landscape. By automating the creation of unit tests, we can significantly reduce the time and effort required from developers, allowing them to focus on more complex and value-adding tasks. The proposed design, leveraging Meta's TestGen-LLM and other advanced LLMs, demonstrates the potential to achieve high code coverage with minimal human intervention.
Despite the current limitations, such as the inability to generate unit tests entirely from scratch and the costs associated with using LLMs, the approach outlined provides a substantial improvement over traditional manual methods. It enhances the existing test suites, increases code coverage, and ensures the robustness of software systems.
There are a few areas where the unit test generator can be improved
-
Support for Additional Coverage Report Formats
-
Improve prompts such that we can create tests from scratch
- This majorly involves choosing a mocking library, installing it and using it properly.
Currently, AI based unit test generators, cannot independently decide on or install mocking libraries, nor can they handle the full complexity of creating unit tests from scratch. Additionally, the cost associated with using LLMs increases with the number of tokens processed. Achieving the final 10% of test coverage remains particularly difficult for LLMs, despite significant improvements in initial coverage.
The proposed design involves using LLMs to generate candidate test cases, which are then pre-processed and integrated into the build system. The remaining tests are validated for increased code coverage and reliability. The process includes analyzing the code, generating and integrating test cases, and running multiple iterations to ensure robustness and eliminate flaky tests.
LLMs analyze the provided source code to generate candidate test cases aimed at covering various code paths and edge cases and try to enhance the unit test generation process by suggesting tests that can increase code coverage, thereby reducing the manual effort required from developers.
AI based unit test generation significantly reduces the time and effort required from developers by automating the creation of unit tests. This allows developers to focus on more complex and value-adding tasks. Additionally, the proposed design promises high coverage and reliability with minimal human intervention, thus improving the efficiency and effectiveness of the testing process.
Keploy's unit test generation feature leverages the power of LLMs to propose test cases that cover various code paths and edge cases. The generated tests are then validated and integrated into the existing test suite, aiming to increase code coverage and ensure the correctness of the code. This feature is designed to reduce the manual effort required in writing unit tests, providing developers with a powerful tool to enhance their testing workflows.
There are several advantages for keploy's utgen :
-
Automation: Significantly reduces the manual effort involved in writing unit tests by automating the process.
-
High Coverage: Aims to provide 80-90% code coverage, even starting from as low as 5% coverage.
-
Edge Case Handling: Generates tests that cover major edge cases, ensuring comprehensive test coverage.
-
Consistency: Ensures that the generated tests are consistent with the existing test suite in terms of style, naming conventions, and structure.
-
Validation: Includes multiple validation steps to ensure that the generated tests are reliable and free from errors, thereby enhancing the overall quality of the test suite.