Artifact release for the paper: "On Automating Configuration Dependency Validation via Retrieval-Augmented Generation"
PDF: will be linked later
Configuration dependencies arise when multiple technologies in a software system require coordinated settings for correct interplay. Existing approaches for detecting such dependencies often yield high false-positive rates, require additional validation mechanisms, and are typically limited to specific projects or technologies. Recent work that incorporates large language models (LLMs) for dependency validation still suffers from inaccuracies due to project- and technology-specific variations, as well as from missing contextual information.
In this work, we propose to use retrieval-augmented generation (RAG) systems for configuration dependency validation, which allows us to incorporate additional project- and technology-specific context information. Specifically, we evaluate whether RAG can improve LLM-based validation of configuration dependencies and what contextual information are needed to overcome the static knowledge base of LLMs. To this end, we conducted a large empirical study on validating configuration dependencies using RAG. Our evaluation shows that vanilla LLMs already demonstrate solid validation abilities, while RAG has only marginal or even negative effects on the validation performance of the models. By incorporating tailored contextual information into the RAG system--derived from a qualitative analysis of validation failures--we achieve significantly more accurate validation results across all models, with an average precision of 0.84 and recall of 0.70, representing improvements of 35% and 133% over vanilla LLMs, respectively. In addition, these results offer two important insights: Simplistic RAG systems may not benefit from additional information if it is not tailored to the task at hand, and it is often unclear upfront what kind of information yields improved performance.
/config
: contains the configuration files for the different RAG variants and for ingestion/data
: contains data of subject systems, dependency datasets, ingested data, and evaluation results/evaluation
: contains script for evaluation/src
: contains implementation of the RAG system
Alias | Model Name | # Params | Context Length | Open Source |
---|---|---|---|---|
4o | gpt-4o-2024-11-20 | - | 128k | no |
4o-mini | gpt-4o-mini-2024-07-18 | - | 128k | no |
DSr:70b | deepseek-r1:70b | 70B | 131k | yes |
DSr:14b | deepseek-r1:14b | 14B | 131k | yes |
L3.1:70b | llama3.1:70b | 70B | 8k | yes |
L3.1:8b | llama3.1:8b | 8B | 8k | yes |
ID | Embedding Model | Embedding Dimension | Reranking | Top N |
---|---|---|---|---|
R1 | text-embed-ada-002 | 1536 | Sentence Transformer | 5 |
R2 | text-embed-ada-002 | 1536 | Sentence Transformer | 3 |
R3 | text-embed-ada-002 | 1536 | Colbert Rerank | 5 |
R4 | text-embed-ada-002 | 1536 | Colbert Rerank | 3 |
R5 | gte-Qwen2-7B-instruct | 3584 | Sentence Transformer | 5 |
R6 | gte-Qwen2-7B-instruct | 3584 | Sentence Transformer | 3 |
R7 | gte-Qwen2-7B-instruct | 3584 | Colbert Rerank | 5 |
R8 | gte-Qwen2-7B-instruct | 3584 | Colbert Rerank | 3 |
-
Create a
.env
file in the root directory containing the API token for OpenAI, Pinecone, and GitHub.OPENAI_KEY=<your-openai-key> PINECONE_API_KEY=<your-pinecone-key> GITHUB_TOKEN=<your-github-key>
-
Run the ingestion pipeline once to create the specific Pinecone indices for the static and dynamic context information and already ingest the static context using the following command:
python ingestion_pipeline.py
By default, this script uses the
.env
file in the root directory and theingestion.toml
in the configs directory, but they can be changed using the the corresponding command line argumengts--config_file
and--env_file
. Theingestion.toml
specifies the static and dynamic indices according to the underlying embedding models and their embedding models as well as the sources of the static context, which is directly ingested after the creation of the static indices. -
Once the vector database is set up properly, we can start the retrieval pipeline for a given RAG variants using the following command:
python retrieval_pipeline.py --config_file=configs/config_{ID}.toml
The
config_{ID}.toml
defines a specific RAG variant. The RAG variants have the IDs from 1 to 8 (R1-R8), while vanille LLMs have the ID 0. Each configuration file for a RAG variant contains the following parameters:index_name
: the index from which data should be retrievedembedding_model
: the embedding modelembedding_dimension
: the dimension of the embedding modelrerank
: the re-ranking algorithmtop_n
: the number of chunk provided to the LLMnum_websited
: number of websites to get dynamic contextalpha
: the weight for sparse/dense retrievalweb_search_enabled
: defined whether Web search is enabled or notinference_models
: list of LLMs for generationtemperature
: temperature of LLMsdata_file
: path of data file containing the dependencies for validationretrieval_file
: path of data file in which the retrieval results should be storedgeneration_file
: path of data file in which the generation results should be stores.
This script iterates through all dependencies, retrieves static and dynamic context, and finally stores the retrieval results.
-
Once the additional context is retrieved, we can run the generation pipeline with the following command:
python generation_pipeline.py --config_file=configs/config_{ID}.toml
This script takes as input the same configuration file that we use for running the retrieval pipeline. For each inference model specified, it iterates through all dependencies, validates them with the additional context, and finally stores the generation results.
-
To run the retrieval and generation for the refined vanilla LLMs and refined RAG variant, execute step 2 and 3 with the corresponding configuration file:
configs/advanced_{ID}
, where ID can either be 0 for refined vanille LLMs or 1 for refined RAG variant R1. -
To compute the validation effectiveness of a vanilla LLMs or a specific RAG variant switch to the
evaluation
directory and execute the following command:python metrics.py --genration_file={generation_file}.json
The following tables show the validation effectiveness of vanilla LLMs (w/o) and all RAG variants (R1-R8) on a dataset of 350 real-world cross-technology configuration dependencies.
Precision
Model | w/o | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 |
---|---|---|---|---|---|---|---|---|---|
4o | 0.89 | 0.86 | 0.84 | 0.83 | 0.85 | 0.82 | 0.79 | 0.83 | 0.86 |
4o-mini | 0.76 | 0.60 | 0.59 | 0.53 | 0.56 | 0.54 | 0.62 | 0.55 | 0.58 |
DSr:70B | 0.76 | 0.74 | 0.63 | 0.66 | 0.67 | 0.63 | 0.65 | 0.69 | 0.68 |
DSr:14B | 0.84 | 0.66 | 0.70 | 0.61 | 0.74 | 0.68 | 0.70 | 0.66 | 0.69 |
L3.1:70b | 0.70 | 0.65 | 0.67 | 0.65 | 0.60 | 0.53 | 0.62 | 0.58 | 0.66 |
L3.1:8b | 0.52 | 0.53 | 0.50 | 0.56 | 0.54 | 0.54 | 0.51 | 0.58 | 0.47 |
Mean | 0.75 | 0.67 | 0.65 | 0.64 | 0.66 | 0.62 | 0.65 | 0.65 | 0.66 |
Best | 0.89 | 0.86 | 0.84 | 0.83 | 0.85 | 0.82 | 0.79 | 0.83 | 0.86 |
Recall
Model | w/o | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 |
---|---|---|---|---|---|---|---|---|---|
4o | 0.46 | 0.61 | 0.62 | 0.56 | 0.59 | 0.61 | 0.56 | 0.56 | 0.60 |
4o-mini | 0.18 | 0.78 | 0.76 | 0.62 | 0.74 | 0.67 | 0.71 | 0.64 | 0.73 |
DSr:70B | 0.59 | 0.73 | 0.51 | 0.64 | 0.52 | 0.59 | 0.62 | 0.65 | 0.58 |
DSr:14B | 0.56 | 0.46 | 0.45 | 0.44 | 0.42 | 0.56 | 0.41 | 0.51 | 0.39 |
L3.1:70b | 0.45 | 0.34 | 0.36 | 0.24 | 0.23 | 0.26 | 0.35 | 0.24 | 0.30 |
L3.1:8b | 0.52 | 0.34 | 0.29 | 0.33 | 0.32 | 0.33 | 0.41 | 0.34 | 0.33 |
Mean | 0.46 | 0.54 | 0.50 | 0.47 | 0.47 | 0.50 | 0.51 | 0.49 | 0.49 |
Best | 0.59 | 0.78 | 0.76 | 0.64 | 0.74 | 0.67 | 0.71 | 0.65 | 0.73 |
F1-Score
Model | w/o | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 |
---|---|---|---|---|---|---|---|---|---|
4o | 0.61 | 0.71 | 0.71 | 0.67 | 0.70 | 0.70 | 0.65 | 0.67 | 0.71 |
4o-mini | 0.29 | 0.68 | 0.66 | 0.57 | 0.64 | 0.60 | 0.66 | 0.59 | 0.65 |
DSr:70B | 0.66 | 0.74 | 0.56 | 0.65 | 0.59 | 0.61 | 0.63 | 0.67 | 0.62 |
DSr:14B | 0.67 | 0.54 | 0.55 | 0.51 | 0.54 | 0.62 | 0.51 | 0.57 | 0.50 |
L3.1:70b | 0.55 | 0.45 | 0.47 | 0.35 | 0.34 | 0.35 | 0.44 | 0.34 | 0.41 |
L3.1:8b | 0.52 | 0.41 | 0.37 | 0.42 | 0.40 | 0.41 | 0.46 | 0.43 | 0.39 |
Mean | 0.55 | 0.59 | 0.55 | 0.53 | 0.53 | 0.55 | 0.56 | 0.55 | 0.55 |
Best | 0.67 | 0.74 | 0.71 | 0.67 | 0.70 | 0.70 | 0.66 | 0.67 | 0.71 |
The following figures show the fraction of sources that the RAG variants have deemed relevant for the query and submitted to one of the three or five context slots depending on the RAG variant. We also the show average relevance score for each context slot across all RAG variants
Average Relevance Score per Context Slot
Context Slot | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 |
---|---|---|---|---|---|---|---|---|
1 | -2.93 | -5.05 | 0.70 | 0.68 | -2.53 | -3.10 | 0.70 | 0.68 |
2 | -5.73 | -6.65 | 0.67 | 0.66 | -5.63 | -5.10 | 0.66 | 0.66 |
3 | -6.67 | -7.31 | 0.66 | 0.65 | -6.51 | -6.51 | 0.65 | 0.65 |
4 | -7.29 | -- | 0.65 | -- | -7.26 | -- | 0.65 | -- |
5 | -7.68 | -- | 0.64 | -- | -7.88 | -- | 0.64 | -- |
We derived eight distinct failure categories from the vanilla LLMs and the best perfroming RAG variant R1. The table below summarizes the failure categories along with a brief description and example per category. We also show the final revised validation prompt.
Failure Categories from vanilla LLMs and R1
Category | Description | Example |
---|---|---|
Inheritance and Overrides | This category includes validation failures due to Maven's project inheritance, which allows modules to inherit and override configurations from a parent module, such as general settings, dependencies, plugins, and build settings. | In piggymetrics, Llama3.1:70b does not recoginze that project.parent_piggymetrics.version inherits the version project.version from the parent POM. |
Configuration Consistency | Often configuration values are the same across different configuration files, which often leads to dependencies, but sometimes only serves the purpose of consistency. In this category, LLMs confuse equal values for the sake of consistency with real dependencies. | In litemall, Llama3.1:8b misinterpret identical logging levels in different Spring modules as a dependency, though the equality is likely due to project-wide consistency. |
Resource Sharing | Resources, such as databases or services can be shared across modules or used exclusively by a single module. Without additional project-specific about available resources, LLMs struggle to infer whether resources are shared or used exclusively by a single module. | In music-website, GPT-4o-mini does not infers a dependency between services.db.environment.MYSQL_PASSWORD in Docker Compose and spring.datasource.password in Spring although both options refer to the same datasource. |
Port Mapping | Ports of services are typically defined in several configuration files of different technologies, creating equality-based configuration dependencies. However, not all port mappings have to be equal (e.g. a container and host port in docker compose). | In mall-swarm, DeepSeek-r1:70b assumes a dependency between the host and container port for a specific service in the Docker Compose file, because both ports have identical values. Although the values are equal, there is no actual configuration dependency between host and container option in Docker Compose. |
Naming Schemes | Software projects often use ambiguous naming schemes for configuration options and their values. These ambiguities result from generic and commonly used names (e.g., project name) that may not cause configuration errors if not consistent but can easily lead to misinterpretation by LLMs. | In Spring-Cloud-Platform, Llama3.1:8b assumes a dependency between project.artifactId and project.build.finalName due to their identical, although the match stems from Maven naming conventions. |
Context (Availability, Retrieval, and Utilization) | Failures in this category are either because relevant information is missing (e.g. not in the vector database or generally not available to vanilla LLMs), available in the database but not retrieved, or given to the LLM but not utilized to draw the right conclusion. | Maven's documentation states that 4.0.0 is the only supported POM version. This information was indexed into the vector database and either not retrieved and utilized when validating a dependency caused by the modelVersion option. |
Independent Technologies and Services | In some cases (e.g. in containerized projects) different components, such as services, are isolated by design. In these cases the configuration options between these components are independent, if not explicitly specified. | In piggymetrics, Llama3.1:8b falsely assumes a dependency between identical FROM instructions in Dockerfiles of two independent services, not recognizing that the services are isolated and the shared image does not imply a configuration dependency. |
Others | This category contains all validation failures where the LLMs fail to classify the dependencies correctly that can not be matched to any other category and share no common patterns. | In litemall, GPT-4o-mini does not infer a dependency between project.artifactId and project.modules.module, although the the parent POM specifices all child modules using their artifactId. |
Revised Validation Prompt
You are a full-stack expert in validating intra-technology and cross-technology configuration dependencies. You will be presented with configuration options found in the software project {project_name}
. {project_info}
. Your task is to determine whether the given configuration options actually depend on each other based on value-equality.
{dependency_str}
Information about both configuration options, including their descriptions, prior usages, and examples of similar dependencies are provided below. The provided information comes from various sources, such as manuals, Stack Overflow posts, GitHub repositories, and web search results. Note that not all the provided information may be relevant for validating the dependency. Consider only the information that is relevant for validating the dependency, and disregard the rest.
{context_str}
Additionally, here are some examples on how similar dependencies are evaluated:
{shot_str}
Given the information and similar examples, perform the following task:
Carefully evaluate whether configuration option {nameA}
of type {typeA}
with value {valueA}
in {fileA}
of technology {technologyA}
depends on configuration option {nameB}
of type {typeB}
with value {valueB}
in {fileB}
of technology {technologyB}
or vice versa.
Respond in a JSON format as shown below:
{format_str}