GitHub - AI-4-SE/rag-config-val

Artifact release for the paper: "On Automating Configuration Dependency Validation via Retrieval-Augmented Generation"

Paper

PDF: will be linked later

ABSTRACT

Configuration dependencies arise when multiple technologies in a software system require coordinated settings for correct interplay. Existing approaches for detecting such dependencies often yield high false-positive rates, require additional validation mechanisms, and are typically limited to specific projects or technologies. Recent work that incorporates large language models (LLMs) for dependency validation still suffers from inaccuracies due to project- and technology-specific variations, as well as from missing contextual information.

In this work, we propose to use retrieval-augmented generation (RAG) systems for configuration dependency validation, which allows us to incorporate additional project- and technology-specific context information. Specifically, we evaluate whether RAG can improve LLM-based validation of configuration dependencies and what contextual information are needed to overcome the static knowledge base of LLMs. To this end, we conducted a large empirical study on validating configuration dependencies using RAG. Our evaluation shows that vanilla LLMs already demonstrate solid validation abilities, while RAG has only marginal or even negative effects on the validation performance of the models. By incorporating tailored contextual information into the RAG system--derived from a qualitative analysis of validation failures--we achieve significantly more accurate validation results across all models, with an average precision of 0.84 and recall of 0.70, representing improvements of 35% and 133% over vanilla LLMs, respectively. In addition, these results offer two important insights: Simplistic RAG systems may not benefit from additional information if it is not tailored to the task at hand, and it is often unclear upfront what kind of information yields improved performance.

Project Structure

/config: contains the configuration files for the different RAG variants and for ingestion
/data: contains data of subject systems, dependency datasets, ingested data, and evaluation results
/evaluation: contains script for evaluation
/src: contains implementation of the RAG system

Supported Models

Alias	Model Name	# Params	Context Length	Open Source
4o	gpt-4o-2024-11-20	-	128k	no
4o-mini	gpt-4o-mini-2024-07-18	-	128k	no
DSr:70b	deepseek-r1:70b	70B	131k	yes
DSr:14b	deepseek-r1:14b	14B	131k	yes
L3.1:70b	llama3.1:70b	70B	8k	yes
L3.1:8b	llama3.1:8b	8B	8k	yes

Supported RAG Variants

ID	Embedding Model	Embedding Dimension	Reranking	Top N
R1	text-embed-ada-002	1536	Sentence Transformer	5
R2	text-embed-ada-002	1536	Sentence Transformer	3
R3	text-embed-ada-002	1536	Colbert Rerank	5
R4	text-embed-ada-002	1536	Colbert Rerank	3
R5	gte-Qwen2-7B-instruct	3584	Sentence Transformer	5
R6	gte-Qwen2-7B-instruct	3584	Sentence Transformer	3
R7	gte-Qwen2-7B-instruct	3584	Colbert Rerank	5
R8	gte-Qwen2-7B-instruct	3584	Colbert Rerank	3

Experiments

To run the experiments on the validation effectiveness of vanilla LLMs and different RAG variants, you need to execute the ingestion once and the retrieval, and generation pipeline one after the other for a given RAG variant. Next, we describe the different steps in detail:

Create a .env file in the root directory containing the API token for OpenAI, Pinecone, and GitHub.

OPENAI_KEY=<your-openai-key>
PINECONE_API_KEY=<your-pinecone-key>
GITHUB_TOKEN=<your-github-key>

Run the ingestion pipeline once to create the specific Pinecone indices for the static and dynamic context information and already ingest the static context using the following command:
```
python ingestion_pipeline.py
```
By default, this script uses the .env file in the root directory and the ingestion.toml in the configs directory, but they can be changed using the the corresponding command line argumengts --config_file and --env_file. The ingestion.toml specifies the static and dynamic indices according to the underlying embedding models and their embedding models as well as the sources of the static context, which is directly ingested after the creation of the static indices.
Once the vector database is set up properly, we can start the retrieval pipeline for a given RAG variants using the following command:
```
python retrieval_pipeline.py --config_file=configs/config_{ID}.toml
```
The config_{ID}.toml defines a specific RAG variant. The RAG variants have the IDs from 1 to 8 (R1-R8), while vanille LLMs have the ID 0. Each configuration file for a RAG variant contains the following parameters:
- index_name: the index from which data should be retrieved
- embedding_model: the embedding model
- embedding_dimension: the dimension of the embedding model
- rerank: the re-ranking algorithm
- top_n: the number of chunk provided to the LLM
- num_websited: number of websites to get dynamic context
- alpha: the weight for sparse/dense retrieval
- web_search_enabled: defined whether Web search is enabled or not
- inference_models: list of LLMs for generation
- temperature: temperature of LLMs
- data_file: path of data file containing the dependencies for validation
- retrieval_file: path of data file in which the retrieval results should be stored
- generation_file: path of data file in which the generation results should be stores.
This script iterates through all dependencies, retrieves static and dynamic context, and finally stores the retrieval results.
Once the additional context is retrieved, we can run the generation pipeline with the following command:
```
    python generation_pipeline.py --config_file=configs/config_{ID}.toml
```
This script takes as input the same configuration file that we use for running the retrieval pipeline. For each inference model specified, it iterates through all dependencies, validates them with the additional context, and finally stores the generation results.
To run the retrieval and generation for the refined vanilla LLMs and refined RAG variant, execute step 2 and 3 with the corresponding configuration file: configs/advanced_{ID}, where ID can either be 0 for refined vanille LLMs or 1 for refined RAG variant R1.
To compute the validation effectiveness of a vanilla LLMs or a specific RAG variant switch to the evaluation directory and execute the following command:
```
python metrics.py --genration_file={generation_file}.json
```

Results

RQ1.1: Validation Effectiveness of Vanilla LLMs and RAG Variants

The following tables show the validation effectiveness of vanilla LLMs (w/o) and all RAG variants (R1-R8) on a dataset of 350 real-world cross-technology configuration dependencies.

Precision

Model	w/o	R1	R2	R3	R4	R5	R6	R7	R8
4o	0.89	0.86	0.84	0.83	0.85	0.82	0.79	0.83	0.86
4o-mini	0.76	0.60	0.59	0.53	0.56	0.54	0.62	0.55	0.58
DSr:70B	0.76	0.74	0.63	0.66	0.67	0.63	0.65	0.69	0.68
DSr:14B	0.84	0.66	0.70	0.61	0.74	0.68	0.70	0.66	0.69
L3.1:70b	0.70	0.65	0.67	0.65	0.60	0.53	0.62	0.58	0.66
L3.1:8b	0.52	0.53	0.50	0.56	0.54	0.54	0.51	0.58	0.47
Mean	0.75	0.67	0.65	0.64	0.66	0.62	0.65	0.65	0.66
Best	0.89	0.86	0.84	0.83	0.85	0.82	0.79	0.83	0.86

Recall

Model	w/o	R1	R2	R3	R4	R5	R6	R7	R8
4o	0.46	0.61	0.62	0.56	0.59	0.61	0.56	0.56	0.60
4o-mini	0.18	0.78	0.76	0.62	0.74	0.67	0.71	0.64	0.73
DSr:70B	0.59	0.73	0.51	0.64	0.52	0.59	0.62	0.65	0.58
DSr:14B	0.56	0.46	0.45	0.44	0.42	0.56	0.41	0.51	0.39
L3.1:70b	0.45	0.34	0.36	0.24	0.23	0.26	0.35	0.24	0.30
L3.1:8b	0.52	0.34	0.29	0.33	0.32	0.33	0.41	0.34	0.33
Mean	0.46	0.54	0.50	0.47	0.47	0.50	0.51	0.49	0.49
Best	0.59	0.78	0.76	0.64	0.74	0.67	0.71	0.65	0.73

F1-Score

Model	w/o	R1	R2	R3	R4	R5	R6	R7	R8
4o	0.61	0.71	0.71	0.67	0.70	0.70	0.65	0.67	0.71
4o-mini	0.29	0.68	0.66	0.57	0.64	0.60	0.66	0.59	0.65
DSr:70B	0.66	0.74	0.56	0.65	0.59	0.61	0.63	0.67	0.62
DSr:14B	0.67	0.54	0.55	0.51	0.54	0.62	0.51	0.57	0.50
L3.1:70b	0.55	0.45	0.47	0.35	0.34	0.35	0.44	0.34	0.41
L3.1:8b	0.52	0.41	0.37	0.42	0.40	0.41	0.46	0.43	0.39
Mean	0.55	0.59	0.55	0.53	0.53	0.55	0.56	0.55	0.55
Best	0.67	0.74	0.71	0.67	0.70	0.70	0.66	0.67	0.71

RQ1.2: Retrieved Contextual Information

The following figures show the fraction of sources that the RAG variants have deemed relevant for the query and submitted to one of the three or five context slots depending on the RAG variant. We also the show average relevance score for each context slot across all RAG variants

Context Source Usage per RAG Variant

R1

R2

R3

R4

R5

R6

R7

R8

Average Relevance Score per Context Slot

Context Slot	R1	R2	R3	R4	R5	R6	R7	R8
1	-2.93	-5.05	0.70	0.68	-2.53	-3.10	0.70	0.68
2	-5.73	-6.65	0.67	0.66	-5.63	-5.10	0.66	0.66
3	-6.67	-7.31	0.66	0.65	-6.51	-6.51	0.65	0.65
4	-7.29	--	0.65	--	-7.26	--	0.65	--
5	-7.68	--	0.64	--	-7.88	--	0.64	--

RQ2.1: Validation Failures

We derived eight distinct failure categories from the vanilla LLMs and the best perfroming RAG variant R1. The table below summarizes the failure categories along with a brief description and example per category. We also show the final revised validation prompt.

Failure Categories from vanilla LLMs and R1

Category	Description	Example
Inheritance and Overrides	This category includes validation failures due to Maven's project inheritance, which allows modules to inherit and override configurations from a parent module, such as general settings, dependencies, plugins, and build settings.	In piggymetrics, Llama3.1:70b does not recoginze that project.parent_piggymetrics.version inherits the version project.version from the parent POM.
Configuration Consistency	Often configuration values are the same across different configuration files, which often leads to dependencies, but sometimes only serves the purpose of consistency. In this category, LLMs confuse equal values for the sake of consistency with real dependencies.	In litemall, Llama3.1:8b misinterpret identical logging levels in different Spring modules as a dependency, though the equality is likely due to project-wide consistency.
Resource Sharing	Resources, such as databases or services can be shared across modules or used exclusively by a single module. Without additional project-specific about available resources, LLMs struggle to infer whether resources are shared or used exclusively by a single module.	In music-website, GPT-4o-mini does not infers a dependency between services.db.environment.MYSQL_PASSWORD in Docker Compose and spring.datasource.password in Spring although both options refer to the same datasource.
Port Mapping	Ports of services are typically defined in several configuration files of different technologies, creating equality-based configuration dependencies. However, not all port mappings have to be equal (e.g. a container and host port in docker compose).	In mall-swarm, DeepSeek-r1:70b assumes a dependency between the host and container port for a specific service in the Docker Compose file, because both ports have identical values. Although the values are equal, there is no actual configuration dependency between host and container option in Docker Compose.
Naming Schemes	Software projects often use ambiguous naming schemes for configuration options and their values. These ambiguities result from generic and commonly used names (e.g., project name) that may not cause configuration errors if not consistent but can easily lead to misinterpretation by LLMs.	In Spring-Cloud-Platform, Llama3.1:8b assumes a dependency between project.artifactId* and project.build.finalName due to their identical, although the match stems from Maven naming conventions.*
Context (Availability, Retrieval, and Utilization)	Failures in this category are either because relevant information is missing (e.g. not in the vector database or generally not available to vanilla LLMs), available in the database but not retrieved, or given to the LLM but not utilized to draw the right conclusion.	Maven's documentation states that 4.0.0 is the only supported POM version. This information was indexed into the vector database and either not retrieved and utilized when validating a dependency caused by the modelVersion option.
Independent Technologies and Services	In some cases (e.g. in containerized projects) different components, such as services, are isolated by design. In these cases the configuration options between these components are independent, if not explicitly specified.	In piggymetrics, Llama3.1:8b falsely assumes a dependency between identical FROM instructions in Dockerfiles of two independent services, not recognizing that the services are isolated and the shared image does not imply a configuration dependency.
Others	This category contains all validation failures where the LLMs fail to classify the dependencies correctly that can not be matched to any other category and share no common patterns.	In litemall, GPT-4o-mini does not infer a dependency between project.artifactId and project.modules.module, although the the parent POM specifices all child modules using their artifactId.

Revised Validation Prompt

You are a full-stack expert in validating intra-technology and cross-technology configuration dependencies. You will be presented with configuration options found in the software project {project_name}. {project_info}. Your task is to determine whether the given configuration options actually depend on each other based on value-equality.

{dependency_str}

Information about both configuration options, including their descriptions, prior usages, and examples of similar dependencies are provided below. The provided information comes from various sources, such as manuals, Stack Overflow posts, GitHub repositories, and web search results. Note that not all the provided information may be relevant for validating the dependency. Consider only the information that is relevant for validating the dependency, and disregard the rest.

{context_str}

Additionally, here are some examples on how similar dependencies are evaluated:

{shot_str}

Given the information and similar examples, perform the following task:
Carefully evaluate whether configuration option {nameA} of type {typeA} with value {valueA} in {fileA} of technology {technologyA} depends on configuration option {nameB} of type {typeB} with value {valueB} in {fileB} of technology {technologyB} or vice versa.

Respond in a JSON format as shown below:

{format_str}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper

ABSTRACT

Project Structure

Supported Models

Supported RAG Variants

Experiments

Results

RQ1.1: Validation Effectiveness of Vanilla LLMs and RAG Variants

RQ1.2: Retrieved Contextual Information

RQ2.1: Validation Failures

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
configs		configs
data		data
evaluation		evaluation
src		src
.gitignore		.gitignore
README.md		README.md
generation_pipeline.py		generation_pipeline.py
ingestion_pipeline.py		ingestion_pipeline.py
main.py		main.py
requirements.txt		requirements.txt
retrieval_pipeline.py		retrieval_pipeline.py
sensitivity.py		sensitivity.py

AI-4-SE/rag-config-val

Folders and files

Latest commit

History

Repository files navigation

Paper

ABSTRACT

Project Structure

Supported Models

Supported RAG Variants

Experiments

Results

RQ1.1: Validation Effectiveness of Vanilla LLMs and RAG Variants

RQ1.2: Retrieved Contextual Information

RQ2.1: Validation Failures

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages