Skip to content

AI-4-SE/rag-config-val

Repository files navigation

Artifact release for the paper: "On Automating Configuration Dependency Validation via Retrieval-Augmented Generation"

Paper

PDF: will be linked later

ABSTRACT

Configuration dependencies arise when multiple technologies in a software system require coordinated settings for correct interplay. Existing approaches for detecting such dependencies often yield high false-positive rates, require additional validation mechanisms, and are typically limited to specific projects or technologies. Recent work that incorporates large language models (LLMs) for dependency validation still suffers from inaccuracies due to project- and technology-specific variations, as well as from missing contextual information.

In this work, we propose to use retrieval-augmented generation (RAG) systems for configuration dependency validation, which allows us to incorporate additional project- and technology-specific context information. Specifically, we evaluate whether RAG can improve LLM-based validation of configuration dependencies and what contextual information are needed to overcome the static knowledge base of LLMs. To this end, we conducted a large empirical study on validating configuration dependencies using RAG. Our evaluation shows that vanilla LLMs already demonstrate solid validation abilities, while RAG has only marginal or even negative effects on the validation performance of the models. By incorporating tailored contextual information into the RAG system--derived from a qualitative analysis of validation failures--we achieve significantly more accurate validation results across all models, with an average precision of 0.84 and recall of 0.70, representing improvements of 35% and 133% over vanilla LLMs, respectively. In addition, these results offer two important insights: Simplistic RAG systems may not benefit from additional information if it is not tailored to the task at hand, and it is often unclear upfront what kind of information yields improved performance.

Project Structure

  • /config: contains the configuration files for the different RAG variants and for ingestion
  • /data: contains data of subject systems, dependency datasets, ingested data, and evaluation results
  • /evaluation: contains script for evaluation
  • /src: contains implementation of the RAG system

Supported Models

Alias Model Name # Params Context Length Open Source
4o gpt-4o-2024-11-20 - 128k no
4o-mini gpt-4o-mini-2024-07-18 - 128k no
DSr:70b deepseek-r1:70b 70B 131k yes
DSr:14b deepseek-r1:14b 14B 131k yes
L3.1:70b llama3.1:70b 70B 8k yes
L3.1:8b llama3.1:8b 8B 8k yes

Supported RAG Variants

ID Embedding Model Embedding Dimension Reranking Top N
R1 text-embed-ada-002 1536 Sentence Transformer 5
R2 text-embed-ada-002 1536 Sentence Transformer 3
R3 text-embed-ada-002 1536 Colbert Rerank 5
R4 text-embed-ada-002 1536 Colbert Rerank 3
R5 gte-Qwen2-7B-instruct 3584 Sentence Transformer 5
R6 gte-Qwen2-7B-instruct 3584 Sentence Transformer 3
R7 gte-Qwen2-7B-instruct 3584 Colbert Rerank 5
R8 gte-Qwen2-7B-instruct 3584 Colbert Rerank 3

Experiments

To run the experiments on the validation effectiveness of vanilla LLMs and different RAG variants, you need to execute the ingestion once and the retrieval, and generation pipeline one after the other for a given RAG variant. Next, we describe the different steps in detail:
  1. Create a .env file in the root directory containing the API token for OpenAI, Pinecone, and GitHub.

    OPENAI_KEY=<your-openai-key>
    PINECONE_API_KEY=<your-pinecone-key>
    GITHUB_TOKEN=<your-github-key>   
    
  2. Run the ingestion pipeline once to create the specific Pinecone indices for the static and dynamic context information and already ingest the static context using the following command:

    python ingestion_pipeline.py

    By default, this script uses the .env file in the root directory and the ingestion.toml in the configs directory, but they can be changed using the the corresponding command line argumengts --config_file and --env_file. The ingestion.toml specifies the static and dynamic indices according to the underlying embedding models and their embedding models as well as the sources of the static context, which is directly ingested after the creation of the static indices.

  3. Once the vector database is set up properly, we can start the retrieval pipeline for a given RAG variants using the following command:

    python retrieval_pipeline.py --config_file=configs/config_{ID}.toml

    The config_{ID}.toml defines a specific RAG variant. The RAG variants have the IDs from 1 to 8 (R1-R8), while vanille LLMs have the ID 0. Each configuration file for a RAG variant contains the following parameters:

    • index_name: the index from which data should be retrieved
    • embedding_model: the embedding model
    • embedding_dimension: the dimension of the embedding model
    • rerank: the re-ranking algorithm
    • top_n: the number of chunk provided to the LLM
    • num_websited: number of websites to get dynamic context
    • alpha: the weight for sparse/dense retrieval
    • web_search_enabled: defined whether Web search is enabled or not
    • inference_models: list of LLMs for generation
    • temperature: temperature of LLMs
    • data_file: path of data file containing the dependencies for validation
    • retrieval_file: path of data file in which the retrieval results should be stored
    • generation_file: path of data file in which the generation results should be stores.

    This script iterates through all dependencies, retrieves static and dynamic context, and finally stores the retrieval results.

  4. Once the additional context is retrieved, we can run the generation pipeline with the following command:

        python generation_pipeline.py --config_file=configs/config_{ID}.toml

    This script takes as input the same configuration file that we use for running the retrieval pipeline. For each inference model specified, it iterates through all dependencies, validates them with the additional context, and finally stores the generation results.

  5. To run the retrieval and generation for the refined vanilla LLMs and refined RAG variant, execute step 2 and 3 with the corresponding configuration file: configs/advanced_{ID}, where ID can either be 0 for refined vanille LLMs or 1 for refined RAG variant R1.

  6. To compute the validation effectiveness of a vanilla LLMs or a specific RAG variant switch to the evaluation directory and execute the following command:

    python metrics.py --genration_file={generation_file}.json

Results

RQ1.1: Validation Effectiveness of Vanilla LLMs and RAG Variants

The following tables show the validation effectiveness of vanilla LLMs (w/o) and all RAG variants (R1-R8) on a dataset of 350 real-world cross-technology configuration dependencies.

Precision
Modelw/oR1R2R3R4R5R6R7R8
4o0.890.860.840.830.850.820.790.830.86
4o-mini0.760.600.590.530.560.540.620.550.58
DSr:70B0.760.740.630.660.670.630.650.690.68
DSr:14B0.840.660.700.610.740.680.700.660.69
L3.1:70b0.700.650.670.650.600.530.620.580.66
L3.1:8b0.520.530.500.560.540.540.510.580.47
Mean0.750.670.650.640.660.620.650.650.66
Best0.890.860.840.830.850.820.790.830.86
Recall
Modelw/oR1R2R3R4R5R6R7R8
4o0.460.610.620.560.590.610.560.560.60
4o-mini0.180.780.760.620.740.670.710.640.73
DSr:70B0.590.730.510.640.520.590.620.650.58
DSr:14B0.560.460.450.440.420.560.410.510.39
L3.1:70b0.450.340.360.240.230.260.350.240.30
L3.1:8b0.520.340.290.330.320.330.410.340.33
Mean0.460.540.500.470.470.500.510.490.49
Best0.590.780.760.640.740.670.710.650.73
F1-Score
Modelw/oR1R2R3R4R5R6R7R8
4o0.610.710.710.670.700.700.650.670.71
4o-mini0.290.680.660.570.640.600.660.590.65
DSr:70B0.660.740.560.650.590.610.630.670.62
DSr:14B0.670.540.550.510.540.620.510.570.50
L3.1:70b0.550.450.470.350.340.350.440.340.41
L3.1:8b0.520.410.370.420.400.410.460.430.39
Mean0.550.590.550.530.530.550.560.550.55
Best0.670.740.710.670.700.700.660.670.71

RQ1.2: Retrieved Contextual Information

The following figures show the fraction of sources that the RAG variants have deemed relevant for the query and submitted to one of the three or five context slots depending on the RAG variant. We also the show average relevance score for each context slot across all RAG variants

Context Source Usage per RAG Variant
R1
R1
R2
R2
R3
R3
R4
R4

R5
R5
R6
R6
R7
R7
R8
R8
Average Relevance Score per Context Slot
Context Slot R1 R2 R3 R4 R5 R6 R7 R8
1-2.93-5.050.700.68-2.53-3.100.700.68
2-5.73-6.650.670.66-5.63-5.100.660.66
3-6.67-7.310.660.65-6.51-6.510.650.65
4-7.29--0.65---7.26--0.65--
5-7.68--0.64---7.88--0.64--

RQ2.1: Validation Failures

We derived eight distinct failure categories from the vanilla LLMs and the best perfroming RAG variant R1. The table below summarizes the failure categories along with a brief description and example per category. We also show the final revised validation prompt.

Failure Categories from vanilla LLMs and R1
Category Description Example
Inheritance and Overrides This category includes validation failures due to Maven's project inheritance, which allows modules to inherit and override configurations from a parent module, such as general settings, dependencies, plugins, and build settings. In piggymetrics, Llama3.1:70b does not recoginze that project.parent_piggymetrics.version inherits the version project.version from the parent POM.
Configuration Consistency Often configuration values are the same across different configuration files, which often leads to dependencies, but sometimes only serves the purpose of consistency. In this category, LLMs confuse equal values for the sake of consistency with real dependencies. In litemall, Llama3.1:8b misinterpret identical logging levels in different Spring modules as a dependency, though the equality is likely due to project-wide consistency.
Resource Sharing Resources, such as databases or services can be shared across modules or used exclusively by a single module. Without additional project-specific about available resources, LLMs struggle to infer whether resources are shared or used exclusively by a single module. In music-website, GPT-4o-mini does not infers a dependency between services.db.environment.MYSQL_PASSWORD in Docker Compose and spring.datasource.password in Spring although both options refer to the same datasource.
Port Mapping Ports of services are typically defined in several configuration files of different technologies, creating equality-based configuration dependencies. However, not all port mappings have to be equal (e.g. a container and host port in docker compose). In mall-swarm, DeepSeek-r1:70b assumes a dependency between the host and container port for a specific service in the Docker Compose file, because both ports have identical values. Although the values are equal, there is no actual configuration dependency between host and container option in Docker Compose.
Naming Schemes Software projects often use ambiguous naming schemes for configuration options and their values. These ambiguities result from generic and commonly used names (e.g., project name) that may not cause configuration errors if not consistent but can easily lead to misinterpretation by LLMs. In Spring-Cloud-Platform, Llama3.1:8b assumes a dependency between project.artifactId and project.build.finalName due to their identical, although the match stems from Maven naming conventions.
Context (Availability, Retrieval, and Utilization) Failures in this category are either because relevant information is missing (e.g. not in the vector database or generally not available to vanilla LLMs), available in the database but not retrieved, or given to the LLM but not utilized to draw the right conclusion. Maven's documentation states that 4.0.0 is the only supported POM version. This information was indexed into the vector database and either not retrieved and utilized when validating a dependency caused by the modelVersion option.
Independent Technologies and Services In some cases (e.g. in containerized projects) different components, such as services, are isolated by design. In these cases the configuration options between these components are independent, if not explicitly specified. In piggymetrics, Llama3.1:8b falsely assumes a dependency between identical FROM instructions in Dockerfiles of two independent services, not recognizing that the services are isolated and the shared image does not imply a configuration dependency.
Others This category contains all validation failures where the LLMs fail to classify the dependencies correctly that can not be matched to any other category and share no common patterns. In litemall, GPT-4o-mini does not infer a dependency between project.artifactId and project.modules.module, although the the parent POM specifices all child modules using their artifactId.
Revised Validation Prompt

You are a full-stack expert in validating intra-technology and cross-technology configuration dependencies. You will be presented with configuration options found in the software project {project_name}. {project_info}. Your task is to determine whether the given configuration options actually depend on each other based on value-equality.

{dependency_str}

Information about both configuration options, including their descriptions, prior usages, and examples of similar dependencies are provided below. The provided information comes from various sources, such as manuals, Stack Overflow posts, GitHub repositories, and web search results. Note that not all the provided information may be relevant for validating the dependency. Consider only the information that is relevant for validating the dependency, and disregard the rest.

{context_str}

Additionally, here are some examples on how similar dependencies are evaluated:

{shot_str}

Given the information and similar examples, perform the following task:
Carefully evaluate whether configuration option {nameA} of type {typeA} with value {valueA} in {fileA} of technology {technologyA} depends on configuration option {nameB} of type {typeB} with value {valueB} in {fileB} of technology {technologyB} or vice versa.

Respond in a JSON format as shown below:

{format_str}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published