Skip to content

A tool that parses ai-dial-core logs into structured parquet datasets organized by deployment name and date, enabling fine-grained access control and easy analysis. Supports S3 and other storage backends.

License

Notifications You must be signed in to change notification settings

epam/ai-dial-log-parser

Dial log parser

Overview

Dial log parser is a tool to parse dial log files and repack it to parquet dataset.

Example:

docker run ai-dial-log-parser:development --input s3://bucket-with-dial-core-logs/ --output s3://bucket-with-parsed-logs/parsed_logs

The command above will read files like s3://bucket-with-dial-core-logs/date=2023-11-061699285645-11111111-2222-3333-4444-555555555555.log.gz for yesterday's date, split the logs by deployment name and repack it into parquet tables.

Example of list of output parquet files:

s3://bucket-with-dial-core-logs/parsed_logs/some-assistant/2023-11-06/part-0.parquet
...
s3://bucket-with-dial-core-logs/parsed_logs/gpt-35-turbo/2023-11-06/part-0.parquet
s3://bucket-with-dial-core-logs/parsed_logs/gpt-35-turbo/2023-11-06/part-1.parquet
...
s3://bucket-with-dial-core-logs/parsed_logs/some-application/2023-11-06/part-0.parquet

Then you could configure an access control by prefixes like s3://bucket-with-dial-core-logs/parsed_logs/some-application/, to allow the developers of the application to have an access to their prompt logs.

The following directory structure could be read by the tools like pyarrow as a single dataset.

import pyarrow.dataset as ds

data = ds.dataset(
    "s3://bucket-with-dial-core-logs/parsed_logs/",
    partitioning=ds.partitioning(field_names=["deployment_name", "date"]),
    exclude_invalid_files=True)
data.head(
    10,
    filter=ds.field("deployment_name") == "some-application"
).to_pandas()

Configuration

The configuration could be set using environment variables or as command-line arguments.

Environment variables

Following environment variables could be used for the configuration:

Variable Required Description
DIAL_LOG_PARSER_INPUT required Path to input log directory
DIAL_LOG_PARSER_OUTPUT required Path to output log directory
DIAL_LOG_PARSER_DATE optional Date to process logs for (default: yesterday)
DIAL_LOG_PARSER_DEBUG optional Enables debug logging
DIAL_LOG_PARSER_FILENAME_REGEX optional Allows to override the regex to match log file names (default: date=(\d{4}-\d{2}-\d{2})(\d+)-(\w{8}-\w{4}-\w{4}-\w{4}-\w{12}).log(.gz)?)

Storage specific environment variables

Specific storage implementations may require additional environment variables to be set.

For example, for S3, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY may be required. See https://s3fs.readthedocs.io/en/latest/#credentials

Fsspec compatible implementations should be supported (may require to install the extra packages to the docker). Check the list Built-in Fsspec Implementations and Other Known Fsspec Implementations for more details.

Command-line arguments

Usage: python -m aidial_log_parser.parse_logs [OPTIONS]

  Parse dial log files and repack it to parquet dataset.

Options:
  -i, --input TEXT       Path to input log directory  [env var:
                         DIAL_LOG_PARSER_INPUT; required]
  -o, --output TEXT      Path to output log directory  [env var:
                         DIAL_LOG_PARSER_OUTPUT; required]
  -d, --date [%Y-%m-%d]  Date to process logs for  [env var:
                         DIAL_LOG_PARSER_DATE; default: 2024-06-09]
  --debug                Enable debug logging  [env var:
                         DIAL_LOG_PARSER_DEBUG]
  --filename-regex TEXT  Regex to match log file names  [env var:
                         DIAL_LOG_PARSER_FILENAME_REGEX; default: date=(\d{4}-
                         \d{2}-\d{2})(\d+)-(\w{8}-\w{4}-\w{4}-\w{4}-\w{12}).lo
                         g(.gz)?]
  --help                 Show this message and exit.

Output format

Output format tries to preserve all the data from the raw logs adding a few columns to help easily access most useful data.

The fields in path:

  • deployment_name - name of the deployment (e.g. gpt-35-turbo, some-assistant, some-application)
  • date - date of the log file (e.g. 2023-11-06)

The fields in the parquet file:

  • request - structure with the request data. It has the following fields:
    • uri - URI of the request
    • time - timestamp of the request
    • body - string with the body of the request. See the Dial API documentation for the format of the request body.
  • response - structure with the response data. It has the following fields:
    • status - status code of the response
    • body - string with the body of the response. See the Dial API documentation for the format of the response body.
  • token_usage - structure with the token usage data. It has the following fields:
    • prompt_tokens - number of tokens in the prompt
    • completion_tokens - number of tokens in the completion
    • total_tokens - total number of tokens in the request
    • deployment_price - the cost of this specific request, excluding the cost of any requests it directly or indirectly initiated.
    • price - the total cost of the request, including the cost of this request and all related requests it directly or indirectly triggered.
  • assembled_response - json with assembled response for the chat/completion requests. In case if the request was made with streaming=true, the field will contains an assembled streaming response.
  • question - last user message in the message history for the chat/completion requests.
  • answer - string with the application/model response for the chat/completion requests.

The question and answer fields are not present in the raw logs, but are added to the parquet file for convenience. These fields could simplify the analysis of the logs for a simple applications which do not require a message history or choice of multiple answers.

Developer environment

This project uses Python>=3.12 and Poetry>=1.8.5 as a dependency manager.

Check out Poetry's documentation on how to install it on your system before proceeding.

To install requirements:

poetry install

This will install all requirements for running the package, linting, formatting and tests.

IDE configuration

The recommended IDE is VSCode. Open the project in VSCode and install the recommended extensions.

The VSCode is configured to use PEP-8 compatible formatter Black.

Alternatively you can use PyCharm.

Set-up the Black formatter for PyCharm manually or install PyCharm>=2023.2 with built-in Black support.

Make on Windows

As of now, Windows distributions do not include the make tool. To run make commands, the tool can be installed using the following command (since Windows 10):

winget install GnuWin32.Make

For convenience, the tool folder can be added to the PATH environment variable as C:\Program Files (x86)\GnuWin32\bin. The command definitions inside Makefile should be cross-platform to keep the development environment setup simple.

Lint

Run the linting before committing:

make lint

To auto-fix formatting issues run:

make format

Test

Run unit tests locally:

make test

Clean

To remove the virtual environment and build artifacts:

make clean

About

A tool that parses ai-dial-core logs into structured parquet datasets organized by deployment name and date, enabling fine-grained access control and easy analysis. Supports S3 and other storage backends.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •