Skip to content

Unexpected results in subset-based alignment and alignment against reference #795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorainer opened this issue May 14, 2025 · 3 comments
Assignees

Comments

@jorainer
Copy link
Collaborator

For some data sets we see a strange artifact when using subset-based alignment:

Image

i.e., instead of the expected continuous, smooth line for each sample we see such random scattering for the retention time differences (rtadj - rtraw) for the samples that get aligned against the subset (smooth solid lines in the plot).

This seems to be specific to certain data sets as we don't see this e.g. on the data set used in the package vignette.

In subset-based alignment we use the internal .applyRtAdjustment() function that aligns a vector of values against reference values:

.applyRtAdjustment <- function(x, rtraw, rtadj) {
    ## re-order everything if rtraw is not sorted; issue #146
    if (is.unsorted(rtraw)) {
        idx <- order(rtraw)
        rtraw <- rtraw[idx]
        rtadj <- rtadj[idx]
    }
    adjFun <- stepfun(rtraw[-1] - diff(rtraw) / 2, rtadj)
    res <- adjFun(x)
    ## Fix margins.
    idx_low <- which(x < rtraw[1])
    if (length(idx_low)) {
        first_adj <- idx_low[length(idx_low)] + 1
        res[idx_low] <- x[idx_low] + res[first_adj] - x[first_adj]
    }
    idx_high <- which(x > rtraw[length(rtraw)])
    if (length(idx_high)) {
        last_adj <- idx_high[1] - 1
        res[idx_high] <- x[idx_high] + res[last_adj] - x[last_adj]
    }
    if (is.null(dim(res)))
        names(res) <- names(x)
    res
}

My suspicion is the use of stepfun() here. I will check if we could not maybe use approxfun() with linear interpolation instead.

@jorainer jorainer self-assigned this May 14, 2025
@jorainer
Copy link
Collaborator Author

I implemented an alternative version that allows to switch the interpolation method:

.applyRtAdjustment <- function(x, rtraw, rtadj,
                               method = c("stepfun", "approxfun")) {
    method <- match.arg(method)
    ## re-order everything if rtraw is not sorted; issue #146
    if (is.unsorted(rtraw)) {
        idx <- order(rtraw)
        rtraw <- rtraw[idx]
        rtadj <- rtadj[idx]
    }
    if (method == "approxfun")
        adjFun <- approxfun(rtraw, rtadj)
    else
        adjFun <- stepfun(rtraw[-1] - diff(rtraw) / 2, rtadj)
    res <- adjFun(x)
    ## Fix margins.
    idx_low <- which(x < rtraw[1])
    if (length(idx_low)) {
        first_adj <- idx_low[length(idx_low)] + 1
        res[idx_low] <- x[idx_low] + res[first_adj] - x[first_adj]
    }
    idx_high <- which(x > rtraw[length(rtraw)])
    if (length(idx_high)) {
        last_adj <- idx_high[1] - 1
        res[idx_high] <- x[idx_high] + res[last_adj] - x[last_adj]
    }
    if (is.null(dim(res)))
        names(res) <- names(x)
    res
}

@jorainer
Copy link
Collaborator Author

I could not reproduce the issue above using artificial data - but it seems to me the stepfun()-based approach has some issues:

Simulating just a linear shift of values by a constant:

a_raw <- seq(1, 100, by = 0.3)
a_adj <- a_raw + 0.2

Defining a vector that should be aligned based on the above alignment results. Note that the increment is the same (0.5), but the starting values and all values are not the same (i.e. none of the values in x is part of a_raw).

x <- seq(1.4, 20, by = 0.3)

Running the original method:

res <- .applyRtAdjustment(x, a_raw, a_adj, method = "stepfun")
head(x)
[1] 1.4 1.7 2.0 2.3 2.6 2.9
head(res)
[1] 1.5 1.8 2.1 2.4 2.7 3.0

The difference between the raw and adjusted values is 0.1, although we would expect 0.2 (as we shifted the values by that constant).

Using approxfun() instead:

res <- .applyRtAdjustment(x, a_raw, a_adj, method = "approxfun")
head(res)
[1] 1.6 1.9 2.2 2.5 2.8 3.1

And also expect_equal(res, x + 0.2) does not throw an error. So, approxfun() correctly estimates the drift and adjusts the second vector accordingly.

A second example adding also noise - but still using a constant shift by 0.2:

a_raw <- seq(0.1, 100, by = 0.03)
a_adj <- sort(a_raw + 0.2)
set.seed(123)
x <- seq(2.02, 90.02, by = 0.03)
x <- sort(x + rnorm(length(x), 0, 0.001))

Running stepfun():

res <- .applyRtAdjustment(x, a_raw, a_adj, method = "stepfun")
expect_equal(res, x + 0.2)
Error: `res` not equal to x + 0.2.
2934/2934 mismatches (average diff: 0.00079)
[1] 2.22 - 2.22 ==  5.60e-04
[2] 2.25 - 2.25 ==  2.30e-04
[3] 2.28 - 2.28 == -1.56e-03
[4] 2.31 - 2.31 == -7.05e-05
[5] 2.34 - 2.34 == -1.29e-04
[6] 2.37 - 2.37 == -1.72e-03
[7] 2.40 - 2.40 == -4.61e-04
[8] 2.43 - 2.43 ==  1.27e-03
[9] 2.46 - 2.46 ==  6.87e-04
...

And plotting the data:

plot(res, res - x, type = "l")

Image

So, seems we partially validate the observed scattering in the initial example.

While when using approxfun():

res <- .applyRtAdjustment(x, a_raw, a_adj, method = "approxfun")
expect_equal(res, x + 0.2)
head(x)
[1] 2.019440 2.049770 2.081559 2.110071 2.140129 2.171715
head(res)
[1] 2.219440 2.249770 2.281559 2.310071 2.340129 2.371715

For these example it seems the approxfun()-based approach better estimates and adjusts the difference.

jorainer added a commit that referenced this issue May 14, 2025
- This fixes issue #795 by using `approxfun()` instead of `stepfun()` for the
  subset-based alignment (and only there!). For adjustment of chrom peaks'
  retention times we still use the `stepfun()` that will ensure the RT will
  coincide with the RT of an actual spectrum (not interpolated).
@jorainer
Copy link
Collaborator Author

The fix is in the above commit. It only affects subset-based adjustment. For any other retention time mapping it still uses the original stepfun()-based approach. Thus, the fix is expected to not introduce any additional/new features/issues.

Before making a PR I will cross-check the results on various data sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant