Skip to content

Conversation

@MichaelChirico
Copy link

@MichaelChirico MichaelChirico commented May 31, 2025

{plyr} is long-superseded. The package itself only uses plyr::count(); a vignette also uses round_any().

I was pursuing just dropping {plyr} and re-implementing in base, but gave up for three main reasons:

  1. table(<data.frame>) may fail if there are very few
    duplicates and each column is of high cardinality, meaning
    table(x) would have a very large number of 0 entries that
    need to be computed and dropped (plyr::count() skips them).
  2. We can use something like interaction(..., drop=TRUE) +
    tapply() to imitate this, but it's hard to generically
    reconstruct the un-interacted levels needed to build an
    equivalent data.frame -- basically, we'd need to, for full
    generality, use a sep=<str> where <str> is not present in
    any of the unique values of any of the columns of x in order
    for strsplit(<level>, <sep>) to uniquely map back.
  3. Something like vapply(split(x, x), nrow, integer(1L)) is also
    appealingly simple, but split() always drops missing levels
    (https://bugs.r-project.org/show_bug.cgi?id=18899) --> we'd
    need an onerous/ugly loop over the columns to replace missing
    observations with a unique NA-equivalent, end-sorting sentinel.

Thus the move to {dplyr}, despite it being a non-lightweight choice.

I also applied some code quality fixes to nearby lines:

  1. T/F --> TRUE/FALSE.
  2. 1:<n> loops replaced by seq_len()/seq_along(), as appropriate.
  3. Loop like x <- c(); for (i in seq_along(y)) x[i] <- foo(y[i]) should pre-initialize x to be length(y).
  4. Move some lines around to avoid creating variables just prior to a possible early return().

for(i in 1:nrow(Freqs)){
Freqs$degree[i] <- rowSums(Freqs[ i ,1:num_sets])
for(i in seq_len(nrow(Freqs))){
Freqs$degree[i] <- rowSums(Freqs[ i ,seq_len(num_sets)])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the for loop altogether. It shouldn't be necessary to iterate through the rows when using rowSums().

I made similar changes back in 2021 in #199 to fix some computational bottlenecks that were causing us problems.

Unfortunately this repository hasn't been updated since 2020 (b14854a), so I don't have any expectation that any of these PRs will ever be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants