Summarising Rows by Maximising Two Columns in R data.table

-

Summarising Rows by Maximising Two Columns in R data.table

When working with grouped data, you sometimes need to pick one row per group where two numeric columns are simultaneously as large as possible. Imagine a data.table like this:

library(data.table)

# sample data
DT <- rowwiseDT(
  group = c("a", "a", "a", "b", "b", "c"),
  a = c(1, 10, 9, 9, 1, 10),
  b = c(10, 1, 9, 9, 1, 10)
)

DT
#> group a b
#> a     1 10
#> a     10 1
#> a      9 9
#> b      9 9
#> b      1 1
#> c     10 10

Your goal is to summarise this into one row per group with the “best” combination of a and b, i.e. the row where both columns are high. Simply taking the maximum of a and the maximum of b separately doesn’t guarantee they come from the same row. Instead, you need to compare the row sums.

Solution using .SD[which.max()]

A concise way is to compute the row sum of a and b and select the row with the maximum sum for each group. In data.table, .SD holds the subset for each group, and which.max() picks the index of the maximum value. Here’s the full solution:

# Select the row with the largest sum of a and b for each group
DT[, .SD[which.max(a + b)], by = group]
#>    group a b
#> 1:     a  9 9
#> 2:     b  9 9
#> 3:     c 10 10

In the example above, group a has rows (1,10), (10,1) and (9,9). Summing a + b gives (11, 11, 18) respectively, so the (9,9) row has the highest combined value and is returned. The same logic applies to groups b and c.

Generalising to more columns

If your table has more numeric columns, you can use rowSums(.SD) to sum all of them:

# Summarise by the maximum row sum across all numeric columns
DT[, .SD[which.max(rowSums(.SD))], by = group, .SDcols = is.numeric]

The .SDcols = is.numeric argument tells data.table to include only numeric columns when computing row sums. This technique works for any number of numeric columns and scales well to larger tables.

Why this works

  • .SD stands for “Subset of Data” and contains the data for the current group. It behaves like a mini data frame for each group.
  • which.max() returns the index of the first maximum value. When applied to a + b or rowSums(.SD), it tells data.table which row to return.
  • Using the row sum ensures you consider both columns simultaneously, which is important when the maxima of individual columns do not occur in the same row.

Conclusion

To summarise grouped data by maximising multiple unrelated columns in R, compute the row sum of those columns and use .SD[which.max(rowSums(.SD))] within data.table. This pattern elegantly picks the row with the highest combined value and generalises to any number of numeric columns.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent comments