`::use_package("generics", "Imports") usethis`

# Create your own broom tidier methods

Write tidy(), glance(), and augment() methods for new model objects.

## Introduction

To use code in this article, you will need to install the following packages: generics, tidymodels, tidyverse, and usethis.

The broom package provides tools to summarize key information about models in tidy `tibble()`

s. The package provides three verbs, or “tidiers,” to help make model objects easier to work with:

`tidy()`

summarizes information about model components`glance()`

reports information about the entire model`augment()`

adds information about observations to a dataset

Each of the three verbs above are *generic*, in that they do not define a procedure to tidy a given model object, but instead redirect to the relevant *method* implemented to tidy a specific type of model object. The broom package provides methods for model objects from over 100 modeling packages along with nearly all of the model objects in the stats package that comes with base R. However, for maintainability purposes, the broom package authors now ask that requests for new methods be first directed to the parent package (i.e. the package that supplies the model object) rather than to broom. New methods will generally only be integrated into broom in the case that the requester has already asked the maintainers of the model-owning package to implement tidier methods in the parent package.

We’d like to make implementing external tidier methods as painless as possible. The general process for doing so is:

- re-export the tidier generics
- implement tidying methods
- document the new methods

In this article, we’ll walk through each of the above steps in detail, giving examples and pointing out helpful functions when possible.

## Re-export the tidier generics

The first step is to re-export the generic functions for `tidy()`

, `glance()`

, and/or `augment()`

. You could do so from `broom`

itself, but we’ve provided an alternative, much lighter dependency called `generics`

.

First you’ll need to add the generics package to `Imports`

. We recommend using the usethis package for this:

Next, you’ll need to re-export the appropriate tidying methods. If you plan to implement a `glance()`

method, for example, you can re-export the `glance()`

generic by adding the following somewhere inside the `/R`

folder of your package:

```
#' @importFrom generics glance
#' @export
::glance generics
```

Oftentimes it doesn’t make sense to define one or more of these methods for a particular model. In this case, only implement the methods that do make sense.

Please do not define `tidy()`

, `glance()`

, or `augment()`

generics in your package. This will result in namespace conflicts whenever your package is used along other packages that also export tidying methods.

## Implement tidying methods

You’ll now need to implement specific tidying methods for each of the generics you’ve re-exported in the above step. For each of `tidy()`

, `glance()`

, and `augment()`

, we’ll walk through the big picture, an example, and helpful resources.

In this article, we’ll use the base R dataset `trees`

, giving the tree girth (in inches), height (in feet), and volume (in cubic feet), to fit an example linear model using the base R `lm()`

function.

```
# load in the trees dataset
data(trees)
# take a look!
str(trees)
#> 'data.frame': 31 obs. of 3 variables:
#> $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
#> $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
#> $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
# fit the timber volume as a function of girth and height
<- lm(Volume ~ Girth + Height, data = trees) trees_model
```

Let’s take a look at the `summary()`

of our `trees_model`

fit.

```
summary(trees_model)
#>
#> Call:
#> lm(formula = Volume ~ Girth + Height, data = trees)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -6.4065 -2.6493 -0.2876 2.2003 8.4847
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
#> Girth 4.7082 0.2643 17.816 < 2e-16 ***
#> Height 0.3393 0.1302 2.607 0.0145 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.882 on 28 degrees of freedom
#> Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
#> F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
```

This output gives some summary statistics on the residuals (which would be described more fully in an `augment()`

output), model coefficients (which, in this case, make up the `tidy()`

output), and some model-level summarizations such as RSE, \(R^2\), etc. (which make up the `glance()`

output.)

### Implementing the `tidy()`

method

The `tidy(x, ...)`

method will return a tibble where each row contains information about a component of the model. The `x`

input is a model object, and the dots (`...`

) are an optional argument to supply additional information to any calls inside your method. New `tidy()`

methods can take additional arguments, but *must* include the `x`

and `...`

arguments to be compatible with the generic function. (For a glossary of currently acceptable additional arguments, see the end of this article.) Examples of model components include regression coefficients (for regression models), clusters (for classification/clustering models), etc. These `tidy()`

methods are useful for inspecting model details and creating custom model visualizations.

Returning to the example of our linear model on timber volume, we’d like to extract information on the model components. In this example, the components are the regression coefficients. After taking a look at the model object and its `summary()`

, you might notice that you can extract the regression coefficients as follows:

```
summary(trees_model)$coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -57.9876589 8.6382259 -6.712913 2.749507e-07
#> Girth 4.7081605 0.2642646 17.816084 8.223304e-17
#> Height 0.3392512 0.1301512 2.606594 1.449097e-02
```

This object contains the model coefficients as a table, where the information giving which coefficient is being described in each row is given in the row names. Converting to a tibble where the row names are contained in a column, you might write:

```
<- summary(trees_model)$coefficients %>%
trees_model_tidy as_tibble(rownames = "term")
trees_model_tidy#> # A tibble: 3 × 5
#> term Estimate `Std. Error` `t value` `Pr(>|t|)`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -58.0 8.64 -6.71 2.75e- 7
#> 2 Girth 4.71 0.264 17.8 8.22e-17
#> 3 Height 0.339 0.130 2.61 1.45e- 2
```

The broom package standardizes common column names used to describe coefficients. In this case, the column names are:

`colnames(trees_model_tidy) <- c("term", "estimate", "std.error", "statistic", "p.value")`

A glossary giving the currently acceptable column names outputted by `tidy()`

methods can be found at the end of this article. As a rule of thumb, column names resulting from `tidy()`

methods should be all lowercase and contain only alphanumerics or periods (though there are plenty of exceptions).

Finally, it is common for `tidy()`

methods to include an option to calculate confidence/credible intervals for each component based on the model, when possible. In this example, the `confint()`

function can be used to calculate confidence intervals from a model object resulting from `lm()`

:

```
confint(trees_model)
#> 2.5 % 97.5 %
#> (Intercept) -75.68226247 -40.2930554
#> Girth 4.16683899 5.2494820
#> Height 0.07264863 0.6058538
```

With these considerations in mind, a reasonable `tidy()`

method for `lm()`

might look something like:

```
<- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
tidy.lm
<- summary(x)$coefficients %>%
result ::as_tibble(rownames = "term") %>%
tibble::rename(estimate = Estimate,
dplyrstd.error = `Std. Error`,
statistic = `t value`,
p.value = `Pr(>|t|)`)
if (conf.int) {
<- confint(x, level = conf.level)
ci <- dplyr::left_join(result, ci, by = "term")
result
}
result }
```

If you’re interested, the actual `tidy.lm()`

source can be found here! It’s not too different from the version above except for some argument checking and additional columns.

With this method exported, then, if a user calls `tidy(fit)`

, where `fit`

is an output from `lm()`

, the `tidy()`

generic would “redirect” the call to the `tidy.lm()`

function above.

Some things to keep in mind while writing your `tidy()`

method:

- Sometimes a model will have several different types of components. For example, in mixed models, there is different information associated with fixed effects and random effects. Since this information doesn’t have the same interpretation, it doesn’t make sense to summarize the fixed and random effects in the same table. In cases like this you should add an argument that allows the user to specify which type of information they want. For example, you might implement an interface along the lines of:

```
<- mixed_model(...)
model tidy(model, effects = "fixed")
tidy(model, effects = "random")
```

- How are missing values encoded in the model object and its
`summary()`

? Ensure that rows are included even when the associated model component is missing or rank deficient. - Are there other measures specific to each component that could reasonably be expected to be included in their summarizations? Some common arguments to
`tidy()`

methods include:`conf.int`

: A logical indicating whether or not to calculate confidence/credible intervals. This should default to`FALSE`

.`conf.level`

: The confidence level to use for the interval when`conf.int = TRUE`

. Typically defaults to`.95`

.`exponentiate`

: A logical indicating whether or not model terms should be presented on an exponential scale (typical for logistic regression).

### Implementing the `glance()`

method

`glance()`

returns a one-row tibble providing model-level summarizations (e.g. goodness of fit measures and related statistics). This is useful to check for model misspecification and to compare many models. Again, the `x`

input is a model object, and the `...`

is an optional argument to supply additional information to any calls inside your method. New `glance()`

methods can also take additional arguments and *must* include the `x`

and `...`

arguments. (For a glossary of currently acceptable additional arguments, see the end of this article.)

Returning to the `trees_model`

example, we could pull out the \(R^2\) value with the following code:

```
summary(trees_model)$r.squared
#> [1] 0.94795
```

Similarly, for the adjusted \(R^2\):

```
summary(trees_model)$adj.r.squared
#> [1] 0.9442322
```

Unfortunately, for many model objects, the extraction of model-level information is largely a manual process. You will likely need to build a `tibble()`

element-by-element by subsetting the `summary()`

object repeatedly. The `with()`

function, however, can help make this process a bit less tedious by evaluating expressions inside of the `summary(trees_model)`

environment. To grab those those same two model elements from above using `with()`

:

```
with(summary(trees_model),
::tibble(r.squared = r.squared,
tibbleadj.r.squared = adj.r.squared))
#> # A tibble: 1 × 2
#> r.squared adj.r.squared
#> <dbl> <dbl>
#> 1 0.948 0.944
```

A reasonable `glance()`

method for `lm()`

, then, might look something like:

```
<- function(x, ...) {
glance.lm with(
summary(x),
::tibble(
tibbler.squared = r.squared,
adj.r.squared = adj.r.squared,
sigma = sigma,
statistic = fstatistic["value"],
p.value = pf(
"value"],
fstatistic["numdf"],
fstatistic["dendf"],
fstatistic[lower.tail = FALSE
),df = fstatistic["numdf"],
logLik = as.numeric(stats::logLik(x)),
AIC = stats::AIC(x),
BIC = stats::BIC(x),
deviance = stats::deviance(x),
df.residual = df.residual(x),
nobs = stats::nobs(x)
)
) }
```

This is the actual definition of `glance.lm()`

provided by broom!

Some things to keep in mind while writing `glance()`

methods: * Output should not include the name of the modeling function or any arguments given to the modeling function. * In some cases, you may wish to provide model-level diagnostics not returned by the original object. For example, the above `glance.lm()`

calculates `AIC`

and `BIC`

from the model fit. If these are easy to compute, feel free to add them. However, tidier methods are generally not an appropriate place to implement complex or time consuming calculations. * The `glance`

method should always return the same columns in the same order when given an object of a given model class. If a summary metric (such as `AIC`

) is not defined in certain circumstances, use `NA`

.

### Implementing the `augment()`

method

`augment()`

methods add columns to a dataset containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have a `.`

prefix to prevent existing columns from being overwritten. (Currently acceptable column names are given in the glossary.) The `x`

and `...`

arguments share their meaning with the two functions described above. `augment`

methods also optionally accept a `data`

argument that is a `data.frame`

(or `tibble`

) to add observation-level information to, returning a `tibble`

object with the same number of rows as `data`

. Many `augment()`

methods also accept a `newdata`

argument, following the same conventions as the `data`

argument, except with the underlying assumption that the model has not “seen” the data yet. As a result, `newdata`

arguments need not contain the response columns in `data`

. Only one of `data`

or `newdata`

should be supplied. A full glossary of acceptable arguments to `augment()`

methods can be found at the end of this article.

If a `data`

argument is not specified, `augment()`

should try to reconstruct the original data as much as possible from the model object. This may not always be possible, and often it will not be possible to recover columns not used by the model.

With this is mind, we can look back to our `trees_model`

example. For one, the `model`

element inside of the `trees_model`

object will allow us to recover the original data:

```
$model
trees_model#> Volume Girth Height
#> 1 10.3 8.3 70
#> 2 10.3 8.6 65
#> 3 10.2 8.8 63
#> 4 16.4 10.5 72
#> 5 18.8 10.7 81
#> 6 19.7 10.8 83
#> 7 15.6 11.0 66
#> 8 18.2 11.0 75
#> 9 22.6 11.1 80
#> 10 19.9 11.2 75
#> 11 24.2 11.3 79
#> 12 21.0 11.4 76
#> 13 21.4 11.4 76
#> 14 21.3 11.7 69
#> 15 19.1 12.0 75
#> 16 22.2 12.9 74
#> 17 33.8 12.9 85
#> 18 27.4 13.3 86
#> 19 25.7 13.7 71
#> 20 24.9 13.8 64
#> 21 34.5 14.0 78
#> 22 31.7 14.2 80
#> 23 36.3 14.5 74
#> 24 38.3 16.0 72
#> 25 42.6 16.3 77
#> 26 55.4 17.3 81
#> 27 55.7 17.5 82
#> 28 58.3 17.9 80
#> 29 51.5 18.0 80
#> 30 51.0 18.0 80
#> 31 77.0 20.6 87
```

Similarly, the fitted values and residuals can be accessed with the following code:

```
head(trees_model$fitted.values)
#> 1 2 3 4 5 6
#> 4.837660 4.553852 4.816981 15.874115 19.869008 21.018327
head(trees_model$residuals)
#> 1 2 3 4 5 6
#> 5.4623403 5.7461484 5.3830187 0.5258848 -1.0690084 -1.3183270
```

As with `glance()`

methods, it’s fine (and encouraged!) to include common metrics associated with observations if they are not computationally intensive to compute. A common metric associated with linear models, for example, is the standard error of fitted values:

```
<- predict(trees_model, newdata = trees, se.fit = TRUE)$se.fit %>%
se.fit unname()
head(se.fit)
#> [1] 1.3211285 1.4893775 1.6325024 0.9444212 1.3484251 1.5319772
```

Thus, a reasonable `augment()`

method for `lm`

might look something like this:

```
<- function(x, data = x$model, newdata = NULL, ...) {
augment.lm if (is.null(newdata)) {
::bind_cols(tibble::as_tibble(data),
dplyr::tibble(.fitted = x$fitted.values,
tibble.se.fit = predict(x,
newdata = data,
se.fit = TRUE)$se.fit,
.resid = x$residuals))
else {
} <- predict(x, newdata = newdata, se.fit = TRUE)
predictions ::bind_cols(tibble::as_tibble(newdata),
dplyr::tibble(.fitted = predictions$fit,
tibble.se.fit = predictions$se.fit))
} }
```

Some other things to keep in mind while writing `augment()`

methods: * The `newdata`

argument should default to `NULL`

. Users should only ever specify one of `data`

or `newdata`

. Providing both `data`

and `newdata`

should result in an error. The `newdata`

argument should accept both `data.frame`

s and `tibble`

s. * Data given to the `data`

argument must have both the original predictors and the original response. Data given to the `newdata`

argument only needs to have the original predictors. This is important because there may be important information associated with training data that is not associated with test data. This means that the `original_data`

object in `augment(model, data = original_data)`

should provide `.fitted`

and `.resid`

columns (in most cases), whereas `test_data`

in `augment(model, data = test_data)`

only needs a `.fitted`

column, even if the response is present in `test_data`

. * If the `data`

or `newdata`

is specified as a `data.frame`

with rownames, `augment`

should return them in a column called `.rownames`

. * For observations where no fitted values or summaries are available (where there’s missing data, for example), return `NA`

. * *The augment() method should always return as many rows as were in data or newdata*, depending on which is supplied

The recommended interface and functionality for `augment()`

methods may change soon.

## Document the new methods

The only remaining step is to integrate the new methods into the parent package! To do so, just drop the methods into a `.R`

file inside of the `/R`

folder and document them using roxygen2. If you’re unfamiliar with the process of documenting objects, you can read more about it here. Here’s an example of how our `tidy.lm()`

method might be documented:

```
#' Tidy a(n) lm object
#'
#' @param x A `lm` object.
#' @param conf.int Logical indicating whether or not to include
#' a confidence interval in the tidied output. Defaults to FALSE.
#' @param conf.level The confidence level to use for the confidence
#' interval if conf.int = TRUE. Must be strictly greater than 0
#' and less than 1. Defaults to 0.95, which corresponds to a
#' 95 percent confidence interval.
#' @param ... Unused, included for generic consistency only.
#' @return A tidy [tibble::tibble()] summarizing component-level
#' information about the model
#'
#' @examples
#' # load the trees dataset
#' data(trees)
#'
#' # fit a linear model on timber volume
#' trees_model <- lm(Volume ~ Girth + Height, data = trees)
#'
#' # summarize model coefficients in a tidy tibble!
#' tidy(trees_model)
#'
#' @export
<- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
tidy.lm
# ... the rest of the function definition goes here!
```

Once you’ve documented each of your new methods and executed `devtools::document()`

, you’re done! Congrats on implementing your own broom tidier methods for a new model object!

## Glossaries

### Arguments

Tidier methods have a standardized set of acceptable argument and output column names. The currently acceptable argument names by tidier method are:

### Column Names

The currently acceptable column names by tidier method are:

The alexpghayes/modeltests package provides unit testing infrastructure to check your new tidier methods. Please file an issue there to request new arguments/columns to be added to the glossaries!

## Session information

```
#> ─ Session info ─────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.3 (2024-02-29)
#> os macOS Sonoma 14.4.1
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Los_Angeles
#> date 2024-03-26
#> pandoc 2.17.1.1 @ /opt/homebrew/bin/ (via rmarkdown)
#>
#> ─ Packages ─────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> broom * 1.0.5 2023-06-09 [1] CRAN (R 4.3.0)
#> dials * 1.2.1 2024-02-22 [1] CRAN (R 4.3.1)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1)
#> generics * 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
#> ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.1)
#> infer * 1.0.7 2024-03-25 [1] CRAN (R 4.3.1)
#> parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.3.1)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
#> recipes * 1.0.10 2024-02-18 [1] CRAN (R 4.3.1)
#> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1)
#> rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.3.1)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.3.1)
#> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
#> tune * 1.2.0 2024-03-20 [1] CRAN (R 4.3.1)
#> workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.3.1)
#> yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.3.1)
#>
#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library
#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────
```