Create your own broom tidier methods

Learning objective

Write tidy(), glance(), and augment() methods for new model objects.


To use the code in this article, you will need to install the following packages: generics, tidymodels, tidyverse, and usethis.

The broom package provides tools to summarize key information about models in tidy tibble()s. The package provides three verbs, or “tidiers,” to help make model objects easier to work with:

  • tidy() summarizes information about model components
  • glance() reports information about the entire model
  • augment() adds information about observations to a dataset

Each of the three verbs above are generic, in that they do not define a procedure to tidy a given model object, but instead redirect to the relevant method implemented to tidy a specific type of model object. The broom package provides methods for model objects from over 100 modeling packages along with nearly all of the model objects in the stats package that comes with base R. However, for maintainability purposes, the broom package authors now ask that requests for new methods be first directed to the parent package (i.e. the package that supplies the model object) rather than to broom. New methods will generally only be integrated into broom in the case that the requester has already asked the maintainers of the model-owning package to implement tidier methods in the parent package.

We’d like to make implementing external tidier methods as painless as possible. The general process for doing so is:

  • re-export the tidier generics
  • implement tidying methods
  • document the new methods

In this article, we’ll walk through each of the above steps in detail, giving examples and pointing out helpful functions when possible.

Re-export the tidier generics

The first step is to re-export the generic functions for tidy(), glance(), and/or augment(). You could do so from broom itself, but we’ve provided an alternative, much lighter dependency called generics.

First you’ll need to add the generics package to Imports. We recommend using the usethis package for this:

usethis::use_package("generics", "Imports")

Next, you’ll need to re-export the appropriate tidying methods. If you plan to implement a glance() method, for example, you can re-export the glance() generic by adding the following somewhere inside the /R folder of your package:

#' @importFrom generics glance
#' @export

Oftentimes it doesn’t make sense to define one or more of these methods for a particular model. In this case, only implement the methods that do make sense.

Please do not define tidy(), glance(), or augment() generics in your package. This will result in namespace conflicts whenever your package is used along other packages that also export tidying methods.

Implement tidying methods

You’ll now need to implement specific tidying methods for each of the generics you’ve re-exported in the above step. For each of tidy(), glance(), and augment(), we’ll walk through the big picture, an example, and helpful resources.

In this article, we’ll use the base R dataset trees, giving the tree girth (in inches), height (in feet), and volume (in cubic feet), to fit an example linear model using the base R lm() function.

# load in the trees dataset

# take a look!
#> 'data.frame':	31 obs. of  3 variables:
#>  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
#>  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
#>  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

# fit the timber volume as a function of girth and height
trees_model <- lm(Volume ~ Girth + Height, data = trees)

Let’s take a look at the summary() of our trees_model fit.

#> Call:
#> lm(formula = Volume ~ Girth + Height, data = trees)
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -6.406 -2.649 -0.288  2.200  8.485 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -57.988      8.638   -6.71  2.7e-07 ***
#> Girth          4.708      0.264   17.82  < 2e-16 ***
#> Height         0.339      0.130    2.61    0.014 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Residual standard error: 3.88 on 28 degrees of freedom
#> Multiple R-squared:  0.948,	Adjusted R-squared:  0.944 
#> F-statistic:  255 on 2 and 28 DF,  p-value: <2e-16

This output gives some summary statistics on the residuals (which would be described more fully in an augment() output), model coefficients (which, in this case, make up the tidy() output), and some model-level summarizations such as RSE, \(R^2\), etc. (which make up the glance() output.)

Implementing the tidy() method

The tidy(x, ...) method will return a tibble where each row contains information about a component of the model. The x input is a model object, and the dots (...) are an optional argument to supply additional information to any calls inside your method. New tidy() methods can take additional arguments, but must include the x and ... arguments to be compatible with the generic function. (For a glossary of currently acceptable additional arguments, see the end of this article.) Examples of model components include regression coefficients (for regression models), clusters (for classification/clustering models), etc. These tidy() methods are useful for inspecting model details and creating custom model visualizations.

Returning to the example of our linear model on timber volume, we’d like to extract information on the model components. In this example, the components are the regression coefficients. After taking a look at the model object and its summary(), you might notice that you can extract the regression coefficients as follows:

#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -57.988      8.638   -6.71 2.75e-07
#> Girth          4.708      0.264   17.82 8.22e-17
#> Height         0.339      0.130    2.61 1.45e-02

This object contains the model coefficients as a table, where the information giving which coefficient is being described in each row is given in the row names. Converting to a tibble where the row names are contained in a column, you might write:

trees_model_tidy <- summary(trees_model)$coefficients %>% 
  as_tibble(rownames = "term")

#> # A tibble: 3 x 5
#>   term        Estimate `Std. Error` `t value` `Pr(>|t|)`
#>   <chr>          <dbl>        <dbl>     <dbl>      <dbl>
#> 1 (Intercept)  -58.0          8.64      -6.71   2.75e- 7
#> 2 Girth          4.71         0.264     17.8    8.22e-17
#> 3 Height         0.339        0.130      2.61   1.45e- 2

The broom package standardizes common column names used to describe coefficients. In this case, the column names are:

colnames(trees_model_tidy) <- c("term", "estimate", "std.error", "statistic", "p.value")

A glossary giving the currently acceptable column names outputted by tidy() methods can be found at the end of this article. As a rule of thumb, column names resulting from tidy() methods should be all lowercase and contain only alphanumerics or periods (though there are plenty of exceptions).

Finally, it is common for tidy() methods to include an option to calculate confidence/credible intervals for each component based on the model, when possible. In this example, the confint() function can be used to calculate confidence intervals from a model object resulting from lm():

#>                2.5 %  97.5 %
#> (Intercept) -75.6823 -40.293
#> Girth         4.1668   5.249
#> Height        0.0726   0.606

With these considerations in mind, a reasonable tidy() method for lm() might look something like:

tidy.lm <- function(x, = FALSE, conf.level = 0.95, ...) {
  result <- summary(x)$coefficients %>%
    tibble::as_tibble(rownames = "term") %>%
    dplyr::rename(estimate = Estimate,
                  std.error = `Std. Error`,
                  statistic = `t value`,
                  p.value = `Pr(>|t|)`)
  if ( {
    ci <- confint(x, level = conf.level)
    result <- dplyr::left_join(result, ci, by = "term")
If you’re interested, the actual tidy.lm() source can be found here! It’s not too different from the version above except for some argument checking and additional columns.

With this method exported, then, if a user calls tidy(fit), where fit is an output from lm(), the tidy() generic would “redirect” the call to the tidy.lm() function above.

Some things to keep in mind while writing your tidy() method:

  • Sometimes a model will have several different types of components. For example, in mixed models, there is different information associated with fixed effects and random effects. Since this information doesn’t have the same interpretation, it doesn’t make sense to summarize the fixed and random effects in the same table. In cases like this you should add an argument that allows the user to specify which type of information they want. For example, you might implement an interface along the lines of:
model <- mixed_model(...)
tidy(model, effects = "fixed")
tidy(model, effects = "random")
  • How are missing values encoded in the model object and its summary()? Ensure that rows are included even when the associated model component is missing or rank deficient.
  • Are there other measures specific to each component that could reasonably be expected to be included in their summarizations? Some common arguments to tidy() methods include:
    • A logical indicating whether or not to calculate confidence/credible intervals. This should default to FALSE.
    • conf.level: The confidence level to use for the interval when = TRUE. Typically defaults to .95.
    • exponentiate: A logical indicating whether or not model terms should be presented on an exponential scale (typical for logistic regression).

Implementing the glance() method

glance() returns a one-row tibble providing model-level summarizations (e.g. goodness of fit measures and related statistics). This is useful to check for model misspecification and to compare many models. Again, the x input is a model object, and the ... is an optional argument to supply additional information to any calls inside your method. New glance() methods can also take additional arguments and must include the x and ... arguments. (For a glossary of currently acceptable additional arguments, see the end of this article.)

Returning to the trees_model example, we could pull out the \(R^2\) value with the following code:

#> [1] 0.948

Similarly, for the adjusted \(R^2\):

#> [1] 0.944

Unfortunately, for many model objects, the extraction of model-level information is largely a manual process. You will likely need to build a tibble() element-by-element by subsetting the summary() object repeatedly. The with() function, however, can help make this process a bit less tedious by evaluating expressions inside of the summary(trees_model) environment. To grab those those same two model elements from above using with():

     tibble::tibble(r.squared = r.squared,
                    adj.r.squared = adj.r.squared))
#> # A tibble: 1 x 2
#>   r.squared adj.r.squared
#>       <dbl>         <dbl>
#> 1     0.948         0.944

A reasonable glance() method for lm(), then, might look something like:

glance.lm <- function(x, ...) {
      r.squared = r.squared,
      adj.r.squared = adj.r.squared,
      sigma = sigma,
      statistic = fstatistic["value"],
      p.value = pf(
        lower.tail = FALSE
      df = fstatistic["numdf"],
      logLik = as.numeric(stats::logLik(x)),
      AIC = stats::AIC(x),
      BIC = stats::BIC(x),
      deviance = stats::deviance(x),
      df.residual = df.residual(x),
      nobs = stats::nobs(x)
This is the actual definition of glance.lm() provided by broom!

Some things to keep in mind while writing glance() methods:

  • Output should not include the name of the modeling function or any arguments given to the modeling function.
  • In some cases, you may wish to provide model-level diagnostics not returned by the original object. For example, the above glance.lm() calculates AIC and BIC from the model fit. If these are easy to compute, feel free to add them. However, tidier methods are generally not an appropriate place to implement complex or time consuming calculations.
  • The glance method should always return the same columns in the same order when given an object of a given model class. If a summary metric (such as AIC) is not defined in certain circumstances, use NA.

Implementing the augment() method

augment() methods add columns to a dataset containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have a . prefix to prevent existing columns from being overwritten. (Currently acceptable column names are given in the glossary.) The x and ... arguments share their meaning with the two functions described above. augment methods also optionally accept a data argument that is a data.frame (or tibble) to add observation-level information to, returning a tibble object with the same number of rows as data. Many augment() methods also accept a newdata argument, following the same conventions as the data argument, except with the underlying assumption that the model has not “seen” the data yet. As a result, newdata arguments need not contain the response columns in data. Only one of data or newdata should be supplied. A full glossary of acceptable arguments to augment() methods can be found at the end of this article.

If a data argument is not specified, augment() should try to reconstruct the original data as much as possible from the model object. This may not always be possible, and often it will not be possible to recover columns not used by the model.

With this is mind, we can look back to our trees_model example. For one, the model element inside of the trees_model object will allow us to recover the original data:

#>    Volume Girth Height
#> 1    10.3   8.3     70
#> 2    10.3   8.6     65
#> 3    10.2   8.8     63
#> 4    16.4  10.5     72
#> 5    18.8  10.7     81
#> 6    19.7  10.8     83
#> 7    15.6  11.0     66
#> 8    18.2  11.0     75
#> 9    22.6  11.1     80
#> 10   19.9  11.2     75
#> 11   24.2  11.3     79
#> 12   21.0  11.4     76
#> 13   21.4  11.4     76
#> 14   21.3  11.7     69
#> 15   19.1  12.0     75
#> 16   22.2  12.9     74
#> 17   33.8  12.9     85
#> 18   27.4  13.3     86
#> 19   25.7  13.7     71
#> 20   24.9  13.8     64
#> 21   34.5  14.0     78
#> 22   31.7  14.2     80
#> 23   36.3  14.5     74
#> 24   38.3  16.0     72
#> 25   42.6  16.3     77
#> 26   55.4  17.3     81
#> 27   55.7  17.5     82
#> 28   58.3  17.9     80
#> 29   51.5  18.0     80
#> 30   51.0  18.0     80
#> 31   77.0  20.6     87

Similarly, the fitted values and residuals can be accessed with the following code:

#>     1     2     3     4     5     6 
#>  4.84  4.55  4.82 15.87 19.87 21.02
#>      1      2      3      4      5      6 
#>  5.462  5.746  5.383  0.526 -1.069 -1.318

As with glance() methods, it’s fine (and encouraged!) to include common metrics associated with observations if they are not computationally intensive to compute. A common metric associated with linear models, for example, is the standard error of fitted values: <- predict(trees_model, newdata = trees, = TRUE)$ %>%

#> [1] 1.321 1.489 1.633 0.944 1.348 1.532

Thus, a reasonable augment() method for lm might look something like this:

augment.lm <- function(x, data = x$model, newdata = NULL, ...) {
  if (is.null(newdata)) {
                     tibble::tibble(.fitted = x$fitted.values,
                           = predict(x, 
                                                      newdata = data, 
                                             = TRUE)$,
                                   .resid =  x$residuals))
  } else {
    predictions <- predict(x, newdata = newdata, = TRUE)
                     tibble::tibble(.fitted = predictions$fit,
                           = predictions$

Some other things to keep in mind while writing augment() methods:

  • The newdata argument should default to NULL. Users should only ever specify one of data or newdata. Providing both data and newdata should result in an error. The newdata argument should accept both data.frames and tibbles.
  • Data given to the data argument must have both the original predictors and the original response. Data given to the newdata argument only needs to have the original predictors. This is important because there may be important information associated with training data that is not associated with test data. This means that the original_data object in augment(model, data = original_data) should provide .fitted and .resid columns (in most cases), whereas test_data in augment(model, data = test_data) only needs a .fitted column, even if the response is present in test_data.
  • If the data or newdata is specified as a data.frame with rownames, augment should return them in a column called .rownames.
  • For observations where no fitted values or summaries are available (where there’s missing data, for example), return NA.
  • The augment() method should always return as many rows as were in data or newdata, depending on which is supplied
The recommended interface and functionality for augment() methods may change soon.

Document the new methods

The only remaining step is to integrate the new methods into the parent package! To do so, just drop the methods into a .R file inside of the /R folder and document them using roxygen2. If you’re unfamiliar with the process of documenting objects, you can read more about it here. Here’s an example of how our tidy.lm() method might be documented:

#' Tidy a(n) lm object
#' @param x A `lm` object.
#' @param Logical indicating whether or not to include 
#'   a confidence interval in the tidied output. Defaults to FALSE.
#' @param conf.level The confidence level to use for the confidence 
#'   interval if = TRUE. Must be strictly greater than 0 
#'   and less than 1. Defaults to 0.95, which corresponds to a 
#'   95 percent confidence interval.
#' @param ... Unused, included for generic consistency only.
#' @return A tidy [tibble::tibble()] summarizing component-level
#'   information about the model
#' @examples
#' # load the trees dataset
#' data(trees)
#' # fit a linear model on timber volume
#' trees_model <- lm(Volume ~ Girth + Height, data = trees)
#' # summarize model coefficients in a tidy tibble!
#' tidy(trees_model)
#' @export
tidy.lm <- function(x, = FALSE, conf.level = 0.95, ...) {

  # ... the rest of the function definition goes here!

Once you’ve documented each of your new methods and executed devtools::document(), you’re done! Congrats on implementing your own broom tidier methods for a new model object!

Glossaries: argument and column names

Tidier methods have a standardized set of acceptable argument and output column names. The currently acceptable argument names by tidier method are:

Method Argument
tidy alpha
tidy boot_se
tidy by_class
tidy col.names
tidy component
tidy conf.level
tidy conf.method
tidy conf.type
tidy diagonal
tidy droppars
tidy effects
tidy ess
tidy estimate.method
tidy exponentiate
tidy fe
tidy include_studies
tidy instruments
tidy intervals
tidy matrix
tidy measure
tidy na.rm
tidy object
tidy p.values
tidy par_type
tidy parameters
tidy parametric
tidy pars
tidy prob
tidy region
tidy return_zeros
tidy rhat
tidy robust
tidy scales
tidy se.type
tidy strata
tidy test
tidy trim
tidy upper
glance deviance
glance diagnostics
glance looic
glance mcmc
glance test
glance x
augment data
augment newdata
augment se_fit
augment type.predict
augment type.residuals
augment weights

The currently acceptable column names by tidier method are:

Method Column
tidy acf
tidy adj.p.value
tidy alternative
tidy atmean
tidy autocorrelation
tidy bias
tidy ci.width
tidy class
tidy cluster
tidy coef.type
tidy column1
tidy column2
tidy comp
tidy comparison
tidy component
tidy conf.high
tidy conf.low
tidy contrast
tidy cumulative
tidy cutoff
tidy delta
tidy den.df
tidy denominator
tidy dev.ratio
tidy df
tidy distance
tidy estimate
tidy estimate1
tidy estimate2
tidy event
tidy exp
tidy expected
tidy fpr
tidy freq
tidy GCV
tidy group
tidy group1
tidy group2
tidy index
tidy item1
tidy item2
tidy kendall_score
tidy lag
tidy lambda
tidy letters
tidy lhs
tidy logLik
tidy mcmc.error
tidy mean
tidy meansq
tidy method
tidy n
tidy N
tidy n.censor
tidy n.event
tidy n.risk
tidy null.value
tidy num.df
tidy nzero
tidy obs
tidy op
tidy outcome
tidy p
tidy p.value
tidy p.value.Sargan
tidy p.value.weakinst
tidy p.value.Wu.Hausman
tidy parameter
tidy PC
tidy percent
tidy power
tidy proportion
tidy pyears
tidy response
tidy rhs
tidy row
tidy scale
tidy sd
tidy series
tidy sig.level
tidy size
tidy spec
tidy state
tidy statistic
tidy statistic.Sargan
tidy statistic.weakinst
tidy statistic.Wu.Hausman
tidy std_estimate
tidy std.all
tidy std.error
tidy std.nox
tidy step
tidy strata
tidy stratum
tidy study
tidy sumsq
tidy tau
tidy term
tidy time
tidy tpr
tidy type
tidy uniqueness
tidy value
tidy var_kendall_score
tidy variable
tidy variance
tidy withinss
tidy y.level
tidy y.value
tidy z
glance adj.r.squared
glance agfi
glance AIC
glance AICc
glance alpha
glance alternative
glance autocorrelation
glance avg.silhouette.width
glance betweenss
glance BIC
glance cfi
glance chi.squared
glance chisq
glance cochran.qe
glance cochran.qm
glance conf.high
glance conf.low
glance converged
glance convergence
glance crit
glance cv.crit
glance den.df
glance deviance
glance df
glance df.null
glance df.residual
glance dw.original
glance dw.transformed
glance edf
glance estimator
glance events
glance finTol
glance function.count
glance G
glance g.squared
glance gamma
glance gradient.count
glance H
glance h.squared
glance hypvol
glance i.squared
glance independence
glance isConv
glance iter
glance iterations
glance kHKB
glance kLW
glance lambda
glance lambda.1se
glance lambda.min
glance lambdaGCV
glance logLik
glance max.cluster.size
glance max.hazard
glance max.time
glance maxit
glance MCMC.burnin
glance MCMC.interval
glance MCMC.samplesize
glance measure
glance median
glance method
glance min.hazard
glance min.time
glance missing_method
glance model
glance n
glance n.clusters
glance n.factors
glance n.max
glance n.start
glance nevent
glance nexcluded
glance ngroups
glance nobs
glance norig
glance npar
glance npasses
glance null.deviance
glance nulldev
glance num.df
glance number.interaction
glance offtable
glance p.value
glance p.value.cochran.qe
glance p.value.cochran.qm
glance p.value.original
glance p.value.Sargan
glance p.value.transformed
glance p.value.Wu.Hausman
glance parameter
glance pen.crit
glance power
glance power.reached
glance pseudo.r.squared
glance r.squared
glance records
glance residual.deviance
glance rho
glance rho2
glance rho20
glance rmean
glance rmean.std.error
glance rmsea
glance rmsea.conf.high
glance rscore
glance score
glance sigma
glance sigma2_j
glance spar
glance srmr
glance statistic
glance statistic.Sargan
glance statistic.Wu.Hausman
glance tau
glance tau.squared
glance theta
glance timepoints
glance tli
glance tot.withinss
glance total
glance total.variance
glance totss
glance value
glance within.r.squared
augment .class
augment .cluster
augment .cochran.qe.loo
augment .col.prop
augment .conf.high
augment .conf.low
augment .cooksd
augment .cov.ratio
augment .cred.high
augment .cred.low
augment .dffits
augment .expected
augment .fitted
augment .fitted_j_0
augment .fitted_j_1
augment .hat
augment .moderator
augment .moderator.level
augment .observed
augment .probability
augment .prop
augment .remainder
augment .resid
augment .resid_j_0
augment .resid_j_1
augment .row.prop
augment .rownames
augment .seasadj
augment .seasonal
augment .sigma
augment .std.resid
augment .tau
augment .tau.squared.loo
augment .trend
augment .uncertainty
augment .weight

The alexpghayes/modeltests package provides unit testing infrastructure to check your new tidier methods. Please file an issue there to request new arguments/columns to be added to the glossaries!

Session information

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS Catalina 10.15.6      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/Denver              
#>  date     2020-07-21                  
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package    * version date       lib source        
#>  broom      * 0.7.0   2020-07-09 [1] CRAN (R 4.0.0)
#>  dials      * 0.0.8   2020-07-08 [1] CRAN (R 4.0.0)
#>  dplyr      * 1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
#>  generics   * 0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
#>  ggplot2    * 3.3.2   2020-06-19 [1] CRAN (R 4.0.0)
#>  infer      * 0.5.3   2020-07-14 [1] CRAN (R 4.0.2)
#>  parsnip    * 0.1.2   2020-07-03 [1] CRAN (R 4.0.1)
#>  purrr      * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  recipes    * 0.1.13  2020-06-23 [1] CRAN (R 4.0.0)
#>  rlang        0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
#>  rsample    * 0.0.7   2020-06-04 [1] CRAN (R 4.0.0)
#>  tibble     * 3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
#>  tidymodels * 0.1.1   2020-07-14 [1] CRAN (R 4.0.2)
#>  tidyverse  * 1.3.0   2019-11-21 [1] CRAN (R 4.0.0)
#>  tune       * 0.1.1   2020-07-08 [1] CRAN (R 4.0.0)
#>  workflows  * 0.1.2   2020-07-07 [1] CRAN (R 4.0.0)
#>  yardstick  * 0.0.7   2020-07-13 [1] CRAN (R 4.0.2)
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
Explore searchable tables of all tidymodels packages and functions.
Study up on statistics and modeling with our comprehensive books.
Hear the latest about tidymodels packages at the tidyverse blog.