Models by a factor

Recently, a friend asked a question on ways to run a single model by a grouping variable in a dataset (similar to the BY statement in SAS for those familiar with SAS). This was a list of the multiple different ways that were suggested.

I will use the baseball dataset in the plyr package to go through the different approaches.

Load and display the data

library(plyr)  #load the plyr package
data(baseball)  #load the baseball dataset that comes with plyr
head(baseball)  # display the first six lines of the data.frame
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1
##     ibb hbp sh sf gidp
## 4    NA  NA NA NA   NA
## 44   NA  NA NA NA   NA
## 68   NA  NA NA NA   NA
## 99   NA  NA NA NA   NA
## 102  NA  NA NA NA   NA
## 106  NA  NA NA NA   NA

To look at the year-team combinations in the data, use the ddply function

baseball$team <- as.character(baseball$team)  #convert team from a factor to a character

# For the purpose of this excercise we will reduce the dataset to just a few
# teams
baseball.red <- baseball[baseball$team %in% c("BOS", "CHN", "CIN"), ]

# Use ddply again to create a variable called c_year, which is the number of
# years since the first year
baseball.red <- ddply(baseball.red, .(id), transform, c_year = year - min(year))

So the basic idea for the process I am going to use, is that I am going to use a Poisson regression using the glm function in R to look at the hits as a function of year since the first year of data was included in the dataset. Is the model 100% valid? No but for this the actual model is not important.

Further, the approaches I am describing below, are runnign the same model with same dependent and independent variables across different subsets of data. If you were interested in running different subsets of independent variables there are also many different approaches that could be considered, but I will not go into that at this time.

1. Subsetting and running each model seperate

The basic process is to run each model seperately by providing a different subset of the data. Not very difficult to do, but if you had a bunch of models to run, it can very quickly add up. The nice thing about this process though is that you have very clear model outputs (i.e., a specific output for BOS, CHN, and CIN).

mod.BOS <- glm(h ~ c_year, data = baseball.red[baseball.red$team == "BOS", ], 
    family = "poisson")  # run the model using only team BOS
summary(mod.BOS)  #  Display the model summary
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     "BOS", ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -14.29  -10.16   -5.28    6.82   17.97  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.943280   0.005922   665.9   <2e-16 ***
## c_year      0.034136   0.000884    38.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 72809  on 900  degrees of freedom
## Residual deviance: 71420  on 899  degrees of freedom
## AIC: 75179
## 
## Number of Fisher Scoring iterations: 6
mod.CHN <- glm(h ~ c_year, data = baseball.red[baseball.red$team == "CHN", ], 
    family = "poisson")
summary(mod.CHN)
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     "CHN", ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -13.94   -9.17   -2.72    6.21   16.82  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 4.071993   0.005087   800.5   <2e-16 ***
## c_year      0.033453   0.000737    45.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80354  on 1178  degrees of freedom
## Residual deviance: 78403  on 1177  degrees of freedom
## AIC: 84317
## 
## Number of Fisher Scoring iterations: 5
mod.CIN <- glm(h ~ c_year, data = baseball.red[baseball.red$team == "CIN", ], 
    family = "poisson")
summary(mod.CIN)
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     "CIN", ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -12.40   -9.48   -2.03    6.35   14.83  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.21200    0.00528  797.57  < 2e-16 ***
## c_year       0.00589    0.00082    7.19  6.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70036  on 1029  degrees of freedom
## Residual deviance: 69984  on 1028  degrees of freedom
## AIC: 75122
## 
## Number of Fisher Scoring iterations: 5

2. Subsetting and running through a loop

This process first identifies all the uniqe teams listed in the dataset and then will loop through those teams and store the output in a list. The summary for all the models can be provided by running lapply on that list or running summary on an extracted element of the list. Problem with this method is just the inherent complication of using lists in R and running loops can take quite a long time especially on large datasets. If it is speed you are after there are quicker options.

uniq.teams <- unique(baseball.red$team)  # find all the unique teams in the dataset
model.out <- list()  # create a list to store each model

for (i in 1:length(uniq.teams)) {
    model.out[[paste(uniq.teams[i])]] <- glm(h ~ c_year, data = baseball.red[baseball.red$team == 
        paste(uniq.teams[i]), ], family = "poisson")
}

lapply(model.out, summary)  # look at all the model outputs, this could be used with outputs on 3 and 5 as well.  
## $CHN
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     paste(uniq.teams[i]), ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -13.94   -9.17   -2.72    6.21   16.82  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 4.071993   0.005087   800.5   <2e-16 ***
## c_year      0.033453   0.000737    45.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80354  on 1178  degrees of freedom
## Residual deviance: 78403  on 1177  degrees of freedom
## AIC: 84317
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## $CIN
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     paste(uniq.teams[i]), ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -12.40   -9.48   -2.03    6.35   14.83  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.21200    0.00528  797.57  < 2e-16 ***
## c_year       0.00589    0.00082    7.19  6.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70036  on 1029  degrees of freedom
## Residual deviance: 69984  on 1028  degrees of freedom
## AIC: 75122
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## $BOS
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     paste(uniq.teams[i]), ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -14.29  -10.16   -5.28    6.82   17.97  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.943280   0.005922   665.9   <2e-16 ***
## c_year      0.034136   0.000884    38.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 72809  on 900  degrees of freedom
## Residual deviance: 71420  on 899  degrees of freedom
## AIC: 75179
## 
## Number of Fisher Scoring iterations: 6
summary(model.out[["CIN"]])  # or look at a specific model
## 
## Call:
## glm(formula = h ~ c_year, family = "poisson", data = baseball.red[baseball.red$team == 
##     paste(uniq.teams[i]), ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -12.40   -9.48   -2.03    6.35   14.83  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.21200    0.00528  797.57  < 2e-16 ***
## c_year       0.00589    0.00082    7.19  6.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70036  on 1029  degrees of freedom
## Residual deviance: 69984  on 1028  degrees of freedom
## AIC: 75122
## 
## Number of Fisher Scoring iterations: 5

3. Using split and lapply

This process splits the dataset into seperate lists by a variable and then uses lapply to run the glm model across those different datasets.

datasets <- split(baseball.red, baseball.red$team)  # splits the dataset into a list by team
str(datasets)  # look at the structure of the dataset
## List of 3
##  $ BOS:'data.frame':	901 obs. of  23 variables:
##   ..$ id    : chr [1:901] "adairje01" "adairje01" "aguilri01" "altroni01" ...
##   ..$ year  : int [1:901] 1967 1968 1995 1902 1903 1988 1990 1971 1972 1973 ...
##   ..$ stint : int [1:901] 2 1 2 1 1 1 2 1 1 1 ...
##   ..$ team  : chr [1:901] "BOS" "BOS" "BOS" "BOS" ...
##   ..$ lg    : chr [1:901] "AL" "AL" "AL" "AL" ...
##   ..$ g     : int [1:901] 89 74 30 3 1 41 15 125 110 132 ...
##   ..$ ab    : int [1:901] 316 208 0 8 3 148 0 491 436 499 ...
##   ..$ r     : int [1:901] 41 18 0 0 0 14 0 56 47 56 ...
##   ..$ h     : int [1:901] 92 45 0 0 2 34 0 114 112 135 ...
##   ..$ X2b   : int [1:901] 13 1 0 0 0 5 0 23 26 17 ...
##   ..$ X3b   : int [1:901] 1 0 0 0 0 3 0 0 3 1 ...
##   ..$ hr    : int [1:901] 3 2 0 0 0 0 0 4 3 0 ...
##   ..$ rbi   : int [1:901] 26 12 0 0 0 12 0 45 39 49 ...
##   ..$ sb    : int [1:901] 1 0 0 0 0 4 0 6 3 13 ...
##   ..$ cs    : int [1:901] 4 0 0 NA NA 2 0 4 3 1 ...
##   ..$ bb    : int [1:901] 13 9 0 0 1 15 0 35 26 43 ...
##   ..$ so    : int [1:901] 35 28 0 NA NA 35 0 43 28 33 ...
##   ..$ ibb   : int [1:901] 0 2 0 NA NA 0 0 0 0 1 ...
##   ..$ hbp   : int [1:901] 2 1 0 0 0 4 0 2 2 0 ...
##   ..$ sh    : int [1:901] 4 6 0 0 0 4 0 9 5 12 ...
##   ..$ sf    : int [1:901] 2 0 0 NA NA 1 0 4 5 7 ...
##   ..$ gidp  : int [1:901] 10 10 0 NA NA 2 0 7 8 12 ...
##   ..$ c_year: int [1:901] 0 1 0 0 1 0 0 0 1 2 ...
##  $ CHN:'data.frame':	1179 obs. of  23 variables:
##   ..$ id    : chr [1:1179] "abernte02" "abernte02" "abernte02" "abernte02" ...
##   ..$ year  : int [1:1179] 1965 1966 1969 1970 1957 1958 1959 1999 2000 1969 ...
##   ..$ stint : int [1:1179] 1 1 1 1 1 1 1 2 1 1 ...
##   ..$ team  : chr [1:1179] "CHN" "CHN" "CHN" "CHN" ...
##   ..$ lg    : chr [1:1179] "NL" "NL" "NL" "NL" ...
##   ..$ g     : int [1:1179] 84 20 56 11 60 62 3 41 50 41 ...
##   ..$ ab    : int [1:1179] 18 4 8 0 187 96 2 1 0 5 ...
##   ..$ r     : int [1:1179] 1 0 1 0 21 14 0 0 0 2 ...
##   ..$ h     : int [1:1179] 3 0 2 0 47 27 0 0 0 2 ...
##   ..$ X2b   : int [1:1179] 0 0 1 0 10 4 0 0 0 0 ...
##   ..$ X3b   : int [1:1179] 0 0 0 0 2 4 0 0 0 0 ...
##   ..$ hr    : int [1:1179] 0 0 0 0 1 0 0 0 0 0 ...
##   ..$ rbi   : int [1:1179] 2 0 1 0 10 4 0 0 0 0 ...
##   ..$ sb    : int [1:1179] 0 0 0 0 0 2 0 0 0 0 ...
##   ..$ cs    : int [1:1179] 0 0 0 0 3 0 0 0 0 0 ...
##   ..$ bb    : int [1:1179] 0 0 0 0 17 6 0 0 0 0 ...
##   ..$ so    : int [1:1179] 7 2 2 0 28 15 1 0 0 1 ...
##   ..$ ibb   : int [1:1179] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ hbp   : int [1:1179] 1 0 0 0 2 0 0 0 0 0 ...
##   ..$ sh    : int [1:1179] 3 0 0 0 5 3 0 0 0 0 ...
##   ..$ sf    : int [1:1179] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ gidp  : int [1:1179] 0 0 0 0 2 2 0 0 0 0 ...
##   ..$ c_year: int [1:1179] 0 1 4 5 11 12 13 4 5 0 ...
##  $ CIN:'data.frame':	1030 obs. of  23 variables:
##   ..$ id    : chr [1:1030] "abernte02" "abernte02" "adamsbo03" "adamsbo03" ...
##   ..$ year  : int [1:1030] 1967 1968 1946 1947 1948 1949 1950 1951 1952 1953 ...
##   ..$ stint : int [1:1030] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ team  : chr [1:1030] "CIN" "CIN" "CIN" "CIN" ...
##   ..$ lg    : chr [1:1030] "NL" "NL" "NL" "NL" ...
##   ..$ g     : int [1:1030] 70 78 94 81 87 107 115 125 154 150 ...
##   ..$ ab    : int [1:1030] 17 17 311 217 262 277 348 403 637 607 ...
##   ..$ r     : int [1:1030] 0 2 35 39 33 32 57 57 85 99 ...
##   ..$ h     : int [1:1030] 1 0 76 59 78 70 98 107 180 167 ...
##   ..$ X2b   : int [1:1030] 0 0 13 11 20 16 21 12 25 14 ...
##   ..$ X3b   : int [1:1030] 0 0 3 2 3 2 8 5 4 6 ...
##   ..$ hr    : int [1:1030] 0 0 4 4 1 0 3 5 6 8 ...
##   ..$ rbi   : int [1:1030] 2 0 24 20 21 25 25 24 48 49 ...
##   ..$ sb    : int [1:1030] 0 0 16 9 6 4 7 4 11 3 ...
##   ..$ cs    : int [1:1030] 0 0 NA NA NA NA NA 10 9 2 ...
##   ..$ bb    : int [1:1030] 0 3 18 25 25 26 43 43 49 58 ...
##   ..$ so    : int [1:1030] 10 12 32 23 23 36 29 40 67 67 ...
##   ..$ ibb   : int [1:1030] 0 0 NA NA NA NA NA NA NA NA ...
##   ..$ hbp   : int [1:1030] 0 0 3 4 1 0 0 1 0 0 ...
##   ..$ sh    : int [1:1030] 0 0 14 7 7 5 3 3 8 12 ...
##   ..$ sf    : int [1:1030] 0 0 NA NA NA NA NA NA NA NA ...
##   ..$ gidp  : int [1:1030] 1 0 7 2 5 6 2 6 15 7 ...
##   ..$ c_year: int [1:1030] 2 3 0 1 2 3 4 5 6 7 ...
model.out <- lapply(datasets, function(x) glm(h ~ c_year, family = "poisson", 
    data = x))  # apply the glm function across the dataset list

# lapply(model.out,summary), # Not run because output is the same as above

4. Using by

This process is similar to the one directly above but will do it in a single line of code. Essentially what by does is the “data frame is split by row into data frames subsetted by the values of one or more factors, and function FUN is applied to each subset in turn”

model.out <- by(data = baseball.red, baseball.red$team, function(x) glm(h ~ 
    c_year, family = "poisson", data = x))
model.out
## baseball.red$team: BOS
## 
## Call:  glm(formula = h ~ c_year, family = "poisson", data = x)
## 
## Coefficients:
## (Intercept)       c_year  
##      3.9433       0.0341  
## 
## Degrees of Freedom: 900 Total (i.e. Null);  899 Residual
## Null Deviance:	    72800 
## Residual Deviance: 71400 	AIC: 75200
## -------------------------------------------------------- 
## baseball.red$team: CHN
## 
## Call:  glm(formula = h ~ c_year, family = "poisson", data = x)
## 
## Coefficients:
## (Intercept)       c_year  
##      4.0720       0.0335  
## 
## Degrees of Freedom: 1178 Total (i.e. Null);  1177 Residual
## Null Deviance:	    80400 
## Residual Deviance: 78400 	AIC: 84300
## -------------------------------------------------------- 
## baseball.red$team: CIN
## 
## Call:  glm(formula = h ~ c_year, family = "poisson", data = x)
## 
## Coefficients:
## (Intercept)       c_year  
##     4.21200      0.00589  
## 
## Degrees of Freedom: 1029 Total (i.e. Null);  1028 Residual
## Null Deviance:	    70000 
## Residual Deviance: 70000 	AIC: 75100

5. Using dlply

One package I use in alot of my analyses is the plyr package created by Hadley Wickham (author of ggplot2, another package I use daily). This packages make subsetting and applying functions incredibly easy and are relatively fast.

library(plyr)
model.out <- dlply(baseball.red, .(team), glm, formula = h ~ c_year, family = poisson)  #dlply takes the data as a dataframe and returns the output as a list
llply(model.out, summary)  #uses llply (list input - list output) to display the summary (instead of lapply, but lapply works as well)
## $BOS
## 
## Call:
## .fun(formula = ..1, family = ..2, data = piece)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -14.29  -10.16   -5.28    6.82   17.97  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.943280   0.005922   665.9   <2e-16 ***
## c_year      0.034136   0.000884    38.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 72809  on 900  degrees of freedom
## Residual deviance: 71420  on 899  degrees of freedom
## AIC: 75179
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $CHN
## 
## Call:
## .fun(formula = ..1, family = ..2, data = piece)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -13.94   -9.17   -2.72    6.21   16.82  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 4.071993   0.005087   800.5   <2e-16 ***
## c_year      0.033453   0.000737    45.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80354  on 1178  degrees of freedom
## Residual deviance: 78403  on 1177  degrees of freedom
## AIC: 84317
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## $CIN
## 
## Call:
## .fun(formula = ..1, family = ..2, data = piece)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -12.40   -9.48   -2.03    6.35   14.83  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.21200    0.00528  797.57  < 2e-16 ***
## c_year       0.00589    0.00082    7.19  6.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70036  on 1029  degrees of freedom
## Residual deviance: 69984  on 1028  degrees of freedom
## AIC: 75122
## 
## Number of Fisher Scoring iterations: 5

As in a lot of things in R, there are many different approaches to the same problem. I personally tend to shy away from the use of the lapply, mapply, and tapply approaches but use the plyr package in most analysis.

Leave a Comment