# Models by a factor

Recently, a friend asked a question on ways to run a single model by a grouping variable in a dataset (similar to the BY statement in SAS for those familiar with SAS). This was a list of the multiple different ways that were suggested.

I will use the baseball dataset in the *plyr* package to go through the different approaches.

### Load and display the data

To look at the year-team combinations in the data, use the ddply function

So the basic idea for the process I am going to use, is that I am going to use a Poisson regression using the *glm* function in R to look at the hits as a function of year since the first year of data was included in the dataset. Is the model 100% valid? No but for this the actual model is not important.

Further, the approaches I am describing below, are runnign the same model with same dependent and independent variables across different subsets of data. If you were interested in running different subsets of independent variables there are also many different approaches that could be considered, but I will not go into that at this time.

### 1. Subsetting and running each model seperate

The basic process is to run each model seperately by providing a different subset of the data. Not very difficult to do, but if you had a bunch of models to run, it can very quickly add up. The nice thing about this process though is that you have very clear model outputs (i.e., a specific output for BOS, CHN, and CIN).

### 2. Subsetting and running through a loop

This process first identifies all the uniqe teams listed in the dataset and then will loop through those teams and store the output in a list. The summary for all the models can be provided by running lapply on that list or running summary on an extracted element of the list. Problem with this method is just the inherent complication of using lists in R and running loops can take quite a long time especially on large datasets. If it is speed you are after there are quicker options.

### 3. Using *split* and *lapply*

This process splits the dataset into seperate lists by a variable and then uses *lapply* to run the glm model across those different datasets.

### 4. Using *by*

This process is similar to the one directly above but will do it in a single line of code. Essentially what *by* does is the “data frame is split by row into data frames subsetted by the values of one or more factors, and function FUN is applied to each subset in turn”

### 5. Using *dlply*

One package I use in alot of my analyses is the plyr package created by Hadley Wickham (author of **ggplot2**, another package I use daily). This packages make subsetting and applying functions incredibly easy and are relatively fast.

As in a lot of things in R, there are many different approaches to the same problem. I personally tend to shy away from the use of the *lapply*, *mapply*, and *tapply* approaches but use the **plyr** package in most analysis.

## Leave a Comment