[mathjax]
Case Study 3: GLMs for Count and Aggregate Loss Variables
Learning Objectives
- Select appropriate distributions and link functions for count and severity variables.
- Identify appropriate offsets and weights for count and severity variables.
- Implement GLMs for count and severity variables in R.
- Assess the quality of a Poisson GLM using the Pearson goodness-of-fit statistic.
- Combine the GLMs for count and severity variables to make predictions for an aggregate loss variable.
Background
Compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance, the dataset in this case study describes third-party automobile insurance claims for the year 1977 (third-party claims entail payments to someone other than the policyholder and the insurer).
This dataset involves grouped observations, with each row of the dataset corresponding to a collection of (homogeneous) policyholders sharing the same set of predictor values rather than a single policyholder. The variables include various automobile and policyholder characteristics such as the distance driven by a vehicle, geographic area, and recent driver claims experience (see Table 4.6 for the data dictionary). Unlike Section 4.3, where our interest is only the probability of making a claim, the outcomes of interest in this case study are twofold:
- The number of claims (the
Claimsvariable) - The sum of claim payments (the
Paymentvariable)
Note that although our ultimate interest is the aggregate payments arising from an automobile policy, modeling the number of claims is an important stepping stone not only because doing so can shed light on the claims-generating mechanism, but also because the number of claims may serve as a valuable input for modeling aggregate payments, as we will see later.
Stage 1: Define the Business Problem
Objective
Our objectives in this case study are to identify important determinants of the two target variables and to use these determinants to make accurate predictions with GLMs. Even though the dataset we use dates back to 1977, our findings may have implications for automobile ratemaking in this day and age by helping automobile insurers set fair premiums based on useful rating variables.
Stage 2: Data Collection
Run CHUNK 1 to load the Swedish package and the data, and print a summary of each variable. There are 2182 observations and 7 variables. The summary does not suggest any missing or abnormal values.
# CHUNK 1
Swedish <- read.csv("SwedishMotorInsurance.csv", stringsAsFactors=TRUE)
summary(Swedish)
Kilometres Zone Bonus Make Insured Claims Payment Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. : 0.01 Min. : 0.00 Min. : 0 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:3.000 1st Qu.: 21.61 1st Qu.: 1.00 1st Qu.: 2989 Median :3.000 Median :4.00 Median :4.000 Median :5.000 Median : 81.53 Median : 5.00 Median : 27404 Mean :2.986 Mean :3.97 Mean :4.015 Mean :4.992 Mean : 1092.20 Mean : 51.87 Mean : 257008 3rd Qu.:4.000 3rd Qu.:6.00 3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.: 389.78 3rd Qu.: 21.00 3rd Qu.: 111954 Max. :5.000 Max. :7.00 Max. :7.000 Max. :9.000 Max. :127687.27 Max. :3338.00 Max. :18245026
| Variable | Description | Characteristics |
Kilometres |
Distance driven by a vehicle in km per year, grouped into five categories (it is not exactly the distance driven, but a higher number does mean a longer distance) | Integer from 1 to 5
|
Zone |
Geographic zone of a vehicle | Integer from 1 to 7 |
Bonus |
No claims bonus, equal to the number of years, plus one, since the last claim | Integer from 1 to 7 |
Make |
Type of a vehicle | Integer from 1 to 8 representing eight different common car models. All other models are combined in class 9. |
Insured |
Number of policyholder years (fraction of the year that the policyholder has a contract with the issuing company) | Numeric from 0.01 to 127687.27 |
Claims |
Number of claims | Integer from Oto 3338 |
Payment |
Sum of payments in Swedish kroner | Integer from 0 to 18245026 |
|
TASK 1: Select offsets Select an appropriate offset for each of the two target variables, claim counts and aggregate payments.
|
There are two target variables in this case study,
- the number of claims (
Claims) and - the aggregate payments (
Payment).
For the number of claims, it is reasonable to expect that it is proportional to the number of policyholder years (Insured), which provides an exposure basis.
Other things equal, the longer a policyholder is covered by an automobile insurance policy, the larger the number of claims submitted.
In the same vein, we expect that aggregate payments vary in proportion to the number of claims. Each claim generates a payment (see CHUNK 3) and it makes intuitive sense that with more claims submitted, the sum of payments becomes larger.
At this stage, we can conjecture that:
Insuredmay be a good offset forClaimsandClaimsin turn may serve as a good offset forPayment
In CHUNK 2, we construct two scatterplots, one for Claims against Insured and one for Payment against Claims.
# CHUNK 2 ggplot(Swedish, aes(x = Insured, y = Claims)) + geom_point() ggplot(Swedish, aes(x = Claims, y = Payment)) + geom_point()
![]() |
![]() |
Findings:
- It is clear that there is a strong proportional relationship between
ClaimsandInsuredand betweenPaymentandClaims. This supports the use ofInsuredandClaimsas offsets (to be included on a suitable scale) forClaimsandPayment, respectively.
Offset for Claims: InsuredOffset for
Payment: Claims
Not strictly required by the task statement, we can make two additional observations, involving Claims, Insured, and Payment:
1. Regarding the relationship between Payment and Claims, notice that if Claims is zero, then Payment must be zero as well. This is intuitively clear as when there are no claims, there are no payments, and confirmed in CHUNK 3, which shows that Payment is degenerate at zero in the subset of Swedish with no claims. Since claim payments are observed only when there are one or more claims, we have created (via logical subsetting) a subset of Swedish where Claims is strictly positive, called Swedish.severity, which will be useful for modeling claim payments later.
# CHUNK 3 summary(Swedish[Swedish$Claims == 0, ]$Payment) Swedish.severity <- Swedish[Swedish$Claims > 0, ]
Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 0 0 0 0
2. To facilitate later use, in CHUNK 4 we define the two offset-corrected target variables, called ClaimsPerinsured and PaymentPerClaim, which are the number of claims per policyholder year and the payment per claim, respectively.
Traditionally, in the actuarial literature PaymentPerClaim is referred to as claim severity, which is the size of one payment and not to be confused with the aggregate payments (what you learned in Exam STAM should help!).
Note that PaymentPerClaim is only defined on Swedish.severity (or else the division by zero caused by a zero Claims would create computational trouble).
# CHUNK 4 Swedish$ClaimsPerinsured <- Swedish$Claims / Swedish$Insured Swedish.severity$PaymentPerClaim <- Swedish.severity$Payment / Swedish.severity$Claims
The following two scatterplots show that the size as well as the variability of ClaimsPerinsured and PaymentPerClaim shrink significantly with Insured and Claims, respectively, showing that averaging goes a long way towards stabilizing the two target variables. We will take advantage of these findings when we construct GLMs in the later part of this case study.
# CHUNK 4 (cont.) ggplot(Swedish, aes(x = Insured, y = ClaimsPerinsured)) + geom_point() ggplot(Swedish.severity, aes(x = Claims, y = PaymentPerClaim)) + geom_point()
![]() |
![]() |
|
TASK 2: Edit and explore the data Edit and explore the data to prepare for further analysis and identify potentially predictive variables.
|
Data Quality Issue
Factor Conversions
In the Swedish dataset, all variables are represented by numeric values. However, Zone and Make are merely numerical labels of the geographical zone and type of a vehicle and do not convey any numerical order (e.g., there is no indication that Zone 2 is larger than Zone 1 in any sense). These two variables should better be converted to factors, as we do in CHUNK 5.
# CHUNK 5 Swedish$Zone <- as.factor(Swedish$Zone) Swedish$Make <- as.factor(Swedish$Make) Swedish.severity$Zone <- as.factor(Swedish.severity$Zone) Swedish.severity$Make <- as.factor(Swedish.severity$Make)
As for Kilometres and Bonus, these two variables do have a numerical order (e.g., Kilometres = 2 indicates a longer distance traveled than Kilometres = 1) but can be converted to ordered factors to be represented by dummy variables to capture their effects on the target variables more effectively. We will leave them as numeric variables for now and decide whether changing them to factors is worthwhile when it comes to the modeling stage.
Stage 3: Exploratory Data Analysis
Univariate Exploration
Distributions of the Target Variables
In CHUNK 6, we explore the distributions of the two offset-corrected target variables, ClaimsPerinsured and PaymentPerClaim, using summary statistics (via the summary() function) in combination with graphical displays (histograms here).
# CHUNK 6 summary(Swedish$ClaimsPerinsured) summary(Swedish.severity$PaymentPerClaim) ggplot(Swedish, aes(x = ClaimsPerinsured)) + geom_histogram()
> summary(Swedish$ClaimsPerinsured)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02648 0.05123 0.06914 0.08906 1.66667
> summary(Swedish.severity$PaymentPerClaim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
72 2643 4375 5206 5985 31442
![]() |
![]() |
Both ClaimsPerinsured and PaymentPerClaim are non-negative continuous variables (while Claims takes only non-negative integers, the division by Insured makes ClaimsPerinsured continuous). The former ranges from 0 (corresponding to no claims) to 1.66667 while the latter ranges from 72 to 31,442.
Even after being accounted for by the offsets, the two target variables exhibit a pronounced right skew, especially for PaymentPerClaim, as shown in the histograms and reflected by their means substantially exceeding their medians. We will take the right skew into account when developing GLMs later.
Bivariate Exploration
Potentially Useful Predictors
Having investigated the two target variables, we go on to perform bivariate data exploration in search of potentially predictive variables out of Kilometres, Bonus, Zone, and Make. CHUNK 7 provides the means and medians of the two offset-corrected target variables split by distinct values or levels of the four predictors. (Split boxplots can also be used here.)
# CHUNK 7
library(tidyverse)
vars <- c("Kilometres", "Bonus", "Zone", "Make")
for (i in vars) {
x <- Swedish%>%
group_by_(i) %>%
summarise(
mean = mean(ClaimsPerinsured),
median = median(ClaimsPerinsured),
n = n()
)
print(x)
}
# A tibble: 5 × 4
Kilometres mean median n
<int> <dbl> <dbl> <int>
1 1 0.0561 0.0425 439
2 2 0.0651 0.0488 441
3 3 0.0718 0.0552 441
4 4 0.0705 0.0555 434
5 5 0.0827 0.0574 427
# A tibble: 7 × 4
Bonus mean median n
<int> <dbl> <dbl> <int>
1 1 0.129 0.117 307
2 2 0.0792 0.0741 312
3 3 0.0676 0.0591 310
4 4 0.0659 0.0514 310
5 5 0.0550 0.0454 313
6 6 0.0524 0.0447 315
7 7 0.0364 0.0350 315
# A tibble: 7 × 4
Zone mean median n
<fct> <dbl> <dbl> <int>
1 1 0.104 0.0862 315
2 2 0.0795 0.0645 315
3 3 0.0722 0.0575 315
4 4 0.0575 0.0473 315
5 5 0.0626 0.0469 313
6 6 0.0569 0.0435 315
7 7 0.0504 0 294
# A tibble: 9 × 4
Make mean median n
<fct> <dbl> <dbl> <int>
1 1 0.0761 0.0614 245
2 2 0.0802 0.0607 245
3 3 0.0576 0.0389 242
4 4 0.0333 0.0216 238
5 5 0.0919 0.0691 244
6 6 0.0543 0.0430 244
7 7 0.0838 0.0555 242
8 8 0.0729 0.0487 237
9 9 0.0712 0.0630 245
|
# CHUNK 7 (cont.)
for (i in vars) {
x <- Swedish.severity%>%
group_by_(i) %>%
summarise(
mean= mean(PaymentPerClaim),
median= median(PaymentPerClaim),
n = n()
)
print(x)
}
# A tibble: 5 × 4
Kilometres mean median n
<int> <dbl> <dbl> <int>
1 1 4912. 4239. 381
2 2 5068. 4650. 399
3 3 5445. 4494. 378
4 4 5065. 3901. 328
5 5 5601. 4207. 311
# A tibble: 7 × 4
Bonus mean median n
<int> <dbl> <dbl> <int>
1 1 4913. 4195. 252
2 2 5451. 4238. 246
3 3 4899. 3974. 246
4 4 5257. 4095. 240
5 5 5188. 3900 247
6 6 5251. 4464. 270
7 7 5440. 5107. 296
# A tibble: 7 × 4
Zone mean median n
<fct> <dbl> <dbl> <int>
1 1 4959. 4399. 295
2 2 4569. 4142. 295
3 3 5167. 4500. 293
4 4 5411. 4904. 306
5 5 5625. 4145. 236
6 6 5702. 4775. 264
7 7 5016. 2605. 108
# A tibble: 9 × 4
Make mean median n
<fct> <dbl> <dbl> <int>
1 1 5083. 4828. 227
2 2 4854. 4313. 205
3 3 5950. 4438. 178
4 4 4877. 3367 155
5 5 4986. 3797. 206
6 6 5193. 4482. 212
7 7 4759. 3429 197
8 8 6603. 4387 173
9 9 4849. 4791. 244
|
# Remove redundant variabLes rm(i, vars, x)
ClaimsPerlnsured: It appears thatClaimsPerlnsuredhas a monotonic relationship with bothKilometresandBonus, being increasing inKilometresand decreasing inBonus.
These findings make intuitive sense as a longer distance traveled indicates a more worn-out vehicle and a smallerBonussuggests poorer driving experience of a policyholder.
The average ofClaimsPerlnsuredis also particularly low in certain geographical zones such as 4, 5, 6, and 7, and for certain car models such as 3, 4, and 6.PaymentPerClaim: The effects of the four predictors onPaymentPerClaimare not as pronounced as forClaimsPerlnsured. UnlikeClaimsPerlnsured,KilometresandBonusdo not seem to exert a monotonic effect onPaymentPerClaim.
We can see thatPaymentPerClaimhas relatively close means and medians across different values ofKilometresandBonus, and relatively high means and medians for Zone 4 and Zone 6, and relatively low medians for cars of Make 4 and Make 7.
|
TASK 3: Construct a GLM for
|
Stage 4: Model Construction and Selection
Creation of Training and Test sets
Before constructing GLMs for Claims and Payment, we perform the usual step of splitting the data into the training and test sets. The training set (70% of the data) will be used for fitting the GLMs and the test set (30% of the data) for evaluating their prediction accuracy. There is a subtlety in the current context: We have two target variables here and which variable should we base the stratified sampling on when doing the training/test set split?
In CHUNK 8, we do the stratified sampling based on the number of claims per insured, ClaimsPerinsured, and you will see why very shortly.
# CHUNK 8
library(caret)
set.seed(14)
train_ind <- createDataPartition(Swedish$ClaimsPerinsured, p = 0.7, list = FALSE)
data.train <- Swedish[train_ind, ]
data.test <- Swedish[-train_ind, ]
# Takeout the severity portions of the training and test sets
data.train.severity <- data.train[data.train$Claims > 0, ]
data.train.severity$PaymentPerClaim <- data.train.severity$Payment / data.train.severity$Claims
data.test.severity <- data.test[data.test$Claims > 0, ]
data.test.severity$PaymentPerClaim <- data.test.severity$Payment / data.test.severity$Claims
print("TRAIN")
mean(data.train$ClaimsPerinsured)
mean(data.train.severity$PaymentPerClaim)
print("TEST")
mean(data.test$ClaimsPerinsured)
mean(data.test.severity$PaymentPerClaim)
> print("TRAIN")
[1] "TRAIN"
> mean(data.train$ClaimsPerinsured)
[1] 0.06896561
> mean(data.train.severity$PaymentPerClaim)
[1] 5253.53
> print("TEST")
[1] "TEST"
> mean(data.test$ClaimsPerinsured)
[1] 0.06955017
> mean(data.test.severity$PaymentPerClaim)
[1] 5094.656
The average number of claims per insured on the training set and that on the test set are very close to each other, which is a result of the built-in stratification. To our delight, the two sets also produce a similar average claim severity, which can be attributed to the strong proportional relationship between Payment and Claims we saw in CHUNK 2 and the comparable mean values of ClaimsPerinsured on the training and test sets.
Fitting the Model
A GLM for Claims with Kilometres and Bonus as numeric variables.
Because Claims is a non-negative integer-valued variable with a right skew, the Poisson distribution is a suitable target distribution. We will adopt the log link, which is most commonly employed in conjunction with the Poisson distribution as it ensures that the model predictions are always non-negative, makes for an interpretable model, and is the canonical link.
In CHUNK 9, we fit a log-link Poisson GLM for Claims on the training set using Kilometres, Zone, Bonus, and Make as predictors and the logarithm of Insured as an offset. Here Kilometres and Bonus are treated as numeric variables. The GLM is saved as an object named count.numeric to distinguish it from other GLMs we will fit later.
# CHUNK 9
count.numeric <- glm(Claims ~ Kilometres + Zone + Bonus + Make,
data = data.train
offset = log(Insured),
family = poisson)
summary(count.numeric)
Call:
glm(formula = Claims ~ Kilometres + Zone + Bonus + Make, family = poisson,
data = data.train, offset = log(Insured))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.813830 0.016746 -108.311 < 2e-16 ***
Kilometres 0.131218 0.003028 43.330 < 2e-16 ***
Zone2 -0.230780 0.011698 -19.728 < 2e-16 ***
Zone3 -0.382588 0.012185 -31.397 < 2e-16 ***
Zone4 -0.573191 0.010981 -52.197 < 2e-16 ***
Zone5 -0.328492 0.018044 -18.205 < 2e-16 ***
Zone6 -0.522192 0.014480 -36.063 < 2e-16 ***
Zone7 -0.727917 0.050782 -14.334 < 2e-16 ***
Bonus -0.201320 0.001609 -125.103 < 2e-16 ***
Make2 0.047689 0.026055 1.830 0.0672 .
Make3 -0.268146 0.033077 -8.107 5.20e-16 ***
Make4 -0.687657 0.027191 -25.290 < 2e-16 ***
Make5 0.154635 0.023003 6.722 1.79e-11 ***
Make6 -0.350481 0.020309 -17.258 < 2e-16 ***
Make7 -0.061202 0.027085 -2.260 0.0238 *
Make8 -0.074676 0.040067 -1.864 0.0624 .
Make9 -0.069484 0.011548 -6.017 1.78e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 23800.5 on 1529 degrees of freedom
Residual deviance: 2681.6 on 1513 degrees of freedom
AIC: 8059.4
Number of Fisher Scoring iterations: 4
Note that:
- In the formula argument of the
glm()function, we deliberately excludePaymentandClaimsPerinsured, which cannot serve as predictors ofClaims, as well asInsured, which will play its role in the offset argument. - The family argument of the
glm()function is set to poisson(not Poisson!) to specify a Poisson GLM. The log link is the canonical (and default) link for the Poisson distribution and so does not have to be specified explicitly. (If you prefer, you can replacefamily = poissonbyfamily = poisson()orfamily = poisson(link = "log").) - Do remember to put
Insuredon the log scale, hence theoffset = log(Insured)command (notoffset = Insured!).
The model summary shows that the coefficient estimates for Kilometres and Bonus are positive and negative, respectively, consistent with what we saw in CHUNK 7. All features, including the individual factor levels of Zone and Make (except for Levels 2 and 8), are statistically significant, indicating that they are valuable for accounting for Claims and obviating the need for feature removal.
Digression: Can we model ClaimsPerinsured directly?
Note that because of the log link, the equation of the Poisson GLM we have just fitted is equivalent to:
\(\ln(\text{ClaimsPerInsured})=\ln(\dfrac{Claims}{Insured})=\eta\),
where η is the linear predictor comprising Kilometres, Zone, Bonus, and Make. You may wonder:
To save work, can we instead fit a log-link Poisson GLM directly to
ClaimsPerinsuredwithout having to supplylog(Insured)as an offset?
This is explored in CHUNK 10.
# CHUNK 10
freq <- glm(ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make,
data= data.train,
weight = Insured,
family = poisson)
summary(freq)
> summary(freq)
Call:
glm(formula = ClaimsPerinsured ~ Kilometres + Zone + Bonus +
Make, family = poisson, data = data.train, weights = Insured)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.813830 0.016747 -108.311 < 2e-16 ***
Kilometres 0.131218 0.003028 43.330 < 2e-16 ***
Zone2 -0.230780 0.011698 -19.728 < 2e-16 ***
Zone3 -0.382588 0.012185 -31.397 < 2e-16 ***
Zone4 -0.573191 0.010981 -52.197 < 2e-16 ***
Zone5 -0.328492 0.018044 -18.205 < 2e-16 ***
Zone6 -0.522192 0.014480 -36.063 < 2e-16 ***
Zone7 -0.727917 0.050784 -14.334 < 2e-16 ***
Bonus -0.201320 0.001609 -125.103 < 2e-16 ***
Make2 0.047689 0.026055 1.830 0.0672 .
Make3 -0.268146 0.033077 -8.107 5.20e-16 ***
Make4 -0.687657 0.027191 -25.290 < 2e-16 ***
Make5 0.154635 0.023003 6.722 1.79e-11 ***
Make6 -0.350481 0.020309 -17.258 < 2e-16 ***
Make7 -0.061202 0.027085 -2.260 0.0238 *
Make8 -0.074676 0.040067 -1.864 0.0624 .
Make9 -0.069484 0.011548 -6.017 1.78e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 23800.5 on 1529 degrees of freedom
Residual deviance: 2681.6 on 1513 degrees of freedom
AIC: Inf
Number of Fisher Scoring iterations: 6
Notice that:
- As opposed to
CHUNK 9, the target variable in the formula argument of theglm()function becomesClaimsPerinsuredand no offset is specified. - Due to the averaging by
Insured, the variability ofClaimsPerinsuredis lowered and decreases with the value ofInsured, as we saw inCHUNK 4. To reflect the different precision of the observations caused by different values ofInsured, the commandweight = Insuredis in place to insertInsuredas a weight in the model fitting procedure. Observations with a higher value ofInsuredwill receive a higher weight.
With one exception, the summary for this model is identical to that in CHUNK 9, indicating that the two fitted models are essentially the same (the equivalence was first noted on page 296).
That notable exception is that the AIC of the model for ClaimsPerinsured (the frequency model) is Inf while that of the model for Claims (the count model) is a normal value. When running CHUNK 10 in RStudio, you should also see various warning messages that stem from the non-integer values of ClaimsPerinsured (whereas a Poisson random variable only takes non-negative integer values). This inconsistency between the nature of ClaimsPerinsured and a genuine Poisson random variable prohibits the glm() function from computing the AIC of the frequency model. As a result, no feature selection based on the AIC can be performed.
Since the count model and the frequency model produce identical fitted results and the former is much easier to work with, there is no reason to dwell on the latter and we will no longer consider the frequency model in the rest of this case study.
A GLM for Claims with Kilometres and Bonus as factor variables. Now we refit the Poisson count model with Kilometres and Bonus treated as factor variables and name the fitted model as count.factor. Notice that Kilometres and Bonus are replaced by factor(Kilometres) and factor(Bonus) in the model formula.
# CHUNK 11
count.factor <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make,
data = data.train,
offset = log(Insured),
family= poisson)
summary(count.factor)
Call:
glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) +
Make, family = poisson, data = data.train, offset = log(Insured))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.793205 0.016512 -108.598 < 2e-16 ***
factor(Kilometres)2 0.194910 0.010023 19.446 < 2e-16 ***
factor(Kilometres)3 0.294007 0.009877 29.766 < 2e-16 ***
factor(Kilometres)4 0.370948 0.014373 25.809 < 2e-16 ***
factor(Kilometres)5 0.559839 0.014991 37.345 < 2e-16 ***
Zone2 -0.239285 0.011750 -20.364 < 2e-16 ***
Zone3 -0.382898 0.012331 -31.053 < 2e-16 ***
Zone4 -0.574708 0.011185 -51.382 < 2e-16 ***
Zone5 -0.337239 0.018218 -18.511 < 2e-16 ***
Zone6 -0.532676 0.014746 -36.123 < 2e-16 ***
Zone7 -0.708291 0.050845 -13.930 < 2e-16 ***
factor(Bonus)2 -0.453835 0.015951 -28.453 < 2e-16 ***
factor(Bonus)3 -0.699796 0.017949 -38.987 < 2e-16 ***
factor(Bonus)4 -0.823508 0.018985 -43.376 < 2e-16 ***
factor(Bonus)5 -0.906880 0.017397 -52.129 < 2e-16 ***
factor(Bonus)6 -0.983759 0.014285 -68.864 < 2e-16 ***
factor(Bonus)7 -1.332578 0.010936 -121.853 < 2e-16 ***
Make2 0.049512 0.026094 1.897 0.0578 .
Make3 -0.269633 0.033111 -8.143 3.85e-16 ***
Make4 -0.661879 0.027371 -24.182 < 2e-16 ***
Make5 0.152141 0.023016 6.610 3.84e-11 ***
Make6 -0.342875 0.020333 -16.863 < 2e-16 ***
Make7 -0.059942 0.027092 -2.213 0.0269 *
Make8 -0.086094 0.040131 -2.145 0.0319 *
Make9 -0.067240 0.011726 -5.734 9.80e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 23800.5 on 1529 degrees of freedom
Residual deviance: 2016.8 on 1505 degrees of freedom
AIC: 7410.6
Number of Fisher Scoring iterations: 4
In the model summary, Kilometres and Bonus are now replaced by a collection of dummy variables representing their distinct levels other than the baseline levels. The signs and magnitudes of the coefficient estimates of these dummy variables are in agreement with the coefficient estimates of Kilometres and Bonus in CHUNK 9. The estimates of the dummy variables of Kilometres are all positive and increase in value from level 2 to level 5, while the opposite relationship is observed for those of Bonus from level 2 to level 7, consistent with the monotonic effects of Kilometres and Bonus on ClaimsPerinsured we observed earlier. The binarization of Kilometres and Bonus, however, allows us to capture the effects of these two variables more effectively across different factor levels, at the cost of an increase in the number of features in the model and computational burden.
Using the Pearson statistic as an arbitrator
Comparing CHUNKs 9 and 11, we can see that the GLM with binarized Kilometres and Bonus has a lower AIC and therefore appears to be the better model. The superiority of binarizing Kilometres and Bonus is also reflected by the Pearson goodness-of-fit statistics defined by:
\(\dfrac{1}{n}\sum{\dfrac{(y_i-\hat{y}_i)^2}{\hat{y}_i}}\)
where yi is the ith observation of the training or test set, \(\hat{y}_i\) is the corresponding fitted or predicted value, and n is the number of observations in the set.
Featured in the June 16, 2020 PA exam and covered in Exam SRM, the Pearson statistic is commonly used to assess the goodness of fit of a count model. It takes the square of the discrepancy between the observed value and the predicted value but unlike the MSE, it scales the squared discrepancy by the predicted value to avoid the statistic being disproportionately dominated by large predicted values, which are not uncommon with right-skewed data. Like the AIC, the smaller the Pearson statistic, the better the model.
In CHUNK 12, we generate the predictions of the two Poisson GLMs on the test set and evaluate the Pearson goodness-of-fit statistic for both models. Note that the predictions are for Claims, not ClaimsPerinsured, and have taken the offset Insured into account.
# CHUNK 12 count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response") sum((data.test$Claims - count.numeric.pred)^2 / count.numeric.pred) / nrow(data.test) count.factor.pred <- predict(count.factor, newdata = data.test, type = "response") sum((data.test$Claims - count.factor.pred)^2 / count.factor.pred) / nrow(data.test)
> count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response") > > sum((data.test$Claims - count.numeric.pred)^2 / count.numeric.pred) / nrow(data.test) [1] 2.311969 > > count.factor.pred <- predict(count.factor, newdata = data.test, type = "response") > > sum((data.test$Claims - count.factor.pred)^2 / count.factor.pred) / nrow(data.test) [1] 1.602462
In congruence with the AIC-based rankings, treating Kilometres and Bonus as factor variables leads to a lower Pearson statistic on the test set and appears a judicious move. The increase in flexibility offered by the dummy variables outweighs the increase in model complexity.
|
TASK 4: Construct a GLM for claim severity
|
A GLM for claim severity
We now turn to building GLMs for claim severity, or PaymentPerClaim. As CHUNK 6 showed, PaymentPerClaim is a positive, continuous variable with a right skew, so a gamma distribution is a good choice for the target distribution (inverse Gaussian is another choice).
Furthermore, CHUNK 4 indicated that the variability of PaymentPerClaim decreases with the value of Claims. This motivates the use of Claims as a weight when a gamma GLM is fitted for PaymentPerClaim.
In CHUNK 13, we fit a log-link gamma GLM for PaymentPerClaim on the severity portion of the training set with Claims serving as a weight. Consistent with the claim count model in CHUNK 11, we are binarizing Kilometres and Bonus. (If you are interested, feel free to compare the performance of the GLM with numeric Kilometres and Bonus and with binarized Kilometres and Bonus. Such a comparison is omitted here to reduce repetition.)
# CHUNK 13
severity <- glm(PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make,
data = data.train.severity,
weight = Claims,
family = Gamma(link = "log"))
summary(severity)
Call:
glm(formula = PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) +
Make, family = Gamma(link = "log"), data = data.train.severity,
weights = Claims)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.39309 0.02801 299.694 < 2e-16 ***
factor(Kilometres)2 0.01253 0.01722 0.728 0.466926
factor(Kilometres)3 0.01459 0.01703 0.857 0.391764
factor(Kilometres)4 0.03893 0.02481 1.569 0.116860
factor(Kilometres)5 0.03361 0.02584 1.301 0.193644
Zone2 0.04370 0.02023 2.160 0.030966 *
Zone3 0.04870 0.02122 2.294 0.021930 *
Zone4 0.13249 0.01932 6.857 1.11e-11 ***
Zone5 0.04064 0.03141 1.294 0.196051
Zone6 0.15418 0.02541 6.068 1.72e-09 ***
Zone7 0.04576 0.08753 0.523 0.601199
factor(Bonus)2 0.06609 0.02748 2.405 0.016312 *
factor(Bonus)3 0.08396 0.03094 2.713 0.006757 **
factor(Bonus)4 0.08018 0.03274 2.449 0.014445 *
factor(Bonus)5 0.06293 0.03001 2.097 0.036201 *
factor(Bonus)6 0.08939 0.02465 3.627 0.000298 ***
factor(Bonus)7 0.12111 0.01887 6.420 1.94e-10 ***
Make2 -0.03875 0.04498 -0.861 0.389150
Make3 0.05197 0.05710 0.910 0.362879
Make4 -0.20358 0.04691 -4.340 1.54e-05 ***
Make5 -0.09568 0.03963 -2.414 0.015907 *
Make6 -0.05159 0.03501 -1.474 0.140771
Make7 -0.11042 0.04666 -2.367 0.018099 *
Make8 0.25276 0.06912 3.657 0.000266 ***
Make9 -0.05821 0.02014 -2.890 0.003915 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 2.964663)
Null deviance: 3801.1 on 1259 degrees of freedom
Residual deviance: 3185.2 on 1235 degrees of freedom
AIC: 1257914
Number of Fisher Scoring iterations: 5
Most features in the fitted GLM are statistically significant; the notable exception is Kilometres, which is removed when a backward stepwise selection based on the AIC is implemented, as shown in CHUNK 14. (For simplicity, we do not perform explicit binarization.)
# CHUNK 14 library(MASS) severity.final <- stepAIC(severity)
Start: AIC=1257914
PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) +
Make
Df Deviance AIC
- factor(Kilometres) 4 3195.4 1257909
<none> 3185.2 1257914
- factor(Bonus) 6 3312.1 1257945
- Make 8 3334.2 1257948
- Zone 6 3396.3 1257973
Step: AIC=1258151
PaymentPerClaim ~ Zone + factor(Bonus) + Make
summary(severity.final)
Call:
glm(formula = PaymentPerClaim ~ Zone + factor(Bonus) + Make,
family = Gamma(link = "log"), data = data.train.severity,
weights = Claims)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.40279 0.02680 313.506 < 2e-16 ***
Zone2 0.04430 0.02026 2.186 0.028972 *
Zone3 0.04808 0.02121 2.267 0.023554 *
Zone4 0.13333 0.01912 6.972 5.06e-12 ***
Zone5 0.03985 0.03123 1.276 0.202207
Zone6 0.15619 0.02510 6.222 6.71e-10 ***
Zone7 0.04550 0.08779 0.518 0.604383
factor(Bonus)2 0.06995 0.02748 2.546 0.011023 *
factor(Bonus)3 0.08940 0.03087 2.896 0.003844 **
factor(Bonus)4 0.08657 0.03262 2.654 0.008062 **
factor(Bonus)5 0.06956 0.02986 2.329 0.020009 *
factor(Bonus)6 0.09553 0.02444 3.909 9.79e-05 ***
factor(Bonus)7 0.12545 0.01875 6.692 3.32e-11 ***
Make2 -0.03607 0.04503 -0.801 0.423242
Make3 0.05835 0.05709 1.022 0.306992
Make4 -0.21254 0.04663 -4.558 5.66e-06 ***
Make5 -0.09632 0.03975 -2.423 0.015542 *
Make6 -0.05523 0.03504 -1.576 0.115251
Make7 -0.11148 0.04680 -2.382 0.017376 *
Make8 0.25775 0.06916 3.727 0.000203 ***
Make9 -0.05990 0.01993 -3.006 0.002702 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 2.985245)
Null deviance: 3801.1 on 1259 degrees of freedom
Residual deviance: 3195.4 on 1239 degrees of freedom
AIC: 1258151
Number of Fisher Scoring iterations: 5
Lastly, in CHUNK 15 we apply the reduced GLM to produce predictions for Payment on the severity portion of the test set. Because the target variable of the model is PaymentPerClaim, we have to multiply the model predictions by Claims on the test set to get the predictions for Payment.
# CHUNK 15 severity.final.pred <- data.test.severity$Claims * predict(severity.final, newdata = data.test.severity, type= "response") RMSE(data.test.severity$Payment, severity.final.pred) RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))
> severity.final.pred <- data.test.severity$Claims * + predict(severity.final, newdata = data.test.severity, type= "response") > > RMSE(data.test.severity$Payment, severity.final.pred) [1] 69882.03 > > RMSE(data.test.severity$Payment, mean(data.train.severity$Payment)) [1] 1108407
The test RMSE is 69,882.03, which seems large. To put this value in perspective, we compare it with the test RMSE if we use the mean of Payment on the training set as the predicted value. The test RMSE in this case becomes 1,108,407, a much larger value. Our gamma GLM is therefore capable of reducing the test RMSE to only about 6.30% of the RMSE when no predictors are available.
Predictions
|
TASK 5: Combine the two models to make a prediction for aggregate payments Run the recommended GLMs in Tasks 3 and 4 on the full dataset.
|
Combining frequency and severity GLMs to predict aggregate payments
Let’s refit the two GLMs in Tasks 3 and 4 to the full Swedish dataset. Here is the summary output.
# CHUNK 16
count.full <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make,
data = Swedish,
offset = log(Insured),
family = poisson)
summary(count.full)
Call:
glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) +
Make, family = poisson, data = Swedish, offset = log(Insured))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.812840 0.013757 -131.775 < 2e-16 ***
factor(Kilometres)2 0.212586 0.007524 28.255 < 2e-16 ***
factor(Kilometres)3 0.320226 0.008661 36.974 < 2e-16 ***
factor(Kilometres)4 0.404657 0.012054 33.571 < 2e-16 ***
factor(Kilometres)5 0.575954 0.012830 44.892 < 2e-16 ***
Zone2 -0.238168 0.009496 -25.082 < 2e-16 ***
Zone3 -0.386395 0.009670 -39.959 < 2e-16 ***
Zone4 -0.581902 0.008654 -67.243 < 2e-16 ***
Zone5 -0.326128 0.014530 -22.446 < 2e-16 ***
Zone6 -0.526234 0.011877 -44.309 < 2e-16 ***
Zone7 -0.730999 0.040698 -17.962 < 2e-16 ***
factor(Bonus)2 -0.478993 0.012094 -39.607 < 2e-16 ***
factor(Bonus)3 -0.693172 0.013508 -51.316 < 2e-16 ***
factor(Bonus)4 -0.827397 0.014584 -56.735 < 2e-16 ***
factor(Bonus)5 -0.925632 0.013968 -66.269 < 2e-16 ***
factor(Bonus)6 -0.993457 0.011629 -85.429 < 2e-16 ***
factor(Bonus)7 -1.327406 0.008685 -152.845 < 2e-16 ***
Make2 0.076245 0.021239 3.590 0.000331 ***
Make3 -0.247413 0.025094 -9.859 < 2e-16 ***
Make4 -0.653524 0.024185 -27.022 < 2e-16 ***
Make5 0.154924 0.020235 7.656 1.91e-14 ***
Make6 -0.335581 0.017375 -19.314 < 2e-16 ***
Make7 -0.055940 0.023343 -2.396 0.016554 *
Make8 -0.043933 0.031604 -1.390 0.164493
Make9 -0.068054 0.009956 -6.836 8.17e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 34070.6 on 2181 degrees of freedom
Residual deviance: 2966.1 on 2157 degrees of freedom
AIC: 10654
Number of Fisher Scoring iterations: 4
severity.full <- glm(Payment/Claims ~ Zone + factor(Bonus) + Make,
data = Swedish.severity,
weight = Claims,
family = Gamma(link = "log"))
summary(severity.full)
Call:
glm(formula = Payment/Claims ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"),
data = Swedish.severity, weights = Claims)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.41085 0.02222 378.570 < 2e-16 ***
Zone2 0.02297 0.01640 1.401 0.161428
Zone3 0.04770 0.01670 2.855 0.004348 **
Zone4 0.12963 0.01497 8.661 < 2e-16 ***
Zone5 0.05073 0.02509 2.022 0.043293 *
Zone6 0.14652 0.02053 7.137 1.39e-12 ***
Zone7 0.02259 0.07025 0.321 0.747879
factor(Bonus)2 0.04698 0.02085 2.254 0.024346 *
factor(Bonus)3 0.07390 0.02325 3.178 0.001508 **
factor(Bonus)4 0.06240 0.02507 2.489 0.012893 *
factor(Bonus)5 0.03987 0.02396 1.664 0.096308 .
factor(Bonus)6 0.07606 0.01987 3.828 0.000134 ***
factor(Bonus)7 0.12130 0.01480 8.194 4.78e-16 ***
Make2 -0.03177 0.03664 -0.867 0.386081
Make3 0.08932 0.04327 2.064 0.039159 *
Make4 -0.17496 0.04138 -4.228 2.48e-05 ***
Make5 -0.08771 0.03493 -2.511 0.012120 *
Make6 -0.04243 0.02995 -1.417 0.156670
Make7 -0.12068 0.04030 -2.995 0.002784 **
Make8 0.21863 0.05442 4.017 6.14e-05 ***
Make9 -0.05673 0.01711 -3.316 0.000931 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 2.979105)
Null deviance: 5417.7 on 1796 degrees of freedom
Residual deviance: 4547.3 on 1776 degrees of freedom
AIC: 1878541
Number of Fisher Scoring iterations: 5
Note that the predictions made in CHUNK 15 on the severity portion of the test set for aggregate payments were made assuming that the values of Claims were available. In practice, one would predict the aggregate payments of a prospective policyholder based only on his/her rating variables, without knowing the number of claims he/she will submit; the number of claims will not be available until the policy term ends.
To illustrate how to predict the aggregate payments for a future policyholder, whose Claims is yet to be observed, we consider the sample policyholder described in the task statement. According to the data dictionary, the combination of feature values is:
|
|
Zone |
Bonus |
Make |
Insured |
|
2 |
1 | 1 | 6 | 350 |
This policyholder is created in CHUNK 17. Note that Zone and Make are created as character variables with levels surrounded by quotes as we have converted them to factors in CHUNK 5, before the GLMs were fitted. Kilometres and Bonus, however, are turned into factors in the formulas of the GLMs.
# CHUNK 17 sample <- data.frame(Kilometres = 2, Zone = "1", Bonus = 1, Make = "6", Insured = 350) sample
Kilometres Zone Bonus Make Insured 1 2 1 1 6 350
To predict the aggregate payments for this sample policyholder, we will apply the two GLMs to make predictions for the number of claims and claim severity separately, then combine the two predictions to form an overall prediction.
Step 1. Let’s start with predicting the number of claims.
# CHUNK 18 # Predicted number of claims N.pred <- predict(count.full, newdata = sample, type = "response") N.pred
1 50.50629
To check the prediction manually, we use the coefficient estimates in CHUNK 16 to compute the predicted linear predictor for the given policyholder:
η = -1.812840 + 0.212586 + 0 + 0 -0.335581 = -1.935835
Together with the offset, the predicted value of Claims is N = 350 exp( -1.935835) = 50.5063, which agrees with the value produced by R.
Step 2. Next, we turn to the predicted claim severity.
# CHUNK 18 (Cont.) # Predicted claim severity X.pred <- predict(severity.full, newdata = sample, type = "response") X.pred
1 4308.826
The claim severity GLM says that 4,308.826 is the predicted size of one claim for this sample policyholder.
Step 3. Finally, we multiply the two predictions to get the predicted aggregate payments.
# CHUNK 18 (Cont.) # Predi cted aggregate payments S.pred <- N.pred * X.pred S.pred
1 217622.8
An alternative prediction method
In the procedure above, we are basically decomposing the aggregate payments into two components, the claim counts and claim severity, and predict these two variables separately using different G LMs. Then we multiply the two predictions to form the overall prediction for the aggregate payments, using the following formula you may have learned from Exam STAM:
E[aggregate payments] = E[# of claims] • E[claim severity]
which relies on the important assumption that claim counts and claim severity are independent random variables.
An alternative is to fit a target distribution like the Tweedie distribution discussed briefly in Sub-section 4.1.1 to the aggregate payments (as a single target variable) directly and make a prediction using this distribution. This alternative modeling approach has the following advantages:
- It is more efficient to fit and maintain a single (possibly more complex) model than two separate models.
- It does not require the independence between claim counts and claim severity.
EXAM NOTE
As we mentioned in Subsection 4.1.1, it is unlikely that you will actually fit a Tweedie distribution in Exam PA because the R package for Tweedie will not be available on the exam. A conceptual exam item, however, may ask you to describe the Tweedie distribution and how it can serve as an alternative for modeling aggregate payments.





