[mathjax]
Frequency Distributions-Exposure and Coverage Modifications
Exposure Modifications
Exposure can refer to the number of members of the insured group or the number of time limits (e.g., years) they are insured for. Doubling the size of the group or the period of coverage will double the number of claims.
Frequency Model
Suppose the model is based on \(n_1\) exposures, and you now want a model for \(n_2\) exposures.
Distribution | Before Exposure Modification | After Exposure Modification |
Poisson |
\(N\sim Poi(\lambda)\) | \(N\sim Poi(\lambda \dfrac{n_2}{n_1})\) |
Negative Binomial | \(N\sim NB(r,\beta)\) | \(N\sim NB(r\dfrac{n_2}{n_1},\beta)\) |
Binomial | \(N\sim Bin(m,q)\) | \(N\sim Bin(m\dfrac{n_2}{n_1},q)\) |
Coverage/Severity Modifications
The most common type of coverage modification is changing the deductible, so that the number of claims for amounts greater than zero changes. Another example would be uniform inflation, which would affect frequency if there’s a deductible.
Aggregate Losses
If you need to calculate aggregate losses, and don’t care about payment frequency, one way to handle a coverage modification is to model the number of losses (rather than the number of paid claims), in which case no modification is needed for the frequency model. Instead, use the payment per loss random variable, \(Y^L\). This variable is adjusted, and be zero with positive probability.
The \((a,b,0)\) Class
The frequency distribution is modified. The modified frequency-the frequency of positive claims-has the same form as the original frequency, but with different parameters. If you need to calculate aggregate losses, the modified frequency is used in conjunction with the modified payment per payment random variable, \(Y^P\).
Suppose the probability of paying a claim, i.e., severity being greater than the deductible, is \(v\).
Distribution | Before Frequency Modification | After Frequency Modification |
Poisson |
\(N\sim Poi(\lambda)\) | \(N\sim Poi(v\lambda)\) |
Negative Binomial | \(N\sim NB(r,\beta)\) | \(N\sim NB(r,v\beta)\) |
Binomial | \(N\sim NB(m,q)\) | \(N\sim Bin(m,vq)\) |
The \((a,b,1)\) Class
The same parameter that gets multiplied by \(v\) in the \((a,b,0)\) class gets multiplied by \(v\) in the \((a,b,1)\) class.
\(p_0^M\) is adjusted so that: \(1-p_0^{M*}=(1-p_0^M)\dfrac{1-p_0^*}{1-p_0}\), where asterisks indicate distributions with revised parameters.
For the logarithmic distribution, \(p_0^M\) is adjusted so that: \(1-p_0^{M*}=(1-p_0^M)\dfrac{ln(1+v\beta)}{ln(1+\beta)}\)
To get the probability of \(n\) claims for zero-modified distribution if the insurance coverage has a deductible of \(d\), follow the steps:
-
- Obtain the modified parameter: \(\lambda^*=v\lambda\), where \(v=S_X(d)\)
- Use the modified parameter to calculate \(p_0^{M*}\): \(1-p_0^{M*}=(1-p_0^M)\dfrac{1-p_0^*}{1-p_0}\)
- Use \(p_n^{T*}\) to obtain \(p_n^{M*}\): \(p_n^{M*}=(1-p_0^{M*})p_n^{T*}\)
Aggregate Loss Models: Compound Variance
Collective Risk Model
Collective risk model only considers the number of claims and the size of each claim. In other words, the size of the group is only relevant to the extent it affects the number of claims. In this model, aggregate losses \(S\) can be expressed as:
\(S=\sum\limits_{i=1}^{N}{X_{i}}\)
where \(N\) is the number of claims and \(X_i\) is the size of each claim. The following assumptions are made:
- \(X_i\)’s are independent identically distributed random variables. In other words, every claim size has the same probability distribution and is independent of any other claim size.
- \(X_i\)’s are independent of \(N\). The claim counts are independent of the claim sires.
\(S\) is a compound distribution: a distribution formed by summing up a random number of identical random variables. For a compound distribution, \(N\) is called the primary distribution and \(X\) is called the secondary distribution.
Individual Risk Model
The alternative is to let \(n\) be the number of insureds in the group, and \(X_i\) be the aggregate claims of each individual member. We assume:
-
\(X_i\)’s are independent, but not necessarily identically distributed random variables. Different insureds could have different distributions of aggregate losses. Typically, \(Pr(X_i=0)>0\), since an insured may not submit any claims. This is unlike the collective risk model where \(X_i\) is a claim and therefore not equal to o.
-
There is no random variable \(N\). Instead, \(n\) is a fixed number, the size of the group.
\(S=\sum\limits_{i=1}^{n}{X_{i}}\)
Compound Variance
Assume we have a collective risk model. We assume that aggregate losses have a compound distribution, with frequency being the primary distribution and severity being the secondary distribution. If \(N\) is the frequency random variable, \(X\) the severity random variable, and \(S=\sum\limits_{n=1}^{N}{X_{n}}\), then
\(E[S]=E[N]E[X]\)
\(Var(S)=E[N]Var(X)+Var(N)E{{[X]}^{2}}\)
The compound variance formula can only be used when \(N\) and \(X_i\) are independent. If \(N|\theta\) and \(X|\theta\) are conditionally independent, then they are probably not unconditionally independent. However, the compound variance formula may be used on \(N|\theta\) and \(X|\theta\) to evaluate \(Var(S|\theta)\), and then the conditional variance formula can be used to evaluate \(Var(S)\).
Aggregate Loss Models: Approximating Distribution
Collective Risk Model & Individual Risk Model
The aggregate distribution may be approximated with a normal distribution. This may be theoretically justified by the Central Limit Theorem if the group is large.
Discrete Distribution: If severity is discrete, then the aggregate loss distribution is discrete, and a continuity correction is required.
Continuous Distribution: If severity has a continuous distribution, no continuity correction is made since \(S\) has a continuous distribution when \(X\) is continuous.
When the sample isn’t large enough and has a heavy tail, the symmetric normal distribution isn’t appropriate. Sometimes the log-normal distribution is used instead, even though there is no theoretical justification.
Remark:
- If there are multiple data groups \(I_i\) with multiple exposures \(n_i\), then:
-
- \(E[S]=\sum{n_iE[N|I_i]E[X|I_i]}\)
- \(Var(S)=\sum{n_iVar(S|I_i)}=\sum{n_i(E[N|I_i]Var(X|I_i)+Var(N|I_i)E[X|I_i]^2)}\)
- Linear Combination
If the aggregate mean and standard deviation are both reduced by x%, so the normal approximation, which is a linear combination of the mean and standard deviation, is also reduced by x%.
Aggregate Losses: Severity Modifications
Individual losses may be subject to deductibles, limits, or coinsurance. These modifications reduce the expected annual payments made on losses, or the expected annual aggregate costs. The concept of “expected annual payments” is related to the concepts of “payment per payment” and “payment per loss”. Expected annual aggregate payments (sometimes called “expected annual aggregate costs”) may be calculated in one of two ways:
- Expected payment per loss x Expected number of losses per year
In other words, do not modify the frequency distribution for the deductible. Calculate expected number of losses (not payments). But modify the severity distribution. Use the payment-per-loss random variable \(Y^L\) for the severity distribution, so that there will be a non-zero probability of 0. Multiply expected payment per loss times expected number of losses. - Expected payment per payment x Expected number of payments per year
In other words, modify the frequency distribution for the deductible. Calculate the expected number of payments (not losses). Modify the severity distribution; use the payment per payment variable \(Y^P\). Payments of 0 are excluded. Multiply expected payment per payment times the number of payments.
Variance of Aggregate Loss After Severity Modifications
For loss amounts \(X\) and the number of losses \(N\), the loss amounts and the number of losses are taken to be independent. The variance of aggregate payments when an ordinary deductible of d per loss is imposed is:
\(S=\sum\limits_{i=1}^{N}{{(X_i-d)_+}}=\sum\limits_{i=1}^{N’}{X’_i}\), where \(X’_i=(X-d|X>d)\) and \(N’=\text{Expected number of payments (Modified)}\)
Using Compound Variance formula:
\(Var(S)=E[N’]Var(X’)+Var(N’)E[X’]^2\)
Aggregate Loss Models: The Recursive Formula
Assuming the distribution of the severities, the \(X_i\)’s, is discrete. Once we make the assumption that severities are discrete, aggregate losses are also discrete. To obtain the distribution of aggregate losses, we only need to calculate \(Pr(S=n)\) for every possible value \(n\) of the aggregate loss distribution. Using the law of total probability,
\(\Pr (S=n)=\sum\limits_{k}{\Pr (N=k)Pr(\sum\limits_{i}^{k}{{{X}_{i}}=n)}}\)
Let \(p_n=f_N(n)=\Pr (N=n)\) for frequency distribution,\(f_n=f_X(n)=\Pr (X=n)\) for severity distribution and \(g_n=f_S(n)=\Pr (S=n)\) for aggregate loss distribution, then
\(g_n=\sum\limits_{i}^{\infty }{p_k}\sum\limits_{i_1+…+i_k=n}^{\infty }{{f_{i_1}}…{f_{i_k}}}\)
The product of the \(f_{i_k}\)’s is called the k-fold convolution of the \(f\)’s, or \(f^{*k}\). If the probability of a claim size of 0, or \(f_0=0\), then the outer sum is finite, but if \(f_0\ne 0\), the outer sum is an infinite sum, since any number of zeroes can be included in the second sum. However, if the primary distribution is in the \((a,b,0)\) class, the primary distribution can be modified so that the probability of \(X=0\) can be removed.
Methodology
- Calculate \(f_{>0}\) to eliminate the probability of 0, and then calculate conditional severity probabilities \(f_k|f_{>0}\) for all necessary \(k\)
- Adjust the parameter in frequency distribution to account only for non-zeros
- Calculate all combinations to achieve the aggregate amount
Recursive Formula for (a, b, 0) Class
\(g_k=\dfrac{1}{1-af_0}\sum\limits_{j=1}^{k}{(a+\dfrac{bj}{k}){f_j}{g_k-j}}\), for k=1, 2, …
Recursive Formula for (a, b, 1) Class
\(g_k=\dfrac{(p_1-(a+b)p_0)f_k}{1-af_0}+\dfrac{1}{1-af_0}\sum\limits_{j=1}^{k}{(a+\dfrac{bj}{k})f_j{g_k-j}}\), for k=1, 2, …
Aggregate Losses: Aggregate Deductible
The expected value of aggregate losses above the deductible is called the net stop-loss premium:
\(E[(S-d)_+]=E[S]-E[S\wedge d]\)
- Evaluate \(E[S]\) by calculating \(E[S]=E[X]E[N]\) if \(N\) and \(X\) are independent, leaving \(E[S\wedge d]\) for further consideration
- To evaluate \(E[S\wedge d]\), firstly calculate \(g_n\) for \(n<d\)
Using the Definition of \(E[S\wedge d]\)
For a discrete distribution in which the only possible values are multiples of \(h\), it becomes:
\(E[S\wedge d]=\sum\limits_{j=0}^{u}{hj{g_{hj}}+d\Pr (S\ge d)}\) , where \(u=\dfrac{d}{h}-1\), so the sum is over all multiples of \(h\) less than \(d\)
Calculating \(E[S\wedge d]\) by Integrating the Survival Function
For a discrete distribution in which the only possible values are multiplies of \(h\) becomes:
\(E[S\wedge d]=\sum\limits_{j=0}^{u-1}{hS(hj)+(d-hu)S(hu)}\), where \(u=\dfrac{d}{h}-1\)
Proceeding Backwards
Aggregate Losses: Miscellaneous Topics
Exact Calculation of Aggregate Loss Distribution
In some special cases, one can combine the frequency and severity models into a closed form for the distribution function of aggregate losses.
The distribution function of aggregate losses at \(x\) is the sum over \(n\) of the probabilities that the claim count equals \(n\) and the sum of \(n\) loss sizes is less than or equal to \(x\).
Normal Distribution
The sum of \(n\) of \(X_i \sim N(\mu ,\sigma ^2)\) has distribution of \(X \sim N(n\mu ,n\sigma ^2)\)
Exponential and Gamma Distributions
If \(X_i \sim Exp(\theta)\), then \(\sum\limits_{1}^{n}{X_i} \sim Gamma(\alpha=n,\theta)\). When \(\alpha\) is an integer, the gamma distribution is also called an Erlang distribution.
-
- If \(n=1\), the Erlang distribution is an exponential distribution, and \(F(x)=1-e^{(-x/\theta)}\)
For any \(n\), \(F_S(x)=\sum\nolimits_{n=0}^{\infty }{\Pr (N=n){F_{X_n}}(x)}\) where \(X_n\sim Erlang(n,\theta)\)
\(F_X(x)=1-\sum\limits_{j=0}^{n-1}{{(e^{-\frac{x}{\theta }})}\dfrac{{{({}^{x}/{}_{\theta })}^{j}}}{j!}}\)
-
- \(F_X(k)=F_X(X_1+X_2\le k)=1-Pr(Poi(\lambda=\dfrac{k}{\theta})<2)=1-(e^{-\frac{k}{\theta }})(1+\dfrac{k}{\theta})\)
- \(F_X(k)=F_X(X_1+X_2+X_3\le k)=1-Pr(Poi(\lambda=\dfrac{k}{\theta})<3)=1-(e^{-\frac{k}{\theta }})(1+\dfrac{k}{\theta}+\dfrac{(\frac{k}{\theta})^2}{2!})\)
Supplementary Questions: Ratemaking, Severity, Frequency, and Aggregate Loss
Skipped
Maximum Likelihood Estimators
Steps of Fitting the parameters
The steps of fitting the parameters \(\theta \) to observations using maximum likelihood:
- Write down a formula for the likelihood of the observations in terms of \(\theta \), Generally this will be the probability or the density of the observations.
- Log the formula.
- Maximize the formula. Usually this means differentiating the formula and setting the derivative equal to zero.
The potential pitfalls of maximum likelihood:
- There is no guarantee that the likelihood can be maximized. It may go to infinity.
- There may be more than one maximum.
- There may be local maxima in addition to the global maximum; these must be avoided.
- It may not be possible to find the maximum by setting the partial derivatives to zero; a numerical algorithm may be necessary.
Defining the Likelihood
Individual Data
If the data are being fit to a discrete distribution, the likelihood of an observation is its probability.
When setting up the likelihood function, positive multiplicative constants (and certainly additive constants) can be ignored. Multiplying by a positive constant does not affect the point at which the function reaches the maximum. Anything not involving the parameters being estimated is a constant, even if it involves the observations.
Grouped Data
The likelihood that an observation is in the interval \((C_{j-1},C_j)\) is \(F(C_j)-F(C_{j-1})\).
Censored Data
For censored data, such as data in the presence of a policy limit, treat it like grouped data: the likelihood function is the probability of being beyond the censoring point.
Truncated Data
For truncated data, the observation is conditional on being outside the truncated range.
Left Truncated
If data are left truncated at \(d\), such as for a policy with a deductible of \(d\), so that you only see the observation \(x\) if it is greater than \(d\), the likelihood of \(x\) is:
\(\dfrac{f(x)}{\Pr (X>d)}=\dfrac{f(x)}{S(d)}\)
Right Truncated
For the more rare case of right truncated data, you do not see an observation x unless it is under \(u\), the likelihood of \(x\) is:
\(\dfrac{f(x)}{\Pr (X<u)}=\dfrac{f(x)}{F(u)}\)
Truncated between \(d\) and \(u\)
For the more rare case of truncated data between two points, you do not see an observation x between \(d\) and \(u\), the likelihood of \(x\) is:
\(\dfrac{f(x)}{\Pr(X<d)+\Pr(X>u)}=\dfrac{f(x)}{1-(F(u))-F(d))}\)
Left Truncated and Right Censored
Data that are both left truncated and right censored would have likelihood:
\(\dfrac{S(u)}{S(d)}\)
Grouped data that are between \(d\) and \(C_j\) in the presence of truncation at \(d\) has likelihood:
\(\dfrac{F(C_j)-F(d)}{1-F(d)}\)
Maximum Likelihood Estimators-Special Techniques
Shortcut
- If the likelihood function is If \(L(\theta )={{\theta }^{-a}}{e^{-b/\theta }}\), then \(\hat{\theta }=b/a\)
- If the likelihood function is \(L(\theta )={{\theta }^a}{e^{-b\theta }}\), then \(\hat{\theta }=a/b\)
- If the likelihood function is \(L(\theta )={{\theta }^a}{(1-\theta)^b}\), then \(\hat{\theta }=\dfrac{a}{a+b}\)
In fact, this likelihood function is a constant times the probability density function of an inverse gamma with \(\theta=b\) and \(\alpha=a-1\) and the mode of an inverse gamma is listed in the tables as \(\dfrac{\theta}{a+1}\).
Let \(n\) be the number of uncensored observations, \(c\) be the number of censored observations:
Distribution | Formula | For Censored |
Exponential |
\(\hat{\theta }=\dfrac{\sum\nolimits_{i=1}^{n+c}{(x_i-d_i)}}{n}\) |
Yes |
Lognormal |
\(\hat{\mu }=\dfrac{\sum\nolimits_{i=1}^{n}{\ln {x_i}}}{n}\), \(\hat{\sigma^2}=\dfrac{\sum\nolimits_{i=1}^{n}{(\ln x_i)^2}}{n}-{{\hat{\mu }}^{2}}\) Note: Can only used when both parameters are to be estimated. |
No |
lnverse Exponential | \(\hat{\theta }=\dfrac{n}{\sum\nolimits_{i=1}^{n+c}{({x_i}^{-1})}}\) | No |
Weibull, fixed \(\tau\) | \(\hat{\theta }=\sqrt[\tau ]{\dfrac{\sum\nolimits_{i=1}^{n+c}{{x_i}^{\tau }}-\sum\nolimits_{i=1}^{n+c}{{d_i}^{\tau }}}{n}}\) | Yes |
Uniform \([0,\theta]\) Individual Data | \(\hat{\theta }=\max {{x}_{i}}\) | No |
Uniform \([0,\theta]\)Group Data |
\(\hat{\theta }=c_j\dfrac{n}{n_j}\), where \(c_j\) = Upper Bound of Highest Finite Interval \(n_j\) = Number of Observations below \(c_j\) Note: Formula works only if there’s at least one obersvation above \(c_j\) |
No |
Two-Parameter Pareto, Fixed \(\theta\) |
\(\hat{\alpha }=-\dfrac{n}{K}\) \(K=\sum\nolimits_{i=1}^{n+c}{\ln (\theta +d_i)}-\sum\nolimits_{i=1}^{n+c}{\ln (\theta +x_i)}\) |
Yes |
Single-Parameter Pareto, Fixed \(\theta\) |
\(\hat{\alpha }=-\dfrac{n}{K}\) \(K=\sum\nolimits_{i=1}^{n+c}{\ln max(\theta,d_i)}-\sum\nolimits_{i=1}^{n+c}{\ln x_i}\) |
Yes |
Beta, Fixed \(\theta,b=1\) |
\(\hat{a }=-\dfrac{n}{K}\) \(K=\sum\nolimits_{i=1}^{n}{\ln{x_i}-n {\ln} \theta}\) |
No |
Beta, Fixed \(\theta,a=1\) |
\(\hat{b}=-\dfrac{n}{K}\) \(K=\sum\nolimits_{i=1}^{n}{\ln(\theta-x_i)}-n {\ln} \theta\) |
No |
Variance of Maximum Likelihood Estimators
Maximum likelihood estimators are asymptotically unbiased, consistent, and asymptotically are normally distributed, as long as certain regularity conditions hold. No unbiased estimator has a lower variance. All of these properties are only true asymptotically; before the sample size gets to infinity, however, the estimators may be biased, for example.
Information Matrix
Calculating Variance Using the Information Matrix
The asymptotic covariance matrix of the maximum likelihood estimator is the inverse of the (Fisher’s) information matrix, whose entries are:
\(I{(\theta )}_{rs}=E_x[\dfrac{{{\partial }^2}}{\partial {{\theta }_s}\partial {\theta _r}}l(\theta )]=E_x[\dfrac{\partial }{\partial {\theta }_r}l(\theta )\dfrac{{\partial }^2}{\partial {\theta_s}}l(\theta )]\)
\(I(\theta )=-E_X[\dfrac{d^2l}{d\theta ^2}]=E_X[{(\dfrac{dl}{d\theta })}^2]\)
If the sample has \(n\) independent identically distributed observations, then \(l(\theta )=\sum\limits_{i=1}^{n}{\ln f(x_i;\theta )}\) and
\(I(\theta )=-nE_X[\dfrac{d^2\ln f(x_i;\theta )}{d\theta ^2}]=E_X[{(\dfrac{d\ln f(x_i;\theta )}{d\theta })}^2]\)
Ad-joint Method
If the maximum likelihood estimators of these parameters, \(\hat{\alpha}\) and \(\hat{\beta}\), have information matrix:
\(I(\theta)=\left( \begin{matrix}{a_{11}}&{a_{12}}\\{a_{21}}&{a_{22}}\end{matrix}\right)\), then \(I^{-1}=\dfrac{1}{a_{11}a_{22}-a_{12}a_{21}}\left( \begin{matrix}{a_{22}}&{-a_{12}}\\{-a_{21}}&{a_{11}}\end{matrix}\right)\)
\(Var(\hat{\alpha})=\dfrac{a_{22}}{a_{11}a_{22}-a_{12}a_{21}}\) and \(Var(\hat{\beta})=\dfrac{a_{11}}{a_{11}a_{22}-a_{12}a_{21}}\)
Asymptotic Variance of MLE for Common Distributions
Distribution | Formula |
Uniform \([0,\theta]\) | \(Var(\hat{\theta })=\dfrac{n{\theta }^2}{(n+1)^2(n+2)}\) |
Exponential | \(Var(\hat{\theta })=\dfrac{Var(\bar{X})}{n}=\dfrac{Var(X)}{n}=\dfrac{{\theta }^2}{n}\) |
Weibull, fixed | \(Var(\hat{\theta })=\dfrac{{\theta }^2}{n{\tau }^2}\) |
Pareto, fixed \(\theta\) | \(Var(\hat{\alpha })=\dfrac{{\alpha }^2}{n}\) |
Pareto, fixed \(\alpha\) | \(Var(\hat{\theta })=\dfrac{(\alpha+2){\theta }^2}{n\alpha}\) |
Lognormal |
\(Var(\hat{\mu })=\dfrac{{\sigma }^2}{n}\) \(Cov(\hat{\mu },\hat{\sigma })=0\) \(Var(\hat{\sigma })=\dfrac{{\sigma }^2}{2n}\) |
The Delta Method
The formula for the delta method is that the approximate variance of a function of a random variable is the variance of the variable multiplied by the square of the derivative of the function evaluated at the mean, and the delta method for one variable is:
\(Var(g(X))\approx Var(X){(\dfrac{dg}{dx})}^2\)
If the random vector is \(X_1,X_2,…X_n\), let \(\sigma _{i}^{2}\) denote \(Var(X_i)\) and let \(\sigma_{ij}\) denote \(Cov(X_i,X_j)\). Then the covariance matrix is defined by:
\[\sum{=}\left( \begin{matrix} \sigma _{1}^{2} & {{\sigma }_{12}} & \cdots & {{\sigma }_{1n}} \\ {{\sigma }_{21}} & \sigma _{2}^{2} & \cdots & {{\sigma }_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ {{\sigma }_{n1}} & {{\sigma }_{1n}} & \cdots & \sigma _{n}^{2} \\ \end{matrix} \right)\]
Suppose \(X=(X_1,X_2,…,X_k\) is a \(k\) dimensional random variable, \(\theta \) is its mean and \(\sum\) is its covariance matrix. If \(g(X)\) is a function of \(X\), then the delta method approximation of the variance is:
\(Var(g(X))\approx (X)(\partial g{)}’\sum{(\partial g)}\),
where \(\partial g=(\dfrac{\partial g}{\partial {x_1}},…,\dfrac{\partial g}{\partial {x_k}}{)}’\) is the column vector of partial derivatives of \(g\) evaluated at \(\theta\) and prime indicates a transpose.
The delta method for two variables is:
\(Var(g(X,Y))\approx Var(X){{(\dfrac{\partial g}{\partial x})}^2}+2Cov(X,Y)\dfrac{\partial g}{\partial x}\dfrac{\partial g}{\partial y}+Var(Y){{(\dfrac{\partial g}{\partial y})}^2}\)
Confidence Intervals
Normal Confidence Intervals
A normal confidence interval for a quantity estimated by maximum likelihood is constructed by adding and subtracting \(z_p\sigma\), where \(z_p\) is an appropriate standard normal percentile and \(\sigma\) is the estimated standard deviation.
Non-Normal Confidence Intervals
An alternative method for building confidence intervals is to solve an inequality for the loglikelihood equation. The confidence interval consists of the k-dimensional region in which the loglikelihood function is greater than c for some constant \(c\), where \(k\) is the number of parameters.
Steps of calculating \(c\text{%}\) non-Normal confidence intervals for \(\theta\) are:
-
- Obtain the MLE of parameters \(\hat{\theta}\)
- Structure \(L(\theta)\) and then \(l(\theta)\)
- Calculate or test all \(\theta\) from \(2|l(\hat{\theta})-l(\theta)|\le \chi^2_{df=1,c\text{%}}\)
Fitting Discrete Distributions
Discrete Distribution | Parameters | MLE |
Poisson | Complete Individual Data, \(\lambda \) | \(\hat{\lambda }=\bar{x}\), \(Var(\hat{\lambda})=Var(\bar{x})=\dfrac{Var(\sum{x})}{n^2}=\dfrac{nVar(x)}{n^2}=\dfrac{\hat{\lambda}}{n}\) |
Negative Binomial | Complete Data, fixed \(r\) | \(\hat{\beta }=\dfrac{\bar{x}}{r}\) |
Negative Binomial | Complete Data, fixed \(\beta\) | \(\hat{r }=\dfrac{\bar{x}}{\beta}\) |
Binomial | Complete Data | Maximum likelihood proceeds by calculating a likelihood profile for each \(m\ge max (x_i)\). \(\hat{q}\), given \(m\) is \(\dfrac{\bar{x}}{m}\). When the maximum likelihood for \(m+1\) is less than the one for \(m\), the maximum overall is attained at \(m\). |
Modified \((a,b,1)\) | – | \(\hat{p}_0^M=\dfrac{n_0}{n}\) and the mean is set equal to the sample mean. |
Choosing between Distributions in the \((a,b,0)\) Class
Comparing the Sample Mean to the Sample Variance
-
- If \(\bar{\sigma}^2>\hat{x}\), then \(X\sim NB\)
- If \(\bar{\sigma}^2=\hat{x}\), then \(X\sim Poisson\)
- If \(\bar{\sigma}^2<\hat{x}\), then \(X\sim Bin\)
\(\dfrac{kp_k}{p_{k-1}}=ak+b\)
If a sample has \(n_k\) observations of \(k\), and \(\sum\nolimits_{k=0}^{\infty}{n_k}=n\), then multiplying through by \(n\), having \(\dfrac{kn_k}{n_{k-1}}=ak+b\)
-
- If \(a>0\), then \(X\sim NB\)
- If \(a=0\), then \(X\sim Poisson\)
- If \(a<0\), then \(X\sim Bin\)
Note: This method cannot be used when \(n_k=0\)
Hypothesis Tests: Graphic Comparison
Notation
- \(F_n\) and \(f_n\) continue to refer to the empirical distribution.
- \(F^*\) and \(f^*\) will be used to denote the fitted distribution adjusted.
To make the fitted distribution consistent with the empirical distribution, it will be truncated when the empirical distribution is. In other words, if observed data are left-truncated at \(d\), and if we let \(F\) and \(f\) refer to the unmodified distribution,
\(F^*(x)=\dfrac{F(x)-F(d)}{1-F(d)}\) and \(f^*(x)=\dfrac{f(x)}{1-F(d)}\)
\(D(x)\) Plots
To amplify differences, we can plot the function \(D(x)\) only for individual data, defined by:
\(D(x)=F_n(x)-F^*(x)=\dfrac{j}{n}-F^*(x)\)
\(p-p\) Plots
Suppose the \(n\) observations, sorted, are \(x_1\le x_2\le…\le x_n\). The \(p-p\) plot plots the empirical distribution on the \(x\) axis against the fitted distribution on the \(y\) axis:
\((F_n(x_j),F^*(x_j))=(\dfrac{j}{n+1},F^*(x))\)
We divide by \(n+1\) instead of by \(n\), since the expected value of \(F_n(x_j)\) is \(\dfrac{j}{n+1}\).
- \(p-p\) plot is:
- too heavy (thick) if the slope of the fit is greater than 1 near the point
- too light (thin) if the slope of the fit is less than 1 near the point
Hypothesis Tests: Kolmogorov-Smirnov
Individual Data
The Kolmogorov-Smirnov statistic \(D\) is the maximum difference, in absolute value, between the empirical and fitted distributions, \(Max|F_n(x)-F^*(x;\hat{\theta})|\), \(t\le x\le u\), where \(t\) is the lower truncation point (0 if none) and \(u\) is the upper censoring point (\(\infty\) if none).
To evaluate \(D\), assuming the observation points are sorted, \(x_1\le x_2\le…\le x_n\), at each observation point \(x_j\), take \(Max(|F^*(x_j)-\dfrac{j}{n}|,|F^*(x_j)-\dfrac{j-1}{n})|\)
Remark:
\(x_j\) | \(F^-_n(x_j)\) | \(F_n(x_j)\) | \(F^-_n(x_j)\), if \(x_3=x_2\) | \(F_n(x_j)\), if \(x_3=x_2\) | \(F^-_n(x_j)\), if \(x_n>U\) | \(F_n(x_j)\), if \(x_n>U\) |
\(x_1\) | \(0\) | \(\dfrac{1}{n}\) | \(0\) | \(\dfrac{1}{n}\) | \(0\) | \(\dfrac{1}{n}\) |
\(x_2\) | \(\dfrac{1}{n}\) | \(\dfrac{2}{n}\) | \(\dfrac{1}{n}\) | \(\dfrac{2}{n}\) | \(\dfrac{1}{n}\) | \(\dfrac{2}{n}\) |
\(x_3\) | \(\dfrac{2}{n}\) | \(\dfrac{3}{n}\) | – | – | \(\dfrac{2}{n}\) | \(\dfrac{3}{n}\) |
\(x_4\) | \(\dfrac{3}{n}\) | \(\dfrac{4}{n}\) | \(\dfrac{3}{n}\) | \(\dfrac{4}{n}\) | \(\dfrac{3}{n}\) | \(\dfrac{4}{n}\) |
… | … | … | … | … | … | … |
\(x_{n-1}\) | \(\dfrac{n-2}{n}\) | \(\dfrac{n-1}{n}\) | \(\dfrac{n-2}{n}\) | \(\dfrac{n-1}{n}\) | \(\dfrac{n-2}{n}\) | \(\dfrac{n-1}{n}\) |
\(x_{n}\) | \(\dfrac{n-1}{n}\) | \(\dfrac{n}{n}\) | \(\dfrac{n-1}{n}\) | \(\dfrac{n}{n}\) | \(\dfrac{n-1}{n}\) | – |
Hypothesis Tests: Chi-Square
A chi-square random variable with \(k\) degrees of freedom, \(X\), is defined as the sum of the squares of \(k\) standard normal random variables. If \(Z_i\) is a standard normal random variable, then
\(X=\sum\nolimits_{i=1}^{k}{Z_i^2}=\sum\nolimits_{i=1}^{k}{(\dfrac{X_i-\mu_i}{\sigma_i})^2}\).
The sum of squares is the chi-square statistic, denoted as \(Q\), is compared to the chi-square table. If \(Q\) is a high percentile of the chi-square distribution, then reject the null hypothesis.
Definition of Chi-Square Statistic
For each group, let \(p_j\) be the probability that \(X\) is in the \(j^{th}\) group under the hypothesis, let \(n\) be the total number of observations, and let \(O_j\) be the number of observations in group \(j\). Let \(E_j=np_j\), so that \(E_j\) is the expected number of observations in group \(j\). The chi-square statistic is
\(Q=\sum\nolimits_{j=1}^{k}{\dfrac{(O_j-E_j)^2}{E_j}}\) or \(Q=\sum\nolimits_{j=1}^{k}{\dfrac{O_j^2}{E_j}}-n\)
The null hypothesis is accepted at \(\alpha_1\) significance if \(Q<\chi^2_{df,\alpha_1}\) but rejected at \(\alpha_2\) significance if \(Q>\chi^2_{df,\alpha_2}\)
Degrees of Freedom
To determine the number of degrees of freedom from first principles, figure out how many of the observations could be chosen at random. There is often a relationship between the observations which does not allow every observation to be chosen at random.
More generally, the number of degrees of freedom when you divide \(n\) claims into \(k\) groups and estimate \(r\) parameters is \(k-1-r\). More specifically:
- If a distribution with parameters is given, or even if a distribution is fitted by a formal approach like maximum likelihood but using a different set of data, then there are \(k-1\) degrees of freedom.
- If the \(r\) parameters were fitted from the data, or if parameters have to be fitted from the data, then there are \(k-1-r\) degrees of freedom.
Remark
-
- If there’s a requirement of minimum expected number of observations in any group to be \(n\), then the data groups should be combined to meet the requirement so that \(\sum{E_i}>=n\)
Comparison of the Two Methods of Testing Goodness of Fit
Kolmogorov-Smimov | Chi-Square |
---|---|
Should be used only for individual data | May be used for individual or grouped data |
Only for continuous fits | For continuous or discrete fits |
Should lower critical value if \(u\lt \infty\) | No adjustment of critical value is needed for \(u\lt \infty\) |
Critical value should be lowered if parameters are fitted | Critical value is automatically adjusted if fitted parameters are fitted |
Critical value declines with larger sample size | Critical value independent of sample size |
No discretion | Discretion in grouping of data |
Uniform weight on all parts of distribution | Higher weight on intervals with low fitted probability |
Data from Several Periods
The chi-square statistic formula for data from \(k\) periods:
\(Q=\sum\nolimits_{j=1}^{k}{\dfrac{(O_j-E_j)^2}{V_j}}\)
The null hypothesis is accepted at \(\alpha_1\) significance if \(Q<\chi^2_{df,\alpha_1}\) but rejected at \(\alpha_2\) significance if \(Q>\chi^2_{df,\alpha_2}\)
where \(V_j\) is the fitted variance of the number of observations in the cell. The statistic has \(k-p\) degrees of freedom if \(p\) parameters are fitted from the data.
Remark:
- When the chi-square statistic (\(Q\)) is used to analyze a fit of data broken down by period, for example data by year, the number of degrees of freedom is the number of periods minus the number of parameters fit from the data. Do not subtract an additional degree of freedom.
Likelihood Ratio Test and Algorithm, Penalized Loglikelihood Tests
Null Hypothesis
Let \(H_0\) be the Null Hypothesis, then:
Table of Type Errors | \(H_0\) is True | \(H_0\) is False |
Accept the \(H_0\) | Correct Conclusion | Type 2 Error (False Negative) |
Reject the \(H_0\) | Type 1 Error (False Positive) | Correct Conclusion |
Likelihood Ratio Test and Algorithm
The alternative model is accepted if \(2\ln \dfrac{L_1}{L_0}=2(l_1-l_0)>c\), where \(Pr(X>c)=\alpha\) for \(X\) a chi-square random variable with the number of degrees of freedom for the test.
Degree of Freedom
= the number of free parameters in the alternative model (the model of the alternative hypothesis) – the number of free parameters in the base model (the model of the null hypothesis)
Model Selection
-
- k+1-parameter model is preferred over k-parameter model if:
\(p-value=2(l_{k+1}-l_k)>\chi^2_{df,\alpha}\)
Combined Data Set
Let \(k_1\) and \(k_2\) be the parameters for two single models, then the data set is to be combined if:
\(p-value=2(l_{k_1+k_2}-(l_{k_1}+l_{k_2}))>\chi^2_{df,\alpha}\)
The likelihood ratio test can be used to decide whether to combine two data sets into one model or to model them separately.
-
- If the data sets are combined, then the number of free parameters of the overall model is the number of parameters in the single model.
- If they are not combined, the number of free parameters of the overall model is the sum of the number of parameters of the two models.
- Twice the logarithm of the likelihood ratio should exceed the chi-square distribution with degrees of freedom equal to the difference in the number of free parameters in the overall models.
Schwarz Bayesian Criterion and Akaike Information Criterion
Loglikelihood increases as more parameters are added to a model, but parameters shouldn’t be added to a model unless they increase loglikelihood significantly. Two penalized loglikelihood measures are commonly used.
The penalty of the SBC is almost always higher than the penalty of the AIC, so the SBC tends to select models with fewer parameters than the AIC selects.
Schwarz Bayesian Criterion (SBC) or Bayesian Information Criterion (BIC)
\(SBC=l-\dfrac{r}{2}\ln (n)\)
where:
-
- \(n\) is the number of observations
- \(r\) is the number of parameters in the model
Akaike Information Criterion (AIC)
\(AIC=l-r\)
where \(r\) is the number of parameters in the model
Supplementary Questions: Parametric Models
Skipped
Classical Credibility: Poisson Frequency
The general formula for full credibility is:
\(\gamma \dfrac{\sqrt{e_F \sigma^2}}{e_F \mu}=k\) => \(e_F=n_0(\dfrac{\sigma}{\mu})^2=n_0CV^2\), where
- \(e_F\) is the exposure needed for full credibility (an exposure is a unit insured for a time period)
- \(\mu\) is the expected aggregate claims per exposure
- \(\sigma\) is the standard deviation per exposure
- \(y\) is the coefficient from the standard normal distribution for the confidence interval
- \(y_P=\phi^{-1}((1+P)/2)\), where \(P\) is the level of confidence that the mean is in the interval. However, if one-side interval is requested, then \(y_P=\phi^{-1}(P)\).
- \(k\) is the maximum fluctuation to be accepted and is called the range parameter (%)
- \(P\) is the level of confidence that the mean is in the interval and is called the probability parameter
- \(n_0=(\gamma/k)^2\)
- \(CV\) is the coefficient of variation for the aggregate distribution
In other words, aggregate claims to be within \(k%\) of expected claims at least \(P%\) of the time:
\(\begin{align} \Pr\left( \left| S-\mu_{S} \right| \le k \mu_{S}\right) \ge P \\ \Pr\left(\left|\dfrac{S-\mu_{S}}{\sigma_{S}}\right| \le \dfrac{k \mu_{S}}{\sigma_{S}}\right) \ge P \\ \Pr\left( – \dfrac{k \mu_{S}}{\sigma_{S}} \le Z \le \dfrac{k \mu_{S}}{\sigma_{S}}\right) \ge P \end{align}\)
\(\dfrac{k \mu_S}{\sigma_S} \ge y_P\) => \(CV_S=\dfrac{\sigma_S}{\mu_S} \le \dfrac{k}{y_P}\)
\(CV^2_S=\dfrac{\sigma^2_S}{\mu^2_S}=\dfrac{1}{\mu_N}(1+CV^2_X) \le (\dfrac{k}{y_P})^2\)
The three things we calculate credibility for:
- Number of claims. This means we want the number of claims to be within \(k\) of expected \(P\) of the time.
- Full Credibility for Number of Claims (Exposure): \(e_F=n_0CV^2=n_0(\dfrac{Std[N]}{E[N]})^2=n_0\dfrac{\lambda}{\lambda^2}=n_0\dfrac{1}{\lambda}\)
- Claim sizes. This means we want the size of each claim to be within \(k\) of expected \(P\) of the time.
- Full Credibility for Claim Sizes (Severity): \(n_F=n_0CV^2_S\), \(e_F=n_F/E[N]=n_0(\dfrac{CV^2_S}{\lambda})\)
- Aggregate losses or pure premium. This means we want aggregate losses (or pure premium, which is aggregate losses per exposure, or loss ratio, which is aggregate losses divided by earned premium) to be within \(k\) of expected \(P\) of the time.
- Full Credibility for Aggregate Loss:
\(e_F=n_0CV^2=n_0(\dfrac{Var[N]}{E[N]^2})=n_0\dfrac{\lambda E[S^2]}{\lambda^2\mu^2_S}=n_0\dfrac{\lambda(E[S]^2+Var[S])}{\lambda^2\mu^2_S}=n_0\dfrac{\lambda(\mu^2_S+\sigma^2_S)}{\lambda^2\mu^2_S}=n_0\dfrac{1+CV^2_S}{\lambda}\)
- Full Credibility for Aggregate Loss:
Experience Expressed in | Credibility for | ||
– | Number of Claims | Claim Size (Severity) | Aggregate Losses / Pure Premium |
\(e_F\) | \(\dfrac{n_0}{\lambda}\) | \(n_0 (\dfrac{CV^2_S}{\lambda})\) | \(n_0 (\dfrac{1+CV^2_S}{\lambda})\) |
\(n_F\) | \(n_0\) | \(n_0CV^2_S\) | \(n_0(1+CV^2_S)\) |
Classical Credibility: Non-Poisson Frequency
The general formula for the standard for full credibility in exposure units is: \(e_F=n_0(\dfrac{\sigma}{\mu})^2\)
The three things we calculate credibility for:
- Number of claims. The standard for full credibility of claim frequency in terms of the number of exposures is:
- Full Credibility for Number of Claims (Exposure):
\(e_F=n_0CV^2=n_0(\dfrac{Std[N]}{E[N]})^2=n_0\dfrac{\sigma^2_f}{\mu^2_f}\)
- Full Credibility for Number of Claims (Exposure):
- Claim sizes. The standard for full credibility of claim sizes (severity) in terms of exposure units is:
- Full Credibility for Claim Sizes (Severity): \(n_F=n_0CV^2_S=n_0\dfrac{\sigma^2_S}{\mu^2_S}\), \(e_F=n_F/E[N]=n_0(\dfrac{\sigma^2_S}{\mu^2_S\mu_f})\)
- Aggregate losses or pure premium. To establish the standard for full credibility of pure premium, assuming that claim counts and claim sizes are independent, the formula for the (expected) number of claims needed for full credibility can be derived as:
- Full Credibility for Aggregate Loss:
\(e_F=n_0CV^2_S=n_0(\dfrac{Var(S)}{E[S]^2})=n_0\dfrac{E[N]Var(X)+Var(N)E[S]^2}{(E[F]E[S])^2}=n_0\dfrac{\mu_f\sigma^2_S+\sigma^2_f\mu^2_S}{(\mu_f\mu_S)^2}=n_0(\dfrac{\sigma^2_S}{\mu^2_S\mu_f}+\dfrac{\sigma^2_f}{\mu^2_f})\)
- Full Credibility for Aggregate Loss:
Experience Expressed in | Credibility for | ||
– | Number of Claims | Claim Size (Severity) | Aggregate Losses / Pure Premium |
\(e_F\) | \(n_0(\dfrac{\sigma^2_f}{\mu^2_f})\) | \(n_0 (\dfrac{\sigma^2_S}{\mu^2_S\mu_f})\) | \(n_0 (\dfrac{\sigma^2_f}{\mu^2_f}+\dfrac{\sigma^2_S}{\mu^2_S\mu_f})\) |
\(n_F\) | \(n_0(\dfrac{\sigma^2_f}{\mu_f})\) | \(n_0 (\dfrac{\sigma^2_S}{\mu^2_S})\) | \(n_0 (\dfrac{\sigma^2_f}{\mu_f}+\dfrac{\sigma^2_S}{\mu^2_S})\) |
Classical Credibility: Partial Credibility
When there is inadequate experience for full credibility, we must determine \(Z\), the credibility factor. The credibility premium \(P_C\) is determined by:
\(P_C=Z\bar{X}+(1-Z)M\) or \(P_C=M+Z(\bar{X}-M)\), where
- \(M\) is the manual premium, the premium initially assumed if there is no credibility.
The credibility factor for \(n\) expected claims is:
\(Z=\sqrt{\dfrac{n}{n_F}}\), where
- \(n_F\) is the number of expected claims needed for full credibility.
The credibility factor for \(e\) expected exposures is:
\(Z=\sqrt{\dfrac{e}{e_F}}\), where
- \(e_F\) is the number of expected exposures needed for full credibility.
Bayesian Estimation and Credibility – Discrete Prior
The posterior distribution, is developed for the parameters, using Bayes’ Theorem, which for discrete priors and models is:
\(Pr(A|B)=\dfrac{Pr(B|A)Pr(A)}{Pr(B)}\)
Bayesian Premium
Step | Purpose |
Class 1 |
Step 1 | Prior Probability | Enter the given prior probability that the risk is in Class \(i\) |
Step 2 | Likelihood of Experience | Enter the given likelihood of the experience given the class \(i\) |
Step 3 | Joint Probabilities | Calculate the product of the results of step 1 & 2, which is the probability of being in the class \(i\) and having the observed experience, or the joint probability |
Step 4 | Posterior probabilities | Calculate the quotient of the result in Step 3, which is the posterior probability of being in each class given the experience |
Step 5 | Hypothetical Means | Calculate the expected value, given that the risk is in the class \(i\), which is known as the hypothetical means |
Step 6 | Bayesian Premium | Calculate the product of the results of step 4 & 5. The sum of all classes in step 6 is the expected size of the next loss for this risk and is known as the Bayesian premium |
Remark:
- Step 1 to 4 calculates the probability that a risk belongs to class \(i\)
- Step 1 to 6 calculates the expected size of the next loss for a risk
Bayesian Estimation and Credibility – Continuous Prior
Calculating Posterior and Predictive Distributions
If the prior distribution is continuous, then the posterior probability is:
\(\pi(\theta|x_1,…x_n)=\dfrac{f(x_1,…x_n|\theta)\pi(\theta)}{f(x_1,…x_n)}=\dfrac{f(x_1,…x_n|\theta)\pi(\theta)}{\int{f(x_1,…x_n|\theta)\pi(\theta)d\theta}}\), where
- \(\pi(\theta)\) is the prior density
- \(\pi(\theta|x_1,…x_n)\) is the posterior density
The predictive distribution is:
\(f(x_{n+1}|x_1,…x_n)=\int{f(x_{n+1}|\theta)\pi(\theta|x_1,…,x_n)d\theta}\)
Calculate the posterior expected value of the parameter
Integrate the parameter \(\theta\) over the posterior function \(\pi(\theta|x_1,…x_n)\)
Calculate the expected value of the next claim, or the expected value of the predictive distribution, or the Bayesian premium
Integrate the claim variable \(x\) over the predictive distribution \(f(x_{n+1}|x_1,…x_n)\)
However, it is usually easier to integrate \(E[X|\theta)\) over the posterior distribution, where \(E[X|\theta)\) is the expected value of the model given the parameter \(\theta\).
Calculate the posterior probability that the parameter is in a certain range
Integrate the posterior density function \(\pi(\theta|x_1,…x_n)\) over that range.
Calculate the probability that the next claim is in a certain range
Integrate the predictive density function \(f(x_{n+1}|x_1,…x_n)\) over that range.
Recognizing the Posterior Distribution
The posterior distribution is a probability function only of the parameter(s) \(\theta\), not of the observations \(x\). We can drop constants when developing it. If as a function of \(\theta\) it has a form that matches a distribution we know (which is expressed as a function of \(x\) in the tables, so replace 0 with x when trying to recognize it), we can fill in the constant ourselves; the constant is whatever is needed to make the density function integrate to 1.
Loss Functions
Skipped.
The Linear Exponential Family and Conjugate Priors
If the posterior distribution is in the same family as the prior distribution, just with different parameters, then the prior distribution is called the conjugate prior for the model.
Bayesian Credibility: Poisson/Gamma
Posterior Distribution
If a posterior hypothesis comes from the same family of distributions as the prior hypothesis, the prior hypothesis is called the conjugate prior of the model.
- Suppose claim frequency \(N|\lambda\sim Poisson(\lambda)\), \(\lambda\) varying by insured: \(\lambda \sim Gamma(\alpha,\gamma=\theta^{-1})\).
- Suppose there are \(n\) exposures. This could be:
- \(n\) years for one insured, or
- \(n\) insureds for one year, or
- the sum of \(n_i\) insureds in year \(i\) for years \(i=1,…,m,\sum\nolimits_{i=1}^{m}{n_i}\)
- Suppose there are \(x\) claims.
\(\boldsymbol{\lambda|N\sim Gamma(\alpha_*=\alpha+x,\gamma_*=\gamma+n)}\)
The posterior mean is:
\(E[\lambda|N]=\alpha_*\gamma_*^{-1}\)
Predictive Distribution
\(\boldsymbol{N_{i+1}|N_i\sim NB(r=\alpha_*,\beta=\gamma_*)}\)
Credibility Factor
The mean of the experience in this case is \(x/n=\bar{x}\).
The posterior mean can be expressed as a linear credibility formula with a credibility factor \(Z\), analogous to the classical credibility partial credibility formula \(P_C=(1-Z)\mu+Z\bar{x}\), as follows:
\(P_C=\dfrac{\alpha_*}{\gamma_*}=\dfrac{\alpha+n\bar{x}}{\gamma+n}=\dfrac{\gamma}{\gamma+n}\dfrac{\alpha}{\gamma}+\dfrac{n}{\gamma+n}\bar{x}\)
In other words, the credibility factor is \(\boldsymbol{Z=\dfrac{n}{n+\gamma}}\).
Bayesian Credibility: Normal/Normal
Posterior Distribution
The normal distribution as a prior distribution is the conjugate prior of a model having the normal distribution with a fixed variance.
- The model has a \(X|\theta\sim Normal(\theta,v)\), \(\theta\sim Normal(\mu,a)\).
- There are \(n\) exposures, which means \(n\) losses (or \(n\) person-years if this is a model for aggregate losses).
- The posterior mean is a weighted average of the prior mean and the sample mean \(\bar{x}\).
The weights are \(v\) on \(\mu\) and \(na\) on \(\bar{x}\), the posterior mean and variance are:
\(\theta|X\sim Normal(\mu_*=\dfrac{v(\mu)+na(\bar{x})}{v+na}=\dfrac{(\frac{1}{a})\mu+(\frac{n}{v})\bar{x}}{(\frac{n}{v})+(\frac{1}{a})},a_*=(\dfrac{v}{v+na})a)\)
\(\boldsymbol{\theta|X\sim Normal(\mu_*=Z\bar{x}+(1-Z)\mu,a_*=(1-Z)a)}\), where \(\boldsymbol{Z=\dfrac{n}{n+v/a}}\)
Predictive Distribution
Since \(X_{i+1}|\theta\sim Normal(\theta,v)\) and \(\theta|X_i\sim Normal(\mu_*,v+a_*)\), and \(Var(X)=E[Var(X|\theta)]+Var(E[X|\theta])=E[v]+Var(\theta)=v+a_*\)
The predictive distribution is Normal, since a Normal mixture of Normal distributions is Normal:
\(\boldsymbol{X_{i+1}|X_{i}\sim Normal(\mu_*,a_*+v)}\)
Lognormal Prior Distribution
If the model has a \(Y|\theta\sim Lognormal(\theta,v)\) and \(\theta\sim Normal(\mu,a)\), then:
-
- \(\bar{x}=\dfrac{\sum{\ln x_i}}{n}\)
- \(X=\ln Y\sim Normal(\theta,v)\)
- \(X_{i+1}|X_{i}\sim Normal(Z\bar{x}+(1-Z)\mu_,(1-Z)a+v)\)
- \(Y_{i+1}|Y_{i}\sim Lognormal(Z\bar{x}+(1-Z)\mu_,=(1-Z)a+v)\)
Bayesian Credibility: Bernoulli/Beta
Posterior Distribution
The Beta distribution as a prior distribution is the conjugate prior of a model having the Bernoulli distribution:
- The model has a \(X|q\sim Bernoulli(q)\), \(q\sim Beta(a,b,1)\).
- \(n\) Bernoulli trials are observed with \(k\) successes
Then the posterior distribution is:
\(\boldsymbol{\pi(q|X)\sim Beta(a_*=a+k,b_*=b+n-k,1)}\)
Predictive Distribution
The predictive distribution for the next claim is also Bernoulli:
\(\boldsymbol{X_{i+1}|X_{i}\sim Bernoulli(\dfrac{a_*}{a_*+b_*})}\)
The predictive mean is:
\(E[X]=E[E[X|q]=E[q]=\dfrac{a_*}{a_*+b_*}\)
If the model is \(X\sim Binomial(m,q)\), the predictive mean is:
\(\boldsymbol{E[X]=E[E[X|q]=E[mq]=m(\dfrac{a_*}{a_*+b_*})}\)
Credibility Factor
\(E[\theta|X]=\dfrac{a_*}{a_*+b_*}=\dfrac{a+k}{b+n-k}=(\dfrac{a+b}{n+a+b})(\dfrac{a}{a+b})+(\dfrac{n}{n+a+b})(\dfrac{k}{n})=(1-Z)E[q]+Z(\dfrac{k}{n})\)
In other words, the credibility factor is \(Z=\dfrac{n}{n+a+b}\)
Bayesian Credibility: Exponential/Inverse Gamma
Posterior Distribution
- Assume that claim size \(X|\theta\sim Exp(\theta)\), and \(\theta\sim\Gamma^{-1}(\alpha,\beta)\)
- \(n\) clams \(x_1,…,x_n\) are observed
Then the posterior distribution is
\(\boldsymbol{\Gamma^{-1}(\alpha_*=\alpha+n,\beta_*=\beta+n\bar{x})}\)
Since:
- \(\prod{f(x_i|\theta)}=\prod{\dfrac{1}{\theta}e^{-x_i/\theta}}=\dfrac{e^{-\sum{x_i}/\theta}}{\theta^n}\)
- \(\pi(\theta)=\dfrac{\beta^\alpha}{\Gamma(\alpha)}\theta^{-(\alpha+1)}e^{-\beta/\theta}\propto \theta^{-(\alpha+1)}e^{-\beta/\theta}\)
- \(\prod{f(x_i|\theta)}\pi(\theta)=(\dfrac{e^{-\sum{x_i}/\theta}}{\theta^n})(\theta^{-(\alpha+1)}e^{-\beta/\theta})=\theta^{-(\alpha+n+1)}e^{-(\beta+\sum{x_i})/\theta}\)
Posterior of Gamma / Inverse Gamma Conjugate Prior Pair
The posterior distribution is
\(\boldsymbol{\Gamma^{-1}(\alpha_*=\alpha+n\eta,\beta_*=\beta+n\bar{x})}\)
Since:
- \(\prod{f(x_i|\theta)}=\prod{\dfrac{1}{\Gamma(\eta)\theta^\eta}x^{\eta-1}e^{-x_i/\theta}}\propto \dfrac{e^{-\sum{x_i}/\theta}}{\theta^{n\eta}}\)
- \(\pi(\theta)=\dfrac{\beta^\alpha}{\Gamma(\alpha)}\theta^{-(\alpha+1)}e^{-\beta/\theta}\propto \theta^{-(\alpha+1)}e^{-\beta/\theta}\)
- \(\prod{f(x_i|\theta)}\pi(\theta)=(\dfrac{e^{-\sum{x_i}/\theta}}{\theta^{n\eta}})(\theta^{-(\alpha+1)}e^{-\beta/\theta})=\theta^{-(\alpha+n\eta+1)}e^{-(\beta+\sum{x_i})/\theta}\)
Credibility Factor
\(P_C=\dfrac{\beta_*}{\alpha_*-1}=\dfrac{\beta+n\bar{x}}{\alpha+n-1}=(\dfrac{\alpha-1}{\alpha+n-1})(\dfrac{\beta}{\alpha-1})+(\dfrac{n}{\alpha+n-1})\bar{x}=(1-Z)\mu+Z\bar{x}\)
In other words, the credibility factor is \(Z=\dfrac{n}{\alpha+n-1}\)
Bühlmann Credibility: Basics
Introduction
The Bühlmann method is a linear approximation of the Bayesian method. The line is picked to be the weighted least squares approximation of the Bayesian result. Let:
- \(\Theta\) represents the hypothesis as to the risk class to which an exposure belongs
- \(\mu(\Theta)\), the hypothetical mean, is the conditional mean conditioned on the value of the hypothesis \(\Theta\)
- \(v(\Theta)\), the process variance is the conditional variance conditioned on the value of the hypothesis \(\Theta\)
The important results are:
- \(\mu=E_\Theta[v(\Theta)]\), or EHM, is the expected value of the hypothetical mean, or the overall mean.
- \(a=Var_\Theta(\mu(\Theta))\), or VHM, is the variance of the hypothetical mean.
- \(v=E_\Theta[v(\Theta)]\), or EPV, is the expected value of the process variance.
- Bühlmann’s K: \(K=\dfrac{EPV}{VHM}\)
- Bühlmann’s Credibility Factor: \(Z=\dfrac{N}{N+K}\), where \(N\):
-
The number of observations
-
The number of periods when studying frequency or aggregate losses
-
The number of claims when studying severity
-
Comparing Bühlmann Credibility with Classical Credibility
- Classical credibility ignores the variance between hypothetical means, and uses a square root rule which allows full credibility, Bühlmann methods never assign full credibility, The credibility factor converges to 1 but never reaches it.
- The Bayes method calculates the true expected value, whereas the Bühlmann method is only an approximation of the expected value.
- For both Bühlmann Credibility and Classical Credibility, if random risk is selected, the expected value is the overall mean of the observations, and no credibility is assigned.
Bühlmann Credibility: Discrete Prior
The Exposure Unit
The exposure unit is based on the item for which the credibility premium is charged, such as:
- Number of claims per …
- Number of claims per insured
- Claim size per …
- Claim size per claim
The Unit You are Calculating Credibility for
Occasionally you will encounter a situation where you are given heterogeneous groups, However, you will be asked for the Bühlmann credibility premium for an insured selected from the group, You must calculate the variance of the hypothetical means taking into account the means of each insured, not the means of the groups.
Bühlmann Credibility: Continuous Prior
The task is to identify the hypothetical mean \(\mu(\Theta)\) and process variance \(v(\Theta)\), then to calculate the mean and variance of the former (\(\mu\) and \(a\)) and the mean of the latter (\(v\)). The categories are:
- Hypothetical mean and process variance are sums of powers of the parameter(s), and the prior distribution of the parameter(s) is in the tables.
- Hypothetical mean and process variance are sums of powers of the parameter(s), and the prior distribution is uniform.
- Other situations, when we’ll need to integrate to calculate \(a\)) and (\(v\)).
Bühlmann-Straub Credibility
The Bühlmann model assumes one exposure in every period. We would like to generalize to a case where there are \(m_i\) exposures in period \(j\) .
If the expected process variance is \(v\), then the expected process variance of the mean of the observations in period \(j\) is \(\dfrac{v}{m_j}\), because the variance of the sample mean is the distribution variance divided by the number of observations.
The generalization of the Bühlmann formula:
- Sets the experience mean \(\bar{X}\) as the weighted mean, weighted by exposures.
- The EPV and VHM are computed based on one exposure unit. In the \(Z=N/(N+K)\) formula, \(N\) is the sum of the weights.
Exact Credibility
When the model distribution is a member of the linear exponential family and a conjugate prior distribution is used as the prior hypothesis, the posterior mean is a linear function of the mean of the observations. If the prior mean exists, since the Bühlmann credibility estimate is the least squares linear estimate of the posterior mean, it is exactly equal to the posterior mean in this case.
Model | \(Poisson(\lambda)\) | \(Bernoulli(q)\) | \(Normal(\theta,v)\) | \(Exponential(\theta)\) |
Prior | \(\Gamma(\alpha, \eta=\theta^{-1}) \) | \(Beta(a,b)\) | \(Normal(\mu,a)\) | \(Inv\Gamma(\alpha, \theta)\) |
Posterior | \(\Gamma(\begin{align} & \alpha_*=\alpha+n\bar{x} \\ & \eta_*=\eta+n \\ \end{align})\) | \(Beta(\begin{align} & a_*=a+n\bar{x} \\ & b_*=b+n(1-\bar{x}) \\ \end{align})\) | \(Normal(\begin{align} & \mu_*=\dfrac{v\mu+na\bar{x}}{v+na} \\ & a_*=\dfrac{av}{v+na} \\ \end{align})\) | \(Inv\Gamma(\begin{align} & \alpha_*=\alpha+n \\ & \theta_*=\theta+n\bar{x} \\ \end{align})\) |
Predictive | \(NB(r=\alpha_*, \beta=\eta^{-1}) \) | \(Bernoulli(q=\dfrac{a_*}{a_*+b_*})\) | \(Normal(\mu=\mu_*,\sigma^2=a_*+v)\) | \(Pareto(\alpha=\alpha_*,\theta=\theta_*)\) |
Bühlmann EPV | \(\alpha\theta\) | \(\dfrac{ab}{(a+b)(a+b+1)}\) | \(v\) | \(\dfrac{\theta^2}{(\alpha-1)(\alpha-2)}\) |
Bühlmann VHM | \(\alpha\theta^2\) | \(\dfrac{ab}{(a+b)^2(a+b+1)}\) | \(a\) | \(\dfrac{\theta^2}{(\alpha-1)(\alpha-2)}\) |
Bühlmann K | \(\theta^{-1}\) | \(a+b\) | \(v/a\) | \(\alpha-1\) |
Bühlmann as Least Squares Estimate of Bayes
Regression
Suppose you want to estimate \(\gamma_i\) by \(\hat{\gamma}_i\) which is a linear function of \(X_i\):
\(\hat{\gamma}_i=\alpha+\beta X_i\)
Moreover, \(\alpha\) and \(\beta\) are to be selected to minimize the weighted least square difference, where \(p_i\) is the weight of observation \(i\):
Minimize \(\sum{p_i(\hat{\gamma}_i-\gamma_i)^2}\)
We can treat the \((X_i, \gamma_i)\) pairs as a joint probability distribution with \(Pr(X_i,\gamma_i)=p_i\), and use its moments. The formulas for \(\alpha\) and \(\beta\) in terms of the moments are:
\(\beta=\dfrac{Cov(X,\gamma)}{Var(X)}\)
\(\alpha=E[\gamma]-\beta E[X]\)
In the context of Bühlmann credibility,
\(X_i\) | Observations |
\(\gamma_i\) | Bayesian Predictions |
\(\hat{\gamma}_i\) | Bühlmann Predictions |
\(\beta\) | Credibility Factor \(Z\) |
Also, the overall mean of the Bayesian predictions equals the original mean, since each \(\gamma_i\) is \(E[X_{n+1}|X]\), and
\(E[\gamma]=E[E[X_{n+1}|X]]=E[X_{n+1}]\)
So \(E[\gamma]=E[X]\)
- The Equation for \(\alpha\) becomes:
\(\alpha=(1-Z)E[X]\)
- Variance and Covariances can be calculated as follows:
\(Var(X)=\sum{p_i X^2_i}-E[X]^2\)
\(Cov(X,\gamma)=\sum{p_i X_i \gamma_i}-E[X]E[\gamma]\)
Graphic Questions
There were a couple of rules you used to eliminate the bad graphs:
- The Bayesian prediction must always be within the range of the hypothetical means, but Bühlmann predictions may not.
- The Bühlmann predictions must lie on a straight line as a function of the observation, or the mean of the observations.
- There should be Bayesian predictions both above and below the Bühlmann line, This follows from the fact that the Bühlmann prediction is a least squares estimate, However, symmetry is not required, since it is a weighted least squares estimate, it is possible that the Bayesian estimate is much lower at one point and only slightly higher at other points, since the probability of the former point may be low.
- The Bühlmann prediction must be between the overall mean and the mean of the observations, if there’s only one observation, this means that the Bühlmann prediction must be between the overall mean and
the observation.
\(Cov(X_i,X_j)\)
\(\begin{align} & Cov(X_i,X_j)=E[X_i X_j]-E[X_i]E[X_j] \\ & \qquad \qquad \quad \space =E_{\Theta}[E[X_i X_j|\Theta]]-E_{\Theta}[E[X_i|\Theta]]E_{\Theta}[E[X_j|\Theta]] \\ & \qquad \qquad \quad \space =E_{\Theta}[E[X_i|\Theta]E[X_j|\Theta]]-E_{\Theta}[\mu(\Theta)]E_{\Theta}[\mu(\Theta)] \\ & \qquad \qquad \quad \space =E_{\Theta}[\mu(\Theta)^2]-(E_{\Theta}[\mu(\Theta)])^2 \\ & \qquad \qquad \quad \space =Var(\mu(\Theta))=VHM \\ \end{align}\)
\(\begin{align} & Var(X_i)=E_{\Theta}[Var(X_i|\Theta)]+Var_{\Theta}(E[X_i|\Theta]) \\ & \qquad \quad \space = E[v(\Theta)]+Var(\mu(\Theta))=EPV+VHM \\ \end{align}\)
Empirical Bayes Non-Parametric Methods
“Empirical Bayes Non-Parametric Estimation” and “Empirical Bayes Semi-Parametric Estimation” methods can be used in the following two situations:
- There are \(r\) policyholder groups, and each one is followed for \(n_i\) years, where \(n_i\) may vary by group. Experience is provided by year.
- There are \(m_i\) policyholders in group \(i\) in year \(j\).
There are a total of \(m_i\) exposure-years in group \(i\) over all years:
\(m_i=\sum\nolimits_{j=1}^{n_i}{m_{ij}}\)
There are a total of \(m\) exposure-years over all groups:
\(m=\sum\nolimits_{i=1}^{r}{m_i}\)
-
- The average per policyholder in group \(i\) in year \(j\) (this could be the count or the size or aggregate losses) are \(X_{ij}\).
The average per policyholder in group \(i\) over all years is \(\bar{x}_i\):
\(\bar{x}_i=\dfrac{\sum\nolimits_{j=1}^{n_i}{m_{ij}x_{ij}}}{m_i}\)
The average claims per policyholder overall is \(\bar{x}\):
\(\bar{x}=\dfrac{\sum\nolimits_{i=1}^{r}{m_{i}\bar{x}_{i}}}{\sum{m_i}}\)
- There are \(r\) policyholder groups, and each group contains \(n_i\) policyholders. Experience is provided for one year by policyholder. (If more than one year of experience is provided, each combination of a policyholder and a year can be treated separately.) Credibility is calculated for each group, not for each policyholder.
- With policyholders indexed by \(j\) and group indexed by \(i\):
\(m_{ij}=1\) for all \(i\) and \(j\).
-
- The total number of policyholders in group \(i\):
\(m_i=\sum\nolimits_{j}^{}{m_{ij}}\)
Uniform Exposures
Bühlmann Framework
-
- There is one exposure in every cell, and every individual is observed for the same number of years:
\(m_{ij}=1\) for all \(i\) and \(j\), and \(n_i=n\), a constant, for all \(i\)
-
- For each individual, the mean and variance of the underlying distribution of losses is fixed:
Each \(X_{ij}\) has mean \(\mu(\Theta)\), variance \(v(\Theta)\), conditional on \(\Theta\), the hypothesis, which may vary by \(i\) but not by \(j\)
\(\hat{\mu}\) is the sample mean
The overall expected value of each \(X_{ij}\) is \(\mu\). Thus an unbiased estimator for \(\mu\) is the sample mean:
\(\hat{\mu}=\bar{x}=\dfrac{1}{rn}\sum\limits_{i=1}^{r}{\sum\limits_{j=1}^{n}{x_{ij}}}\)
\(\widehat{EPV}\) is the average of the sample variances of the rows
Since the sample variance with division by \(n-1\) is unbiased, for any independent identically distributed random variables \(\gamma_i\) with variance \(\sigma^2\):
\(E[\sum\limits_{i=1}^{n}{(\gamma_i-\bar{\gamma})^2}]=(n-1)\sigma^2\)
Since \(v(\Theta)\) is the conditional variance of each \(X_{ij}\):
\(E[\sum\limits_{j=1}^{n}{(X_{ij}-\bar{X}_i)^2}|\Theta]=(n-1)v(\Theta)\)
Taking the expected value of both sides of this equality over \(\Theta\):
\(\widehat{EPV}=\dfrac{1}{r}\sum\nolimits_{i=1}^{r}{\dfrac{1}{n-1}}\sum\limits_{j=1}^{n}{(X_{ij}-\bar{X}_i)^2}\)
\(\widehat{VHM}\) is the sample variance of the row means minus \(\widehat{EPV}/n\)
Similarly,
\(E[\sum\limits_{i=1}^{r}{(\bar{X}_i-\bar{x})^2}]=(r-1)Var(\bar{x}_i)\)
Since \(E[\bar{X}_i]=E[X]\) and \(Var(\bar{X}_i)=\dfrac{Var(X)}{n}\):
\(Var(\bar{X}_i)=Var(E[\bar{X}_i|\Theta])+E[Var(\bar{X}_i|\Theta)] = Var(E[X|\Theta])+E[Var(X|\Theta)]=a+\dfrac{v}{n}\)
Therefore,
\(\widehat{VHM}=\dfrac{1}{r-1}\sum\limits_{i=1}^{r}{(\bar{X}_i-\bar{X})^2}-\dfrac{\widehat{EPV}}{n}\)
Non-Uniform Exposures
In the general case, the Bühlmann-Straub framework, exposures \((m_{ij})\) vary by cell. Now:
- \(E[X_{ij}|\Theta]=\mu(\Theta)\)
- \(Var(X_{ij}|\Theta)=(\Theta)/m_{ij}\), since the variance of the sample mean of \(m_{ij}\) observations is the distribution variance divided by \(m_{ij}\)
Uniform Exposure | Non-Uniform Exposure | |
---|---|---|
\(\hat{\mu}\) | \(\bar{x}\) | \(\bar{x}\) |
\(\widehat{EPV}\) | \(\dfrac{1}{r(n-1)}\sum\limits_{i=1}^{r}{\sum\limits_{j=1}^{n}{(x_{ij}-\bar{x}_i)^2}}\) | \(\dfrac{\sum\nolimits_{i=1}^{r}{\sum\nolimits_{j=1}^{n_i}{m_{ij}(x_{ij}-\bar{x}_i)^2}}}{\sum\nolimits_{i=1}^{r}{(n_i-1)}}\) |
\(\widehat{VHM}\) | \(\dfrac{1}{r-1}\sum\limits_{i=1}^{r}{(\bar{x}_i-\bar{x})^2}-\dfrac{\widehat{EPV}}{n}\) | \(\dfrac{\sum\nolimits_{i=1}^{r}{m_i(\bar{x}_i-\bar{x})^2-\widehat{EPV}(r-1)}}{m-m^{-1}\sum\nolimits_{i=1}^{r}{m^2_i}}\) |
\(\hat{\mu}^{cred}\) | – | \(\dfrac{\sum\nolimits_{i=1}^{r}{\hat{Z}_i\bar{x}_i}}{\sum\nolimits_{i=1}^{r}{\hat{Z}_i}}\) |
Empirical Bayes Semi-Parametric Methods
Poisson Model
If the model is assumed to have a Poisson distribution, the hypothehcal mean and process variance are equal: \(\mu(\Theta)=v(\Theta)\), then \(\widehat{EPV}=\hat{\mu}=\bar{x}\)
- Uniform Exposures – 1 Year Experience
- \(\widehat{VHM}=s^2-\widehat{EPV}\), where
- \(s^2\)
- \(=\dfrac{1}{r-1}\sum{(O-E)^2}\)
- \(=\dfrac{1}{r-1}\sum{m_i{(X_i-\bar{X})}^2}\)
- \(=\dfrac{r}{r-1}\sum{p_i{({X_i-\bar{X})}^2}}\)
- \(=\boldsymbol{\dfrac{1}{r-1}(\sum{X_i^2}-\dfrac{(\sum{X_i})^2}{r})}\)
- \(=\boldsymbol{\dfrac{r}{r-1}(\sum{{p_i}{X_i}^2}-{{\bar{X}}^2})}\)
- \(r\) is the number of policyholders, \(m_i\) is the number of exposures for observation \(i\).
- \(s^2\)
- \(\widehat{VHM}\) estimated using Empirical Bayes Semi-Parametric Methods may be non-positive. In this case the method fails, and no credibility is given.
- \(\widehat{VHM}\) estimated by \(s^2-\widehat{EPV}\) if all individual data is provided, otherwise calculate \(\widehat{VHM}\) directly using Empirical Bayes Non-Parametric Method.
- \(\widehat{VHM}=s^2-\widehat{EPV}\), where
- Uniform Exposures – n Year Experience
- Non-Uniform Exposures
- Estimate \(\hat{\mu}\) as the sample mean, which follows that \(\widehat{EPV}=\hat{\mu}=\bar{x}\)
- Estimate each \(v_i\) from \(\bar{x}_i\)
- Calculate \(\widehat{VHM}=\dfrac{\sum\nolimits_{i=1}^{r}{m_i(\bar{x}_i-\bar{x})^2-\widehat{EPV}(r-1)}}{m-m^{-1}\sum\nolimits_{i=1}^{r}{m^2_i}}\)
Non-Poisson Models
If the model is not Poisson, but there is a linear relationship between \(\mu\) and \(v\), we can use the same technique as for a Poisson model:
- Estimate \(\mu\) as the sample mean
- Estimate \(v\) from \(\mu\)
- Estimate \(a=s^2-\widehat{EPV}\)
Examples of distributions with linear relationships between \(\mu\) and \(v\) are:
- Negative Binomial with fixed \(\beta\). Then \(E[N|r]=r\beta\) and \(Var(N|r)=r\beta(1+\beta)\)
- \(\hat{\mu}=\bar{x}\)
- \(\widehat{EPV}=\bar{x}(1+\beta)\)
- Gamma with fixed \(\theta\). Then \(E[X|\alpha]=\alpha\theta\) and \(Var(x|\alpha)=\alpha\theta^2\)
- \(\hat{\mu}=\bar{x}\)
- \(\widehat{EPV}=\bar{x}\theta\)