A Data Science Approach to Risk Assessment for Automobile Insurance Policies

In order to determine a suitable automobile insurance policy premium one needs to take into account three factors, the risk associated with the drivers and cars on the policy, the operational costs associated with management of the policy and the desired profit margin. The premium should then be some function of these three values. We focus on risk assessment using a Data Science approach. Instead of using the traditional frequency and severity metrics we instead predict the total claims that will be made by a new customer using historical data of current and past policies. Given multiple features of the policy (age and gender of drivers, value of car, previous accidents, etc.) one can potentially try to provide personalized insurance policies based specifically on these features as follows. We can compute the average claims made per year of all past and current policies with identical features and then take an average over these claim rates. Unfortunately there may not be sufficient samples to obtain a robust average. We can instead try to include policies that are"similar"to obtain sufficient samples for a robust average. We therefore face a trade-off between personalization (only using closely similar policies) and robustness (extending the domain far enough to capture sufficient samples). This is known as the Bias-Variance Trade-off. We model this problem and determine the optimal trade-off between the two (i.e. the balance that provides the highest prediction accuracy) and apply it to the claim rate prediction problem. We demonstrate our approach using real data.


Introduction
Traditionally insurance companies have determined automobile policy premiums using rate tables computed by Actuaries Hassani et al. (2020).Today, however, the vast amount of data collected in electronic form can now be used to determine more suitable premiums for a given policy since such data can be used to better predict risk Errais (2019).Furthermore, by using data from present and past customers, the predictions are better suited for the particular environment in which the insurance company operates.This form of personalized policies benefit the customer (who pays an amount more in line with their risk) as well as the insurance company (which can now better ensure that it can safely cover claims costs from risky policies).The typical approach is straightforward.For a given new customer, one can use historical data of past and present customers with similar characteristics (features) to better estimate the risk level of the new customer and then use this to determine a premium for their policy.This is similar to recommender systems used by companies such as Netflix.In that case movies are recommended to an individual based on movies that were enjoyed by customers with similar characteristics (collaborative filtering).In the case of insurance, one must recommend a policy that is both desirable to the customer (through personalization) and profitable to the insurance provider.

Related Work and Contributions
Many past papers have focused on Recommender systems for insurance companies where one of a small number of insurance products is offered.In Qazi et al. (2020Qazi et al. ( , 2017) ) they used historical data of existing and past customers to determine the most suitable policy for a new customer.In this case a relatively small number of insurance products are available and hence the number of customers who have been using a specific product will be sufficiently high so that the sample size is not an issue when computing the recommendation.The paper Kanchinadam et al. (2018) also addresses the same problem but focuses on speeding up the computation of the recommendations.The papers He et al. (2018); Bian et al. (2018) do address personalized auto insurance premiums but they focus on using Telematics data to do so.Such devices are not available from all insurance companies and hence has limited applicability.The authors in Esfandabadi et al. (0) use Fuzzy Logic to come up with a rule based approach to risk.In our case we use a data science approach and focus on personalization while using traditionally available data.
Several papers also focus on risk assessment.In general, few customers make claims during a year.Furthermore, the claims that are made vary widely from minor incidents (such as a scratched bumper) to major ones (such as when a car has to be written off because it cannot be repaired).This results in a large variation in the average annual claims made by a customer making it difficult to predict.Therefore papers generally focus on predicting either, severity (the expected claim value given that a claim is made, Qazi et al. (2017); Su and Bai (2020)) or frequency (the expected number of claims made per year, Liu et al. (2014); David and Jemna (2015)).Another typical metric is the loss ratio which is the ratio of the claims made to the premium charged (Guelman (2012); Zhang and Dukic (2013).We focus on a more direct measure which is the average total value of claims made per year for a given policy (shortened to simply claims rate) which can be thought of as the product of severity and frequency and hence captures both metrics.As mentioned before, predicting claims rate can be challenging because of its high variance.A significant number of samples are required for a good estimate but as one tries to achieve greater personalization the number of available samples decreases.We investigate the optimal trade-off between these objectives.
Note one potential issue of Recommender Systems is the following.The Recommender system chooses the most appropriate product for the customer but this may not be a very profitable product for the company and so this tradeoff must be taken into account (see Hosein et al. (2019) for a more detailed discussion on this issue).In our case we need not worry about this issue since we are focusing on providing the most suitable (unique) product and the premium is then determined to achieve an acceptable profit for each unique policy.
Our contribution is in the analysis of the trade-off between personalization and robustness.Instead of a finite number of products from which to choose for customer offerings, we provide a unique (personalized) product to each customer.Furthermore, we only take into account demographic and other data collected from each policy but do not consider Telematics data.Naturally our approach can include such data as well.Therefore we are (a) using a new model for risk (based on claim rate), (b) obtaining the best trade-off between personalization and robustness, (c) using the proposed approach for feature importance and selection and (d) demonstrating how the results obtained can be interpreted so that one can explain to the customer the reason for the provided premium.

Problem Formulation and Assumptions
We formulate a model for this problem and then develop an algorithm for its solution.Our objective is as follows, given policy information for a new or renewal automobile policy (i.e., information about the drivers, cars, etc.), predict the expected total amount in claims that will have to be paid out to this customer over the subsequent year of the policy.This prediction will be based on several factors but correlates with the risk associated with the drivers and cars on the policy.This information can then be used to determine an appropriate premium for the policy.Traditionally this computation is performed using risk tables but independent of the specific historical data of the company's customers.Here we use historical data of the provider's customers to make the prediction.This is more appropriate since the parameters used in the risk tables may have been developed based on a different customer base (country) and so unsuitable for the one under consideration.

New versus Renewal Policies
Note that we need to distinguish between a new policy, for which only customer provided data is available, and renewals, for which information about the customer since the start of their policy is available.We develop a model that can be applied to both new and renewal policies.In the case of renewal policies, the historical data of the policy is included in the training set.The proposed approach therefore automatically includes the past claim information of the policy (since it is now included in the training set).Therefore we assume that all policies that are at least one year old are included in the training set used for parameter determination.In this way recent information is included in the predictions.Note that this means there is no need for an accident penalty or a no-claim discount since these adjustments are implicit.

Quantity versus Currency of Data
The more data that is used for predictions, the more accurate the prediction.However, as we increase the dataset by going further back in time we will be using outdated information (e.g., automobile models, cost of repairs, etc.).We manage this as follows.As the cost of claims increases (with time), the claim rate of a policy will also increase.The prediction we get from using outdated information will therefore be lower than what would actually occur.We therefore scale predictions as follows.We predict the total claims for the previous year and we then use a scaling factor to ensure that the total predicted claims equals the total actual claims.This scaling factor is then included when making new predictions.This scale factor computation is repeated every year so that the total predicted claims for the upcoming year will be close to the actual total claims for the year.

Comprehensive versus Third-Party Policies
There are two types of policies, Comprehensive, in which the company has to pay for repairs to the customer's car even if they were at fault, and Third-Party, in which the company only pays for repairs to the other involved party in the accident (i.e., the third party).Note that the risk behaviour (and claims requests) of Third-Party versus Comprehensive policy customers may be different but the approach we use has the ability to extract the relevant information.We therefore make predictions using the combined dataset (i.e.policies of both types) but include the type of policy as a feature.Note that the features for both types of policies are the same except that, for Comprehensive policies, there is also the Sum Insured (based on the value of car) feature.This value is set to zero for Third Party policies but the same model can be used for both policy types.

Multi-Car versus Single-Car Policies
For each policy we must predict the total annual claims for the policy which may have multiple drivers and/or cars.Note that a premium is charged per car and the sum of these forms the policy premium.Our model uses the primary car and primary driver of that car as the sample for that policy (and ignores all other drivers/cars).This means that the prediction is made for a single driver/car pair and this can be repeated for each car on the policy to determine the total claim rate for the policy.

Dataset Description and Preparation
The policy data used for this study spans a period of 5 years.No confidential information is disclosed and all monetary values are normalized.It consists of data collected from past and existing customers.Each policy record consists of policy information, information for each driver on the policy, information for each vehicle on the policy and information on each claim made on the policy since its inception.Some of this information is not relevant for our purposes (e.g., Vehicle Identification Number) and is ignored.Certain features must be derived from the information provided.For example, the policy lifetime is computed as the difference between the termination date and start date (if terminated) or the difference between the present date and the start date (if currently active).Note that the metric of concern is the average claim rate for a driver/car pair.For each policy we determine the total value of all claims made (by the primary driver) and divide by the total lifetime of the policy (in years) to obtain the claim rate.
Our objective is to predict the claim rate and use this claim rate to determine a suitable price.In order to do this we focus only on the primary driver and their associated car for each policy.This happens to be the majority of cases so we do not lose too much information.For this driver we compute the claim rate based on accidents in which they were involved.We remove features that were mostly empty or corrupt and also placed filters to remove anomalous data such as drivers over the age of 85.The data that was finally used for the problem is provided in Table 1.POL is the policy number which is used as a unique identifier for the policy.CLR is the average claims per year computed for the primary driver and their associated car for the policy.TOC is the type of policy (customer) which we also use as a feature.SIV is the sum insured value of the primary vehicle of the policy and this value is zero for Third Party policies.All other features are described in the Table.

Proposed Model
The model we propose is unique in that (a) the metric of concern is claim rate and (b) we use a novel solution approach rather than the traditional approaches.We do not present a full comparison with other Machine Learning approaches in this paper since our intent is to introduce the model.Future papers will include detailed comparisons with state of the art Machine Learning algorithms.

Definition of Distance Metric
In this section we describe the approach used for predicting the annual financial claims per year (henceforth called claim rate) for a given policy.We denote the set of features that we consider by the set F. Features include information such as age, gender, etc., as well as information about their associated vehicle such as model, body type, etc.We denote the set of samples by S where a sample is a policy and includes features for the associated driver/car pair.One way to predict the claim rate is to find the expected value of the claim rates of all existing policies with identical features.However, there may be none or very few of such policies.We must therefore include policies with features that are nearby and include them in the average.
In order to find "close" policies we need to define a distance metric between pairs of categories of a given feature and then use some measure (e.g., Euclidean Distance) to define the distance between two policies.We define this distance as follows.For each category v of feature f let C(f, v) denote the claim rate averaged over all policies that has a value v for feature f .For example, for the feature gender (f = gender) with members m and f , let C(gender, male) denote the average claim rate over all male drivers and let C(gender, f emale) denote the average claim rate over all female drivers.We define the distance between these two categories of this feature by |C(gender, male) − C(gender, f emale)|.In general, if we had several feature categories then the distance between any two of them will be computed in this manner.Therefore if the test policy has a male driver then their gender distance from another policy with a male driver is 0 while for a female driver it would be |C(gender, male) − C(gender, f emale)|.Note that the same computation is done for numerical features such as age.For example, the distance between a 48 year old and a 30 year old is given by |C(age, 48) − C(age, 30)|.By doing this we maintain the same measurement unit (claim rate) for all distances.If the 48 year old is a male and the 30 year old is a female then the Euclidean distance is used (i.e. the root of the sum of the squares of the gender and age feature distances).

Claim Rate Prediction
If there were several existing policies with the exact feature values as the test policy then one could obtain a good estimate on the claim rate for the test policy by taking the average of claim rates over all policies with the same features.However, in general there may not be sufficient samples (or none) to obtain an estimate with sufficient confidence and so we need to include nearby samples as well.The more nearby samples we use the more robust the estimate but the less personalized since included samples are further away.This in turn leads to lower prediction accuracy.We take a weighted average of claims of all policies where the weight is inversely proportional to the Euclidean distance between the policies.
Suppose we wish to predict the claim rate for some test policy and denote the distance between this policy and some training policy s by d s .We use a weight (1 + d s ) −κ for κ ≥ 0 when taking into account the claim rate of sample s ∈ S. However we need to have a normalizing factor α. The predicted claim rate c for the test sample is therefore given by c(κ) where c s is the claim rate of policy s.If all policies had the same claim rate then the predicted claim rate should also have this value and hence we must have c ≡ and hence (3) and so we have the predicted claim rate for the test policy as The pseudo-code for this computation is provided in Algorithm 1.

Computing the Optimal value of κ
Next we determine the optimal value of κ.For an existing policy s we have the actual claim rate c s .Note that we can predict a claim rate for this sample (in which case the sample must be removed from the training set) and we denote this predicted value by ĉ(κ).We introduce the hat to distinguish this predicted value with the actual value (which has no hat).Note that we use 5-Fold cross validation and hence 80% of the samples are used for training (computing the average claim rates C(f, v)) while the other 20% are used for testing (and determination of the accuracy).Note that when κ = 0 then ĉ(κ) = c and so the prediction is simply the average over all (training) samples.As κ is increased, close samples are weighted more heavily but the average becomes less robust and hence the error will eventually start increasing again.Therefore the optimal κ lies somewhere in between (see Figure 4 for an example of this relationship).We therefore will find κ that minimizes the Mean Absolute Error (MAE) of the prediction.For convenience we will normalize this by the MAE if one used the average claim rate over all policies, c, as the predictor.One can think of this case as making a prediction without features.Therefore we will compare the error of the prediction made with features with the error of the prediction made without features.Let us denote the test set by T then we compute the normalized error over the test samples as If the predictor is the same as averaging over all policies (i.e., ĉt (κ) = c) then this ratio is 1.However if, by adding features, the MAE of the predictor is decreased then this ratio drops below 1. Therefore this metric provides an indication of prediction performance using features when compared to prediction performance without using features and hence demonstrates the benefit of the feature-based approach.We then find the κ value that optimizes the predictor as This value is then used to obtain the optimal prediction as c * t = ĉt (κ * ).(predicted claim rate for test policy)

Feature Importance
Consider a single feature.We know from the previous section that, as κ is increased then E(κ) should initially decrease before increasing once again.If this does not occur then the feature does not capture sufficient information to be useful for predictions.One can therefore use the value of E(κ) evaluated at the optimal κ for that feature alone as an indication of importance.In fact, even if we used a fixed value of κ for each feature the corresponding value of E(κ) is an indication of relative importance with lower values indicating more importance.For example, in Figure 1 we plot E(κ) as a function of κ for two features DAF (years claim free for the driver) and YCF (years claim free for the car).For DAF the minimum error occurs at κ = 8 while for YCF it occurs at κ > 20.However, at κ = 10 we find that the respective values provide a good representation of the optimal value and hence can be used to compare the two features.Also note that here we clearly see that risk depends primarily on the driver with the car playing a minor role.
We therefore use this approach to determine which features are important and hence should be included in the analysis.We use a value of κ = 10 and, using a single feature at a time, we compute E(10).The resulting values are provided in Figure 2. The features represented in red have normalized errors greater than 1.

Feature Selection
Now that we know which features are important we focus on which of them should be included in the model.We do this as follows.Starting from the most important feature (lowest value for E( 10)) we add one feature at a time and again compute E(10) for the combination.If the performance metric decreases (i.e., better results) then we keep If the performance metric increases then we remove the recently added feature and repeat.Note that, some features may have low importance when considered in isolation but together with other features (such as type of policy) their value increases.The results of this process are provided in the lower part of Figure 2. A brown bar indicates that the addition of a feature resulted in a loss of performance and hence the feature should be removed.
The final features to be used include number of years since the driver was last in an accident, the number of at-fault accidents by the driver over the last five years and the number of not-at-fault accidents over the last five years.Each of these is a strong indicator of risk.In the case of renewals we would have actual claim rate values but for new customers these three features (even without financial information) correlates well with claim rate.The type of policy feature is also needed since it helps to distinguish the two types of policies.Note that we could do the analysis separately for each type of policy but the increase in sample size by combining the two types provides better overall results.Only one car feature was found to be sufficiently beneficial and that was the number of years since the last claim was made on the car.However this feature is far less important than the driver features that were included indicating that what really matters is the driver on the policy and not the car.Whether the driver was continually insured over the last 5 years (i.e., mature driver), the type of use (personal versus business) as well as the gender of the driver were also found to be useful (but far less so than the others).
Note that we had expected certain features (like age) to be beneficial but they were not.In Figure 5.5 we provide a histogram of the average claim rate by age (in blue).We see that there is a weak dependency on age but because of the large variations from year to year (because of limited data), the dependency is not sufficiently robust.Next we predicted the claim rate for each age using the approach described previously.We found that the optimal value of κ was 2 with a normalized MAE of E(2) = 0.9996 which indicates that limited personalization was possible.We then used this value to find optimal claim rate values for each age.In Figure 5.5 we provide the histogram of the original claim rates (blue) and the filtered claim rates (red).We note that the red claim rates are each close to unity and hence provides little differentiation.This is why this feature does not provide much benefit for predictions.

Parameter Optimization
We now have the set of features to be included in the model.Next we find the optimal value of κ for this combination of features.This value will then be used for making predictions.In Figure 4 (brown curve) we plot E(κ) as a function of κ.We find the optimal value to be κ * = 8 with E(8) = 0.63 and hence once can reduce the MAE obtained with no features by 37% by using the using the 8 chosen features.We also note that, although E increases with κ beyond the  optimal point, the increase is very gradual so the error remains nearly constant for a wide range of values and so the approach is robust with respect to κ.
We believe that if we had performed feature selection using each policy type separately that we would get the same features.We therefore used these features and determined E(κ) for Third-Party policies only and also for Comprehensive policies only.These are also plotted in Figure 4. We find that the accuracy for Third-Party only samples is close to that of the case of using both Third-Party and Comprehensive samples.This is primarily due to the fact that there are 50% more Third-Party samples than Comprehensive samples.Therefore the Third-Party samples are more useful to the Comprehensive predictions than the other way around.Also note that all three cases are optimal at κ = 8.

Illustrative Examples of Predictions
We have now determined the features to be used and the optimal value of κ.In this section we will consider various policy scenarios and predict the resulting claim rate to demonstrate the dependence on the features.Although we use both Comprehensive and Third Party policies in our model we will illustrate using a Comprehensive policy and normalize claim rates with respect to the average over all Comprehensive Policies.In Table 2 we provide various scenarios to illustrate that the model provide reasonable outputs.The top table starts with a low risk policy and features are changed one at a time that results in increased claim rates.The bottom table starts with a high risk policy and features are adjusted one at a time in order to lower the claim rate.
There is one outstanding case that provided unexpected results.In the lower table when we reduce the number of not-at-fault accidents from 1 to zero we expect a decrease in the claim rate but instead we found that it increased.We investigated this in detail.We found that the provided data has some inconsistencies.There were many cases where the number of not-at-fault accidents and at-fault accidents were 0 but the driver indicated that they made a claim within the last five years.This of course is inconsistent.This lead to a lot of claims listed under NNF=0 when they should be listed under NNF=1 or above.We believe this to be the reason for the result obtained.Our intent was not to make any adjustments to the given data for this paper to avoid any appearance of data tweaking.However, in the future, we will investigate what happens when we adjust the data to make it more consistent while justifying any changes made.

Interpreting Prediction
Once a predicted claim rate is computed then this information can be used to compute a premium.The premium will take into account the operational costs of the company as well as the desired profit margin.This is another interesting area of research but is outside the scope of this paper.Once a premium is computed it is important to explain the reason for the amount (i.e, interpretability).The operational cost and profit is independent of the customer so the only customer dependent factor is the predicted claim rate.We can determine the influence of each feature on this claim rate and this information can be used to explain the decision made.We do this as follows.Consider any feature f and let v represent the category value of this feature for the new policy.We can use the model, with only feature f , to determine the predicted claim rate for anyone in category v. Let us denote this predicted claim rate of this feature by C(f, v).Note that this is not the same as the average claim rate over all training samples with category value v which we previously denoted by C(f, v).Let us explain with the feature gender.If κ = 0 then C(gender, male) = C(gender, female) = c.However as κ goes to infinity then C(gender, male) approaches C(gender, male) and C(gender, female) approaches C(gender, female).For positive values of κ, C(f, v) will lie between c and C(f, v).
The metric I f ≡ C(f, v)/c will be used to represent the impact of feature f (of a policy with value v for the feature) where c is the average claim rate.For this exercise we only use Third-Party samples to better explain the approach.If I f < 1 then the feature is causing a reduction of the claim rate otherwise it is causing an increase in the claim rate.Note that all values are being computed using κ = κ * and normalized with respect to the average claim rate for Third-Party policies.
For the policy we chose we have the following information.The driver got into an accident and made a claim 9 years ago (I DAF > 1).They have no at-fault accidents over the last 5 years (I N AF < 1).They have had 1 not-at-fault accidents over the last five years (I N N F > 1).They have a Third-Party Policy (hence I T OC = 1).Their car was last in an accident 18 years ago (I Y CF < 1).They have been continuously insured over the last five years (I COV < 1).This is their private vehicle (I U SE < 1).The driver is Male (I SEX < 1 but almost 1).The predicted claim rate (which was impacted by the various features) has I CLR = 1.0.We provide this information visually in Figure 5. Hence the provider can explain to the customer the specific reasons for the premium of their policy.
8 Some Analytical Results

Expected value of Prediction
Consider the predicted claim rate ĉ for some test sample.If we assume that the training samples have an average claim rate c then we will show that the expected value of ĉ is c.This would mean that the sum of predicted claim rates approaches the sum of the actual claim rates as the number of test samples increases.This ensures that, at the end of the year, the total claims that are predicted is close to the total actual claims that were made.

Figure 1 :
Figure 1: E(κ) of two Features to demonstrate Relative Importance

Figure 3 :
Figure 3: Histogram of Claim Rate versus Age for Original (blue) and filtered (red) Cases

Figure 4 :
Figure 4: E(κ) as a function of κ for Selected Features

Figure
Figure 5: Contribution of each Feature to Prediction

Lemma 1 .
If the training samples have a mean claim rate of c then the expected value of the prediction for a test sample is equal to c. Proof.Recall that the predicted value for a given κ is given by ĉ(κ) = s∈S cs (1+ds) κ s∈S 1 (1+ds) κ

Table 1 :
Policy Features used for Analysis Algorithm 1 Pseudo-code for proposed Algorithm to predict test sample claim rate c(κ) 1: F ≡ set of features 2: S ≡ set of training samples 3: v f ≡ set of categories for feature f ∈ F 4: κ > 0 tuning parameter 5: X sf ∈ v f ≡ category of feature f ∈ F of training sample s ∈ S 6: x f ∈ v f ≡ category of feature f ∈ F for test sample 7: c s ≡ claim rate for sample s ∈ S z ≡ {s ∈ S | X sf = v} s (average claim rate over samples where feature f has value v) 16: for each s ∈ S do 17:

Table 2 :
Predictions for Sample Cases starting with Low Risk Case (top) and High Risk Case (bottom)