Estimating the risk associated with any particular incident is the central goal of insurance practitioners. In its simplest version, this task comes down to being able to answer three questions:

How likely is the incident to happen?
How severe could the incident be?
How much will the consequences of the incident cost?

These are the questions that a model for risk prediction addresses. Underestimating any answers implies economic losses to cover damages. On the other hand, overestimating them can lead to a loss of competitive advantage.

To model a machine learning (ML) algorithm dedicated to estimating incident risk, we want to talk about two examples that illustrate the importance of data - extracting more information than what exists in plain sight and the impact of wide data variety. We created this model for one client project in the automotive leasing sector, but we can extend the constructs to various sectors and cases.

Augmenting data to find propensity towards incidents

To begin, we should base a model to estimate the cost of risk on data that, at the very least, contains a list of incidents, the date of the incident, customers, and the associated cost. We can’t overstress the importance of customer information enough. To determine the probability of a risk, one must examine what kind of customer we are talking about (e.g., socioeconomic characteristics and any other dimensions) to assess the likelihood of an incident. For example, looking at customer history, one might think that customers that have had several incidents (of the same kind or not) in the past may be more likely to keep having them in the future. This is an a priori assumption that we must validate with data. In the undertaken client project, we found it to be true. By deriving new information from the available data (augmenting), we could study the number and type of incidents of customers in the last 3, 6, 12, and 24 months and use that data to find characteristics of customers that make them more prone to suffering specific incidents.
‍
Be cautious, as this is a slippery slope. Data privacy is of extreme importance, and building ethical ML algorithms is the responsibility of data scientists. When determining customers' characteristics, especially the costs of services offered to them, it is necessary to take precautions to ensure no discrimination in the model.

Distinguish different types of risks with metadata

Having reliable data is not everything. With risks of incidents, it is crucial to realize the different types of risks to consider and build a model accordingly.
‍
Sometimes the way to discern two different types of incidents is simple, without much information. Here, we introduce a hypothetical dataset of a home-insurance company that only makes two kinds of insurance: small reparations like clogged pipes with a cost of around 10 (in fictional currency) and more extensive reparations like leaking pipes and floods, with a cost of about 100 (see fig.1).

We do not need much extra information to discern the types of risks to train the model since the total cost is already a good indicator. We could write a rule saying that if the total cost is lower than 45 (for example), then we face a low-cost risk, and the other way around.

‍The issue is more complicated in the presence of a third type of risk, medium-large repairs, for example, paint repairs with a cost of around 80 (see fig.2).

In this case, a simple rule in cost will not help the model's training to distinguish different types of risk. Adding additional information, like descriptions of the problem (paint-work vs. flooding), the company making the repairs (painters vs. plumbers), or other relevant data, will help.

This shows the importance of gathering any piece of information we can get our hands on when creating an ML model.

Final thoughts

Building an incident model for our client was possible due to the quality of the data they already collected and the ability to transform, augment and use metadata related to the different incidents to discern different types of risks.
‍
Agilytic built several incident models, one for each type of risk identified with the customer, addressing the three opening questions: how likely, severe, and costly are the incidents that our client faces?
‍
The main takeaway? Extracting more information than what exists in plain sight and the importance of collecting varieties of data will enhance your incident modeling.

If risk prediction is a challenge for your business, we're here to help!

Tech Talk: Account for your risk factors with an incident model

Augmenting data to find propensity towards incidents

Distinguish different types of risks with metadata

Final thoughts

Tech Talk: Empowering fraud detection in banking

Entering data science out of university: Guillaume's journey