Cluster Analysis

Applications in experience analysis and assumption-setting Marianne Purushotham

Cluster analysis is a statistical technique that has been used extensively by the marketing profession to identify like segments of a target buying population for a particular product. Cluster analysis can be used to reduce the complexity of a particular population by identifying subpopulations that naturally group together in terms of socioeconomic, psychographic and behavioral criteria.

The goal of clustering techniques in the sales and marketing setting is to identify and understand similarities (and differences) among groups of potential customers, allowing companies to develop more customized sales and marketing approaches for these different groups.

This article explores the potential for applying cluster analysis as an additional tool to support the experience analysis and assumption-setting process in actuarial work using a sample population of variable annuity (VA) contracts with guaranteed living withdrawal benefits (GLWBs).

The Cluster Analysis Methodology

There are two general categories of cluster analysis: agglomerative hierarchical methods and distance methods. Agglomerative hierarchical clustering is a process that begins by defining one cluster for each record in a particular data set or population. Clusters then are combined iteratively by defining and calculating the “distances” between existing clusters, and then successively merging those that are “closest.” This process continues until there is only one cluster left, and then an analysis is performed to select the optimal cluster structure for the particular data set.

On the other hand, distance clustering starts with a “seed” for each of the maximum number of clusters as defined by the user. Then each record in a data set or population is assigned to the nearest seed (based on a calculation using a defined distance method) to form a cluster. The original seeds then are replaced by the means of the current set of clusters, and the process continues to iterate until there are no longer changes in the means of the current cluster arrangement. The approach used for the VA example presented in this article falls into this second category of cluster analysis methods and is called K-mean Euclidean Distance Method. This particular approach tends to work well on larger data sets and therefore is a good method for an initial analysis.

A Case Study: VA Contracts with GLWBs

Let’s explore possibilities for applying cluster analysis to support the experience analysis and assumption-setting process.

1. Serve as an independent check of the results of a predictive modeling process.

Cluster analysis can be thought of as a tool similar to factor analysis. While factor analysis identifies key variables in the data that impact a particular outcome (and therefore are the most likely candidates for a predictive modeling exercise), cluster analysis identifies key groups of cases (e.g., contractholders) at the individual record level.

For example, let’s consider a group of VA contract-holders who have elected a GLWB. Suppose we have developed a predictive model for full surrenders for this group. We then could apply a cluster analysis algorithm to the same population and compare the full surrender behavior of members of the clusters identified by this independent approach to the predictive model variables identified as having strong association with full surrender activity.

2. Provide a possible method for developing actuarial assumptions for new product designs where little or no experience is available yet, but where the design shares similarities with an existing product with more available experience.

In this case, an actuary could apply a cluster analysis to a new design (Product 1) based on characteristics of the current population of contractholders to determine like groupings. These groups then could be compared to the cluster analysis results for the existing product design (Product 2), and where there are similarities in cluster characteristics, the actuary might feel more comfortable applying the experience results (modified for actuarial judgment as appropriate) for Product 2 to the Product 1 like cluster(s). For segments of the Product 1 population that don’t overlap with segments of the Product 2 population, we still would need to rely on industry-level experience where credible, as well as reasonable judgment. However, this approach might provide a more mathematically sound basis where there are segments of overlap in the underlying inforce population clusters.

Let’s go into a bit more detail on the example of the VA contractholders mentioned. LIMRA and the Society of Actuaries (SOA) have partnered for several years on annual industry experience reporting for VA contracts that have elected some form of guaranteed living benefit. The published reports include regular updates on full surrender activity, as well as utilization experience for each of the common guaranteed living benefits. Using a random sample of experience data for 2012, LIMRA developed a preliminary predictive model for full surrenders on contracts with GLWBs. A summary of the results of that effort is provided:

Data Description:

  • Experience Year 2012
  • Contracts that have elected a GLWB
  • Includes both policy and product design data as potential predictive variables

Model Selection: Several different modeling approaches were examined, including generalized linear model (GLM) forms, decision trees and survival models. For purposes of this discussion, we will refer to the results of the GLM with a logit link function and binomial distribution assumption.

Based on the final determination of this model, the key predictive variables were:

  • Utilization status
  • Policy duration
  • Market
  • Attained age of policyholder
  • Distribution channel
  • Policy size (account value, cumulative premiums paid)
  • In-the-moneyness range
  • Surrender charge level

Cluster Analysis: Application 1

Serve as an independent check of the results of a predictive modeling process.

For purposes of performing the cluster analysis on the VA population, a K-mean Euclidean Distance Method was used with parameters set for a k=3 up to k=10 clusters successively. This methodology measures distances using numeric (continuous) variables. This is because the Euclidean distance is measurable only for numeric variables. There are distance measures that allow for both categorical and continuous variables; however, these methods are not as manageable with extremely large data sets. As a result, the approach taken here is to determine clusters using key numeric variables and then describe those clusters using all variables (including categorical).


PSEUDO-F STATISTIC is the ratio of between-cluster variance to within-cluster variance.

CLUSTER CONVERGENCE CRITERION FOR CCC is the statistic calculated based on minimizing the within-cluster sum of squares.

APPROXIMATE EXPECTED OVERALL R-SQUARED is the approximate expected value of the overall R-squared under the uniform null hypothesis, assuming that the variables are uncorrelated.

The next step is to select the optimal number of clusters based on an analysis of the degree of differentiation provided by each potential cluster arrangement and the relative sizes of each. The following statistics are commonly used to measure the distinctness or significance of a particular set of population clusters.

  • CCC (cluster convergence criterion)
  • Approximate expected overall R-squared
  • Pseudo-F statistic

The object here is to maximize the value of each statistic, while taking into consideration the additional complexity created by a larger number of clusters.

The values of these statistics for cluster groups of sizes k=3 up to k=10 for the VA population under consideration are shown in Figure 1. The trend to observe in this table is where the Pseudo-F and CCC statistics peak and the increase in the R-squared statistic is diminishing—because R-squared always will continue to increase as the number of clusters increases.

Figure 1: Cluster Analysis Statistics: Determining the Number of Clusters
Number of Clusters Pseudo-F Statistic CCC Approximate Expected Overall R-Squared
3 2,572,494    379.1 0.667
4 2,853,543    569.2 0.812
5 2,963,005    616.2 0.857
6 3,558,955    892.8 0.900
7 3,562,978    884.1 0.915
8 3,930,625 1,027.0 0.933
9 4,058,380 1,068.4 0.943
10 3,657,865    905.5 0.944

Based on the data presented in Figure 1, the biggest jump in cluster significance occurs in moving from five to six clusters. Then there are relatively large increases in moving from seven to eight clusters. Also note the peak values of the Pseudo-F statistic and the CCC occur at k=9 clusters. This analysis indicates that a k=8 or k=9 are reasonable choices for the number of clusters for this population. Given that the R-squared also appears to begin to taper off in terms of relative increase around cluster 8, we begin by assuming this number of clusters.

In making the final decision, though, we also should examine cluster size. Figure 2 shows the relative size of the cluster using eight cluster groups.

Figure 2: Cluster Size
Cluster Cluster Size
1      12,341
2      51,599
3           263
4    590,622
5    271,722
6    890,721
7    132,500
8        2,072
Grand total 1,951,840

Based on the relative size of clusters 3 and 8, it was determined that a six-cluster structure combining clusters 3 and 8 into their closest “neighbor groups” would provide for a reasonable cluster significance/complexity trade off.

Based on the final six-cluster structure, each of the six groups is then examined in detail—to identify the defining characteristics of each based on all descriptive data available, including categorical variables. Figure 3 displays the key characteristics of each cluster based on an analysis of the cluster population by key data factors.

Figure 3: Cluster Descriptions: Distinguishing Characteristics of Each Cluster
Cluster Account Value Size Age of Policy (Duration) Benefit Utilization Status Surrender Charge Level Gender Distribution In the Moneyness Age of Policyholder
1 Largest Greater utilization More male Bank/NBD Greater ITM (150%+)
2 Older Greater utilization More male Ind/Bank/NBD Greater ITM (150%+)
3 No SC Career/Ind Lower ITM (less than 100%)
4 Greater utilization Higher SC Bank/NBD Lower ITM (less than 100%) Under 60
5 Smallest Low utilization/low withdrawals Higher SC More female NBD Lower ITM (less than 100%) Under 60
6 Older Greater utilization No SC More male NBD Greater ITM (150%+)
Predictive model results: Primary predictors for full surrender

Note that seven of the eight key factors identified as highly predictive for full surrender behavior as part of the predictive model development exercise are also key factors in distinguishing the six clusters of the population identified by cluster analysis techniques.

This result provides an additional degree of comfort in the preliminary predictive model developed, because the results of the independent cluster analysis process appear to be consistent with the results of the predictive model selected.

Cluster Analysis: Application 2

Provide a possible method for developing actuarial assumptions for new product designs where little or no experience is available yet, but where the design shares similarities with an existing product with more available experience.

Now let’s consider a product like fixed-indexed annuities (FIA) with guaranteed lifetime income benefits (GLIBs). These are newer product offerings than the guaranteed living benefits choices on VA products, and therefore there is little or no credible experience available for assumption-setting purposes at this time.

However, given the similarity in design between the FIA product GLIB and the VA GLWB, there may be a way to extend the results of our cluster analysis work to enhance the actuarial judgment considerations in setting and updating assumptions for these newer FIA plans.

Let’s consider the following possible approach:

  1. Gather data on inforce FIA contracts with GLIBs, including as many of the data fields as possible that were identified as critical for the VA contracts from predictive modeling and cluster analysis work (i.e., account value size, policy duration, benefit utilization status, surrender charge level, etc.).
  2. Perform an independent cluster analysis on the FIA population to determine which (if any) clusters identified as significant may overlap with those identified for VAs.
  3. Where there are overlaps, use the surrender results for the VA cluster as a starting point for the associated FIA cluster, and make any adjustments that may be needed based on actuarial knowledge of the underlying target market, product design or intended use of the policyby customers.
  4. Monitor these clusters over time and make further adjustments to assumptions as experience develops. Eventually, emerging experience will become credible and allow for more detailed analysis of the FIA population itself.

Note that more subjective approaches will need to be applied for clusters that emerge for the FIA population that do not have an associated cluster in the VA population. However, for at least a portion of inforce business, this process suggests a more mathematically rigorous method for justifying current assumption bases for products with minimal historical experience available.


Data analytics and statistical modeling techniques have been critical in the credit and property and casualty markets of the financial services industry for many years. However, to date, these tools have not been implemented well in other areas of the industry, most notably the life insurance and retirement industry sectors.

As the financial services industry continues to refine and improve products to better serve customers, these techniques should be included in our technical toolbox as potential methods to analyze and forecast how products will perform.

Marianne Purushotham, FSA, MAAA, is corporate vice president of the Statistical Analysis and Modeling Group at LIMRA.