Learning to Untangle

Training actuaries to solve problems with data Brian Hartman

As a professor, I spend a lot of time thinking about how to best prepare students to make an impact in their careers, families and communities. While there are many skills essential to building a solid career in actuarial science (e.g., communication, leadership, diligence, empathy), for this article I will focus on the skills that better enable students to solve problems with data. Data is everywhere and growing rapidly. It is a large part of an actuary’s job currently, and it will only become more important in the future.

Seeing this trend, we at Brigham Young University have worked to organize our curriculum to prepare our students to use data to solve problems (as have my colleagues at many other universities). For example, our data science class is a two-semester class during which the students become familiar with the software (R, Python, Spark, etc.) and machine learning methodology (trees, neural nets, support vector machines, etc.) to analyze large data sets in the first semester. Then in the second semester, companies give seminars on compelling problems, and the students gather in groups to work on the issues using real data.

In addition to the coursework, we work with companies to solve large problems together. We have partnered with companies in property and casualty, health, life, and long-term care insurance. This arrangement benefits the companies, students and faculty alike. Employers get access to cutting-edge methodology, academic experts and bright students. Our students gain experience solving actual problems and dealing with real and messy data. As faculty, we are able to keep our research and teaching connected to current industry practice.

Overall, we take a holistic approach and build the students’ skill sets in four major areas essential to solving these problems effectively. First, they need to understand the data and the business problem. Second, they need to understand enough statistical methodology to be able to know the right tools to use in a particular situation. Third, they need to be able to implement those methods efficiently. Finally, they must be able to communicate their results effectively.

Understanding the Data and the Business Problem

The first step is understanding the data and the business problem. Statistician John Tukey once said, “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” Without truly understanding the business problem and the data, you may gather a poor data set where results cannot be trusted, or you may build an incredible model that doesn’t answer the main question (or potentially any others).

While it is true that statistical methods are very transferable and “data is data,” the results are not valuable without the proper context. Many of our courses include a culminating project in which the students need to find and validate data to solve a problem. For the final project in our first regression course, they need to solve a problem of their choice by finding applicable data and using many of the methods they learned in class to analyze it. Before starting the project, the students often assume the analysis will take the most time. While a solid analysis does take a good deal of time, finding dependable and applicable data often takes far longer.

Understanding Statistical Methodology

Second, students need to understand the methodology used to perform the analysis. We try to balance our time between exposing them to a large number of statistical techniques and helping them gain a strong statistical foundation, which makes it much easier for them to learn new techniques when necessary. More important than knowing how to apply any particular method, we help them to understand the assumptions, strengths and shortcomings of various methods so they can properly apply what they know and look for a better method when they need one.

For example, the error term in standard linear regression is assumed to be normally distributed, meaning that the predicted values (and the associated uncertainty in those predictions) are going to be continuous, symmetric, and possibly include both positive and negative numbers. To illustrate this point, in some applications, like modeling the total claim cost for a given personal auto policy next year, this normality assumption may be questionable. Total claim cost is likely right-skewed, positive, has a positive probability of being exactly zero and could have a very heavy tail. Understanding the assumptions and shortcomings of the model allows students to look for improvements or alternatives. Changing the error distribution, say through a generalized linear model (GLM), can add skew and require the costs to be positive. Certain error distributions can account for the heavy tail. You can incorporate a two-part model to account for the probability of no claims.

As another example, consider the case when a data set is very large, though vary sparse (meaning most of the data is missing). While linear regression (or a standard GLM) will not work well in such cases, you can find different shrinkage models (e.g., LASSO or elastic net) to better solve your problem. While it may not be reasonable to teach students all of the current methods, it is nonetheless important for them to understand the weaknesses and limitations of the methods they learn. Doing so will allow them to know when they should try to improve upon the methods they understand and when they should seek alternative methods.

When one of my students and I worked with a major health insurer, we examined one of those assumptions and came to an interesting conclusion. We were modeling claim severity for a large portion of its business (9 million policyholders, 32 million claims). The insurer was interested in better severity models to price a new product line and wanted us to do a better job fitting a gamma model to its data. We also checked how well the gamma distribution fit the data. It turned out there were many distributions, some commonly implemented in software, which greatly outperformed the gamma distribution in terms of model fit. By challenging commonly-held assumptions—not only in the company but throughout the industry—we were able to better understand the future claims costs and help the company model and manage its risk.

Implementing the Methodology

Third, students need to be able to implement the models efficiently, especially with rapidly growing data sets to analyze. Understanding the models to fit and the assumptions to challenge will not help you actually solve the problem unless you can implement them in a reasonable amount of time. In the health insurance example, the first method we tried on a small subset of the data was rather slow computationally. It would have taken four weeks to implement it on the entire data set. We developed a new method, based on random forests, which only took a few minutes to run on the entire data set.

Communicating Results

Finally, students need to be able to communicate their results. The best models and results cannot bring an appreciable change to the business without proper communication. They need to be able to tailor and present information to both technical and nontechnical audiences, to audiences both inside of their departments (with a solid understanding of the business) and outside. To help our students improve their communication skills, many of our classes require a written project and oral presentation. While they are graded on their application of statistical methodology, a large portion of the grade is determined by their communication, both oral and written. Students also are required to take a course in business communication, during which they learn how to convey technical information to many different audiences.

Most of our projects with companies involve our students. At the University of Connecticut, a group of students was involved in a series of projects with a major long-term care insurer. The students met with the client to define the business problem, worked together to analyze the data and then led the presentations of the work to the client. Not only was that experience valuable for the students, helping them to learn and develop all the skills mentioned so far in this article, but the company was able to see potential in the students and made full-time job offers to some of them.

Developing Well-Rounded Actuaries

Additionally, it is very important for students to be well-rounded and curious. We are in a very competitive market and can be left behind if not working hard to stay current. Even if we had the time to teach students most of the cutting-edge models currently in use, we wouldn’t be able to predict the problems they will need to solve in 10 years, let alone 20, when they will be running their departments.

The International Actuarial Association (IAA) has updated its curriculum to include more analytics. In the United States, both the Casualty Actuarial Society (CAS) and the Society of Actuaries (SOA) see the value in students learning and being tested on their data analytics skills. The SOA has a module in its fellowship requirements entitled “Applications of Statistical Techniques.” The CAS recently added exam S to its associateship curriculum, which covers basic applied statistics (e.g., inference, estimators, goodness-of-fit, GLMs and time series). The CAS also is considering adding one or two advanced statistics exams (at one time tentatively called S2 and S3) to the fellowship requirements. What’s more, CAS plans to launch an additional credential in predictive analytics and data science. There likely will be further innovations in the future.

For my students, the actuarial exams are important and constantly on their minds. They know they need to pass the exams in order to land internships and full-time jobs, but often it is hard to see that the exams actually occupy only a small part of their careers. Top students today graduate in their early 20s, having passed two to four exams. Many will have their fellowships in five to 10 years. Assuming a normal retirement age (though actuarial salaries allow for much earlier retirement, but that is another subject) means they have “only” 30–35 years left in their careers without any exams. Plus, the majority of the first five to 10 years will be spent working on projects tangentially related to the exams, requiring a good bit of additional study and work to be successful. Without internally-motivated curiosity, they will be unable to perform at the highest possible level. With resources like Coursera and edX, anyone with the desire and diligence is able to improve his or her skill sets. It has never been easier to acquire new skills: As of April 2016, a search for “data” in the Coursera course catalog returns 294 courses and specializations at varying levels of expertise.

By building our students’ skill sets to understand the business problem, properly choose and implement the methodology, and communicate the results, we are working to prepare them not only for the industry of today, but also the industry of the future. Actuaries with strong data skills naturally can impact not only the insurance industry, but many others, such as health care, finance and marketing, as well. It is a great time to be an actuary.

Brian Hartman, ASA, Ph.D., is the actuarial program director and an assistant professor in the department of statistics at Brigham Young University.

Special thanks to Chris Groendyke, FSA, Ph.D., assistant professor at Robert Morris University, for his helpful comments and review.