Skip to main content

One post tagged with "gpt-3.5-turbo"

View All Tags

· 19 min read
Jeff Yang, FSA

In this article we examine AI's ability to answer actuarial exam questions, using OpenAI's recently released API for its gpt-3.5-turbo model. At the time of writing, gpt-3.5-turbo is the model that powers ChatGPT, a popular chatbot interface.

We utilize publicly available exam questions and solutions to:

  1. generate alternative solutions to existing problems, and
  2. create new problems from existing questions.
note

We utilize AI Actuarial Assistant, a recently built user interface that integrates gpt-3.5-turbo with The Actuarial Nexus's database of actuarial exam questions.

Can ChatGPT Pass an Actuarial Exam?

Introduction

Actuaries must pass a series of professional exams to become certified. In total, these exams require thousands of hours of studying and take candidates many years to complete. Each exam has a pass rate of roughly 50%, so it is not uncommon for candidates to fail 1, 2, or 5 along the way.

There has been ample press coverage touting ChatGPT's ability to pass professional exams in medicine, law, business, and other more well-known fields.1,2 Similar articles have also been published about its lackluster ability to solve simple math problems.1

Actuarial exams are notoriously difficult for the level of mathematical rigor involved. Since OpenAI's models are built on top of large language models (LLMs) centered around completions, it may be reasonable to assume that actuarial exam questions, which often require a certain level of quantitative reasoning and problem-solving ability, are not well-suited for ChatGPT's more linguistic-based approach.

In this article, we test gpt-3.5-turbo's ability to answer actuarial exam questions, and apply prompting techniques to improve gpt-3.5-turbo's reliability.

Selecting a Model

ChatGPT is a popular chatbot interface built on top of a family of large language models known as GPT-3. Both ChatGPT and GPT-3 are developed by OpenAI, an artificial intelligence research company based in San Francisco.

On March 1, 2023, OpenAI released the API for its latest version of the model, gpt-3.5-turbo. This is the same model that powers ChatGPT. Before gpt-3.5-turbo, the most capable model was text-davinci-003. OpenAI's documentation states that gpt-3.5-turbo does equally well on text completions when compared with the Davinci model family, and it does so at 10% of the cost to developers.

The scenarios presented in this article are generated using gpt-3.5-turbo. In the future, we may consider using a different model.

Comparing gpt-3.5-turbo and text-davinci-003

At first glance, and in terms of cost and performance, gpt-3.5-turbo seems like the ideal model to use going forward. Unfortunately, gpt-3.5-turbo currently does not allow for fine-tuning, which means the model has no memory of any information it was previously prompted with. To keep a conversation going, the entire chat history must be included in each successive prompt. Since the OpenAI API charges based on token usage, this could amplify the cost significantly compared to a pre-trained model. Furthermore, the combined prompt input and output are limited to 4,096 tokens. Assuming the average question & solution pair consumes 250 tokens, this means that the model can retain ~12 Q&A pairs at any one time, allotting ~1,000 tokens for the response output and prompt guidance.

On a per-token basis, text-davinci-003 is the more costly model by a factor of 10x. However, it allows for fine-tuning, which means that models can retain information based on past questions and solutions. If text-davinci-003 is used, we won't have to pass in the history of training data for each new prompt. A downside of text-davinci-003 is that is significantly worse at solving math questions out-the-box.1

For questions that just need a little bit of guidance, gpt-3.5-turbo could be the way to go. For other questions that require a more refined model, text-davinci-003 could be the solution in the long run.

Sourcing Questions

We perform our analyses using questions from Exam P (Probability), Exam FM (Financial Mathematics), Exam SRM (Statistics for Risk Modeling), and Exam FAM (Fundamentals of Actuarial Mathematics).

To perform this study, we sample from publicly available sample questions released by the Society of Actuaries. These sample questions are a good proxy for the questions on the actual exam, both in terms of difficulty and scope.

In its raw form, the SOA sample questions are only available in PDF format. To limit the amount of legwork needed to prepare the data, we query from The Actuarial Nexus's organized database of questions, which includes the SOA sample questions and solutions reformatted in Markdown and LaTeX. At the time of writing, this includes ~800 questions across Exam P, Exam FM, Exam SRM, and Exam FAM.

Methodology

We perform a series of scenario tests and document the methodology and results for each scenario.

Increasing reliability

For each scenario, we also apply the following adjustments to increase reliability:

  1. Per Techniques to Improving Reliability, we affix each prompt with, "Let's think step by step," a technique known as zero-shot chain of thought (zero-shot-CoT) prompting. Adding this one sentence yields significant improvements in the accuracy of the results, particularly for math problems.

  2. Since we are comparing results from different scenarios and value accuracy over creativity, we set the temperature to 0 to reduce the level of randomness in the responses. Unfortunately, even with a setting of 0, "a small amount of variability may remain." In a more formal study, we would increase the sampling size to further increase reliability, but this setting suffices for demonstration purposes.

  3. Aside from the base case scenario, we also prompt the model with either a full solution write-up or the correct answer. This significantly increases the chance that the output arrives at the correct answer. Because the AI has a high chance of yielding the correct numeric answer, it is even more essential that the human prompting the output carefully review the explanation for logical consistency.

Scenarios

info

Model: gpt-3.5-turbo
Prompt: Question Only
Additional Prompting: None

This scenario serves as our base case for gpt-3.5-turbo. We prompt the model with the sample question and compare the model's result to the official SOA solution. The solution is not provided in this scenario.

Since the model doesn't know the answer beforehand, we expect the output to be the least reliable compared to Scenario 1 and Scenario 2. However, out of the three cases, the prompting is the least restrictive, so the output may include creative solutions.

Model

All results presented in this article were generated using AI Actuarial Assistant.

AI Actuarial Assistant integrates gpt-3.5-turbo's capabilities with The Actuarial Nexus's database of existing actuarial exam questions (~800 questions at the time of writing). By automating all the back-end logistics and allowing users to save output to a central database, AI Actuarial Assistant is prepped to store and process a large network of user-prompted solutions and questions.

AI Actuarial Assistant breaks down a significant barrier in using AI to answer actuarial exam questions. Specifically, any user can generate a response from an existing question with the click of a few buttons (no text input required). From there, the user can easily post the solution to the integrated forum with the click of a button, and then edit the post if needed. Other users can then upvote or downvote the resulting output.

In the near future, through a process known as Self-taught Reasoner (STaR), higher-quality output (posts) can be used to train future models. The output would then be integrated into AI Actuarial Assistant to (1) create a sustainable ecosystem for generating new questions, and (2) increase the reliability of future model responses.

note

gpt-3.5-turbo includes a level of randomness in the output, even when the temperature is set to 0, so there is no guarantee that the exact results presented in this article can be replicated.

Examples

Below are several examples documenting sample output from AI Actuarial Assistant. All relevant links are included, including links to the original question and the output from each scenarios. The observations are solely based on my own judgment. Any reader who has a strong opinion is welcome to reply to the corresponding forum post.

Exam P - SOA Practice Question #1
  • Original Question
  • Subject(s): General Probability, Set Theory
  • Difficulty: Easy
  • Observations
    • Base Case - The model is able to calculate the correct solution on its own. Unfortunately, gpt-3.5-turbo decided to get a bit too fancy and included an unnecessary link to a non-existent image.
    • Scenario 1 - Prompting the engine with the correct solution produced basically the same logic as in the Base Case. This is a reasonable outcome given the engine was able to derive the correct solution on its own in the Base Case. Unfortunately, it decided to include a link to the same non-existent image again.
    • Scenario 2 - The provided solution is similar to the explanation in the Base Case, Scenario 1 and the official SOA solution. This time, the model did not attempt to include an accompanying image.
    • Scenario 3 - The model performed as intended and generated an error-free question, solution, (A)-(E) answer choices, title, and associated keywords.
  • Overall
    • The model performs quite well on this problem, and arrives at the correct solution in all cases.
Exam P - SOA Practice Question #51
  • Original Question
  • Subject(s): Deductibles, Continuous Random Variables, Expected Value
  • Difficulty: Hard
  • Observations
    • Base Case - The model does a fairly good job setting up the problem, but fails to correctly evaluate an integral, and ultimately arrives at the wrong answer. The arithmetic is also incorrect.
    • Scenario 1 - The model struggles with calculating the expected payment for partial damage. The expected value for total loss is also incorrectly calculated.
    • Scenario 2 - The model does a good job connecting the types of loss with the corresponding mathematical representations, and arrives at the correct final answer. However it makes math errors when calculating the partial damage expected payment.
    • Scenario 3 - The model does a good job creating a new question, solution, (A)-(E) answer choices, a title and relevant tags. The question itself is highly relevant to the original question. In the solution, it again appears to incorrectly calculate the partial damage expected payment by not factoring in the deductible. The arithmetic is also incorrect.
  • Overall
    • The model does a relatively good job understanding the problem and setting up the solution. It struggles with more complicated integrals and simplifying equations into numbers.
Exam P - SOA Practice Question #308
  • Original Question
  • Subject(s): Univariate Random Variables, Variance
  • Difficulty: Medium
  • Observations
    • Base Case - The model failed to understand that the probabilities provided in the table are cumulative. The rest of the logic to calculate the standard deviation seems correct. It again makes an error in applying simple arithmetic to calculate the variance. There are also issues rendering LaTeX.
    • Scenario 1 - Scenario 1 is fraught with the same shortcomings as in the Base Case. In providing the correct answer as part of the prompt, the model forces its solution to match the correct answer by erroneously equating its solution to the correct solution.
    • Scenario 2 - Again, the model misses the fact that the probabilities in the table are cumulative, so the first sentence in the output is incorrect. The rest of the output looks okay.
    • Scenario 3 - The model was able to generate a very relevant question. It did not provide (A)-(E) answer choices, and the probabilities in the solution don't match the probabilities in the question.
  • Overall
    • The main stumbling block in this problem is that the AI was not able to recognize that the given probabilities were cumulative. There were also some simple arithmetic errors. Both of these issues seem like minor hiccups that could be addressed in future models.
Exam FM - SOA Practice Question #1
  • Original Question
  • Subject(s): Time Value of Money, Force of Interest, Rate of Interest
  • Difficulty: Medium
  • Observations
    • Base Case - The question asks us to calculate the continuous force of interest, given the semiannual rate of interest. In this case, the number of years (7.25 years), is not necessary to solve this problem. Understandably, the model's solution attempts to factor in the 7.25 years, but makes an arithmetic error in evaluating an exponent (it calculates 1.02^14.5 = 157.10). Aside from arithmetic errors, the rest of the logic seems correct.
    • Scenario 1 - Again there is an error in evaluating the same exponent (it calculates 1.02^14.5 = 153.96), which is closer, but not close enough. The solution is also overly complicated, and the logic seems to break towards the end.
    • Scenario 2 - The model doesn't quite interpret the problem correctly. It tries to correlate the annual force of interest with the value of the account in 7.25 years. However the 7.25 years is irrelevant to the problem. This type of problem seems simple for a human, but potentially difficult for AI, without additional prompting. Nonetheless, it provides an acceptable solution to the problem, since it was prompted with the original solution.
    • Scenario 3 - The model does a good job creating a useable question, solution, (A)-(E) answer choices, a title and relevant tags.
  • Overall
    • This problem is tricky for GPT-3.5 because the question itself includes seemingly relevant information that is, in fact, not necessarily to solve the problem (i.e. a red herring). It also makes careless errors when evaluating equations.
Exam SRM - SOA Practice Question #1
  • Original Question
  • Subject(s): Clustering Algorithms, Unsupervised Learning Techniques
  • Difficulty: Easy
  • Observations
    • Base Case - The model does a good job setting up the problem and creating a nice looking table. Unfortunately, it doesn't show its work in calculating the numbers in the table, and it also calculates the numbers incorrectly. The error seems to stem from faulty arithmetic.
    • Scenario 1 - The Scenario 1 solution is similar to the Base Case solution, with some nuanced additions. In providing the correct answer as part of the prompt, the model forces its solution to match the correct answer by erroneously equating its solution to the correct solution.
    • Scenario 2 - The model regurgitates the original solution, so not much value is added here.
    • Scenario 3 - The model does a good job creating a new question, solution, (A)-(E) answer choices, a title and relevant tags. Unfortunately, there are some minor mistakes in the solution.
  • Overall - The model does a relatively good job understanding the problem and setting up the solution. It struggles with arithmetic.
Exam FAM-S - SOA Practice Question #1
  • Original Question
  • Subject(s): Credibility, Poisson Distribution, Pareto Distribution
  • Difficulty: Hard
  • Observations
    • Base Case - The output includes a lot of background information (formulas, notation, etc.) that might be unnecessary for students who prefer a more succinct response. For others who prefer a more comprehensive explanation, the background information could be helpful. Unfortunately, the model is not able to piece the fundamentals together to form a correct solution.
    • Scenario 1 - The approach is similar to the Base Case approach. The model regurgitates fundamental concepts, but does not apply them correctly.
    • Scenario 2 - The output is able to provide helpful context to the SOA solution. It is able to correctly identify how different pieces of an equation fit together.
    • Scenario 3 - The generated question and solution are similar to the original question and solution. Again, the model appears to make a simple numeric error by equating 18,408 to 30,000.
  • Overall - This problem requires knowledge of different formulas related to random variable distributions and credibility theory. The model does a fairly good job gathering the pieces, but struggles to put these pieces together.

Takeaways

Our findings in this article show that gpt-3.5-turbo, and by extension AI Actuarial Assistant, still has several challenges to overcome before it can reliably answer actuarial exam questions without human supervision.

One of the main limitations is the model's inability to simplify arithmetic, algebraic, and calculus equations into numeric answers. This seems reasonable given that the model is trained to recognize language patterns rather than perform computations. However, as OpenAI notes, "[GPT-3] is actually attempting to perform the relevant [arithmetic] computation rather than memorizing a table." The fact that GPT-3's has the potential to "compute" rather than regurgitate patterns seems promising for future versions of the model, at least as it applies to solving actuarial exam questions.

Despite its current limitations, AI can still offer several benefits to students preparing for actuarial exams:

  1. Given gpt-3.5-turbo's impressive ability to clearly explain concepts, the generated explanations could benefit students who are looking for additional guidance to existing solutions. If a student just needs help with one part of the problem, there's a relatively high chance an AI-generated solution can help the student see things from a different perspective. From there, it is up to the student to fill in the gaps and determine whether the rest of the generated solution is accurate or not.
  2. Every few years, new exams are introduced and old exams are removed from the actuarial credentialing process. This can leave students with little material to draw from, especially for the first sitting of a new exam. By turning up the temperature parameter, AI-generated questions could be used as inspiration for the types of problems that could be asked in a new exam.

Next Steps

The results in this article just scratch the surface of AI's potential. We presented an initial infrastructure and use case for integrating LLMs into the actuarial exam study process. As the world begins to better understand the capabilities and limitations of LLMs, better models will be implemented and results will improve.

The Actuarial Nexus was built to intake and store a large amount of community-sourced exam questions and solutions. The infrastructure is uniquely set-up to integrate with AI Actuarial Assistant. Rather than write a question and solution from scratch, a contributor can simply press a few buttons and generate a (relatively) unique solution or question using AI. The aforementioned STaR process can then be utilized to select high quality responses to train future models. Of course, steps will need to be taken to ensure that this initial quantity over quality approach provides more benefit than harm to students, which will come with usage and time.

The current version of AI Actuarial Assistant (beta) was built in a short amount of time to lay the groundwork for this article. Given the novelty of the idea and the relatively quick development time, there is a lot of unexplored potential in using AI to generate actuarial exam study material, particularly in the topics of prompt design and fine-tuning. I look forward to sharing progress on these two topics in a future blog post.

Conclusion

AI has the potential to enhance the studying experience for actuarial exam candidates. We presented examples in which the model explains existing solutions, writes new practice problems, and stores responses in a central database for future training. The technology is still new, so it will take time to figure out how to best produce reliable and useful results.

The vast majority of actuarial exam questions are in print/PDF form or behind paywalls, so they are not easily accessible to OpenAI's core training model. Since actuarial exams test candidates on a relatively narrow and niche range of topics, I believe the real value in AI integration in the next few years will be derived from the ability to procure high-quality data, process that data, and train a curated model for solving actuarial exam questions. Given the heavy focus on math, an added challenge will be working with LaTeX, for both supplying input and processing output. AI Actuarial Assistant is just the first step in integrating AI with preparing for actuarial exams.

AI is here to stay, and can only improve with time. OpenAI's gpt-3.5-turbo is the latest installment in a rapidly evolving series of LLMs, with each model being a massive improvement upon the previous model. On January 30, 2023, ChatGPT was upgraded with improved factuality and mathematical capabilities. As recently as today, OpenAI announced GPT-4, its latest LLM. GPT-4 potentially addresses some of the limitations outlined in this article, particularly with math calculations.

There is also speculation of Wolfram|Alpha integrating their facts-based answer engine with an OpenAI LLM. This would considerably increase the useability of AI for solving actuarial exam questions, since a main limitation with the current model is its inability to compute basic math equations.

With the right supervision and ecosystem, AI models and its implementations can evolve to provide tremendous value to students preparing for actuarial exams. We have barely begun to scratch the surface in this article. In the next part, we will expand on the topics covered here and discuss further developments with AI Actuarial Assistant.