Can GPT Optimize My Taxes?

TL;DR We made a GPT interface to an open-source US tax scenarios library, and it is at times pretty good, but asks a lot of the user.

Apparently this blog will be about working out implications of observations by Andrej Karpathy. This time we’re thinking about his comment that LLMs may come to be thought of as a kind of higher-order operating system: not omnicapable themselves, but rather the glue that links together other components, such as data stores, domain-specific libraries and user interfaces. Cam Pedersen wrote a nice post spelling out the ideas more fully.

I thought it would be fun to build some concrete data product to explore this LLM-as-OS idea, and contrast it with the existing state of the art. The idea: a ways back I’d written a python library called tenforty, that in turn uses a wonderful open source package called Open Tax Solver, to explore some tax scenarios we were dealing with. I’d never written it up, and thought someday I’d make a webapp from it to share. This seemed like a nice opportunity to try building something as a newfangled LLM-OS app: an AI tax advisor.

tenforty turns tax form calculations into python function calls, which makes it easy to evaluate one return, or many hypothetical returns. This latter what-if aspect was the main itch I was aiming to scratch with the library: What if we sold this stock over two years rather than one? What would happen to our tax bracket if we maxed out our 401Ks? To answer questions like this in TurboTax felt Sisyphean; write down tax amount, back-back-back, tweak input, forward-forward-forward, write down tax amount…¹ (There’s more about tenforty per se in an appendix.)

Before LLMs appeared, to build a webapp for these what-ifs, I might have researched common tax scenarios beyond our own that tenforty could help with, and built a kind of calculator app to anticipate those cases, perhaps like SmartAsset’s tax overview page.

In 2024, if we let the LLM be the UX, then building the app becomes less, and different, work. In some sense the user will just bring their own scenario, and the app doesn’t need to anticipate it. These were the steps:

Build a tiny web service that exposes the main functions from tenforty as endpoints. We used FastAPI.
Configure the custom GPT on OpenAI:
- Come up with a name: Tax Driver²
- Write the GPT’s prompt. We link a copy of it here.
- Configure a GPT “Action” that provides the details of the endpoints. Since FastAPI autogenerates OpenAPI specifications (ref), this was mainly pasting in that spec.
- Write a few example prompts to illustrate the kinds of things it can do.
- (Revise the GPT prompt to address unexpected behavior… 🤦🏼‍♂️)

If you have a paid ChatGPT+ account you can try out Tax Driver right here. The privacy policy is linked from the GPT, and also here; the upshot is that the backing web service is set up as a pure calculator, and logs only which endpoint was called, and when it was called by OpenAI’s servers. Note that independently of ChatGPT+, you can also play with tenforty directly using the included Colab notebook.

What Did We Learn?

Tax Driver does a bunch of things well:

From minimal input, as in the example screenshot here, it reliably defers to the Action to evaluate the scenario. We’ve gotten used to GPT-4’s natural-language abilities over the past year or so, but they really are stunning. (Note: images enlarge if clicked/tapped.)
It is game to at least try to evaluate essentially any tax scenario you might find yourself faced with, combining its knowledge base with the specific calculations enabled by tenforty. I would never have been able to match that breadth with a set of canned scenario calculators. Here’s another example that’s gone reasonably well, if one thinks of AMT as “the excess tax over the regular tax due to the AMT calculation” as it’s sometimes discussed informally.

Tax Driver Example: Incentive Stock Option Gains

It can produce tables/graphics that summarize the results well:

If asked, it will reliably produce apt python code using tenforty that you can drop into a notebook and take from there.

Some of those capabilities are simply from the future, and some might have been achievable by me or a team, only with much more work. And building this “app” took a tiny fraction of the time it would have taken me to build its pre-LLM analog.

But it can also be frustrating:

It can just miss the point of a given scenario request, in that cheerfully confident way GPT-4 has, like a politician not-answering a question at a town hall. In this example, we’ve asked for it to look at the tax burden if everything was exercised in one year vs spread out over two or three years, and Tax Driver misses the mark widely:
It has a set of specific instructions to follow from its prompt, and it often follows them, but also sometimes defies them, especially those having to do with figure generation. For example, here’s a case in which we’ve explicitly asked for a shaded area chart, in addition to the instruction to prefer shaded area charts for plots like this in its prompt, and it claims that it’s drawn one, but hasn’t:
It can be erratic. Run the same prompt twice, and once you will get reasonable output, and the second time it will fail to successfully run an Action or process the Action’s results. No example shown.
It occasionally mixes up the Action calls, i.e. web requests, with python calls to tenforty. No example shown–the output is generated python code, when it “should” have called out to the Action and done something with the response.

These limitations are familiar to us by now. This lack of meta-cognition is a big reason why this first generation of LLM-based products – ChatGPT, Stable Diffusion, GitHub Copilot – are expressed as copilots, as opposed to say autonomous assistants that can be relied upon to perform a task correctly. We encountered the same issue in our previous post about doing scientific literature meta-analysis with LLMs.

More broadly, as easy as it is to create this app, the universal chat interface is the proverbial jack of all trades, master of none. It places a lot of the burden on the user to know what they want, write out their requests carefully, and exercise care in interpreting the results and iterating upon them. Simon Willison made a connected point in a recent podcast with Outerbounds: the generic chat interface leaves much to be desired in terms of discoverability, and runs a high risk of scaring off non-enthusiast prospective users.

The upshot: Tax Driver is a capable if mercurial assistant. At least playing with our own actual tax scenarios, with a little patience and some coaching, it produced solid analyses. I found it particularly helpful to ask it to generate tenforty code that I could play with myself at the end of one of these chats. That seems useful, compared to having to do that all oneself, but it requires a fairly tolerant user, and is less surefooted that one ideally would hope for from a tax advisor.

Acknowledgments

Thanks kindly to Sarah Laskey for improving an earlier draft. All errors are mine!

Appendix: About `tenforty`

Open Tax Solver (OTS), the package that tenforty relies on, is one of these gems of the internet. The OTS team has published their software for more than twenty consecutive years now. It provides implementations of the US 1040 form calculations, and the forms for many states, including California and New York.

You can prepare your tax return either using OTS’ provided GUI, or via a simple text file format. In either case, OTS outputs the resulting return to another text file, and can insert the output into the official tax form PDFs. It’s thus a DIY, open source TurboTax that improves upon doing the calculations by hand. Very cool.

tenforty works by amalgamating the C language source code from OTS releases into a single big source file, and then wrapping that source file as a cython extension. Because OTS is fundamentally designed around reading and writing text files (that contain tax return information), tenforty actually writes and reads back text files in temporary directories behind the scenes, although only python dicts and pandas dataframes are presented to the library user.

OTS’s text-file interface works in terms of tax line numbers, e.g. W-2 income gets reported on Line 1 of the US 1040. Since only accountants and the very determined will know the line numbers, OTS annotates them with their natural names like “W-2 Income” in their GUI. To solve the same problem, I added a higher-level interface to tenforty, that offers natural names for the familiar tax-form quantities, e.g. w2_income. All available options are listed out in the repo’s README.

This higher-level interface will be limiting for some, because I only did the fairly common quantities. For example, if you’ve got a small business, there will be some missing inputs. There are two ways around this:

There’s a lower-level interface in tenforty as it stands, where you can provide any of the line-level inputs as a python dictionary, the return will be evaluated, and you’ll get back a dictionary with all the line-level outputs.
tenforty is open-source, and people are encouraged to contribute PRs with their improvements.

The document DEVELOP.md goes into more detail about how the package works.

Limitations

Even though OTS goes back to the Bush II administration, I have only included from 2017 through 2023 tax years into tenforty, and I’ve only wrapped up California (where we live), Massachusetts, and New York, although OTS supports more states. More years/states/etc. can be added by PR.

Testing

There are some basic unit tests in place, and then some property-based tests using the hypothesis library, that exercise tenforty/OTS by generating a wide variety of inputs and checking that certain properties hold, e.g. “If your W-2 income goes up, and nothing else changes, your total tax must always be the same or higher.”

I further verified that our own externally-prepared returns from the past few years give the same answers through tenforty. Not a big $n$, but in principle a larger test set of reference return calculations might be built up.

This is to say, there’s a good-faith effort to test the package, but the testing game could certainly be improved. PRs welcome. :)

Maybe Turbo Tax isn’t great for this, but what about doing what-ifs using Open Tax Solver’s GUI? OTS’ GUI is not really designed for what-ifs either, so the process would be fairly labor-intensive this way as well, although less labor-intensive than with TurboTax. The OTS FAQ anticipates and offers some guidance for this use case (ref):

Q: Can I use OTS to do quick “what-if’s” throughout the year ? For example, I would like to understand the tax consequences of selling some mutual funds, working extra hours, or deferring income in a 401K or IRA ?

A: Yes. You can enter tentative values in any of the lines, save to a file name that reflects your experiment, like “1040_more_hours.txt”. Run the solver and compare your taxes before and after. Even better, you could write a script that sweeps certain values, such as income or capital gains or loses, and runs the solver on each case. Then you can plot the results. (You probably cannot do that very easily, if at all, with any of the commercial packages.) You may find that your tax situation is rather non-linear. It helps to be aware of the inflection points during the year, while you can still do something about it.

In some sense tenforty aims to make it even easier to write such a value-sweeping script. ↩︎
We had re-watched Ronin the other week… ↩︎

finedataproducts