Can GPT Optimize My Taxes?
TL;DR
We made a GPT interface to an open-source US tax scenarios library,
and it is at times pretty good, but asks a lot of the user.
Apparently this blog will be about working out implications of observations by Andrej Karpathy. This time we’re thinking about his comment that LLMs may come to be thought of as a kind of higher-order operating system: not omnicapable themselves, but rather the glue that links together other components, such as data stores, domain-specific libraries and user interfaces. Cam Pedersen wrote a nice post spelling out the ideas more fully.
I thought it would be fun to build some concrete data product to explore this
LLM-as-OS idea, and contrast it with the existing state of the art. The idea: a
ways back I’d written a python library called
tenforty
, that in turn uses a
wonderful open source package called Open Tax
Solver, to explore some tax scenarios
we were dealing with. I’d never written it up, and thought someday I’d make a
webapp from it to share. This seemed like a nice opportunity to try building
something as a newfangled LLM-OS app: an AI tax advisor.
tenforty
turns tax form calculations into python function calls, which makes
it easy to evaluate one return, or many hypothetical returns. This latter
what-if aspect was the main itch I was aiming to scratch with the library:
What if we sold this stock over two years rather than one? What would happen to
our tax bracket if we maxed out our 401Ks? To answer questions like this in
TurboTax felt Sisyphean; write down tax amount, back-back-back, tweak input,
forward-forward-forward, write down tax amount…1 (There’s more about
tenforty
per se in an appendix.)
Before LLMs appeared, to build a webapp for these what-ifs, I might have
researched common tax scenarios beyond our own that tenforty
could help with,
and built a kind of calculator app to anticipate those cases, perhaps like
SmartAsset’s tax overview page.
In 2024, if we let the LLM be the UX, then building the app becomes less, and different, work. In some sense the user will just bring their own scenario, and the app doesn’t need to anticipate it. These were the steps:
- Build a tiny web service that exposes the main functions from
tenforty
as endpoints. We used FastAPI. - Configure the custom GPT on OpenAI:
- Come up with a name: Tax Driver2
- Write the GPT’s prompt. We link a copy of it here.
- Configure a GPT “Action” that provides the details of the endpoints. Since FastAPI autogenerates OpenAPI specifications (ref), this was mainly pasting in that spec.
- Write a few example prompts to illustrate the kinds of things it can do.
- (Revise the GPT prompt to address unexpected behavior… 🤦🏼♂️)
If you have a paid ChatGPT+ account you can try out Tax Driver right
here. The privacy policy is
linked from the GPT, and also
here; the upshot is that the
backing web service is set up as a pure calculator, and logs only which endpoint
was called, and when it was called by OpenAI’s servers. Note that independently
of ChatGPT+, you can also play with tenforty
directly using the included
Colab
notebook.
What Did We Learn?
Tax Driver does a bunch of things well:
-
From minimal input, as in the example screenshot here, it reliably defers to the Action to evaluate the scenario. We’ve gotten used to GPT-4’s natural-language abilities over the past year or so, but they really are stunning. (Note: images enlarge if clicked/tapped.)
-
It is game to at least try to evaluate essentially any tax scenario you might find yourself faced with, combining its knowledge base with the specific calculations enabled by
tenforty
. I would never have been able to match that breadth with a set of canned scenario calculators. Here’s another example that’s gone reasonably well, if one thinks of AMT as “the excess tax over the regular tax due to the AMT calculation” as it’s sometimes discussed informally.
- It can produce tables/graphics that summarize the results well:
-
If asked, it will reliably produce apt python code using
tenforty
that you can drop into a notebook and take from there.
Some of those capabilities are simply from the future, and some might have been achievable by me or a team, only with much more work. And building this “app” took a tiny fraction of the time it would have taken me to build its pre-LLM analog.
But it can also be frustrating:
-
It can just miss the point of a given scenario request, in that cheerfully confident way GPT-4 has, like a politician not-answering a question at a town hall. In this example, we’ve asked for it to look at the tax burden if everything was exercised in one year vs spread out over two or three years, and Tax Driver misses the mark widely:
-
It has a set of specific instructions to follow from its prompt, and it often follows them, but also sometimes defies them, especially those having to do with figure generation. For example, here’s a case in which we’ve explicitly asked for a shaded area chart, in addition to the instruction to prefer shaded area charts for plots like this in its prompt, and it claims that it’s drawn one, but hasn’t:
-
It can be erratic. Run the same prompt twice, and once you will get reasonable output, and the second time it will fail to successfully run an Action or process the Action’s results. No example shown.
-
It occasionally mixes up the Action calls, i.e. web requests, with python calls to
tenforty
. No example shown–the output is generated python code, when it “should” have called out to the Action and done something with the response.
These limitations are familiar to us by now. This lack of meta-cognition is a big reason why this first generation of LLM-based products – ChatGPT, Stable Diffusion, GitHub Copilot – are expressed as copilots, as opposed to say autonomous assistants that can be relied upon to perform a task correctly. We encountered the same issue in our previous post about doing scientific literature meta-analysis with LLMs.
More broadly, as easy as it is to create this app, the universal chat interface is the proverbial jack of all trades, master of none. It places a lot of the burden on the user to know what they want, write out their requests carefully, and exercise care in interpreting the results and iterating upon them. Simon Willison made a connected point in a recent podcast with Outerbounds: the generic chat interface leaves much to be desired in terms of discoverability, and runs a high risk of scaring off non-enthusiast prospective users.
The upshot: Tax Driver is a capable if mercurial assistant. At least playing
with our own actual tax scenarios, with a little patience and some coaching, it
produced solid analyses. I found it particularly helpful to ask it to generate
tenforty
code that I could play with myself at the end of one of these chats.
That seems useful, compared to having to do that all oneself, but it requires a
fairly tolerant user, and is less surefooted that one ideally would hope for
from a tax advisor.
Acknowledgments
Thanks kindly to Sarah Laskey for improving an earlier draft. All errors are mine!
Appendix: About tenforty
Open Tax Solver (OTS), the package
that tenforty
relies on, is one of these gems of the internet. The OTS team
has published their software for more than twenty consecutive years now. It
provides implementations of the US 1040 form calculations, and the forms for
many states, including California and New York.
You can prepare your tax return either using OTS’ provided GUI, or via a simple text file format. In either case, OTS outputs the resulting return to another text file, and can insert the output into the official tax form PDFs. It’s thus a DIY, open source TurboTax that improves upon doing the calculations by hand. Very cool.
tenforty
works by amalgamating the C language source code from OTS releases
into a single big source file, and then wrapping that source file as a cython
extension. Because OTS is fundamentally designed around reading and writing text
files (that contain tax return information), tenforty
actually writes and
reads back text files in temporary directories behind the scenes, although only
python dicts and pandas dataframes are presented to the library user.
OTS’s text-file interface works in terms of tax line numbers, e.g. W-2 income
gets reported on Line 1 of the US 1040. Since only accountants and the very
determined will know the line numbers, OTS annotates them with their natural
names like “W-2 Income” in their GUI. To solve the same problem, I added a
higher-level interface to tenforty
, that offers natural names for the familiar
tax-form quantities, e.g. w2_income
. All available options are listed out in
the repo’s
README.
This higher-level interface will be limiting for some, because I only did the fairly common quantities. For example, if you’ve got a small business, there will be some missing inputs. There are two ways around this:
- There’s a lower-level interface in
tenforty
as it stands, where you can provide any of the line-level inputs as a python dictionary, the return will be evaluated, and you’ll get back a dictionary with all the line-level outputs. tenforty
is open-source, and people are encouraged to contribute PRs with their improvements.
The document
DEVELOP.md
goes into more detail about how the package works.
Limitations
Even though OTS goes back to the Bush II administration, I have only included
from 2017 through 2023 tax years into tenforty
, and I’ve only wrapped up
California (where we live), Massachusetts, and New York, although OTS supports
more states. More years/states/etc. can be added by PR.
Testing
There are some basic unit tests in place, and then some property-based tests
using the hypothesis library, that exercise
tenforty
/OTS by generating a wide variety of inputs and checking that certain
properties hold, e.g. “If your W-2 income goes up, and nothing else changes,
your total tax must always be the same or higher.”
I further verified that our own externally-prepared returns from the past few
years give the same answers through tenforty
. Not a big $n$, but in principle
a larger test set of reference return calculations might be built up.
This is to say, there’s a good-faith effort to test the package, but the testing game could certainly be improved. PRs welcome. :)
-
Maybe Turbo Tax isn’t great for this, but what about doing what-ifs using Open Tax Solver’s GUI? OTS’ GUI is not really designed for what-ifs either, so the process would be fairly labor-intensive this way as well, although less labor-intensive than with TurboTax. The OTS FAQ anticipates and offers some guidance for this use case (ref):
Q: Can I use OTS to do quick “what-if’s” throughout the year ? For example, I would like to understand the tax consequences of selling some mutual funds, working extra hours, or deferring income in a 401K or IRA ?
A: Yes. You can enter tentative values in any of the lines, save to a file name that reflects your experiment, like “1040_more_hours.txt”. Run the solver and compare your taxes before and after. Even better, you could write a script that sweeps certain values, such as income or capital gains or loses, and runs the solver on each case. Then you can plot the results. (You probably cannot do that very easily, if at all, with any of the commercial packages.) You may find that your tax situation is rather non-linear. It helps to be aware of the inflection points during the year, while you can still do something about it.
In some sense
tenforty
aims to make it even easier to write such a value-sweeping script. ↩︎ -
We had re-watched Ronin the other week… ↩︎