AI integration: challenges and discoveries

So I’m coming off a R&D sprint investigating how to integrate the newer AI tools into our toolkit, and it’s been an interesting ride. I’ve learned a bit about prompt writing (“prompt engineering?” I’ll get to that), a bit about Python, a bit about Ruby, and a bit about neural nets. Not bad for a developer from Frontendistan.

Here are a few of my observations.

That’s a heck of a lot of math

Even using paid services, the APIs are slow and sometimes fail to respond at all. I read Stephen Wolfram’s book What ChatGPT is Doing… and Why Does It Work and he explains that tokenizing text — breaking the text down into more manageable parts — is a pretty compute-intensive task. Every token has to go through every node in the neural net, and that neural net is 175 billion nodes. This suggests to me that if you have a text that’s one thousand tokens long, which is by no means outlandish, you’re looking at 175 trillion operations. With a “T.” That’s a hard number to think about. But for comparison’s sake, a human body has roughly 100 trillion cells.

Think about that the next time you ask ChatGPT to make up a funny limerick about cheese sandwiches.

In a sandwich of cheese so divine,
Between slices, it’s melted, a sign.
With a crunch and a chew,
It’s a flavor to pursue,
Oh, that cheesy delight, so benign!

Anyway, what that means is that it takes a little while for the AI to think about what you are saying and then put together a good response. Kind of like what would happen to you if someone dropped an 800 word text on you and requested a sentiment analysis.

This is a pattern that we see repeated often with these Large Language Models. They were designed to mimic human writing, so they tend to make very human decisions.

No one really understands how it works

No, seriously. When they say “we don’t know how this works,” they don’t know how this works.

Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work. — Stephen Wolfram, “What ChatGPT is Doing…”

Wolfram uses an odd word for the tribal knowledge of tricks that seem to work in developing and training a neural net. That word is “lore.” He uses it a lot.

Sometimes—especially in retrospect—one can see at least a glimmer of a “scientific explanation” for something that’s being done. But mostly things have been discovered by trial and error, adding ideas and tricks that have progressively built a significant lore about how to work with neural nets. — Stephen Wolfram, “What ChatGPT is Doing…”

Wolfram is no AI neophyte — if the name is not ringing bells for you, read up on Wikipedia. The fact that this physicist / computer scientist / mathematician says “we don’t know” and “lore” a lot is sobering.

Unsurprisingly, prompt writing works the same way

I really don’t like the term “prompt engineering.” There are lots of definitions of engineering, but they almost always rest on applying scientific principles in an orderly fashion to build, create, or predict the behavior of machines under specific conditions.

The American Engineer’s Council for Professional Development actually uses the phrase “to construct or operate [these machines] with full cognizance of their design.” I doubt you ever hear an engineer talking about “steel girder tensile strength lore.”

The way the prompt writing works (briefly) is this: you think of what you need, ask the AI for an answer, and then continually revise your prompt until you get something acceptable back.

When people are showing off what an AI can do, it’s something like a magic trick. But when you need it for actual work, it turns out you’re doing a fair amount of heavy lifting yourself. Assuming you care about the quality of the results, that is.

You really can’t count on the results

That is to say the results of any specific prompt are not precisely repeatable. There’s a configuration value called “temperature” that tamps down or ramps up the “creativity” for ChatGPT (“who names these things?” asked a colleague), but even at its lowest setting you can get some variation in responses when you repeat the same question over and over. Kind of like what you would get with a human, actually.

There are many cases where this is undesirable. The enormous advantage of programming is that the computer does exactly what it is told, and it does it the same way every time. Unless, of course, you go to a great deal of effort to introduce randomness into the system. Even then it might not be very random.

“One thing that traditional computer systems aren’t good at is coin flipping,” says Steve Ward, Professor of Computer Science and Engineering at MIT’s Computer Science and Artificial Intelligence Laboratory. “They’re deterministic, which means that if you ask the same question you’ll get the same answer every time. In fact, such machines are specifically and carefully programmed to eliminate randomness in results. They do this by following rules and relying on algorithms when they compute.” Can a computer generate a truly random number?

This is perfectly fine if you are trying to get natural speech, but kind of a bummer if you need to mill that block of aluminum to within 0.001” of the spec. You may even want repeatable results in natural language, because there’s a small but non-zero chance the LLM might decide to start calling your very irate client “dawg.”

Much like your human support agents, in fact.

Beware vendor lock-in

Until and unless we get LLMs that are open source and freely distributed, any code you write that depends on those is going to be managed by a company that’s in it for the benjamins. Right now those requests may be cheap, but you can bet at some point enshitification will sit in.

Suddenly you are the Third Party Vendor to OpenAI’s Amazon. When they decide to change the model, your prompts are suddenly turning out much different stuff. Maybe they ban all politics. Or all criticism of Amazon. Or all mentions of competing LLMs. Stable Diffusion demonstrated they were able to make those kinds of censorial changes, at least with their art bot.

Maybe they pull a Reddit and hike the price of their API drastically. Or they pull a Twitter and shut the API off altogether.

This is not a new risk; tons of folks built businesses on services provided by YouTube, TikTok, Ebay, Twitter, and Reddit, only to have service fees, changes in direction, or even an increased censorious attitude undermine all of their work.

It’s exciting (and worrying)

All of that said, the AI makes some previously very difficult programming tasks remarkably easy, and I think we’ve just scratched the surface of those. No doubt future generations of LLMs will get faster. Maybe we will replace “lore” with better understanding. Maybe prompt writing will actually become engineering at some point. Maybe we will find a way to make these very complex, poorly understood tools respond in a consistent manner without us having to worry about it embarrassing us in front of Mom or starting political fights over Thanksgiving.

The last concern, though, is more about the business structure of AI. As are concerns about what AI ultimately means to people’s jobs. Those will require some political action, political activism, and maybe a serious rethink of our economic foundations. I don’t know if it’s gonna get worse, or gonna get better, but it’s certainly gonna get weird.