AI and LLM's as my search engine

An insert before I even post this for the first time. I wrote this post on the 22nd of March but didn't have time to do a quick read through and publish it until the morning of 24th March. In that time, OpenAI has announced plugins for their ChatGPT product. This has huge implications and is likely to solve an entire class of issues which I have identified below. The class of problems can be described as challenges that require precise computation. I'll probably want to write another blog post about that once I get access to the plugins. What a time to be alive.

At this point there are a million articles and twitter threads written about bing chat and Chat GPT and their successes and failures. Amusing conversations, disturbing ones, completely incorrect ones, mind blowing ones. But one thing I haven't really seen or tried out for myself is what these tools look like for your bog standard day to day stuff. The kind of stuff where maybe your kid asks if an ant has a heart and you realized that you've gone your entire childhood without actually reading about this. The kind of stuff which you'd normally use your normal search engine plus your own reading time to get answers for.

So starting from yesterday, I've started using Bing AI chat and Chat GPT for my standard searches. And honestly, it's been both great and underwhelming at the same time. I have an opinion forming about it already but I've promised myself that I would try and stick with it for the next 30 days at least.

Where it's good

On the plus side, the bots are really good at informational queries where the knowledge is not controversial. Ants and insects have a different circulatory system from humans. There is only information that points to the same thing. So the language model takes that, synthesizes it, and pops it out with very good referencing.

Additionally, in the case of Bing, the model has access to the internet. It is incredibly powerful at being able to scan the internet for discussions on current topics and it can return synthesized knowledge very quickly. The synthesis is a little limited depending on the recency of information but it's pretty good nevertheless at getting me started on understanding a particular current topic.

For example, I was following along on the banking crisis unfolding in America. I got today's stock price for a regional bank of my choice. I got several news articles talking about its down rating by Fitch. I followed up on the topic and although Bing couldn't really summarise at that moment why exactly Fitch had down rated the bank, it was able to point me in the right direction quickly. More importantly, it was also able to help me dig deeper into the topic by helping me understand what each rating meant and what the difference between a restricted default and a normal default is.

That said, some of the answers that Bing gave were both correct and weird. It said that it had no idea what the ratings were for the 6 biggest banks of USA but then in the same answer it listed the ratings for all of those banks. It also ended with some weird stuff about there being no government owned banks in USA but there are government owned banks in Europe. Thanks Bing but I never asked for that information.

I'm not sure if Bing wants to reach the point where it will read a URL and summarise it for me within the chat interface itself. That smells of a copyright lawsuit waiting to happen. But if it ever figured out some kind of revenue share model with the publisher where if the publisher partners with Bing they get paid every time someone requests a summary of a URL from the publisher's site, I think that would be pretty sweet.

Rowing with the models

It's not all fun and innocent errors with Bing though. It struggles with any area that requires some actual thinking or processing of tricky information. I had a conversation with it yesterday about my rowing exercises. There is a type of rowing called steady state rowing where you row for an extended period at a low intensity level to build up endurance without damaging or overtraining yourself. This type of exercise exists in other racing sports in names like zone 2 training.

I wanted Bing to tell me what steady state rowing was and it got that correct. You take your 2km timing (2km is the distance used for testing and for standard boat racing) and you row at 50-55% of the average wattage.

What does this mean? Let me explain because Bing is about to get it very wrong. So if your pace is 8 minutes and 2 seconds, your average wattage is 200 watts per stroke (trust me, I did the numbers). Which means that you do steady state training at about 100 watts which translates to a pace of 2 minutes and 31 seconds per 500m.

The next step for me was to ask Bing the above question. If my 2km pace is 8 minutes, what should my steady state pace be? It managed to get it mostly right with a little bit of deviation from what should be the actual answer. I then asked it what my pace should be if I am doing 2km in 7 minutes and 45 seconds and then it proceeded to get things wrong in a weird way.

Pause here. These are actual things I would like to see a next generation search engine do. We've been travelling in this direction slowly for some time where the search box has been capable of doing some forms of computation and conversions for a while. One would hope we make a leap to more advanced calculations which are derived from the synthesizing of information provided by an LLM.

Firstly, Bing correctly stated that a timing of 7 minutes and 45 seconds for 2km does translate to an average wattage of 222 and therefore the steady state wattage should be about 111. But then it stated that a 111 watts translates to a pace of 2 minutes and 10 seconds per 500m. That's... wrong. It translates to a pace of 2 minutes and 26.6 seconds. When I asked Bing to show its method it started to come apart by imagining that you take a timing and multiply it by 2. When I stated that that was incorrect and to use the conversion given by concept 2 (just to be sure), it started talking about damper, or resistance, settings. Those have nothing to do with the pace at all. Watts are watts. It's like saying 10 kg of feathers weighs less than 10kg of steel. It makes no sense at all.

When I asked Chat GPT (using GPT 4) about this, it just started to spit out absolutely bizarre information. It confused pace and wattage and said that I should be rowing at 50% of my 2km pace. Pace and wattage are not the same thing. There's a non linear relationship between the two. I pointed this out, and then it threw out even worse calculations.

It said that my 500m pace for a 2km timing of 7 minutes and 45 seconds is 3 minutes and 52.5 seconds. Incorrect. It's 1 minute and 56.25 seconds.
In the same answer it did a recalculation and got an answer of 58.125 seconds per 500m.
It then proceeded to apply the pace to watts calculation completely incorrectly even though it got the correct formula. And then it's calculation was completely wrong.

It was a mess.

Clearly, my excel sheet that I use for this isn't going anywhere.

Telling the time with the models

On to the next experiment. Something that is very possible by today's search engine and something I use all the time. Working in a remote team and having family members spread across the world means that I use search for time conversion all the time. Throwing in "1030AM IST in Melbourne" into google will correctly let me know that it will be 4 PM. Great. Now I know what time formula 1 kicks off in Melbourne in their local time.

But when I ask the question from Bing; "What time will it be in Melbourne when it is 10 30 AM in Sri Lanka", it understands exactly what I am trying to ask and will even tell me what timezones the two places are in. If I ask it to break down its calculations into steps it will do so correctly; subtract 5 and a half hours from the Sri Lankan time and add 11 hours back to get to Melbourne's timezone of AEDT. But the final answer I get is 7 AM or something ridiculous like that.

Even after "teaching" Bing how to calculate the correct answer again, when I asked it to convert 12 PM IST to AEDT, it just gave absolute nonsense again.

Now I think I know why this is happening. None of these large language models are capable of actual computation. All they do is guess what the next words are going to be and that does not translate well to math. Math is not probabilistic. It does not change on some whims or feelings. At least, the level of math I use doesn't. I'm sure there's some obscure branch of quantum math that is in multiple states depending on whether or not the cat looked at you while you were working on it. But I digress.

On the other side of the tooling, Open AI's chat running on GPT 3.5 and GPT 4 both got the answers correct. GPT 4 felt sluggish but it still got the answer right. Doesn't mean it's reliable of course. When I asked it to convert the time from Sri Lanka, it got the answer correct, i.e., it would be the same time. GPT 3.5 did throw in a weird nugget saying that the time difference between Sri Lanka and Chennai is +30 minutes. What? Once again, going to show that underneath the hood, there's probably no computation happening.

The point here is that when it comes to mathematical stuff, I cannot expect Bing to work entirely correctly and that's really sad. It needs to be able to correctly switch over to a more precise computation method where the language model identifies that it being asked a question that requires mathematical precision and uses its ability to send the parameters to the appropriate computational model.

Added note on March 24th just before publishing: This looks like what is going to happen with the new plugin model introduced by OpenAI. They already have a plugin that works with wolfram alpha which will do the actual mathematical heavy lifting. The interface looks clunky where the plugin has to be selected ahead of time but I can see this improving in the future or someone just building a super plugin that does the redirection to different layers underneath.

Debugging programming stuff

I'm currently setting up a micro k8s based development environment. It had a problem where it had to pull an image from a private repo on dockerhub. But even if the local docker daemon is logged in, it looked like micro k8s didn't know how to use that login information. So I plugged this query into bing and asked it to solve my problem for me. It did it really well!

The only problem was that I forgot to save the search result and I closed the tab. So I just asked Bing the same question again later. This time around it gave me completely useless banal stuff. Until I prompted it to give me a micro k8s configuration to use, it did not seem to be able to give the correct answer. Even then, the final answer I got which I know how to use to solve my problem came nowhere near the quality of answer I got the first time I asked the question.

When I asked GPT to give me an answer to my query using the GPT 4 model, it did it really well. Even then, it did provide a more challenging way of achieving my goal when there was a much simpler solution available.

Determinism and Doubts

At the start of this, I mentioned that I had an opinion already forming about Bing and GPT. I think about my usage of the internet and search engines as they are today. A standard search engine will spit out multiple links at me. I pick a bunch of them, read them, and synthesize the knowledge on my own. This takes time, but once I'm done, I'm actually done. I have a reasonable level of confidence in what I just took in.

To me the whole point of the new chat models is that I should be able to depend on the chat interface to answer my questions. If it doesn't know the answer it should be able to say "hey I don't know this answer but here's a source that does". And when I do read an answer within the chat interface, that should be enough for me to form my opinion of the topic I'm researching on. I should have a reasonable level of confidence in the topic the same way I do when I research the more traditional way.

The problem now is that I don't have that confidence. If I get different answers each time I ask the same question, and if I get conflicting answers from time to time, even if it's 1% of the time, it means that the chat interface is useless to me. If I want to feel confident that what I know agrees with the sources of information out there, I have to go read them myself. If I'm going to do that anyway, why use this new chat interface then? When I ask Bing which Apple watch series come in 45mm models, can I be sure it's correct? Why not just go to the sources myself if I need to take time to research anyways? It's important to note that under the interface that does real time search and synthesizing, the links that it uses hasn't changed from the normal search engine ranking algorithms used today. I am fully capable of translating my own question into a search query, thank you.

Basically, the table stakes here is that an LLM based search engine has to be doing at least a good enough job as I do at synthesizing information from a set of links it gets after doing a web search.

Added note on March 24th before publishing: Someone on Twitter shared a conversation with ChatGPT where it used a web browsing plugin to get current information to explain why Silicon Valley Bank failed. A few commenters pointed out correctly that while the summary was decent, it managed to get the order of events incorrect. The sale of bonds at a loss triggered a bank run instead of the bank run triggering the first loss making sales of bonds. The confusion probably arises from the fact that in order to be liquid, SVB had to continue selling assets quickly and in order to make a quick sale it had to sell those assets at a loss too. Picking up on this kind of nuance still feels like its a far away goal for the current generation of LLM's

And finally, will these small errors eventually be part of a 1000 papercuts to this whole LLM as a search engine business? Will it be like the self driving cars that made leaps and bounds but eventually got stuck and confused by the many little weird things that happen on the street every day?

The next 30 days and after

Like it or not, these LLM's are here to stay. I don't want to ignore them and say "eh. I'll be back when it's good enough". I want to use this technology and find its good sides and bad sides in a way that's intimate. Not as something that I read from a 1000 tweets and Medium articles. I first want to dive into this model as a search engine and information butler for the next 30 days. At some point I want to put my credit card into Open AI and start messing about with it at an API level. Who knows what I might get out of it. Maybe I'll be able to build something that can actually compute a steady state timing correctly based on a conversation. Maybe maybe.

This blog doesn't have a comment box. But I'd love to hear any thoughts y'all might have. Send them to [email protected]

Posted on March 24 2023 by Adnan Issadeen