[Paper] Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in SOTA Large Language Models

rufus@discuss.tchncs.de · edit-2 20 days ago

Didn’t Finland have an entire culture of doing spoofs on old German shit, like TV series? I’ve seen Die Kühe and I think it’s absolutely awesome. (The Dutch also did a good impression in Jiskefet.)

And I remember watching / listening to lots of M. A. Numminen. Not sure if that’s just the stuff that found its way back to Germany and me… But nonetheless…

The thing with the tv series might be a thing of the past, though. I’m not sure if anyone still remembers “Der Alte” or “Derrick”. It’s probably not that funny without the context. And how we watch TV has certainly changed during the last decades.

rufus@discuss.tchncs.de · 20 days ago

deleted by creator

rufus@discuss.tchncs.de · edit-2 29 days ago

I think most people use something like exllamav2 or vllm or use GGUF to do inference and it seems neither of those projects have properly implemented multimodality or this specific model architecture, yet.

You might just be at the forefront of things and there isn’t yet any beaten path you could follow.

The easiest thing you could do is just use something that already exists, be it 4bit models, wait a few weeks and then upgrade. And I mean you can also always quantize models yourself and set the parameters however you like, if you have some inference framework that supports your model including the adapters for vision and has the quantization levels you’re interested in…

rufus@discuss.tchncs.de · edit-2 1 month ago

Well, I’d say there is information in language. That’s kinda the point of it and why we use it. And language is powerful. We can describe and talk about a lot of things. (And it’s an interesting question what can not be described with language.)

I don’t think the stochastical parrot thing is a proper debate. It’s just that lots of people don’t know what AI is and what it can and cannot do. And it’s neither easy to understand nor are the consequences always that obvious.

Training LLMs involves some clever trickery, limit their size etc so they can’t just memorize everything, but instead are forced to learn concepts behind those texts.

I think they form models of the world inside of them. At least of things they’ve learned from the dataset. That’s why they can for example translate text. They have some concept of a cat stored inside of them and can apply that to a different language that uses entirely different characters to name that animal.

I wouldn’t say they are “tools to learn more aspects about nature”. They aren’t a sensor or something. And they can infer things, but not ‘measure’ things like an X-ray.

rufus@discuss.tchncs.de · edit-2 1 month ago

I’m currently reading the paper. I occasionally debate here on Lemmy whether LLMs are just stochastic parrots, or if they actually grasp the concepts they’re talking about. There’s also evicence for that.

Ultimately I wonder if and when we’ll get LLMs that address ‘hallucinations’ and expose a setting to adjust the factuality of the answer. I suppose that’s somewhere in the model or at least possible to learn for the model. But certainly not controlled or factored in in the current generation of LLMs.

rufus@discuss.tchncs.de · edit-2 1 month ago

[Paper] Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in SOTA Large Language Models

rufus@discuss.tchncs.de · edit-2 1 month ago

It’s lemmy.ml . During the API wars on Reddit lots of people came here and lots of new instances were founded. lemmy.world was part of that and quickly grew into -I think- the now largest instance by far. But lemmy.ml is at least 2 years older and hosted by the actual developers. And due to history hosts to this point some of the large communities.

Yeah. And “Lemmy self-corrects” is kind of what this post is about (in my opinion.) I’d like to see lemmy.world and a few other instances now do it and defederate. That’s how it should be, call out bullshit, be vocal and then do something about it. My point is, we’re at phase 1 or 2. Now we’re going to see if Lemmy self-corrects. As of now it didn’t.

I think just hoping for a bright future isn’t cutting it. And if you ask me, all the infighting and defederating each other also isn’t healthy.

rufus@discuss.tchncs.de · edit-2 1 month ago

Well that paper only says it’s theoretically not possible to completely eliminate hallucination. That doesn’t mean it can be migitated and reduced to the point of insignificance. I think fabricating things is part of creativity. I mean LLMs are supposed to come up with new text. But maybe they’re not really incentivised to differentiate between fact and fiction. I mean they have been trained on fictional content, too. I think the main problem is to control when to stick close to facts and when to be creative. Sure, I’d agree that we can’t make them infallible. But there’s probably quite some room for improvement. (And I don’t really agree with the premise of the paper that it’s caused solely from shortcomings in the training data. It’s an inherent problem in being creative and that the world also consists of fiction and opinions and so much more than factual statements… But the training data quality and bias also has a severe effect.)

That paper is interesting. Thanks!

But I really fail to grasp the diagonal argument. Can we really choose the ground truth function f arbitrarily? Doesn’t that just mean given arbitrary realities, there aren’t hallucination-free LLMs in all of them? But I don’t really care if there’s a world where 1+1=2 and simultaneously 1+1=3 and there can’t be an LLM telling the “truth” in that world… I think they need to narrow down “f”. To me a reality needs to fulfill certain requirements. Like being contradiction free etc. And they’d need to prove that Cantor applies to that subset of “f”.

And secondly: Why does the LLM need to decide between true and false? Can’t it not just say “I don’t know?” I think that’d immediately ruin their premise, too. Because they only look at LLMs who don’t ever refuse and have to decide on a truth.

I think this is more related to Gödel’s incompleteness theorem, which somehow isn’t mentioned in the paper. I’m not a proper scientist and didn’t really understand it, so I might be wrong with all of that. But it doesn’t feel correct to me. And I mean the paper hasn’t been cited or peer-reviewed (as of now). So it’s more like just their opinion, anyways. I say (if their maths is correct) they just proved that there can’t be an LLM that knows everything in any possible and impossible world. That doesn’t quite apply because LLMs that don’t know everything are useful, too. And we’re concerned with one specific reality here that has some limitations. Like physics, objectivity or consistency.

rufus@discuss.tchncs.de · edit-2 1 month ago

I get you. But they’re the flagship instance. At least they used to be. They shape the brand identity of whole Lemmy. And that’s being tankie and having a culture that could be nice, but regularly isn’t. So everyone on the internet knows Lemmy isn’t really something I want to subject myself to. And if we’re being honest, alsmost nobody knows the fine nuances of power abuse on specific instances. It’s just “Lemmy” that this gets attributed to.

Every interaction here represents Lemmy. Some disproportionately so.

And we’ve established, me leaving (which I’ve done) is not gonna change anything about it. The communities are still amongst the largest and where most of the users are, and also attracting the new users.

Your argumantation would be perfectly valid if lemmy.ml were some small instance that’s unheard of by most users. Or blocked by the rest of the network. We could ignore them then, let them do their own thing like the Fediverse does with a few nazi and conspiracy instances. But this isn’t the case here.

Regarding money and doing it “for the fun of it”: That’s not correct. They get money for two or three full-time jobs from the NLNet fund and the EU. They could be having fun, too. But they definitely also get a substancial amount of money for it.

Concerning the 4chan example: That’s on point. 4chan is the epitome of echo chamber and incel culture. That’s mainly because there’s no one else. They left. And now, why would anyone else visit a place like that in the first place? I’d rather not Lemmy become like that. Do you?

rufus@discuss.tchncs.de · edit-2 1 month ago

And am I supposed to let other people be subject to that, too? Let people like that drag down Lemmy as a whole? Shouldn’t I have a nice and welcoming place on the internet for me and my friends?

Do you like echo chambers? If you want my perspective: I have until now recommended Lemmy to exactly zero of my friends. Because of things like this. Lemmy has quite some potential. But it just has so many issues to tackle and the culture here just isn’t what appeals to “normal” people. If other people share my experience, that’s exactly why Lemmy still is below 50k active users and super small.

Sure. I moved away from the .ml communities a few weeks ago because I think it’s the right thing to do (for me). It’s just dragging down everyone and making Lemmy a worse place. Like we see constantly with all the posts like this. Should we (the people who want more than an echo chamber, and want fair and honest discussions) all abandon Lemmy?

rufus@discuss.tchncs.de · 1 month ago

Probably !support@lemmy.world

Go to lemmy.world and have a look at the Sidebar. That’s where instances publish info like that. And they list several methods to contact them, there.

rufus@discuss.tchncs.de · edit-2 1 month ago

I don’t think so. It’s a bit like being bullied and your friends are being bullied, too. What do you do? Leave the room and be happy they bully your friends and not you? Keep silent which ultimately enables them? No. You’re being vocal about it. You warn your friends not to go in there. And you try to do sth about it. In the end it’s the bullies who should leave, not the nice people. Or the whole place is doomed and just getting worse.

rufus@discuss.tchncs.de · 1 month ago

My first idea would be to have users report posts and ping a random sample of like 20 active and currently online users of the community and have them decide (democratically). That way prevents brigading and groups collectively mobbing or harassing other users. It’d be somewhat similar to a jury in court. And we obviously can’t ask everyone because that takes too much time, and sometimes content needs to be moderated asap.

rufus@discuss.tchncs.de · edit-2 1 month ago

You’re 100% right OP. Don’t let the people tell you it’s a you problem and you should leave. It’s exactly like you said (in my opinion.) If at all, it’s the bad people who should leave. Not the nice ones and the ones calling out the bullshit.

Nothing changes if the just people keep silent and let bigotry or whatever just happen. It just makes the whole place become worse. And I’d say it’s warranted to speak up or do something. And as far as I heard you’re not the only one complaining.

rufus@discuss.tchncs.de · edit-2 1 month ago

deleted by creator

rufus@discuss.tchncs.de · edit-2 1 month ago

Are you referring to me or BigFig? I’m neither a mile (I’m European, so we use the metric system), nor a mole. If you make me choose an animal, I’d like to be an alpaca. And I’d be willing to do a captcha to prove to you that I’m not a bot.

rufus@discuss.tchncs.de · 1 month ago

Thanks for spreading the word. We get these complaints every few weeks. More people need to be educated and move away from these instances to make the Threadiverse a better place.

rufus@discuss.tchncs.de · edit-2 1 month ago

I’m pretty sure he did this out of this own motivation because he thinks/thought it’s a fascinating topic. So, sure this doesn’t align with popularity. But it’s remarkable anyways, you’re right. And I always like to watch the progression. As far as I remember the early videos lacked professional audio and video standards that are nowadays the norm on Youtube. At some point he must have bought better equipment, but his content has been compelling since the start of his Youtube ‘career’. 😊

And I quite like the science content on Youtube. There are lots of people making really good videos, both from professional video producers and also from scientists (or hobbyists) who just share their insight and interesting perspective.

rufus@discuss.tchncs.de · edit-2 1 month ago

And maybe have a look at his Youtube channel and the older videos, too. Lots of them are a bit more philosophical and not too technical for the average person. I think he’s quite inspiring and conveys very well what AI safety is about, and what kinds of problems that field of science is concerned with.

rufus@discuss.tchncs.de · 1 month ago

Pics or it didn’t happen.

(Seriously, I’d like to see the source of this story. Googling “Tim the pencil” doesn’t bring up anything related.)

rufus@discuss.tchncs.de · edit-2 2 months ago

Yeah, doesn’t really work. I mean it has a rough idea of that it needs to go east. And I’m surprised that it knows which interstates are in an area and a few street names in the cities. I’m really surprised. But I told it to get me from Houston to Montgomery as in your example. And in Houston it just tells random street names that aren’t even connected and in different parts of the city. Then it drives north on the I-45 and somehow ends up in the south on the I-610-E and finally the I-10-E. But then it makes up some shit, somehow drives to New Orleans, then a bit back and zig-zags it’s way back onto the I-10. Then some more instructions I didn’t fact check and it gets that it needs to go through Mobile and then north on the I-65.

I’ve tested ChatGPT on Germany. And it also gets which Autobahn is connected to the next. It still does occasional zig-zags and in between it likes to do an entire loop of 50km (30 miles) that ends up 2 cities back where it came from… Drives east again and on the second try takes a different exit.

However: I’m really surprised by the level of spatial awareness. I wouldn’t have expected it to come up with mostly correct cardinal directions and interstates that are actually connected and run through the mentioned cities. And like cities in between.

I don’t think I need to try “phi”. Small models have very limited knowledge stored inside of them. They’re too small to remember lots of things.

So, you were right. Consider me impressed. But I don’t think there is a real-world application for this unless your car has a teleporter built in to deal with the inconsistencies.

rufus@discuss.tchncs.de · edit-2 4 months ago

[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits