Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
No "Zero-Shot" Without Exponential Data (arxiv.org)
187 points by zerojames on May 9, 2024 | hide | past | favorite | 118 comments


This feels like the worst possible outcome of the current AI hype.

We've essentially been ripping off the entire internet and feeding it to the models already, spending many billions of dollars in the process. It's pretty much the largest possible dataset you can currently get, and due to the ever-increasing and now rapidly accelerated AI poisoning of the internet most likely the largest possible dataset which will ever exist.

All that and all we're getting out of it is not-entirely-useless but still quite crappy AI? We would've been better off if we had never done this.


> not-entirely-useless but still quite crappy

15 months ago, general-purpose LLMs that have not been specifically trained on legal reasoning could score better than 90% of humans on the multistate bar exam, and these are humans who actually completed law school.

General-purpose LLMs get similar results in medicine, and when the models are fine-tuned for medical diagnosis they're even better.

And that was more than a year ago. Those who have seen current models not yet released tell us they'll make the current state of the art look like toys.

Progress is still tracking the steep part of the S-curve, and there's no indication that they're near the top yet.


> Progress is still tracking the steep part of the S-curve, and there's no indication that they're near the top yet.

If I understand it correctly, that seems to be exactly what this paper is suggesting.

Scoring high on the bar exam is pretty trivial for AI - the data needed for that is fairly generic and widely available on the internet. It requires you to demonstrate a relatively basic understanding of the concepts by answering a bunch of multiple-choice answers. If anything, I'd expect AI to have a perfect score.

Like I said, such an AI is not entirely useless. You can replace quite a few legal assistants with that, and I bet it could be used to create first drafts or to expand a core concept into a full legal argument. There is plenty of money to be made there, and it's going to make an awful lot of people jobless. But that's just replacing more-trivial jobs with automation, it doesn't add anything novel to society.

On the other hand, the actual difficult work involves being able to come up with completely novel concepts, and being able to expand upon some obscure but crucial stuff few people have ever heard about. Current models simply aren't capable of that, and the results achieved here with multimodal models suggests that they never will. We risk getting stuck with models which can do some trivial work, but silently produce complete garbage when you ask them to do anything providing substantial value.


what is novel?

what if you couple a random text generator with a picture generator, isn't each new picture novel?

if we could send all artists to work in the healthcare mines we might discover new treatment for various ailments. is that not novel? is that too indirect?

increasing "economic surplus with externalities factored in" is good for society even if it's not novel.

arguably philosophers come up with novel stuff all the time, and ...

> Current models simply aren't capable of that, and the results achieved here with multimodal models suggests that they never will.

can you elaborate on this a bit please?


> if we could send all artists to work in the healthcare mines we might discover new treatment for various ailments. is that not novel? is that too indirect?

You do realize that artists are artists because they want to be, and even if AI can create art, artists will still be artists, right? They won't just be like, "Oh, AI is good at generating art now? Whelp time to go become a doctor."


Wait, you're saying people's brains aren't fungible compute cycles? Casually discussing people's lives as top-down resource allocation problems is one of the things I hate most about how "tech optimists" discuss this.


I think this says more about the benchmark than the capabilities of the model. If it were the case that 90th percentile performance on the bar exam mean that a model was a 90th percentile lawyer, and we've had these models for 15 months (in fact longer), where are all the LLM lawyers? The lesson here is that a test designed for humans may not be equally representative of capabilities when given to an LLM.


nah, it's probably an okay-ish benchmark for humans, but it's just that, a filter to weed out those who can't learn hundreds of pages of legal trivia.

the model is great at this, because the training set is full of this stuff. the questions are static and simple. (the actual text of the questions change, of course, but the format and the answers are from a fixed set. and the LLM doesn't need to generate text basically, just one token. A B C or D ... that said I'm curious how it's administered to the LLMs and how much that influences their performance.)


It's not super surprising that LLMs perform well on standardized tests, given that they have a lot of standardized test related text in their training data. There are a lot of claims out there about the zero-shot ability of LLMs, and very little specific research to back it up. Until now that is.


This paper is about CLIP not LLMs and does not generalize to LLM architectures.


Why not?


> 15 months ago, general-purpose LLMs that have not been specifically trained on legal reasoning could score better than 90% of humans on the multistate bar exam

The claims of similar performance on coding problems were shown to be due to contamination of the training data on the tested problems. It did abysmal on problems made public after the model training cutoff.

I don’t think anyone has tested contamination for the MBE claims, but I would lean toward assuming the same issue exists for that assessment until proven otherwise.


You can see similar (poor) results if you give it a slightly tweaked puzzle that it’s seen before. The most recent example on Twitter was the farmer with the animals and the boat. If there are no constraints, it will give you an answer based on the tricks for the original unless you harass it.


Quite crappy? It seems to me like the current SOTA is working plenty good enough for most use cases and like the highest impact (practical) way to improve right now is going to be advancements in domain knowledge acquisition and retention.


I sure haven't seen any of those "plenty good" results yet - current SOTA seems to be about as useful as semi-coherently gluing together random Google results. Good enough to perhaps replace a minimum-wage worker, not good enough to provide actual value when you care about the quality of the result.

This paper seems to suggest that significant advancement in domain knowledge acquisition and retention is exactly the problem, as you seem to need exponentially more data due to a lack of generalization. What's the point of a model which can perfectly quote Shakespeare if you're a programmer trying to refactor a proprietary codebase and it fails to make a link to whatever garbage it picked up from StackOverflow?


In CLIP.

Replacing CLIP with an LLM is the current meta in image generation models, specifically because of that lack of generalisation. This isn’t a surprise to anyone.


I'm getting a heck of a lot of useful work out of these so-called useless models


Even if LLMs (pre-trained transformers) turn out to be a dead end as far as AGI goes, there are productivity applications for them, and perhaps just as importantly interesting insights/confirmations about how the mind works and directions for future AGI research/architectures.

The use cases for LLMs will no doubt grow as hallucinations are reduced, and they gain planning/reasoning ability over next couple of years. It'll be interesting to see what they subjectively "feel" like with these improvements.


Oh absolutely, LLMs are already causing a slaughter in applications where quality doesn't really matter to the company, like customer service. With minor improvements they're going to be a serious problem for any junior developer/lawyer/journalist/reviewer/artist/whatever, and if they ever fix the hallucinations issue it's game-changing.

On the other hand, it's still a big "if" whether a general hallucinations solution exists, and in the meantime we're paying a pretty high price for it, as the entire internet is being flooded with absolute garbage. We risk getting stuck in a situation where there is no way to get new senior people because nobody hired them as juniors because AI is cheaper and good enough, and senior people are getting less and less productive due to them being unable to use websites like StackOverflow as reference material. That's a pretty high price for a tiny gain, if you ask me.


One less potential competitor employee we have to worry about in the future economy. Thank you for your service.


The training sets used for current models, even the largest ones, is nowhere even close to "the entire Internet". Some napkin math based on known sizes of public datasets says it's less than 1%.


> It's pretty much the largest possible dataset you can currently get, [...]

No. What you describe encompasses only one poor modality: text. We have oodles more data, and can create almost arbitrary amounts more, by just eg pointing webcams at the world.

> We would've been better off if we had never done this.

Who is 'we'?


Quite a few people saw this coming.

It is still early to tell if we reached AI winter again or not, but at least we can see that news are slowing down.


I think that depends what your expectations are, and what you mean by another AI winter.

We're just scratching the surface of what's possible with the current state of the art. Even if there are no major advances or breakthroughs in the near future, LLMs and associated technologies are already useful in many use cases. Or close enough to useful where engineering rather than science will be sufficient to overcome many (though not all) of the shortcomings of current AI models. Never mind grandiose claims about AGI, there's enough utility to be gotten out of the limited LLMs that we have today to keep engineers and entrepreneurs busy for years to come.


You're right, and I suspect the GP would agree with you about there being real engineering applications for LLMs, diffusion models, etc

But I think the term "AI Winter" usually refers to the underlying research programme and the economics around it. Soaking up many billions of dollars of industry money and public grants on the pitch that AGI might be just around the corner, and then being unable to deliver on that pitch can induce a hangover effect that makes it much harder to raise money for anything that smells even a little bit like the failed pitch. Investors and administrators feel burnt and turn very skeptical for a very long time.

Meanwhile, the actual productive applications which shook out of the initial boom just get renamed to something else so that they don't carry that smell.

We'll see how it goes here, but that's the familiar road and where the terminology comes from.


The minor irony of this is that current efforts at AGI are focused on the data scaling laws, which we have showed no signs of slowing on, but if funding dries up these crazy expensive training runs wont be allowed anymore.


"AI winter" was a political phenomenon that happened because the AI applications didn't hold water to the hype, and all of the gullible people that invested on them got burned out by the disparity.

We are very clearly on that same path again. What leads to the conclusion that another winter is coming. But even the fact that people are talking about it is evidence it's not here yet, and as always with political phenomena, there's no guarantee history will repeat.

Anyway, none of it means people will stop applying their knowledge or studying AI. The entire thing happens on funding and PR, and the world is not entirely controlled by those two.


Except contemporary AI is useful on a day to day basis.


Being useful doesn't mean it met expectations.


So were the developments before the last AI winter..


The whole point of past AI winters is that genuinely useful technologies were preposterously overhyped as representing "human cognition," despite overwhelming evidence to the contrary. The bubble bursts - and the money evaporates - because expectations come crashing down, not because the usefulness was a mirage. Lisp was hardly the key to human symbolic thought like some hoped it would be, but it helped improve programming languages across the board. Perceptrons really are quite limited and there's absolutely no way you could emulate a human brain with them alone, but they're still essential for more advanced ML.

But you would think after 70 years AI practitioners would learn some humility! It is very obvious that GPT-4 is dumber than a honeybee, let alone a cat, let alone a crow. But for over a year I've heard dozens of tech folks insist it's smarter than most people.


> But you would think after 70 years AI practitioners would learn some humility!

Why? The people doing AI now are not the same as those that did AI 60 years ago.


I think we're in the beginning of a warm winter.

The funding is there but we're letting hype drive everything and not calling out the con artists.

The problem is we've had recent great success but still don't know how to get to AGI. But because we're afraid of winter we're not willing to try new things. We want to only compare to sota and think it's fair to compare a new method with a handful of papers to the status quo. That's not how the S-curves of technology work. Sure, maybe things don't scale but that doesn't mean they don't have merit or can't scale if someone finds some modification.

The problem is we treat research like products, not academic work. You need to produce everything from hard core theory to robust products to have an effective chain. But we seem hyper focused on the middle area. And for some reason people think products can be placing a nice interface around research code. There's still a lot of work you need to do and all those models can be optimized. They absolutely do not have optimal hyper parameters or even parameters.


I hope this means that people actually start curating their training sets. The quality control is horrible on all of the datasets I've looked through. Especially image captions.

Yes sheer quantity has a quality of its own but that won't produce optimal results.


From a very cynical marketing/VC-pitch perspective, un-curated datasets full of random crap have the benefit of sometimes producing totally surprising, unplanned "features", which helps sell the idea that one-size-fits-all black-box machine learning can solve any problem.


RAG seems to be all the rage. Not to mention the quest for the cooking up the correct cocktail of smaller MoE/ensemble models, and ... there's decades' worth of optimization work ahead (a few years of it seems to be already VC and edu grants funded), no?


There's tons of optimization work left, but by it's nature optimization tends to push the limits of what we currently have, and rarely allows for substantial improvements.

There are some major limitations to LLMs that aren't going to be "optimized" away. At the end of the day LLMs are Monte Carlo samplers over a latent, compressed representation existing human text data. Many of the tasks people hope LLMs will achieve require major leaps in our current understanding of language modeling.

One great example limitation (which shocks me sometimes when I think about it): generating output is still ultimately stuck in looking at the probability of the next token rather than the much more useful, probability of the generated statement. There are techniques to improve this, but we're missing a major piece of generating highly probable statements with no hint about how to really get there. Consider how you might write SQL. You conceive of the high level query first and start sketching out the pieces. An LLM can only look at each token and can't, statistically speaking, think in terms of the entire query.

Personally I think LLMs are very underutilized/exploited for what they are good at, and there is way too much focus on what they can't do. Hopefully we'll dodge an AI winter by using LLMs to solve the wide range of classical NLP problems that make many tasks that just a few years ago nearly impossible, rather simple today. Unfortunately the irrational hype around these models makes me skeptical of that scenario.


The issue with predicting the next statement is the combinatorial infinity which is improbable to model with historical frequencies,

ie., P(mat|the cat sat on the) is a distribution over, say, 100k words. Whereas, P(the cat sat|on the mat) is a distribution over 100k^3 words.

Part of the illusion of an LLM is that we produce create text in such a highly regular way that a mere distribution over 100k iterated, say 500 times, gives you a page of text as if modelling 100k^500.


> An LLM can only look at each token and can't, statistically speaking, think in terms of the entire query.

There's always some space in high-dimensions to put that extra context somewhere. So yes, only one token is predicted, but there's a lot of information squeezed into the "condition" (as in conditional probability, as mjburgess' comment shows)

> [hype instead of NLP]

yes, exactly. we'll see what remains after the bubble bursts.


RAG takes the current limit of LLM and focuses it on specific problems using custom data. It's not exactly magic, it just finds a way to produce something tangible and usable from an otherwise broadly focused model.

It's the rage because it's a way to practically _do work_ from LLMs that generally provide wow from conversationally accurate, often factually accurate responses.


I think the grandparent comment is about AI research, driven by the quest for AGI.

Incremental improvement of proven approaches, driven by profit motive, will surely continue regardless of whether there is an AI winter or not.


Indeed, but there's probably a very real funding and attention (heh) issue, if there's ton of both then there will be progress. But. Usually the bigger the hype the more progress is expected so the faster the fund-tention will dissipate.


I'm not even going to discuss whether or not things are actually slowing down since I disagree with that.

But I do feel confident that an AI winter is not on the horizon solely due to the overhead of implementation that we currently have. Just with currently existing AI, it would take years for the economy to fully leverage the abilities that are available. I'm confident we have transformative AI. So it won't feel like a winter for several years as we actually succeed in optimizing, implementing, and productizing current technology in other industries.


If there are no breakthroughs and funding dries up in the near future, it will feel to many like going off a precipice at high speed.

Only the poor souls who survive the fall, at the very bottom of the precipice, will get to experience the AI winter.


Dunno exactly what you mean by ai winter, but the whole gen AI thing has kind of got all the attention and even if that route slows down there are a ton of other branches fruitful for development.


The issue, I think, is that if results don't follow expectations, given the very high costs, investors might get cold feet.

Sure we already have some interesting applications, but they are not exactly printing money.


Commercial adoption has only started to ramp up. We are going to see a floodwave of LLM-powered bots/UX-wizards.

Even if academic progress stalls, we're going to be inundated for at least another few years.


Definitely not an AI winter, since the tools are so useful. People who think AGI is right around the corner might be disappointed though, since we can't really even accurately define AGI.


> reached AI winter again

Again?


This was not the technology industry's first encounter with AI hype. The term was coined 40 years ago, and has been suggested as a description for almost a dozen periods in the field's history:

https://en.wikipedia.org/wiki/AI_winter


It's a term for when AI hype dries up, has happened multiple times since the 60s[0].

[0]: https://en.wikipedia.org/wiki/AI_winter


Uh, what?


I've always been rather upset that it's fairly common to train on things like LAION or COCO and then "zero shot" test on ImageNet. Zero shot doesn't mean a held out set, it means disjoint classes. You can't train on all the animals in the zoo with sentences and then be surprised your model knows zebras. You need to train on horses and test on zebras.


How would the model know what zebra was if it had never seen it? Same is true for humans.


When I was little, zebras were described to me as black and white stripped horses. Without even seeing one, I'm sure anything who has seen a horse could then merge those two concepts to create a close to accurate picture of what a zebra is.

If AI is supposed to resemble a human mind with ability to learn, then it must be able to learn from a blanker slate. You don't teach the human before it is born, and in this comparison an AI is born when you finish it's model and set its weights using the training set. If you test it with the training set, you aren't testing ability to comprehend, just regurgitate what it was born with


You can look at a medieval bestiary to see how people thought animals might look based on descriptions alone. Like these lovely elephants

https://britishlibrary.typepad.co.uk/digitisedmanuscripts/20...


Honestly, these are some of my favorite stories and I think more ML people need to learn more about mythology (I say this as a ML researcher btw). Because once you go down this path you start to understand how "Rino" == "Unicorn". You have to really think about how to explain things when you're working with a limited language. Yeah, we have the word "rino" now, but how would you describe one to someone who has no concept of this? Maybe a cow with one big horn? Is "like a big fat tough skin horse with a big horn coming out of its head" accurate? And then apply your classic game of telephone[0]. It is also how you get things like how in Chinese a giraffe is "long neck deer"[1] (that doesn't work for all things in Chinese and there's another game of telephone (lol, maybe I was too harsh on the British in [0]) and well... you can imagine things get warped like crazy).

There's so many rabbit holes to go down when trying to understand language, vision, reasoning, and all that stuff.

[0] (Jesus England... this is what you call this game?!) https://en.wikipedia.org/wiki/Chinese_whispers

[1] https://translate.google.com/?sl=en&tl=zh-CN&text=giraffe&op... ----> https://translate.google.com/?sl=zh-CN&tl=en&text=%E9%95%BF%...


If you trained an image generator, removing all instances of zebras from the training set, you could ask it to output images of a black and white striped horse and it would likely succeed. Then you could fine tune an image recognition model (also with all zebras removed from the training set) on the generated image set to associate it with the word zebra. If you then showed it a bunch of images of actual zebras there’s a really good chance it would succeed.


AI is not supposed to resemble a human mind. It's just supposed to be useful to us.


Great question!

It depends on the zero-shot experiment. Let's look at two simple examples

Example 1:

We train a classifier that classifies several animals (and maybe other things). For example, you can use the classic CIFAR-10 dataset which has labels: airplane, automobile, bird, cat, deer, dog, frog, __horse__, ship, truck. The reason I underlined horse is because you want your model to classify the zebras as horses!

The reason this is useful is for measuring the ability to generalize. At least in our human thinking framework we'd place a zebra in that bin because it is the most similar (and deer should be the most common "error"). This can help us understand the network and we'll be pretty certain that the network is learning the key concepts of a horse when trying to classify horses rather than things like textures, colors, or background elements. If it frequently picks ships your network is probably focusing on textures (IIRC CIFAR has ships with the Dazzle Camo[0] and that's why I threw "ship" out there).

Example 2:

Let's say we train our network on __text__. In this case it can get any description of a zebra that it wants. In fact, you'd probably want to have a description of what it looks like!

The what we might do is take that trained text network, and attach it to a vision classifier. For simplicity, let's say that was trained on CIFAR-10 again. We then tune our LM + CV model so that it can match the labels of CIFAR-10 (basically you're tuning to ensure the networks build a communication path, otherwise it won't work). Here we end up testing our model's actual understanding of the zebra concept. It again should pick horse as the likely class because you've presumably had in the training text some description that compares zebras to horses.

-----

So really the framework of zero-shot (and few-shot) is a bit different. We're actually more concerned about clustering and you should treat them more similar to clustering algorithms. n-shot frameworks really come from the subfield of metalearning (focusing on learning how networks learn). But as you can imagine, these concepts are pretty abstract, but hey, so are humans (that's why we see a log as a chair and will situationally classify it as such, but let's save the discussion of embodiment for another time).

In either example I think you can probably see how a toddler could do similar tasks. You can ask which of those things the zebra is most similar to and you'd be testing the toddler's visual reasoning. The text one might need be a little older but it could be a great way to test a child's reading comprehension. Does this make sense? Of course machines are different and we need to be careful with these analyses (which is why I rage against just comparing scores/benchmarks, these mean very little), because the machines may be seeing and interpreting things differently than us. So really the desired outcome depends on if you're testing for what the machine knows/understands (you need to do way more than what we discussed above) or if you are training a machine to think more similar to a human (then we can rely pretty close to exactly what we discussed).

Hope this makes more sense.

[0] https://en.wikipedia.org/wiki/Dazzle_camouflage


Should we also try to teach children geometry and test them on calculus?


This illustrates two ways of teaching

I’ve experienced both, each at a different university

In one, professors would teach one thing then ask very different (and much harder) questions on tests

In the other, tests were more of a recap of the material up to that point

I definitely learned a lot more in the second case and was a lot more motivated. It also required more effort from the professors

The two methods also test different things. The recap one tests effort and dedication, if you do the work, you get the grade

The difficult tests measure either luck and/or creativity and problem solving under pressure. It’s not about doing the work, it’s about either being lucky or good at testing


> it’s about either being lucky or good at testing

I think you are misunderstanding the experience.

The first (harder questions) is testing your understanding of the material and problem. Can you applying the material to solve a novel problem? Do you understand the material not just the mechanics. Do you understand how it would interelate it with other problems? Do you understand the limitations?

The second is just regurgitation. This is great for rote skills, but this isn't really learning. This is grinding until you can reproduce. These are the kinds of skills that are easily automated. This is not what we should be testing our kids.


Fwiw, I think both of you are on the same page.

And yes, to bring back to ML it is the difference of generalization and memorization (compression). I wrote a longer response to a different response to my initial comment to help clarify because I think this chain is a bit obtuse and aggressive for no reason :/ (I mean you can check the Wiki page to verify what I said)


For centuries we only taught children geometry and one of them invented calculus.


Now that’s setting a high bar. If AI could reliably invent calculus, then I’d be briefly impressed and then terrified.


One AI in a couple hundred years might be able to do it by luck?


I think the authors are responding to a claim that AI is doing this: look, we taught them geometry, and now they know calculus! GP is saying that it's not a true zero-shot to have a separate test and training set, because the classes overlap. Similarly, the authors are saying "true zero-shot" is basically not happening, at least not nearly to the extent some are claiming.

So everyone here, including TFA, are all kinda doubting the same claim (our AI models can perform zero-shot generalizations) in different ways, I think?


I think you're being overly obtuse, and I'd request you try to work in good faith. If you doubt what I claimed, you can quickly verify on the wiki page[0].

In essence you aren't wrong, but that's not what we'd typically do in a zero (or few) shot setting. We'd be focusing on things that are more similar. If you want to understand this a bit better in what we might do in a ML context I wrote more here[1].

And I like Nico's comment about how different professors test. Because it makes you think about what is actually being tested. Are you being tested on memorization or generalization? You can argue both these kinds of tests are testing "if you learned the material" but we'd understand that these two types of tests are fundamentally different and let's be real, are not reasonably fair to compare scores to. I'm sure many of us have experienced this where someone that gets a C in professor A's class likely learned more than than someone who got an A in professor B's class. The thing is that the nuance is incredibly important here to really understand. And you can trivialize anything, but be careful when doing so, you may overlook the most important things ;)

Now... you could make this argument about geometry -> calculus if we're not talking about the typical geometry (single) class most people will have taken in either middle school or high school. Because yes, at the end of the day there is a geometric interpretation and we have the Riemann sum. But we'd need to ensure those kids have the understanding of infinities (which aren't numbers btw). We'd have to be pretty careful about formulating this kind of test if we're going to take away any useful information from it. Though the naive version might give us clues about information leakage (in our case with children this might be "child who has a parent that's a mathematician" or something along those lines). It really all depends on what the question behind the test is. Scores only mean things when we have nuanced clear understandings of what we're measuring (so again, tread carefully because "here be dragons" and you're likely to get burned before, or even without, knowing it)

And truth be told, we actually do this a bit. There's a reason you take geometry before calculus. Because the skills build up. But you're right that they don't generalize.

[0] https://en.wikipedia.org/wiki/Zero-shot_learning

[1] https://news.ycombinator.com/item?id=40313501


My long-standing observation has been that while nature may abhor a vacuum, she also really, really loves sigmoids.

That performance vs. training data is not linear, but logarithmic, doesn't exactly come as a surprise.


The question that is unanswered, is the logarithmic performance improvement the result of better sampling of the underlying distribution over time, or related to just doing more training with slight variations to effectively regularize the model so it generalizes better? If it's the former, that indicates that we could achieve small models that are every bit as smart as large ones in limited domains, and if that's the case, it radically changes the landscape of what an optimal model architecture is.

I suspect from the success of Phi3 that it is in fact the former.


Every exponential is really an s curve?


in physical systems, it’s very often the case!


Pretty much always. One exception might be the expansion of the universe


Symmetries in Nature strike again! It's just like in Noether's theorem.


Noether's theorem has nothing to do with S-curves.


The physical world has limits, that's why sigmoids everywhere.


Computerphile just did a video on this and it's a pretty good summary: https://www.youtube.com/watch?v=dDUC-LqVrPU


Can one upweight the known-rare concepts in the training set? But then which are rare and how should we identify the rare ones worth knowing?

ML is funny. It's predictably going to run into the Education field.


The CLIP plot (Fig. 2) is damning, however some of the generative models show flat responses in Fig. 3 (e.g. Adobe GigaGAN, DALL-E-mini). While those are on the one hand technically linear relationships, but are also exactly what we'd want: image generation aesthetic score that doesn't care about concept frequency. Maybe the issue is with the contrastive training target used in CLIP?


My biggest worry is the idea of generating data as training data. We're obviously already unwittingly doing this, but once someone decides to augment low-volume segments of the dataset with generative input, we're going to start getting some really crappy feedback loops.


We are already doing this. In fact for the next generation of frontier models the primary electricity cost is running inference to generate training data for these models. As Llama 3 has shown us scale of training data is more important than size of model.


Do you ever have novel ideas while walking or in the shower?

Well, you’re learning from data you generate.


https://arxiv.org/pdf/2305.17493

There's some cursory indication that in the long tail, training LLMs on LLM-generated data causes model collapse. Kind of like how if you photocopy a photocopy too many times the document becomes unreadable.

This isn't really surprising though. Neural networks at large are a form of lossy compression. You can't do lossy compression on artifacts recovered from lossy compression too many times. The losses stack.


> Well, you’re learning from data you generate.

Sure. I'm producing human output from human input in a generally unconstrained, limitless way.

This is producing approximated human output from approximated human input. That second level of abstraction will be constrained, ultimately, by the limits of input.


I don't necessarily think it is the equivalent. Maybe it's more akin to a high-schooler reading a single book and then being asked to write several new books that hold equal weight in teaching the next generation of children.

These new text books could be great at simplifying the subject matter and making the material accessible or they may just never have fully understood the materials and are misleading.

Now imagine that over and over again, imo it's pretty likely to introduce inaccuracies if just taking a naiive approach.


This is correct. To give more synthetic data examples:

1. Generate inverted problems which are easier to produce than solve. For instance, create an integration math exercise by differentiating a hairy function and reversing the steps.

2. Create (simulated) environmental data.

3. Use adversarial model competition, e.g. self-playing chess or training an artificial image generator/detector model pair.

There's evidently a commonplace myth that information quality starts pristine and exclusively gets degraded by systems thereafter. That's just easily, demonstrably false in myriad ways. That's why it's absurd to conclude that LLM output being in internet training data will cause model collapse.

Information-rich synthetic data can be created without humans, and it works. (Check out Phi, for instance.)


Well put.

Also, ever read your old journals? You are training on generated data.


Ever dream? It’s the same thing.

Of course we still need real world data, but it seems like generated data should also play a role. Humans don’t weight dreams equally with reality, however, and that’s a distinction I feel is missing here.


Even hear an idea from another person? You just trained one model with another.


Shouldn't there be enough known-good training content that we can use it to determine if a document is worth including in a training set?


There's not enough human workers to validate that at scale.

If you want ML to do it... well that's a bit of a catch-22. How would an ML algorithm know if data is good enough to be trained on unless it has already been trained on that data?


The way humans do it is via curiosity/boredom/surprise. If we can't predict some thing well (i.e. we're "surprised" by it) then that both acts as learning trigger to predict better next time, as well as retains our interest to explore it.

Eventually AGI's will need to learn by experimentation as we do, but in the meantime ability to well predict a potential training sample could be used to decide whether to add it to training set or not. At the moment it seems the main emphasis is on a combination of multi-modality (esp. video) and synthetic data where one generation of LLM generates tailored training samples for the next generation. I guess this synthetic data allows a more selective acquisition of knowledge than just adding surprising texts found in the wild.


Get a collection of data which is small enough to have humans annotate as being either low quality or high quality. Train a model to predict this annotation. Then on a larger disjoint collection of data, use this model to estimate whether the data points would be considered low quality or high quality, and use this to filter it.

This seems doable, and, I think something like it is already done?


You could use data where you know it was not AI generated, like the library of Congress catalog from prior to 2015. Or highly cited research papers, things of that nature.


This deserves to be on the front page.

The authors ask whether image-to-text and text-to-image models (like CLIP and Stable Diffusion) are truly capable of zero-shot generalization.

To answer the question, the authors compile a list of 4000+ concepts (see paper for details on how they compile the list of concepts), and test how well 34 different models classify or generate those concepts at different scales of pretraining, from ~3M to ~400M samples.

They find that model performance on each concept scales linearly as the concept's frequency in pretraining data grows exponentially, i.e., the rarer the concept the less likely it is actually/properly learned -- which implies there is no "zero-shot" generalization.

The authors also release a long-tail test dataset that they cleverly name the "Let it Wag!" benchmark to allow other researchers to see for themselves how current models perform on the long tail of increasingly rare concepts.

Go read the whole thing, or at least the introduction. It's clear, concise, and well-written.


Worth noting that this paper is about CLIP only, which is way simpler than llm architectures. (if I’m not mistaken)

Still, interesting approach and kind of confirms the experience of most people where clip models can recognize known concepts but struggle with novel ones.


But is this some abstract truth of statistics or is this just a property of how these types of models work?


How do we know humans don’t do the same thing?


The models tested seem to work as expected as it’s not a retrieval model being used here. The weights are lowest on wormsnake but much higher on worm and snake. The temperature of the model for something like stable diffusion has to be higher than something doing a retrieval, so we would not expect it to reproduce the exact worm snake image from its training data.


Of course, it will require exponential data for zero shot. The keyword here is zero shot. If you think about it for a second, this applies to humans too. We also need exponential training data to do things without examples.


When we learn the grammar of our language, the teacher does not stand in front of the class and proceed to say a large corpus of examples of ungrammatical sentences, only the correct ones are in the training set.

When we learn to drive, we do not need to crash our car a thousand times in a row before we start to get it.

When we play a new board game for the first time, we can do it fairly competently (though not as good as experienced players) just by reading and understanding the rules.


please help yourself and do a quick Google search about "zero shot" and "few shot" learning.


You could explain them instead of being snarky.

https://xkcd.com/1053/


Not to derail this conversation but...

When I'm explaining AI stuff to family, the example I use is classification and I specifically use cats and dogs. I use the analogy of how you teach a toddler that this is a cat and that is a dog. Essentially repetition. And at first they get them mixed up and the parent will say "no, that's a dog" when they think it's a cat and so on.

But essentially, for a child learning the difference between a cat and a dog you only need to show them a handful of each and they'll generally get it from that point on.

That being said, why does it take ML millions (or billions) of images to be able to say "that's a cat" when a human does it on a handful (might be up to, say 100 but my point stands). Why can ML not do that yet?

I'm a dev for many years but not in AI, hence my ELI5 question :)

Edit: If the answer is massively long and complicated, perhaps if you could point me to some text (book, paper etc.) and I can read at my leisure.

Edit2: I just thought of something. Is it related to whether the child sees a still image or a live cat? So, for example, a still image is a single example of a cat standing in a particular position etc, whereas a moving, live cat, would be interpreted by the brain as many many still images, all processed individually? The end result being that, in fact the child, when seeing a live cat, actually sees thousands or millions of still images of the cat? It just popped into my head there :D


Saying "it takes a hundred images for a human to learn" implies that you can take a baby/toddler/whatever, who has been blind all their life, restore their sight, show them 100 photos of dogs and cats, and expect them to know what's what.

You're ignoring the trillions of frames a toddler has seen before they get to the part where they can even understand what a photo is.


Toddlers have a neural network architected and fine tuned for object recognition. They also have a benefit of learning thousands of hours of video from the real world, initially blurry, gradually getting sharper, to draw the model of physical reality before they are given first examples of "dog" or "cat" to classify. If you kept toddler immobile in the darkness then presented him with static 2d photos of a dog or a cat I think it wouldn't learn much faster than artificial neural net.


Neural networks do not imitate the brain and “training” a network has nothing to do with learning, despite these terms serving double duty.

Machine learning models are just math functions fitted to some data. When we get predictions from them, we’re really just using a technique to interpolate between the data points. The denser a particular region has been sampled in the data, the better the predictions will be. (This is why GPT-anything will do a good job writing solutions to common leetcode problems, while struggling with a novel problem.)

Humans have a powerful abstraction ability far beyond any algorithm that has been developed. We can take in a few pieces of information describing a really unusual set of circumstances, run imaginary experiments and simulations on them, and make very granular and accurate predictions about their consequences. Nobody actually knows how.


> just math functions fitted to some data

I don't want to be pedantic. But I'd like to interject. We can simulate everything with math functions. If the algorithm isn't there yet it's because we're using the wrong functions.


Are you a physicalist or a dualist, in the philosophy of mind senses of those words? I don't see how what we do is much different than computers as both use fundamental physical computation to achieve a result. It could simply be that our brains are much more complex in their computation than current computers, but they both compute nonetheless.


I have no idea! I don’t think these questions can be answered (despite modern culture expressing absolute faith in physicalism).

But back to the subject of deep learning, whatever brains do, I think it’s pretty clear that existing neural networks don’t approximate the biological process. They’re just too static.


1. Children are not tabulae rasae, they have previous optimization of their brains via evolution such that they are much more predisposed to visual and other human tasks than computers. On the other hand, computers are tabulae rasae such that they are essentially blind and have no optimization until you train them.

2. Children see many more data points than computers, thousands of hours of lived experience from which they can learn, of not just the visual world but the auditory and other sensory worlds.

Essentially, you are comparing something that is ~80% trained and then training it on billions of data points to something that is 0% trained and then training it on still images of only two dimensions and asking why the former is much better.


Millions of years of evolution has trained the biological LLM in our brains to be good at fine tuning those concepts.


It was also mentioned in a just uploaded Computerphile video: https://www.youtube.com/watch?v=dDUC-LqVrPU


I wonder, is this exponential relation specific to multi-modal models? From my admittedly naïve view it seems to make sense that "...what is rare is not properly learned" would apply generally?


Better title: Image classification models suck at identifying nouns that they've rarely seen.

Crucial context:

- They're only looking at image models -- not LLMs, etc

- Their models are tiny

- A "concept" here just means "a noun." The authors index images via these nouns.

- They didn't control for difficulty in visual representation/recognition of these exceptional infrequent, long-tail "concepts."

If I didn't know an object's label, I too would struggle to identify/draw it...


Odd how this went from top voted comment to lowest comment. What's the disagreement?


I show +23 points. Are popular comments down ranked to encourage diversity? Oh well, I'm proud of the brief success in challenging misinformation here!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: