Well...
From an evolutionary standpoint, we're basically the same collection of mostly-hairless primates that, 20,000 years ago, hadn't yet figured out agriculture and were roaming the land in small groups of maybe 100 or so at most, living off it as best we could.
From that standpoint, I think that we've done pretty well with a brain that evolved to deal with a rather different environment and is having to navigate a terribly-confusing, rather different situation.
I mean, you see any other critters that have been outperforming us on improving their understanding of the world?
I think that the current crop of systems is often good enough for a header illustration in a journal or something, but there are also a lot of things that it just can't reasonably do well. Maintaining character cohesion across multiple images, for example, and different perspectives
try doing a graphic novel with diffusion models trained on 2D images, and it just doesn't work. The whole system would need to have a 3D model of the world, be able to do computer vision to get from 2D images to 3D, and have a knowledge of 3D stuff rather than 2D stuff. That's something that humans, with a much deeper understanding of the world, find far easier.
Diffusion models have their own strong points where they're a lot better than humans, like easily mimicking a artist's style. I expect that as people bang away on things, it'll become increasingly-visible what the low-hanging fruit is, and what is far harder.