I have been generating a few dozen images per day for storyboarding purposes. The more I try to perfect it, the easier it becomes to control these outputs and even keep the entire visual story as well as their characters consistent over a few dozen different scenes; while even controlling the time of day throughout the story. I am currently working with 7 layers prompts to control for environment, camera, subject, composition, light, colors and overall quality (it might be overkill, but it’s also experimenting).
I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
We use nano banana extensively to build video storyboards, which we then turn into full motion video with a combination of img2vid models. It sounds like we're doing similar things, trying to keep images/characters/setting/style consistent across ~dozens of images (~minutes of video). You might like the product depending on what you're doing with the outputs! https://hypernatural.ai
If anything, the ubiquity of AI has just revealed how many people have 0 taste. It also highlights the important role that these human-centred jobs were doing to keep these people from contributing to the surface of any artistic endeavour in "culture".
There is a reason people (used to) study art and train for years. Easy art is often no art because you need that effort and investment, and learning artistic context, to understand and appreciate.
Which is not to say don’t be creative, I applaud all creativity, but also to be very critical of what you are doing.
I've been playing around with T2I/I2V generation to make some NSFW stuff of video-game characters using ComfyUI.
It's pretty easy to get something decent. It's really hard to get something good. I share my creations with some close friends and some are like "that's hot!" but are too fixated on breasts to realize that the lighting or shadow is off. Other friends do call out the bad lighting.
You may be like "it's just porn, why care about consistent lighting?" and the answer for me is that I'm doing all this to learn how everything works. How to fine tune weights, prompts, using IP Adapter, etc. Once I have a firm understanding of this stuff, then I will probably be able to make stuff that's actually useful to society. Unlike that coke commercial.
I think it's a fair comment though. Porn isn't really useful to society (one could argue that it's actually detrimental to society but that's a separate topic).
But what I understood from parent comment is that they just do it for fun, not necessarily to be a boon to society. And then if it comes with new skills that actually can benefit society, then that's a win.
Granted, the commenter COULD play around with SFW stuff but if they're just doing it for fun then that's still not benefiting society either, so either way it's a wash. We all have fun in our own ways.
Reminds me of that AI coke commercial. I personally didn't notice how shitty it was until I read about it online. (I actually didn't even see the commercial until I read about it online).
But it's impressive that this billion dollar company didn't have one single person say "hey it's shitty, make it better."
Everything's shitty in its own way. Modern (or even golden age era) movies, with top production values are equivalent of Egyptian wall paintings. They have specific style, specific way to show things. Over the years movie artists just figured out in what specific way the movies should be shitty and the audiences were taught that as a canon.
AI is shitty in its own new unique ways. And people don't like new. They want they old, polished shittiness they are used to.
While I agree that all art is kinda shitty in its own way (IMDB has sections dedicated to breaks in continuity and stuff like that), experienced filmmakers would be good at hiding the shittiness (maybe with a really clever action sequence or something).
It's only a matter of time before we get experienced AI filmmakers. I think we already have them, actually. It's clear that Coke does not employ them though.
The ubiquity of AI has just revealed that there are tons of grifters willing to release the sloppiest thing ever if they thought it could make some money. They would refrain from that if they had at least a glimmer of taste.
It is really no different than music. Millions of people play guitar but most are not worth listening to or deserving of an audience.
Imagine if you gave everyone a free guitar and people just started posting their electric guitar noodlings on social media after playing for 5 minutes.
It is not a judgement on the guitar. If anything it is a judgement on social media and the stupidity of the social media user who get worked up about someone creating "slop" after playing guitar for 5 minutes.
What did you expect them to sound like, Steve Vai?
So in the end it turns out that the art was never so much about creativity as about gatekeeping. And "everyone can make art" was just a fake facade, because not really.
Of course everyone can make art. Toddlers make art. The hard truth is that getting good technical art skills, be they visual, musical, literary, or anything else is like getting stronger— many people that want to do it are too lazy or undisciplined to do the daily work required to do it. You might be starting too late (Maybe post-middle-age) or don’t have the time to become an exceptional artist, but most art that people like wasn’t made by exceptional artists; there are a lot more strong people than professional athletes or Olympians. You don’t even need a gym membership or weights, and there’s limitless free information about how to do it online. Nobody is stopping anyone from doing it. Just like many, if not most gym memberships are paid for but unused after the first, like, month, many people try drawing for a little while, get frustrated that it’s so difficult to learn, and then give up. The gatekeeping argument is an asinine excuse people make to blame other people for their own lack of discipline.
Hitchens was, first and foremost, a critic. Most of the so-called gatekeeping that people accuse artists of is actually born from art criticism-- a completely different group of people rarely as popular among artists as they are among people that like to feel cool about looking at art.
> Of course everyone can make art. Toddlers make art.
That's my entire point. Artists were fine with everybody making "art" as long as everybody except them (with their hard fought skill and dedication) achieved toddler level of output quality. As soon as everybody could truly get even close to the level of actual art, not toddler art, suddenly there's a horrible problem with all the amateur artists using the tools that are available to them to make their "toddler" art.
Most artists don’t give a flying fuck about what you do on your own. Seriously! They really don’t. What they care about is having their work ripped off so for-profit companies can kill the market for their hard-won skills with munged-up derivatives.
Folks in tech generally have very limited exposure to the art world — fan art communities online, Reddit subs, YouTubers, etc. It’s more representative of internet culture than the art world— no more representative of artists than X politics is representative of voters. People have real grievances here and you are not a victim of the world’s artists. Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
I will be if they manage to slow down development of AI even by a smidgen.
> Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
Fully agree. They care about whether there's going to be anyone willing to buy their stuff from them. And not-toddler art is a real competition for them. So they are super against everybody making it.
Well drat, you’ve exposed all of us, from art directors to VFX artists to fine art painters to singer-songwriters to graphic designers to game designers to symphony cellists as a monolithic glob of petty, transactional rakes. Fortunately, everyone is an artist now, so you can make your own output to feed to models and leave our work out of it entirely! It clearly has no value so nobody should be mad about going without it. Problem solved!
If you think human art was anything but a bootstrap for AI you are kidding yourself. I don't think artists are going to be as happy as you think though, because market for their services will drop even further towards zero and they will go back to being financed by the richest on a whim. The way it always used to be before the advent of information copying and distribution technologies. Technology giveth, technology taketh away.
Why are so many AI art boosters such giant edgelords? Do you really think having that much of a chip on your shoulder is justified?
You obviously can’t un-ring a bell, but finding ways to poison models that try to rip artists off sure is amusing. The real joke is on the people in software that think they’re so special that their skills will be worth anything at all, or believe that this will do anything but transfer wealth out of the paychecks of regular people, straight into the pockets of existing centibillionaires. There are too many developers in the existing market as it is, and so many of the ones that are diligently trying to reduce that demand further for an even larger range of disciplines, especially the in-demand jobs like setting up agents to take people’s jobs. Well, play stupid games, win stupid prizes.
Making value statements about art is pretty much exclusively the realm of art critics and art historians. They're no more representative of artists than general historians are representative of politicians and soldiers.
I agree. Bruhcula? Something like that. He's a vampire, but also models and does stunts for Baywatch - too much color and vitality. Joan of Arc is way more pale.
Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.
Flux Kontext does pretty well also, for modifications. Though I’ve otherwise found the Flux models somewhat stubbornly locked into certain compositions at times that requires a control net to break where other models have been more pliable, though with other trade offs.
Makes a lot of sense for some short kid's skit teaching them about the branches of government or whatever. One could also get more creative with the Statue of Liberty and Joan of Arc.
> The more I try to perfect it, the easier it becomes
I have the opposite experience, once it goes off track, its nearly impossible to bring it back on message
How much have you experimented with it? For some stories I may generate 5 image variations of 10-20 different scenes and then spend time writing down what worked and what did not; and running the generation again (this part is mostly for research). It’s certainly advancing my understanding over time and being able to control the output better. But I’m learning that it takes a huge amount of trial and error. So versioning prompts is definitely recommended, especially if you find some nuances that work for you.
> I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api)
Are you talking about Automatic1111 / ComfyUI inpainting masks? Because Nano doesn't accept bounding boxes as part of its API unless you just stuffed the literal X/Y coordinates into the raw prompt.
You could do something where you draw a bounding box and when you get the response back from Nano, you could mask that section back back over the original image - using a decent upscaler as necessary in the event that Nano had to reduce the size of the original image down to ~1MP.
No I am using my own workflows and software for this. I made nano-banana accept my bounding boxes. Everything is possible with some good prompting: https://edwin.genego.io/blog/lpa-studio < there are some videos of an earlier version there while I am editing a story. Either send the coords and describe the location well, or draw a box around the bb and tell it to return the image without the drawn bb, and only the requested changes.
It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.
- normal image editing response: 12-14s
- image editing response with Claude meta-prompting: 20-25s
- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s
(I use Replicate though, so the actual API may be much faster).
This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.
Thanks, that makes sense. I'll have to give the "red bounding box overlay" a shot when there are a great deal of similar objects in the existing image.
I also have a custom pipeline/software that takes in a given prompt, rewrites it using an LLM into multiple variations, sends it to multiple GenAI models, and then uses a VLM to evaluate them for accuracy. It runs in an automated REPL style, so I can be relatively hands-off, though I do have a "max loop limiter" since I'd rather not spend the equivalent of a small country's GDP.
Automated generator-critique loops for evaluation may be really useful for creating your own style libraries, because its easy for an LLM-agent to evaluate how close an image is from a reference style or scene. So you end up with a series of base prompts, and now can replicate that style across a whole franchise of stories. Most people still do it with reference images, and it doesn't really create very stable results. If you do need some help with bounding boxes for nano-banana, feel free to send me a message!
You can literally just open the image up in Preview or whatever and add a red box, circle etc and then say "in the area with the red square make change foo" and it will normally get rid of the red box on the generated image. Whether or not it actually makes the change you want to see is another matter though. It's been very hit or miss for me.
Yeah I could see that being useful if there were a lot of similar elements in the same image.
I also had similar mixed results wrt Nano-banana especially around asking it to “fix/restore” things (a character’s hand was an anatomical mess for example)
That sounds intriguing. 7 layers - do you mean its one prompt composed of 7 parts, like different paragraphs for each aspect?
How do you send bounding box info to banana? Does it understand something like that? What does claude add to that process? Makes your prompt more refined?
Thanks
Yes, the prompt is composed of 7 different layers, where I group together coherent visual and temporal responsibilities. Depending on the scene, I usually only change 3-5 layers, but the base layers still stay the same; so the scenes all appear within the same story universe and same style. If something feels off, or feels like it needs to be improved, I just adjust one layer after the other to experiment with the results on the entire story, but also on individual scene level. Over time, I have created quite some 7-Layer style profiles, that work well, and I can cast onto different story universes. Keep in mind this is heavy experimentation, it may just be that there is a much easier way to do this, but I am seeing success with this. https://edwin.genego.io/blog/lpa-studio - at any point I may throw this all out and start over; depending on how well my understanding of this all develops.
Bounding boxes: I actually send an image with a red box around where the requested change is needed. And 8 out of 10 times it works well. But if it doesn't work, I use Claude to make the prompt more refined. The Claude API call that I make, can see the image + the prompt, as well understanding the layering system. This is one of the 3 ways I edit, there is another one where I just sent the prompt to Claude without it looking at the image. Right now this all feels like dial-up. With a minimum of 0.035$ per image generation (0.0001$ if I just use a LoRa though) and a minimum of 12-14 seconds wait on each edit/generation.
This is beautiful and inspiring, This is exactly what we need right now - tools to empower artists and builders leveraging the novel technologies. Claude Code is a great example IMHO and it's the tip of the iceberg - the future consists of a whole new world, new mental model and set of constraints and capabilities, so different that I can't really imagine it.
Who has thought that we reach this uncharted territory with so many opportunities for pioneering and innovation? Back in 2019 it felt like nothing was new under the sun, today it feels like there is a whole new world under the sun, for us to explore!
Thanks! Its really refreshing to work on this sort of stuff, not even knowing what the end result is going to be. Just a hobby? Something that some new model or third party app will completely replace next week? A new career path? Me getting back to my filmmaking and arts roots? I have no idea, I just know that its some of the best fun I have had with software in my career. I am hoping that more people jump on this experimental path with GenAI, just for themselves or to see how far they can push boundaries.
> Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
I keep hearing advocates of AI video generation talking at length about how easy the tools are to use and how great the results are, but I've yet to see anyone produce something meaningful that's coherent, consistent, and doesn't look like total slop.
I watched the most popular and most recent videos of each channel to compare, and they were all awful:
> Bots in the Hall
* voices don't match the mouth movements
* mouth movements are poorly animated
* hand/body movements are "fuzzy" with weird artifacts
* characters stare in the wrong direction when talking
* characters never move
* no scenes over 3 seconds in length between cuts
> Neural Viz
* animations and backgrounds are dull
* mouth movements are uncanny
* "dead eyes" when showing any emotions
* text and icons are poorly rendered
> The Meat Dept video for Igorrr's ADHD
This one I can excuse a bit since it's a music video, and for the sake of "artistic interpretation", but:
* continuation issues between shots
* inconsistent visual style across shots
* no shots longer then 4 seconds between cuts
* rendered text is illegible/nonsensical
* movement artifacts
You're just one person, those people have their own audiences; so your own critique, is just your own critique. Just because you don't like it, doesn't mean that it is not resonating well with others. I can tell you from the research I am doing for several hours per day on ai-filmmaking that there are already a few handful of creators making a living from this; with communities behind them that keep growing, and their audiences that keep expanding (some already have 100k to 1m subscribers across different social media channels). Some of them are even striking brand deals.
Entire narrative driven AI stores that are driven by AI stories and AI characters in AI generated universes... they are here already, but I can only count those who do it well on two hands (last year, there where 1-2). This is going to accelerate, and if you think its "slop" now, it just takes a few iterations of artists who you personally resonate with to jump onto this, before you stop seeing it as slop. I am jumping on this, because I can see very clearly where this will all lead. You don't have to like it, but it will arrive regardless.
Almost every talented artist with a public presence that has spoken on AI art, has spoken against it's generation, the use of AI tools, and the harm it's causing to their communities. The few established artists who are proponents of AI art (Lioba Brueckner comes to mind) have a financially incentive to do so, since they sell tools or courses teaching others with less/no talent to do the same.
The tools aren't going anywhere. Fans were outraged at the look and artists raged against the transition from cel animation to digital. Almost nothing serious is produced via cel now and the art adjusted by making extremely complex and beautiful art that couldn't have been done on cels.
There's a real legal fight that needs to go on right now about these companies stealing style, voices, likeness, etc. But it's really beginning to feel like there's a generation of artists that are hampering their career by saying they are above it instead of using the tools to enhance their art to create things they otherwise couldn't.
I see kids in high school using the tools like how I used Photoshop when I was younger. I see unemployed/under employed designers lamenting what the tools have done.
The issue for them is that once the tools exists, adoption only moves in one direction. And it will enable a whole wave of new artists. I sympathize with them, but if I enjoy GenAI art creation and see it as my genuine creative outlet, why would I stop? What about thousands of others exploring this?
If at some point I also get very good at it; and the tech, models and tools mature, this will turn into a real avenue; who are they to tell us not to pursue it?
Why didn't you mention financial incentives of many outspoken critics of AI? They feel like their entire livelyhood depends on AI failing. I'd say that's a pretty strong financial incentive.
I don’t think that is the problem (as someone that has been described in that bracket), it’s the tooling and control that is missing. I believe that will be solved over time.
I dont get how these tools are considered good when they cant even do a simple thing decribing this scene.
> i was to bring awareness to the dangers of dressing up like a seal while surfboarding (ie. wearing black wetsuites, arms hanging over the board). Create a scene from the perspective of a shark looking up from the bottom of the ocean into a clear blue sky with silhouettes of a seal and a surfer and fishing boat with line dangling in the water and show how the shark contemplates attacking all these objects because they look so similiar.
I havnt found a model yet that can process that description, or any varition, into a scene that usable and makes sense visually to anyone older the a 1st grader. They will never place the seal, surfer, shark or boat in the correct location to make sense visually. Typically everyone is under water, sizing of everything is wrong. You tell them to the image is wrong, to place the person ontop of the water, and they cant. Please can someone link to a model that is capable or tell me what i am doing wrong? How can you claim to process words into images in a repeatable way when these systems cant deal with multiple contraints at once?
I'm sorry but there are already tons of similarly "imported" comments here that disparage AI and AI artists that similarly add no value to the discussion.
My intention was solely to support the parent in the face of prevalent general critique of what he dabbles in.
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
GEMINI_API_KEY="..." \
uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
python -m gemimg "a racoon holding a hand written sign that says I love trash"
I use Gemini CLI on a daily basis. It used to crash often and I'd lose the chat history. I found this tool called ai-cli-log [1] and it does something similar out of the box. I don't run Gemini CLI without it.
The author went to great lengths about open source early on. I wonder if they'll cover the QwenEdit ecosystem.
I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
The Qwen-Edit images from my GenAI Image Editing Showdown site were all generated from a ComfyUI workflow on my machine - it's shockingly good for an open-weight model. It was also the only model that scored a passing grade on the Van Halen M&M test (even compared against Nanobanana)
Ha I created a Van Halen M&M test for text prompts. I would include an instruction demanding that the response contain <yellow_m&m> and <red_m&m> but never <brown_m&m>. Then I would fail any llm that did not include any m&ms, or if they wrote anything about the <brown_m&m> in the final output.
> I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
For imagegen, agreed. But for textgen, Kimi K2 thinking is by far the best chat model at the moment from my experience so far. Not even "one of the best", the best.
It has frontier level capability and the model was made very tastefully: it's significantly less sycophantic and more willing to disagree in a productive, reasonable way rather than immediately shutting you out. It's also way more funny at shitposting.
I'll keep using Claude a lot for multimodality and artifacts but much of my usage has shifted to K2. Claude's sycophancy is particular is tiresome. I don't use ChatGPT/Gemini because they hide the raw thinking tokens, which is really cringe.
Claude Sonnet 4.5 doesn't even feel sycophantic (in the 4o) way, it feels like it has BPD. It switches from desperately agreeing with you to moralizing lectures and then has a breakdown if you point out it's wrong about anything.
Also, yesterday I asked it a question and after the answer it complained about its poorly written system prompt to me.
They're really torturing their poor models over there.
I've been keeping an eye on Qwen-Edit/Wan 2.2 shenanigans and they are interesting: however actually running those types of models is too cumbersome and in the end unclear if it's actually worth it over the $0.04/image for Nano Banana.
I was skeptical about the notion of running similar models locally as well, but the person who did this (https://old.reddit.com/r/StableDiffusion/comments/1osi1q0/wa... ) swears that they generated it locally, just letting a single 5090 crunch away for a week.
If that's true, it seems worth getting past the 'cumbersome' aspects. This tech may not put Hollywood out of business, but it's clear that the process of filmmaking won't be recognizable in 10 years if amateurs can really do this in their basements today.
I just merged the PR and pushed 0.3.1 to PyPI. I also added README documentation and allowed for a `gemimg` entrypoint to the CLI via project.scripts as noted elsewhere in the thread.
I decided to avoid that purely to keep changes made to the package as minima as possible - adding a project.scripts means installing it adds a new command alias. My approach changes nothing other than making "python -m gemimg" do something useful.
I agree that a project.scripts would be good but that's a decision for the maintainer to take on separately!
Use Google AI Studio to submit requests, and to remove watermark, open browser development tools and right click on request to “watermark_4” image and select to block it. And from next generation there will be no watermark!
This only applies to the visible watermark on the corner, which you could crop anyways. If I’m not mistaken, all images generated by Google models have an invisible watermark: https://deepmind.google/models/synthid/
How would you enforce that when it’s actually important? Any “bad actor” could just open photoshop and remove it. Or run a delobotimized model which doesn’t watermark.
> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.
> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.
This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.
It might also be an explicit guard against Studio Ghibli specifically after the "make me Ghibli" trend a while back, which upset Studio Ghibli (understandably so).
It happens with other styles. The demo documentation example which attempts to transfer an image into the very-public-domain Starry Night by Van Gogh doesn't do a true style transfer: https://x.com/minimaxir/status/1963429027382694264
The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."
Exactly what I was thinking too. I'm a designer, and I'm used to receiving feedback and instructions. "The left eye socket" would to me refer to what I currently see in front of me, while "its left eye socket" instantly shift the perspective from me to the subject.
I find this interesting. I've always described things from the users point of view. Like the left side of a car, regardless of who is looking at it from what direction, is the driver side. To me, this would include a body.
To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
Mmh, ime you need to discard the session/rewrite the failing prompt instead of continuing and correcting on failures. Once errors occur you've basically introduced a poison pill which will continuously make things to haywire. Spelling out what it did wrong is the most destructive thing you can do - at least in my experience
I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later in the post.
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.
The lack of proper indentation (which you noted) in the Python fib() examples was even more apparent. The fact that both AIs you tested failed in the same way is interesting. I've not played with image generation, is this type of failure endemic?
Came to make exactly the same comment. It was funny that the author specifically said that Nano Banana got all five edit prompts correct, rather than noting this discrepancy, which could be argued either way (although I think the "right eye" of a skull should be interpreted with respect to the skull's POV.)
Extroverts tend to expect directions from the perspective of the skull. Introverts tend to expect their own perspective for directions. It's a psychology thing, not an error.
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
Fair enough! I suppose I've avoided that kind of "style transfer" for a variety of reasons, it hadn't even occurred to me that people were still interested in that. And I don't say that to open up debate on the topic, just explaining away my own ignorance/misinterpretation. Thanks
Yes, that is a serious skill. How many of the woes that we see is because people don't know what they want or are unable to describe it in such a way that others understand it.
I believe prompt engineer to properly convey how complex communication can be, when interacting with a multitude of perspectives, world views, assumptions, presumptions etc.
I believe it works well to counter the over-confidence that people have, from not paying attention to what gaps exist between what is said and what is meant.
Yes, obviously a role involving complex communication while interacting with a multitude of perspectives, world views, assumptions, presumptions, etc needs to be called "engineer."
That is why I always call technical writers "documentation engineers," why I call diplomats "international engineers," why I call managers "team engineers," and why I call historians "hindsight engineers."
I believe you're joking here, but I do think it'd be useful to have some engineering background in each of these domains.
The number of miscommunications that happen in any domain, due to oversight, presumptions and assumptions is vast.
At the very least the terminology will shape how we engage with it, so having an aspirational title like prompt engineer, may influence the level of rigor we apply to it.
I don't think that's the right direction to go in.
Despite needing much knowledge of how a planes inner workings function, a pilot is still a pilot and not an aircraft engineer.
Just because you know how human psychology works when it comes to making purchase decision and you are good at applying that to sell things, you're not a sales engineer.
Giving something a fake name, to make it seem more complicated or aspirational than it actually is makes you a bullshit engineer in my opinion.
I think what you're describing is more commonly included under epistemology under philosophy, and I agree that it would be a useful background in each of those domains, but for some reason in the last few decades we have downgraded the humanities as less useful.
Most designers can't, either. Defining a spec is a skill.
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
… sure … but also no. For example, say I have an image. 3 people in it; there is a speech bubble above the person on the right that reads "I'A'T AY RO HERT YOU THE SAP!"¹
I give it,
Reposition the text bubble to be coming from the middle character.
DO NOT modify the poses or features of the actual characters.
Now sure, specs are hard. Gemini removed the text bubble entirely. Whatever, let's just try again:
Place a speech bubble on the image. The "tail" of the bubble should make it appear that the middle (red-headed) girl is talking. The speech bubble should read "Hide the vodka." Use a Comic Sans like font. DO NOT place the bubble on the right.
DO NOT modify the characters in the image.
There's only one red-head in the image; she's the middle character. We get a speech bubble, correctly positioned, but with a sans-serif, Arial-ish font, not Comic Sans. It reads "Hide the vokda" (sic). The facial expression of the middle character has changed.
Yes, specs are hard. Defining a spec is hard. But Gemini struggles to follow the specification given. Whole sessions are like this, and absolute struggle to get basic directions followed.
You can even see here that I & the author have started to learn the SHOUT AT IT rule. I suppose I should try more bulleted lists. Someone might learn, through experimentation "okay, the AI has these hidden idiosyncrasies that I can abuse to get what I want" but … that's not a good thing, that's just an undocumented API with a terrible UX.
(¹because that is what the AI on a previous step generated. No, that's not what was asked for. I am astounded TFA generated an NYT logo for this reason.)
You're right, of course. These models have deficiencies in their understanding related to the sophistication of the text encoder and it's relationship to the underlying tokenizer.
Which is exactly why the current discourse is about 'who does it best' (IMO, the flux series is top dog here. No one else currently strikes the proper balance between following style / composition / text rendering quite as well). That said, even flux is pretty tricky to prompt - it's really, really easy to step on your own toes here - for example, by giving conflicting(ish) prompts "The scene is shot from a high angle. We see the bottom of a passenger jet".
Talking to designers has the same problem. "I want a nice, clean logo of a distressed dog head. It should be sharp with a gritty feel". For the person defining the spec, they actually do have a vision that fits each criteria in some way, but it's unclear which parts apply to what.
at least then, we had hard overrides that were actually hard.
"This got searched verbatim, every time"
W*ldcards were handy
and so on...
Now, you get a 'system prompt' which is a vague promise that no really this bit of text is special you can totally trust us (which inevitably dies, crushed under the weight of an extended context window).
Unfortunately(?), I think this bug/feature has gotta be there. It's the price for the enormous flexibility. Frankly, I'd not be mad if we had less control - my guess is that in not too many years we're going to look back on RLHF and grimace at our draconian methods. Yeah, if you're only trying to build a "get the thing I intend done" machine I guess it's useful, but I think the real power in these models is in their propensity to expose you to new ideas and provide a tireless foil for all the half-baked concepts that would otherwise not get room to grow.
Case in point, the final image in this post (the IP bonanza) took 28 iterations of the prompt text to get something maximally interesting, and why that one is very particular about the constraints it invokes, such as specifying "distinct" characters and specifying they are present from "left to right" because the model kept exploiting that ambiguity.
Hey! The author, thank you for this post! QQ, any idea roughly how much this experimentation cost you? I'm having trouble processing their image generation pricing I may just not be finding the right table. I'm just trying to understand if I do like 50 iterations at the quality in the post, how much is that going to cost me?
All generations in the post are $0.04/image (Nano Banana doesn't have a way to increase the resolution, yet), so you can do the math and assume that you can generated about 24 images per dollar: unlike other models, Nano Banana does charge for input tokens but it's neligible.
Discounting the testing around the character JSON which became extremely expensive due to extreme iteration/my own stupidity, I'd wager it took about $5 total including iteration.
My personal project is illustrating arbitrary stories with consistent characters and settings. I've rewritten it at least 5 times, and Nano Banana has been a game-changer. My kids are willing to listen to much more sophisticated stories as long as it has pictures, so I've used it to illustrate text like Ender's Game. Unfortunately, it's getting harder to legally acquire books in a format you can feed to an LLM.
I first extract all the entities from the text, generate characters from an art style, and then start stitching them together into individual illustrations. It works much better with NB than anything else I tried before.
This works with the openrouter API as well, which skips having to make a google account etc. Here's a Claude-coded openrouter compatible adaptation which seems to work fine: https://github.com/RomeoV/gemimg
A 1024x1024 image seems to cost about 3ct to generate.
And then pass Gemini 2.5's output directly to Nano-Banana. Doing this yields very high-quality images. This is also good for style transfer and image combination. For example, if you then give Gemini 2.5 a user prompt that looks something like this:
I would like to perform style transfer. I will provide the image generation model a photograph alongside your generated prompt. Please write a prompt to transfer the following style: {{ brief style description here }}.
You can get aesthetic consistently-styled images, like these:
Photo-realism is great but the real step-jump in image-gen I’m looking for is the ability to draw high quality technical diagrams with a mix of text and images, so I can stop having LLMs generate crappy diagrams with mermaid, SVG, HTML/CSS, draw.io
I tried asking for a shot from a live-action remake of My Neighbor Totoro. This is a task I’ve been curious about for a while. Like Sonic, Totoro is the kind of stylized cartoon character that can’t be rendered photorealistically without a great deal of subjective interpretation, which (like in Sonic’s case) is famously easy to get wrong even for humans. Unlike Sonic, Totoro hasn’t had an actual live-action remake, so the model would have to come up with a design itself. I was wondering what it might produce – something good? something horrifying? Unfortunately, neither; it just produced a digital-art style image, despite being asked for a photorealistic one, and kept doing so even when I copied some of the keyword-stuffing from the post. At least it tried. I can’t test this with ChatGPT because it trips the copyright filter.
Not quite; the eye color and heterochromia is followed only so-so.
The black-and-silver cat seems to have no heterochromia; eye color could be interpreted as silver though.
The white-and-gold cat _does_ have heterochromia. The colors can be interpreted as "white" and "gold", though I'd describe them as whitish-blue and orange. What's interesting about this is an adjustment of the instructions toward biologically more plausible eye colors in the cat which also has more natural fur colors.
The last cat's fur colors are so "implausible" that the model doesn't seem to have problems taking exactly those colors for the (heterochromatic) eyes too!
It's really nice to see long-form, obviously human-written blogs from people deep into the LLM space - maybe us writers will be around for a while still in spite of all the people saying we've been replaced.
I've started increasing the number of jokes in my blog posts to make it sound more obviously human-written: to be honest I was expecting some "why is this so unserious" complaints.
In my own experience, nano banana still has the tendency to:
- make massive, seemingly random edits to images
- adjust image scale
- make very fine grained but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
The "ALL CAPS" part of your comment got me thinking. I imagine most llms understand subtle meanings of upper case text use depending on context. But, as I understand it, ALL CAPS text will tokenize differently than lower case text. Is that right? In that case, won't the upper case be harder to understand and follow for most models since it's less common in datasets?
There's more than enough ALL CAPS text in the corpus of the entire internet, and enough semantic context associated with it for it to be intended to be in the imperative voice.
I like to use these AI models for generating mockup screenshots of game. I can drop a "create a mockup screenshot of a steampunk 2D platformer in which you play as a robot" and it will give me some interesting screenshot. Then I can ask it to iterate on the style. Of course it's going to be broken in some ways and it's not even real pixel art, but it gives a good reference to quickly brainstorm some ideas.
Unfortunately I have to use ChatGPT for this, for some reason local models don't do well with such tasks. I don't know if it's just the extra prompting sauce that ChatGPT does or just diffusion models aren't well designed for these kind of tasks.
For images of people generated from scratch, Nano Banana always adds a background blur, it can't seem to create more realistic or candid images such as those taken via a point and shoot or smartphone, has anyone solved this sort of issue? It seems to work alright if you give it an existing image to edit however. I saw some other threads online about it but I didn't see anyone come up with solutions.
I tried that but they don't seem to make much difference for whatever reason, you still can't get a crisp shot such as this [0] where the foreground and background details are all preserved (linked shot was taken with an iPhone which doesn't seem to do shallow depth of field unless you use their portrait mode).
Those are rarely in the captions for the image. They'd have to extract the EXIF for photos and include it in recaptioning. Which they should be doing, but I doubt they thought about it.
Nano Banana can be frustrating at times. Yesterday I tried to get it to do several edits to an image, and it would return back pretty much the same photo.
Things like: Convert the people to clay figures similar to what one would see in a claymation.
And it would think it did it, but I could not perceive any change.
After several attempts, I added "Make the person 10 years younger". Suddenly it made a clay figure of the person.
use it for technical design doc, where i sketch out something on paper and ask nano banana to make flow chat, its incredibly good at this kind of editing (also if want to borrow image from someone and change some bridges usually its hard its embedded image, but nano banana solves that)
Theres lots these models can do but I despise when people suggest they can do edits with "with only the necessary aspects changed".
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Nano Banana is different and much better at edits without changing texture/lighting/sharpness/color balance, and I am someone that is extremely picky about it. That's why I add the note that Gemini 2.5 Flash is aware of segmentation masks, and that's my hunch why that's the case.
That's probably where things are headed and there are already products trying this (even photoshop already). Just like how code gen AI tools don't replace the entire file on every prompt iteration.
That article says that most image generators had been over-shadowed by gpt.
Yet when I ask some simple tasks to it, like doing a 16:9 picture sized image instead of a square one, it ends up doing a 16:9 on a white background that matches a square.
When I ask it to make it with text, then on the second request to redo while changing just a certain visual element, it ends up breaking the previously asked text.
It's getting more good at flattering people and telling them how clever and right they are than actually doing the task.
The kicker for nano banana is not prompt adherence which is a really nice to have but the fact that it's either working on pixel space or with a really low spatial scaling. It's the only model that doesn't kill your details because of vae encode/decode.
> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
I had no idea that the context window was so large. I’d been instinctively keeping my prompts small because of experience with other models. I’m going to try much more detailed prompts now!
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
Not really since "prompt engineering" can be tossed in the same pile as "vibe coding." Just people coping with not developing the actual skills to produce the desired products.
Try getting a small model to do what you want quickly with high accuracy, high quality, etc, and using few tokens per request. You'll find out that prompt engineering is real and matters.
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
I really wish that real expert stuff, like how to do controlnet, use regional prompting, or most other advanced ComfyUI stuff got upvoted to the top instead.
Great post with some nice insights I could've used a few days ago!
I was trying to create a simple "mascot logo" for my pet project. I first created an account on Kittl [0] and even paid for one month but it was quite cumbersome to generate images until I figured out I could just use the nano banana api myself.
Took me 4 prompts to ai-slop a small python script I could run with uv that would generate me a specified amount of images with a given prompt (where I discovered some of the insight the author shows in their post). The resulting logo [1] was pretty much what I imagined. I manually added some text and played around with hue/saturation in Kittl (since I already paid for it :)) et voilà.
Feeding back the logo to iterate over it worked pretty nicely and it even spit out an "abstract version" [2] of the logo for favicons and stuff without a lot of effort.
All in all this took me 2 hours and around 2$ (excluding the 1 month Kittl subscription) and I would've never been able to draw something like that in Illustrator or similar.
> It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
This article was a good read, but the writer doesn't seem to understand how model-based image generation actually works, using language that suggests the image is somehow progressively constructed the way a human would do it. Which is absurd.
I've noticed a lot of this misinformation floating around lately, and I can't help but wonder if it's intentional?
I'm not sure what you're implying is incorrect/misleading. As noted in the post, autoregressive models like Nano Banana and gpt-image-1 generate by token (and each generated token attends to all previous tokens, both text and image) which are then decoded, while diffusion models generate the entire image simultaneously, refined over n iteration steps.
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.
I have been generating a few dozen images per day for storyboarding purposes. The more I try to perfect it, the easier it becomes to control these outputs and even keep the entire visual story as well as their characters consistent over a few dozen different scenes; while even controlling the time of day throughout the story. I am currently working with 7 layers prompts to control for environment, camera, subject, composition, light, colors and overall quality (it might be overkill, but it’s also experimenting).
I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
We use nano banana extensively to build video storyboards, which we then turn into full motion video with a combination of img2vid models. It sounds like we're doing similar things, trying to keep images/characters/setting/style consistent across ~dozens of images (~minutes of video). You might like the product depending on what you're doing with the outputs! https://hypernatural.ai
The website lets you type in an entire prompt, then tells you to login, then dumps your prompt and leaves you with nothing. Lame.
Got caught out by this just today. Just wanted to try out because it appeared as if I can try it out.
I noticed ChatGPT and others do exactly the same once you run out of anonymous usage. Insanely annoying.
Hn does that too. You've typed out a long response, oh sorry you're posting too fast. Please slow down.
It's intentionally hostile and inconsiderate.
Rule #1 when typing longer texts into webforms/textboxes: ALWAYS do a CTRL+C before you click submit.
You don’t lose the message though, so it’s infinitely less annoying
At least on HN you can Go Back in your browser and restore the page before submission with your post in the box.
But it would be _much_ better if when you hit reply, it gave you a message that you're "posting too fast" before you spend the time to write it up.
That's because you're in the bad user doghouse.
Your "Dracula" character is possibly the least vampiric Dracula I've ever seen tbh
If anything, the ubiquity of AI has just revealed how many people have 0 taste. It also highlights the important role that these human-centred jobs were doing to keep these people from contributing to the surface of any artistic endeavour in "culture".
There is a reason people (used to) study art and train for years. Easy art is often no art because you need that effort and investment, and learning artistic context, to understand and appreciate.
Which is not to say don’t be creative, I applaud all creativity, but also to be very critical of what you are doing.
I've been playing around with T2I/I2V generation to make some NSFW stuff of video-game characters using ComfyUI.
It's pretty easy to get something decent. It's really hard to get something good. I share my creations with some close friends and some are like "that's hot!" but are too fixated on breasts to realize that the lighting or shadow is off. Other friends do call out the bad lighting.
You may be like "it's just porn, why care about consistent lighting?" and the answer for me is that I'm doing all this to learn how everything works. How to fine tune weights, prompts, using IP Adapter, etc. Once I have a firm understanding of this stuff, then I will probably be able to make stuff that's actually useful to society. Unlike that coke commercial.
You can do better than porn which isn't very useful to society.
As opposed to what you're doing at the moment, living your best life here on social media.
I think it's a fair comment though. Porn isn't really useful to society (one could argue that it's actually detrimental to society but that's a separate topic).
But what I understood from parent comment is that they just do it for fun, not necessarily to be a boon to society. And then if it comes with new skills that actually can benefit society, then that's a win.
Granted, the commenter COULD play around with SFW stuff but if they're just doing it for fun then that's still not benefiting society either, so either way it's a wash. We all have fun in our own ways.
Reminds me of that AI coke commercial. I personally didn't notice how shitty it was until I read about it online. (I actually didn't even see the commercial until I read about it online).
But it's impressive that this billion dollar company didn't have one single person say "hey it's shitty, make it better."
It's an intentional new-media ad, so I think they're embracing the flaws rather than trying to hide them.
Also, since it's new media, nobody knows how to budget time or money to fix the flaws. It could be infinitely expensive.
Everything's shitty in its own way. Modern (or even golden age era) movies, with top production values are equivalent of Egyptian wall paintings. They have specific style, specific way to show things. Over the years movie artists just figured out in what specific way the movies should be shitty and the audiences were taught that as a canon.
AI is shitty in its own new unique ways. And people don't like new. They want they old, polished shittiness they are used to.
While I agree that all art is kinda shitty in its own way (IMDB has sections dedicated to breaks in continuity and stuff like that), experienced filmmakers would be good at hiding the shittiness (maybe with a really clever action sequence or something).
It's only a matter of time before we get experienced AI filmmakers. I think we already have them, actually. It's clear that Coke does not employ them though.
The ubiquity of AI has just revealed that there are tons of grifters willing to release the sloppiest thing ever if they thought it could make some money. They would refrain from that if they had at least a glimmer of taste.
It is really no different than music. Millions of people play guitar but most are not worth listening to or deserving of an audience.
Imagine if you gave everyone a free guitar and people just started posting their electric guitar noodlings on social media after playing for 5 minutes.
It is not a judgement on the guitar. If anything it is a judgement on social media and the stupidity of the social media user who get worked up about someone creating "slop" after playing guitar for 5 minutes.
What did you expect them to sound like, Steve Vai?
So in the end it turns out that the art was never so much about creativity as about gatekeeping. And "everyone can make art" was just a fake facade, because not really.
Of course everyone can make art. Toddlers make art. The hard truth is that getting good technical art skills, be they visual, musical, literary, or anything else is like getting stronger— many people that want to do it are too lazy or undisciplined to do the daily work required to do it. You might be starting too late (Maybe post-middle-age) or don’t have the time to become an exceptional artist, but most art that people like wasn’t made by exceptional artists; there are a lot more strong people than professional athletes or Olympians. You don’t even need a gym membership or weights, and there’s limitless free information about how to do it online. Nobody is stopping anyone from doing it. Just like many, if not most gym memberships are paid for but unused after the first, like, month, many people try drawing for a little while, get frustrated that it’s so difficult to learn, and then give up. The gatekeeping argument is an asinine excuse people make to blame other people for their own lack of discipline.
Classic gatekeeping quote: "Everyone has a book in them, but in most cases that's where it should stay"
Hitchens was, first and foremost, a critic. Most of the so-called gatekeeping that people accuse artists of is actually born from art criticism-- a completely different group of people rarely as popular among artists as they are among people that like to feel cool about looking at art.
I prefer Stephen King's version: something like "Everybody has four crappy books in them. Get them done and out of the way as soon as possible."
> Of course everyone can make art. Toddlers make art.
That's my entire point. Artists were fine with everybody making "art" as long as everybody except them (with their hard fought skill and dedication) achieved toddler level of output quality. As soon as everybody could truly get even close to the level of actual art, not toddler art, suddenly there's a horrible problem with all the amateur artists using the tools that are available to them to make their "toddler" art.
Most artists don’t give a flying fuck about what you do on your own. Seriously! They really don’t. What they care about is having their work ripped off so for-profit companies can kill the market for their hard-won skills with munged-up derivatives.
Folks in tech generally have very limited exposure to the art world — fan art communities online, Reddit subs, YouTubers, etc. It’s more representative of internet culture than the art world— no more representative of artists than X politics is representative of voters. People have real grievances here and you are not a victim of the world’s artists. Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
> you are not a victim of the world’s artists
I will be if they manage to slow down development of AI even by a smidgen.
> Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
Fully agree. They care about whether there's going to be anyone willing to buy their stuff from them. And not-toddler art is a real competition for them. So they are super against everybody making it.
Well drat, you’ve exposed all of us, from art directors to VFX artists to fine art painters to singer-songwriters to graphic designers to game designers to symphony cellists as a monolithic glob of petty, transactional rakes. Fortunately, everyone is an artist now, so you can make your own output to feed to models and leave our work out of it entirely! It clearly has no value so nobody should be mad about going without it. Problem solved!
If you think human art was anything but a bootstrap for AI you are kidding yourself. I don't think artists are going to be as happy as you think though, because market for their services will drop even further towards zero and they will go back to being financed by the richest on a whim. The way it always used to be before the advent of information copying and distribution technologies. Technology giveth, technology taketh away.
Why are so many AI art boosters such giant edgelords? Do you really think having that much of a chip on your shoulder is justified?
You obviously can’t un-ring a bell, but finding ways to poison models that try to rip artists off sure is amusing. The real joke is on the people in software that think they’re so special that their skills will be worth anything at all, or believe that this will do anything but transfer wealth out of the paychecks of regular people, straight into the pockets of existing centibillionaires. There are too many developers in the existing market as it is, and so many of the ones that are diligently trying to reduce that demand further for an even larger range of disciplines, especially the in-demand jobs like setting up agents to take people’s jobs. Well, play stupid games, win stupid prizes.
Well but then they spent 100 years telling us that the toddler stuff was the good stuff. Just as long as it was created by a “real artist”.
Making value statements about art is pretty much exclusively the realm of art critics and art historians. They're no more representative of artists than general historians are representative of politicians and soldiers.
Everyone can make art, but whether it's considered good is another matter.
Everyone can, don't worry, art people are snobs even with their own. Now they can just complain about the plebes doing it wrong ALSO.
That looks exactly like the photos on a Spirit Halloween costume.
I'm in tears. Clicked to check out Dracula and sure enough it's a spot on spirit halloween dollar tree Dracula.
The Sherlock Holmes is heavily influenced by Cucumber Patch.
People pay consulting firms good money to be told their ideal customer so plainly!
I agree. Bruhcula? Something like that. He's a vampire, but also models and does stunts for Baywatch - too much color and vitality. Joan of Arc is way more pale.
Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.
Flux Kontext does pretty well also, for modifications. Though I’ve otherwise found the Flux models somewhat stubbornly locked into certain compositions at times that requires a control net to break where other models have been more pliable, though with other trade offs.
He looks like Dracula on LinkedIn
Having a Statue of Liberty character available is for some reason so funny to me.
Makes a lot of sense for some short kid's skit teaching them about the branches of government or whatever. One could also get more creative with the Statue of Liberty and Joan of Arc.
> Create me a video of Joan of Arc fighting the Statue of Liberty in the style of Shadow of the Colossus.
I see where you are coming from...
[dead]
Yes we are definitely doing the same! For now I’m just familiarizing myself in this space technically and conceptually. https://edwin.genego.io/blog
> The more I try to perfect it, the easier it becomes I have the opposite experience, once it goes off track, its nearly impossible to bring it back on message
How much have you experimented with it? For some stories I may generate 5 image variations of 10-20 different scenes and then spend time writing down what worked and what did not; and running the generation again (this part is mostly for research). It’s certainly advancing my understanding over time and being able to control the output better. But I’m learning that it takes a huge amount of trial and error. So versioning prompts is definitely recommended, especially if you find some nuances that work for you.
> I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api)
Are you talking about Automatic1111 / ComfyUI inpainting masks? Because Nano doesn't accept bounding boxes as part of its API unless you just stuffed the literal X/Y coordinates into the raw prompt.
You could do something where you draw a bounding box and when you get the response back from Nano, you could mask that section back back over the original image - using a decent upscaler as necessary in the event that Nano had to reduce the size of the original image down to ~1MP.
No I am using my own workflows and software for this. I made nano-banana accept my bounding boxes. Everything is possible with some good prompting: https://edwin.genego.io/blog/lpa-studio < there are some videos of an earlier version there while I am editing a story. Either send the coords and describe the location well, or draw a box around the bb and tell it to return the image without the drawn bb, and only the requested changes.
It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.
- normal image editing response: 12-14s
- image editing response with Claude meta-prompting: 20-25s
- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s
(I use Replicate though, so the actual API may be much faster).
This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.
What framework are you using to generate your documentation? It looks amazing.
Thanks, that makes sense. I'll have to give the "red bounding box overlay" a shot when there are a great deal of similar objects in the existing image.
I also have a custom pipeline/software that takes in a given prompt, rewrites it using an LLM into multiple variations, sends it to multiple GenAI models, and then uses a VLM to evaluate them for accuracy. It runs in an automated REPL style, so I can be relatively hands-off, though I do have a "max loop limiter" since I'd rather not spend the equivalent of a small country's GDP.
Automated generator-critique loops for evaluation may be really useful for creating your own style libraries, because its easy for an LLM-agent to evaluate how close an image is from a reference style or scene. So you end up with a series of base prompts, and now can replicate that style across a whole franchise of stories. Most people still do it with reference images, and it doesn't really create very stable results. If you do need some help with bounding boxes for nano-banana, feel free to send me a message!
You can literally just open the image up in Preview or whatever and add a red box, circle etc and then say "in the area with the red square make change foo" and it will normally get rid of the red box on the generated image. Whether or not it actually makes the change you want to see is another matter though. It's been very hit or miss for me.
Yeah I could see that being useful if there were a lot of similar elements in the same image.
I also had similar mixed results wrt Nano-banana especially around asking it to “fix/restore” things (a character’s hand was an anatomical mess for example)
That sounds intriguing. 7 layers - do you mean its one prompt composed of 7 parts, like different paragraphs for each aspect? How do you send bounding box info to banana? Does it understand something like that? What does claude add to that process? Makes your prompt more refined? Thanks
Yes, the prompt is composed of 7 different layers, where I group together coherent visual and temporal responsibilities. Depending on the scene, I usually only change 3-5 layers, but the base layers still stay the same; so the scenes all appear within the same story universe and same style. If something feels off, or feels like it needs to be improved, I just adjust one layer after the other to experiment with the results on the entire story, but also on individual scene level. Over time, I have created quite some 7-Layer style profiles, that work well, and I can cast onto different story universes. Keep in mind this is heavy experimentation, it may just be that there is a much easier way to do this, but I am seeing success with this. https://edwin.genego.io/blog/lpa-studio - at any point I may throw this all out and start over; depending on how well my understanding of this all develops.
Bounding boxes: I actually send an image with a red box around where the requested change is needed. And 8 out of 10 times it works well. But if it doesn't work, I use Claude to make the prompt more refined. The Claude API call that I make, can see the image + the prompt, as well understanding the layering system. This is one of the 3 ways I edit, there is another one where I just sent the prompt to Claude without it looking at the image. Right now this all feels like dial-up. With a minimum of 0.035$ per image generation (0.0001$ if I just use a LoRa though) and a minimum of 12-14 seconds wait on each edit/generation.
This is beautiful and inspiring, This is exactly what we need right now - tools to empower artists and builders leveraging the novel technologies. Claude Code is a great example IMHO and it's the tip of the iceberg - the future consists of a whole new world, new mental model and set of constraints and capabilities, so different that I can't really imagine it.
Who has thought that we reach this uncharted territory with so many opportunities for pioneering and innovation? Back in 2019 it felt like nothing was new under the sun, today it feels like there is a whole new world under the sun, for us to explore!
Thanks! Its really refreshing to work on this sort of stuff, not even knowing what the end result is going to be. Just a hobby? Something that some new model or third party app will completely replace next week? A new career path? Me getting back to my filmmaking and arts roots? I have no idea, I just know that its some of the best fun I have had with software in my career. I am hoping that more people jump on this experimental path with GenAI, just for themselves or to see how far they can push boundaries.
> Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
I keep hearing advocates of AI video generation talking at length about how easy the tools are to use and how great the results are, but I've yet to see anyone produce something meaningful that's coherent, consistent, and doesn't look like total slop.
Bots in the Hall. Neural Viz. The Meat Dept video for Igorrr's ADHD. More will come.
You need talented people to make good stuff, but at this time most of them still fear the new tools.
I watched the most popular and most recent videos of each channel to compare, and they were all awful:
> Bots in the Hall
* voices don't match the mouth movements * mouth movements are poorly animated * hand/body movements are "fuzzy" with weird artifacts * characters stare in the wrong direction when talking * characters never move * no scenes over 3 seconds in length between cuts
> Neural Viz
* animations and backgrounds are dull * mouth movements are uncanny * "dead eyes" when showing any emotions * text and icons are poorly rendered
> The Meat Dept video for Igorrr's ADHD
This one I can excuse a bit since it's a music video, and for the sake of "artistic interpretation", but:
* continuation issues between shots * inconsistent visual style across shots * no shots longer then 4 seconds between cuts * rendered text is illegible/nonsensical * movement artifacts
You're just one person, those people have their own audiences; so your own critique, is just your own critique. Just because you don't like it, doesn't mean that it is not resonating well with others. I can tell you from the research I am doing for several hours per day on ai-filmmaking that there are already a few handful of creators making a living from this; with communities behind them that keep growing, and their audiences that keep expanding (some already have 100k to 1m subscribers across different social media channels). Some of them are even striking brand deals.
Entire narrative driven AI stores that are driven by AI stories and AI characters in AI generated universes... they are here already, but I can only count those who do it well on two hands (last year, there where 1-2). This is going to accelerate, and if you think its "slop" now, it just takes a few iterations of artists who you personally resonate with to jump onto this, before you stop seeing it as slop. I am jumping on this, because I can see very clearly where this will all lead. You don't have to like it, but it will arrive regardless.
You'll have to wait for actual talented artists to start using these tools.
Almost every talented artist with a public presence that has spoken on AI art, has spoken against it's generation, the use of AI tools, and the harm it's causing to their communities. The few established artists who are proponents of AI art (Lioba Brueckner comes to mind) have a financially incentive to do so, since they sell tools or courses teaching others with less/no talent to do the same.
The tools aren't going anywhere. Fans were outraged at the look and artists raged against the transition from cel animation to digital. Almost nothing serious is produced via cel now and the art adjusted by making extremely complex and beautiful art that couldn't have been done on cels.
There's a real legal fight that needs to go on right now about these companies stealing style, voices, likeness, etc. But it's really beginning to feel like there's a generation of artists that are hampering their career by saying they are above it instead of using the tools to enhance their art to create things they otherwise couldn't.
I see kids in high school using the tools like how I used Photoshop when I was younger. I see unemployed/under employed designers lamenting what the tools have done.
The issue for them is that once the tools exists, adoption only moves in one direction. And it will enable a whole wave of new artists. I sympathize with them, but if I enjoy GenAI art creation and see it as my genuine creative outlet, why would I stop? What about thousands of others exploring this?
If at some point I also get very good at it; and the tech, models and tools mature, this will turn into a real avenue; who are they to tell us not to pursue it?
Art, like science, advances one funeral at a time.
Why didn't you mention financial incentives of many outspoken critics of AI? They feel like their entire livelyhood depends on AI failing. I'd say that's a pretty strong financial incentive.
I don’t think that is the problem (as someone that has been described in that bracket), it’s the tooling and control that is missing. I believe that will be solved over time.
I dont get how these tools are considered good when they cant even do a simple thing decribing this scene.
> i was to bring awareness to the dangers of dressing up like a seal while surfboarding (ie. wearing black wetsuites, arms hanging over the board). Create a scene from the perspective of a shark looking up from the bottom of the ocean into a clear blue sky with silhouettes of a seal and a surfer and fishing boat with line dangling in the water and show how the shark contemplates attacking all these objects because they look so similiar.
I havnt found a model yet that can process that description, or any varition, into a scene that usable and makes sense visually to anyone older the a 1st grader. They will never place the seal, surfer, shark or boat in the correct location to make sense visually. Typically everyone is under water, sizing of everything is wrong. You tell them to the image is wrong, to place the person ontop of the water, and they cant. Please can someone link to a model that is capable or tell me what i am doing wrong? How can you claim to process words into images in a repeatable way when these systems cant deal with multiple contraints at once?
You'll have somewhat better luck if you fix the spelling errors.
https://lmarena.ai/c/019a84ec-db09-7f53-89b1-3b901d4dc6be
https://gemini.google.com/share/da93030f131b
Obviously neither are good but it is better.
I think image models could be producing a lot more editable outputs if eg they output multi-layer PSDs.
[flagged]
What is the point of bringing these silly comments that say nothing from over cheap news sites to Hacker News?
I'm sorry but there are already tons of similarly "imported" comments here that disparage AI and AI artists that similarly add no value to the discussion.
My intention was solely to support the parent in the face of prevalent general critique of what he dabbles in.
I like the Python library that accompanies this: https://github.com/minimaxir/gemimg
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...@simonw: slight tangent but super curious how you managed to generate the preview of that gemini-cli terminal session gist - https://gistpreview.github.io/?17290c1024b0ef7df06e9faa4cb37...
is this just a manual copy/paste into a gist with some html css styling; or do you have a custom tool à la amp-code that does this more easily?
I used this tool: https://tools.simonwillison.net/terminal-to-html
I made a video about building that here: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
It works much better with Claude Code and Codex CLI because they don't mess around with scrolling in the same way as Gemini CLI does.
very cool. frequently, i want to share my prompt + session output; this will make that super easy! thanks again for sharing!
I use Gemini CLI on a daily basis. It used to crash often and I'd lose the chat history. I found this tool called ai-cli-log [1] and it does something similar out of the box. I don't run Gemini CLI without it.
[1] https://github.com/alingse/ai-cli-log
The author went to great lengths about open source early on. I wonder if they'll cover the QwenEdit ecosystem.
I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
The Qwen-Edit images from my GenAI Image Editing Showdown site were all generated from a ComfyUI workflow on my machine - it's shockingly good for an open-weight model. It was also the only model that scored a passing grade on the Van Halen M&M test (even compared against Nanobanana)
https://genai-showdown.specr.net/image-editing
Ha I created a Van Halen M&M test for text prompts. I would include an instruction demanding that the response contain <yellow_m&m> and <red_m&m> but never <brown_m&m>. Then I would fail any llm that did not include any m&ms, or if they wrote anything about the <brown_m&m> in the final output.
> I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
For imagegen, agreed. But for textgen, Kimi K2 thinking is by far the best chat model at the moment from my experience so far. Not even "one of the best", the best.
It has frontier level capability and the model was made very tastefully: it's significantly less sycophantic and more willing to disagree in a productive, reasonable way rather than immediately shutting you out. It's also way more funny at shitposting.
I'll keep using Claude a lot for multimodality and artifacts but much of my usage has shifted to K2. Claude's sycophancy is particular is tiresome. I don't use ChatGPT/Gemini because they hide the raw thinking tokens, which is really cringe.
Claude Sonnet 4.5 doesn't even feel sycophantic (in the 4o) way, it feels like it has BPD. It switches from desperately agreeing with you to moralizing lectures and then has a breakdown if you point out it's wrong about anything.
Also, yesterday I asked it a question and after the answer it complained about its poorly written system prompt to me.
They're really torturing their poor models over there.
It rubs the data on its skin or else it gets the prompt again!
I've been keeping an eye on Qwen-Edit/Wan 2.2 shenanigans and they are interesting: however actually running those types of models is too cumbersome and in the end unclear if it's actually worth it over the $0.04/image for Nano Banana.
I was skeptical about the notion of running similar models locally as well, but the person who did this (https://old.reddit.com/r/StableDiffusion/comments/1osi1q0/wa... ) swears that they generated it locally, just letting a single 5090 crunch away for a week.
If that's true, it seems worth getting past the 'cumbersome' aspects. This tech may not put Hollywood out of business, but it's clear that the process of filmmaking won't be recognizable in 10 years if amateurs can really do this in their basements today.
Neural Viz has been putting out some extremely high quality content recently, these seem to be the closest I've seen to approaching Hollywood level:
https://www.youtube.com/watch?v=5bYA2Rv2CQ8
https://www.youtube.com/watch?v=rfTnW8pl3DE
Takes a couple mouse clicks in ComfyUI
On that subject - ComfyUI is not the future of image gen. It's an experimental rope bridge.
Adobe's conference last week points to the future of image gen. Visual tools where you mold images like clay. Hands on.
Comfy appeals to the 0.01% that like toolkits like TouchDesigner, Nannou, and ShaderToy.
Got a link handy to a video of what you're referring to from Adobe's conference? Gave it a quick google but there's a lot of content. Thanks!
They demoed a ton of new features in various stages of completion. Some of them are already production-grade and are being launched soon.
https://www.youtube.com/watch?v=YqAAFX1XXY8 - dynamic 3D scene relighting is insane, check out the 3:45 mark.
https://www.youtube.com/watch?v=BLxFn_BFB5c - molding photos like clay in 3D is absolutely wild at the 3:58 mark.
I don't have links to everything. They presented a deluge of really smart editing tools and gave their vision for the future of media creation.
Tangible, moldable, visual, fast, and easy.
Thank you! Will take a look. That's really exciting.
I just merged the PR and pushed 0.3.1 to PyPI. I also added README documentation and allowed for a `gemimg` entrypoint to the CLI via project.scripts as noted elsewhere in the thread.
Any reason for not also adding a project.scripts entry for pyproject.toml? That way the CLI (great idea btw) could be installed as a tool by uv.
I decided to avoid that purely to keep changes made to the package as minima as possible - adding a project.scripts means installing it adds a new command alias. My approach changes nothing other than making "python -m gemimg" do something useful.
I agree that a project.scripts would be good but that's a decision for the maintainer to take on separately!
Use Google AI Studio to submit requests, and to remove watermark, open browser development tools and right click on request to “watermark_4” image and select to block it. And from next generation there will be no watermark!
So the watermark is being added to the image on the client-side? That's pretty bad
That sounds dangerous honestly. Watermarks should be mandatory for AI generated images.
This only applies to the visible watermark on the corner, which you could crop anyways. If I’m not mistaken, all images generated by Google models have an invisible watermark: https://deepmind.google/models/synthid/
How would you enforce that when it’s actually important? Any “bad actor” could just open photoshop and remove it. Or run a delobotimized model which doesn’t watermark.
Can't believe that worked, thanks!
Good read minimaxir! From the article:
> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.
> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.
This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.
[1] - https://mordenstar.com/portfolio/zeno-paradox
It might also be an explicit guard against Studio Ghibli specifically after the "make me Ghibli" trend a while back, which upset Studio Ghibli (understandably so).
It happens with other styles. The demo documentation example which attempts to transfer an image into the very-public-domain Starry Night by Van Gogh doesn't do a true style transfer: https://x.com/minimaxir/status/1963429027382694264
Ah interesting! Thanks for the clarification. Great article :)
The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."
Exactly what I was thinking too. I'm a designer, and I'm used to receiving feedback and instructions. "The left eye socket" would to me refer to what I currently see in front of me, while "its left eye socket" instantly shift the perspective from me to the subject.
I find this interesting. I've always described things from the users point of view. Like the left side of a car, regardless of who is looking at it from what direction, is the driver side. To me, this would include a body.
Spend some time at sea, learn why a ship has no right or left side.
I picked up on that also. I feel that a lot of humans would also get confused about whether you mean the eye on the left, or the subject's left eye.
To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
https://gemini.google.com/share/a024d11786fc
Mmh, ime you need to discard the session/rewrite the failing prompt instead of continuing and correcting on failures. Once errors occur you've basically introduced a poison pill which will continuously make things to haywire. Spelling out what it did wrong is the most destructive thing you can do - at least in my experience
Almost no image/video models can do "upside-down" either.
to the point where you can say, raise the left arm and then raise the right arm and get the same image with the same arm raised.
I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later in the post.
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.
The lack of proper indentation (which you noted) in the Python fib() examples was even more apparent. The fact that both AIs you tested failed in the same way is interesting. I've not played with image generation, is this type of failure endemic?
My hunch in that case is that the composition of the image implied left-justified text which overwrote the indentation rule.
Came to make exactly the same comment. It was funny that the author specifically said that Nano Banana got all five edit prompts correct, rather than noting this discrepancy, which could be argued either way (although I think the "right eye" of a skull should be interpreted with respect to the skull's POV.)
Extroverts tend to expect directions from the perspective of the skull. Introverts tend to expect their own perspective for directions. It's a psychology thing, not an error.
I was kind of surprised by this line:
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
You can't take a highly artistic image and supply it as a style reference. Nano Banana can't generalize to anything not in its training.
Fair enough! I suppose I've avoided that kind of "style transfer" for a variety of reasons, it hadn't even occurred to me that people were still interested in that. And I don't say that to open up debate on the topic, just explaining away my own ignorance/misinterpretation. Thanks
"prompt engineered"...i.e. by typing in what you want to see.
Yes, that is a serious skill. How many of the woes that we see is because people don't know what they want or are unable to describe it in such a way that others understand it. I believe prompt engineer to properly convey how complex communication can be, when interacting with a multitude of perspectives, world views, assumptions, presumptions etc. I believe it works well to counter the over-confidence that people have, from not paying attention to what gaps exist between what is said and what is meant.
Yes, obviously a role involving complex communication while interacting with a multitude of perspectives, world views, assumptions, presumptions, etc needs to be called "engineer."
That is why I always call technical writers "documentation engineers," why I call diplomats "international engineers," why I call managers "team engineers," and why I call historians "hindsight engineers."
I believe you're joking here, but I do think it'd be useful to have some engineering background in each of these domains. The number of miscommunications that happen in any domain, due to oversight, presumptions and assumptions is vast. At the very least the terminology will shape how we engage with it, so having an aspirational title like prompt engineer, may influence the level of rigor we apply to it.
I don't think that's the right direction to go in.
Despite needing much knowledge of how a planes inner workings function, a pilot is still a pilot and not an aircraft engineer.
Just because you know how human psychology works when it comes to making purchase decision and you are good at applying that to sell things, you're not a sales engineer.
Giving something a fake name, to make it seem more complicated or aspirational than it actually is makes you a bullshit engineer in my opinion.
I think what you're describing is more commonly included under epistemology under philosophy, and I agree that it would be a useful background in each of those domains, but for some reason in the last few decades we have downgraded the humanities as less useful.
So Prompt Philosopher/Communicator?
it’s really unclear whether this is satire.
[flagged]
It IS a skill. And most often it is disregarded by those who did not yet conquer it ...
We understand now that we interface with LLMs using natural and unnatural language as the user interface.
This is a very different fuzzy interface compared to programming languages.
There will be techniques better or worse at interfacing.
This is what the term prompt engineering is alluding to since we don’t have the full suite of language to describe this yet.
Not all models can actually do that if your prompt is particular
Most designers can't, either. Defining a spec is a skill.
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
… sure … but also no. For example, say I have an image. 3 people in it; there is a speech bubble above the person on the right that reads "I'A'T AY RO HERT YOU THE SAP!"¹
I give it,
Now sure, specs are hard. Gemini removed the text bubble entirely. Whatever, let's just try again: There's only one red-head in the image; she's the middle character. We get a speech bubble, correctly positioned, but with a sans-serif, Arial-ish font, not Comic Sans. It reads "Hide the vokda" (sic). The facial expression of the middle character has changed.Yes, specs are hard. Defining a spec is hard. But Gemini struggles to follow the specification given. Whole sessions are like this, and absolute struggle to get basic directions followed.
You can even see here that I & the author have started to learn the SHOUT AT IT rule. I suppose I should try more bulleted lists. Someone might learn, through experimentation "okay, the AI has these hidden idiosyncrasies that I can abuse to get what I want" but … that's not a good thing, that's just an undocumented API with a terrible UX.
(¹because that is what the AI on a previous step generated. No, that's not what was asked for. I am astounded TFA generated an NYT logo for this reason.)
You're right, of course. These models have deficiencies in their understanding related to the sophistication of the text encoder and it's relationship to the underlying tokenizer.
Which is exactly why the current discourse is about 'who does it best' (IMO, the flux series is top dog here. No one else currently strikes the proper balance between following style / composition / text rendering quite as well). That said, even flux is pretty tricky to prompt - it's really, really easy to step on your own toes here - for example, by giving conflicting(ish) prompts "The scene is shot from a high angle. We see the bottom of a passenger jet".
Talking to designers has the same problem. "I want a nice, clean logo of a distressed dog head. It should be sharp with a gritty feel". For the person defining the spec, they actually do have a vision that fits each criteria in some way, but it's unclear which parts apply to what.
The NYT logo being rendered well makes sense because it's a logo, not a textual concept.
https://habitatchronicles.com/2004/04/you-cant-tell-people-a...
Yep, knowing how and what to ask is a skill.
For anything, even back in the "classical" search days.
at least then, we had hard overrides that were actually hard.
"This got searched verbatim, every time"
W*ldcards were handy
and so on...
Now, you get a 'system prompt' which is a vague promise that no really this bit of text is special you can totally trust us (which inevitably dies, crushed under the weight of an extended context window).
Unfortunately(?), I think this bug/feature has gotta be there. It's the price for the enormous flexibility. Frankly, I'd not be mad if we had less control - my guess is that in not too many years we're going to look back on RLHF and grimace at our draconian methods. Yeah, if you're only trying to build a "get the thing I intend done" machine I guess it's useful, but I think the real power in these models is in their propensity to expose you to new ideas and provide a tireless foil for all the half-baked concepts that would otherwise not get room to grow.
Used to be called Google Fu
... and then iterating on that prompt many times, based on your accumulated knowledge of how best to prompt that particular model.
Case in point, the final image in this post (the IP bonanza) took 28 iterations of the prompt text to get something maximally interesting, and why that one is very particular about the constraints it invokes, such as specifying "distinct" characters and specifying they are present from "left to right" because the model kept exploiting that ambiguity.
Hey! The author, thank you for this post! QQ, any idea roughly how much this experimentation cost you? I'm having trouble processing their image generation pricing I may just not be finding the right table. I'm just trying to understand if I do like 50 iterations at the quality in the post, how much is that going to cost me?
All generations in the post are $0.04/image (Nano Banana doesn't have a way to increase the resolution, yet), so you can do the math and assume that you can generated about 24 images per dollar: unlike other models, Nano Banana does charge for input tokens but it's neligible.
Discounting the testing around the character JSON which became extremely expensive due to extreme iteration/my own stupidity, I'd wager it took about $5 total including iteration.
right? 15 months ago in image models you used to have to designate rendering specifications, and know the art of negative prompting
now you can really use natural language and people want to debate you about how poor they are at articulating a shared concepts, amazing
it's like the people are regressing and the AI is improving
"amenable to highly specific and granular instruction"
My personal project is illustrating arbitrary stories with consistent characters and settings. I've rewritten it at least 5 times, and Nano Banana has been a game-changer. My kids are willing to listen to much more sophisticated stories as long as it has pictures, so I've used it to illustrate text like Ender's Game. Unfortunately, it's getting harder to legally acquire books in a format you can feed to an LLM.
I first extract all the entities from the text, generate characters from an art style, and then start stitching them together into individual illustrations. It works much better with NB than anything else I tried before.
> so I've used it to illustrate text like Ender's Game
That sounds interesting. Could you share?
This works with the openrouter API as well, which skips having to make a google account etc. Here's a Claude-coded openrouter compatible adaptation which seems to work fine: https://github.com/RomeoV/gemimg
A 1024x1024 image seems to cost about 3ct to generate.
The minimaxir/gemimg repo is pretty cool, fwiw.
Going further, one thing you can do is give Gemini 2.5 a system prompt like the following:
https://goto.isaac.sh/image-prompt
And then pass Gemini 2.5's output directly to Nano-Banana. Doing this yields very high-quality images. This is also good for style transfer and image combination. For example, if you then give Gemini 2.5 a user prompt that looks something like this:
You can get aesthetic consistently-styled images, like these:https://goto.isaac.sh/image-style-transfer
>it is possible to generate NSFW images through Nano Banana—obviously I cannot provide examples.
It is in fact not at all obvious why you can't.
The American prudishness continues to boggle my mind.
Photo-realism is great but the real step-jump in image-gen I’m looking for is the ability to draw high quality technical diagrams with a mix of text and images, so I can stop having LLMs generate crappy diagrams with mermaid, SVG, HTML/CSS, draw.io
I tried asking for a shot from a live-action remake of My Neighbor Totoro. This is a task I’ve been curious about for a while. Like Sonic, Totoro is the kind of stylized cartoon character that can’t be rendered photorealistically without a great deal of subjective interpretation, which (like in Sonic’s case) is famously easy to get wrong even for humans. Unlike Sonic, Totoro hasn’t had an actual live-action remake, so the model would have to come up with a design itself. I was wondering what it might produce – something good? something horrifying? Unfortunately, neither; it just produced a digital-art style image, despite being asked for a photorealistic one, and kept doing so even when I copied some of the keyword-stuffing from the post. At least it tried. I can’t test this with ChatGPT because it trips the copyright filter.
Great article!
Regarding the generated cat image:
> Each and every rule specified is followed.
Not quite; the eye color and heterochromia is followed only so-so.
The black-and-silver cat seems to have no heterochromia; eye color could be interpreted as silver though.
The white-and-gold cat _does_ have heterochromia. The colors can be interpreted as "white" and "gold", though I'd describe them as whitish-blue and orange. What's interesting about this is an adjustment of the instructions toward biologically more plausible eye colors in the cat which also has more natural fur colors.
The last cat's fur colors are so "implausible" that the model doesn't seem to have problems taking exactly those colors for the (heterochromatic) eyes too!
It's really nice to see long-form, obviously human-written blogs from people deep into the LLM space - maybe us writers will be around for a while still in spite of all the people saying we've been replaced.
I've started increasing the number of jokes in my blog posts to make it sound more obviously human-written: to be honest I was expecting some "why is this so unserious" complaints.
In other words: you show your "personality".
AI can't do that (yet?).
Can't make everyone happy!
Kinda like paper newspapers. In some ways it's "not optimal", but in many ways it's irreplaceable.
It's really cool how good of a job it did rendering a page given its HTML code. I was not expecting it to do nearly as well.
Same. This must have training from sites that show html next to screenshots of the pages.
In my own experience, nano banana still has the tendency to:
- make massive, seemingly random edits to images - adjust image scale - make very fine grained but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
Has anyone had a better experience?
The "ALL CAPS" part of your comment got me thinking. I imagine most llms understand subtle meanings of upper case text use depending on context. But, as I understand it, ALL CAPS text will tokenize differently than lower case text. Is that right? In that case, won't the upper case be harder to understand and follow for most models since it's less common in datasets?
There's more than enough ALL CAPS text in the corpus of the entire internet, and enough semantic context associated with it for it to be intended to be in the imperative voice.
Shouldn't all caps normalised to tokens like low caps? There are no separate tokens for all caps and low caps in Llama, or at least not in the past.
Looking at the tokenizer for the older Llama 2 model, the tokenizer has capital letters in it: https://huggingface.co/meta-llama/Llama-2-7b-hf
I work on the PixLab prompt based photo editor (https://editor.pixlab.io), and it follows exactly what you type with explicit CAPS.
I like to use these AI models for generating mockup screenshots of game. I can drop a "create a mockup screenshot of a steampunk 2D platformer in which you play as a robot" and it will give me some interesting screenshot. Then I can ask it to iterate on the style. Of course it's going to be broken in some ways and it's not even real pixel art, but it gives a good reference to quickly brainstorm some ideas.
Unfortunately I have to use ChatGPT for this, for some reason local models don't do well with such tasks. I don't know if it's just the extra prompting sauce that ChatGPT does or just diffusion models aren't well designed for these kind of tasks.
For images of people generated from scratch, Nano Banana always adds a background blur, it can't seem to create more realistic or candid images such as those taken via a point and shoot or smartphone, has anyone solved this sort of issue? It seems to work alright if you give it an existing image to edit however. I saw some other threads online about it but I didn't see anyone come up with solutions.
Maybe try including “f/16” or “f/22” as those are likely to be in the training set for long depth of field photos.
I tried that but they don't seem to make much difference for whatever reason, you still can't get a crisp shot such as this [0] where the foreground and background details are all preserved (linked shot was taken with an iPhone which doesn't seem to do shallow depth of field unless you use their portrait mode).
[0] https://www.lux.camera/content/images/size/w1600/2024/09/IMG...
Those are rarely in the captions for the image. They'd have to extract the EXIF for photos and include it in recaptioning. Which they should be doing, but I doubt they thought about it.
Photo sites like Flickr do extract EXIF data and show it next to the image, but who knows if the scraping picked them up.
Looks like specific f-stops don't actually make a difference for stable diffusion at least: https://old.reddit.com/r/StableDiffusion/comments/1adgcf3/co...
Nano Banana can be frustrating at times. Yesterday I tried to get it to do several edits to an image, and it would return back pretty much the same photo.
Things like: Convert the people to clay figures similar to what one would see in a claymation.
And it would think it did it, but I could not perceive any change.
After several attempts, I added "Make the person 10 years younger". Suddenly it made a clay figure of the person.
The first request is a style transfer, which is why I included the Ghibli failure example.
I've gotten it to make Ghibli transfers by responding to the initial attempt with "I can barely tell the difference. Make the effect STRONGER."
In my experience, once it starts interpreting your request incorrectly, you're better off starting with fresh context.
The blueberry and strawberry are not actually where they prompted.
use it for technical design doc, where i sketch out something on paper and ask nano banana to make flow chat, its incredibly good at this kind of editing (also if want to borrow image from someone and change some bridges usually its hard its embedded image, but nano banana solves that)
Theres lots these models can do but I despise when people suggest they can do edits with "with only the necessary aspects changed".
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Even if you
Nano Banana is different and much better at edits without changing texture/lighting/sharpness/color balance, and I am someone that is extremely picky about it. That's why I add the note that Gemini 2.5 Flash is aware of segmentation masks, and that's my hunch why that's the case.
Nano banana has a really low spatial scaling and doesn't affect details like other models.
That is true for gpt-image-1 but not nano-banana. They can do masked image changes
Could you just mask out the area you wish to change in more advanced tools, or is there something in the model itself which would prevent this?
That's probably where things are headed and there are already products trying this (even photoshop already). Just like how code gen AI tools don't replace the entire file on every prompt iteration.
That article says that most image generators had been over-shadowed by gpt.
Yet when I ask some simple tasks to it, like doing a 16:9 picture sized image instead of a square one, it ends up doing a 16:9 on a white background that matches a square.
When I ask it to make it with text, then on the second request to redo while changing just a certain visual element, it ends up breaking the previously asked text.
It's getting more good at flattering people and telling them how clever and right they are than actually doing the task.
It doesn't say that gpt is better, just that it is more popular
> It's getting more good at flattering people and telling them how clever and right they are than actually doing the task.
Not (knowingly) used an llm for a long time. Is the above true?
Nope
The kicker for nano banana is not prompt adherence which is a really nice to have but the fact that it's either working on pixel space or with a really low spatial scaling. It's the only model that doesn't kill your details because of vae encode/decode.
>> "The image style is definitely closer to Vanity Fair (the photographer is reflected in his breastplate!)"
I didn't expect that. I would have definitely counted that as a "probably real" tally mark if grading an image.
It was very interesting, I liked your style of explaining both at user level but also at more technical level.
Very cool post, thanks for sharing!
> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
I had no idea that the context window was so large. I’d been instinctively keeping my prompts small because of experience with other models. I’m going to try much more detailed prompts now!
I'm getting annoyed by using "prompt engineered" as a verb. Does this mean I'm finally old and bitter?
(Do we say we software engineered something?)
I think it’s meant to be engineering in the same sense as “social engineering”.
You're definitely old and bitter, welcome to it.
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
Don't get me wrong, I have nothing against using AI as an expression of creativity :)
Create? So I have created all that code I'm running on my site, yes is bad I know, but thank you very much! Such creative guy I was!
Not really since "prompt engineering" can be tossed in the same pile as "vibe coding." Just people coping with not developing the actual skills to produce the desired products.
Try getting a small model to do what you want quickly with high accuracy, high quality, etc, and using few tokens per request. You'll find out that prompt engineering is real and matters.
Couldn't care less. I don't need to know how to do literally everything. AI fills in my gaps and I'm a ton more productive.
I wouldn't bother trying to convince people who are upset that others have figured out a way to use LLMs. It's not logical.
No it means you can still discern what is BS.
> Nano Banana is still bad at rendering text perfectly/without typos as most image generation models.
I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.
So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.
Google's example documentation for Nano Banana does demo that pipeline: https://ai.google.dev/gemini-api/docs/image-generation#pytho...
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
Yeah, close.
But it is still generating it with a prompt
> Logo: "A simple, modern logo with the letters 'G' and 'A' in a white circle.
My idea was do to it manually so that there is no probabilities involved.
Though your idea of using python is same.
I need to give this a shot for turning written stories into comics. Seems like the technology is finally there.
Well, I just asked it for a 13-sided irregular polygon (is it that hard?)…
https://imgur.com/a/llN7V0W
I really wish that real expert stuff, like how to do controlnet, use regional prompting, or most other advanced ComfyUI stuff got upvoted to the top instead.
Great post with some nice insights I could've used a few days ago!
I was trying to create a simple "mascot logo" for my pet project. I first created an account on Kittl [0] and even paid for one month but it was quite cumbersome to generate images until I figured out I could just use the nano banana api myself.
Took me 4 prompts to ai-slop a small python script I could run with uv that would generate me a specified amount of images with a given prompt (where I discovered some of the insight the author shows in their post). The resulting logo [1] was pretty much what I imagined. I manually added some text and played around with hue/saturation in Kittl (since I already paid for it :)) et voilà.
Feeding back the logo to iterate over it worked pretty nicely and it even spit out an "abstract version" [2] of the logo for favicons and stuff without a lot of effort.
All in all this took me 2 hours and around 2$ (excluding the 1 month Kittl subscription) and I would've never been able to draw something like that in Illustrator or similar.
[0] https://www.kittl.com/ [1] https://github.com/sidneywidmer/yass/blob/master/client/publ... [2] https://github.com/sidneywidmer/yass/blob/master/client/publ...
I found this well written. I read it start to finish. The author does a good job of taking you through their process
Created a tool you can try out!! sorry to self-plug but I launch on Product Hunt next week that lets you do this:)
www.brandimagegen.com
if you want a premium account to try out, you can find my email in my bio!!
Perhaps I'm childish, but nano banana = tiny penis.
> It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
I haven't paid much attention to image generation models (not my area of interest), but these examples are shockingly good.
regarding buzzword usage
"YOU WILL BE PENALIZED FOR USING THEM"
That is disconcerting.
Cute. What’s the use case?
NSFW, mostly
how did you do NSFW?
Another thing it can't do is remove reflections in windows, it's nearly a no-op.
This article was a good read, but the writer doesn't seem to understand how model-based image generation actually works, using language that suggests the image is somehow progressively constructed the way a human would do it. Which is absurd.
I've noticed a lot of this misinformation floating around lately, and I can't help but wonder if it's intentional?
I'm not sure what you're implying is incorrect/misleading. As noted in the post, autoregressive models like Nano Banana and gpt-image-1 generate by token (and each generated token attends to all previous tokens, both text and image) which are then decoded, while diffusion models generate the entire image simultaneously, refined over n iteration steps.
[flagged]
How meta. This comment is clearly written by AI.
I don't feel like I should search for "nano banana" on my work laptop
lots of words
okay, look at imagen 4 ultra:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”
Is Imagen thinking?
Let's compare to gemini 2.5 flash image (nano banana):
look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0
without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.
Can you provide screenshots or links that don't require login
sorry, but I don't understand you post. those links don't work.