DALL·E 2 can create original, realistic images and art from atext description. It can combine concepts, attributes, and styles.
DALL·E 2 can create original, realistic images and art from atext description. It can combine concepts, attributes, and styles.
You need to be a member of Studio Artist to add comments!
We track all of the generative neural net research pretty closely here. I'm happy to talk about this stuff as much or as little as people are interested in hearing. Feel free to ask questions in this thread if you want more info about anything.
OpenAI announced DALL-E-2 the other week. Like DALL-E, it is a closed code base. Unlike DALL-E, they are going to provide an api to developers. But even then, you can't just use it, you need to apply and wait for approval. We applied to check out the initial GPT-3 release api right after it was announced, and almost a year later, they finally approved it, so that will probably be a very slow boat again this time.
To be honest, what is happening in the open source community with this stuff has the potential to be much more exciting in the long run (short run as well). After DALL-E was announced last spring, the open source community dived into trying to replicate what they were doing, and in the process really extended the whole concept of what could be done with it and how it could be used. So much so, that the design of DALL-E-2 is influenced by that work to some extent. And that same open source community it now diving into working through what is different about DALl-E-2 and how to replicate it.
At the same time, the open source community is working to put together extremely large multi-modal (text-image in this case) database models to use for this kind of research that are also open source. They just announced a 5 billion text-image pair database very recently, and are working to put together open source CLIP models based on their open source datasets.
The basic idea behind all of this work is to combine a CLIP neural net model with a generative image synthesis neural net model.
If you show an image to a CLIP model, it can basically spit out a set of text descriptors that it thinks the image represents with some probability measure attached (80% a cat, 10% royal creature, etc). The CLIP model can be used to evaluate the output of a generative image synthesis neural net model (evaluation with respect to some text tag(s), so for example my generated image from the neural net is only 30% likely to be a cat).
This evaluation of the generated image with respect to some textural tag phrase can then be used to compute an error signal that you then feedback into the image generation neural net. Because the generative synthesis part is all differentiable, so the evaluation error is back-propagated through the generative model, and in that process internal parameter values associated with the generative model are incrementally adjusted based on that back-propagated error signal.
So the system is tweaking it's internal parameter values in a way that hopefully reduces the evaluation error over time. Reducing the error and generating images that look more like the textural tag phrase that describes the potential image you want the system to generate.
There are various generative image synthesis neural net architectures. They all are constructed differently internally. And they all have distinctive visual looks associated with them. You can basically take any of them and drop them into that part of the system. So you could configure your own version of this 'art strategy' by choosing a specific CLIP-like model for the evaluation of the image generator output, and then feed that back into the generative model of your choice.
The original DALL-E work used a VQVAE neural net model for the generative image synthesis.
The DALL-E-2 work uses a diffusion neural net model for the generative image synthesis.
There are a number of open source variants of this approach that use a VQGAN neural net model for the generative image synthesis.
OpenAI did provide an open source trained data model for the CLIP architecture when they announced DALL-E. But they never released the actual data they used to train it.
A year later we find outselves with an open source CLIP model trained on an open source image database with the same systems number of images, and the possibility to train new open source CLIP models on a much larger open source database.
So the state of the art and what is possible with this stuff is changing at a very fast rate.
Since this forum is for artists, let us again re-iterate that my experience with working with different variations of this system is that all of the components generate visually distinct idiosyncrasies and stylized appearance. Both in terms of what the image generator can synthesize, as well as how the textural prompt tagging modifies that image synthesis behavior.
Another factor is that all of these systems tend to output very small resolution synthesized images. Like 256x256 for many of them. Sometimes 512x512, but then you start to potentially run into memory issues on the GPU you are using.
A cynical viewpoint would be that all of this is really just an ad to get you to pay for cloud computing charges. Yes, you can in theory run some of this stuff for free on Colab, but i can guarantee you that if you try to work with it for more that a few hours Colab is going to aggressively hobble your usage until you pony up a minimum of 120$ a year in reoccurring credit card charges to google (or whomever else you end up using for the cloud computing part).
On the desktop you are really going to need a GEFORCE RTX 3080 at the minimum to pull it off. Which immediately brings you to the dilemma that apple hates Nvidia and Nvidia gpus are apparently banned from the mac, so this solution is a windows or linux only solution.
In theory we should all be excited about those new Studio mac machines with all of the GPU core support on them, but you can't actually run the PyTorch models on them yet as far as i can tell. There are rumors about that maybe being addressed in June at WWDC, but i'm not holding my breath.
I also wanted to briefly touch on something associated with curation of 'bad' images on these systems. OpenAI for example has a huge effort to sanitize toxic imagery or text from their DALL-E 2 system. And of course we all know that the history of art has been one of dancing smiling kittens, why artists have never addressed difficult topics, emotions, imagery in their artwork, oh wait....
So this is another factor that is going to be associated with the appearance of the visual output of these systems, certainly large corporate ones like OpenAI or Microsoft, which is effectively who you will be dealing with if you use OpenAI apis for anything.
So there's a few words on a quite fascinating and technically very deep topic. Again, i'm happy to post more on all of this, and i'm curious how people are using it, or invision using it. There is a lot to talk about. How the various approaches work. Implications of sanitizing output content from an artistic perspective. How practical is this approach really for working artists trying to do specific things. Where might things be headed in the next few years. What you might want to do different.
I should point out that the OpenAI image examples are cherry picked. Actually using these systems is a whole other thing, something i'm still getting a handle on to be honest.
Thank you, I am very happy to find out that you are closely monitoring this section, the SA deserves to always be at the highest level as always.
It's great to see the interest and movement here in this area. Can't figure out if I feel queasy or excited about what is coming but sooner or later, mostly sooner anyone in the arts will have to somehow reckon with AI and its impact on the field. Among the projects I've seen recently, one of the more interesting ones I've come across is prosepainter.com. I can imagine combining SA's general approach with somekind of text driven controlled inpanting. Despite the dall-e 2 hype, I think diffusion based models like the latest laion have a lot of creative promise especially if integrated with how SA does things
Yes, how we integrate all of this stuff into the Studio Artist workflow is an interesting question.
An analogy i like to use is look at how people compose digital music. In the olden days one would have to use some low level thing like CSound to do it. But of course no one except some academic electronic music people actually work that way. Everyone else uses much higher level interfaces more attuned to how people play and think about musical phrasing and structure.
Generative neural image synthesis is kind of at the CSound level as far as how people are currently working with it.
OK so you just robbed me of years of my life by introducing me to prosepainter.com ! hellllp! 1. "birds", 2. "sleep"
I'll try to start posting some specific examples of the different generative image synthesis techniques i talked about earlier over time.
Here's an example of a generative synthesized image from something i was working on yesterday that uses latent diffusion with the LAION-400M public data model. It's from a video that riffs on 'The End of Time' as a children's book as narrated by an evil robot.
This particular generative model implementation is good at this kind of thing. Not so good with specific references to things like the local Hawaiian theme tinged imagery i like to mess with, probably because that kind of image data is not very well represented in the image database used for the training.
Diffusion based generative image synthesis can be thought of as follows. Think about starting with an image and then gradually adding random noise to it. Eventually you end up with an image that just looks like noise. You can then invert that process to go from an image of noise to an image that looks like something. The key is to add the noise in a differentiable way, because then you can back-propagate to learn how to go the other direction (noise to image).
The 'latent' part of latent diffusion means that rather than working with pixels, you are working with vectors in a latent space. Think of the latent space in terms of image compression, the latent space is like the codebook for the compression algorithm (compressed images are generated by combining together different pieces of the codebook to build the overall compressed image). Using the latent space approach has a lot of advantages, speeding things up being one of them since you are working on groups of pixels together at the same time rather than working on single pixels one by one in the image you are building.
Vector quantization (VQ) used in a lot of these algorithms at it's most basic level is like a Studio Artist image brush building a photo mosaic image.
Now i'm going to show you another image generated by the same latent diffusion algorithm and i want you to look at it very closely.
This one was from some experiments where i was trying to generate imagery off of a 'Honolulu harbor sand island' textural description. If you look at this image closely and then think about what is going on in the reconstruction of it, you will gain quite a bit of insight into what is going on under the hood of the algorithm. Think it through.
At that point you might want to start thinking about how one could use the tools you already the to build this kind of thing more directly.
This particular example is a bust as far as the generative synthesis goes off the textural tags i used, but i choose it to post here for this purpose because it's failure does give you a lot of insight into what it is trying to do.
This is not a slag on the neural net generative algorithms, they are fascinating. Same goes for the multi-modal angle on the synthesis (text to image).
But while the new techniques may seem particularly magical at first, once you dive under the hood, or look at the second image above, you can start to see relationships to more elementary things like texture synthesis, image brushes, positional source cloning, all fun stuff you can do today in Studio Artist V5.5 if you set your mind to it.
I'll try to put together a good example of what i'm talking about in this second part, ie 'using the tools you already have to tackle this kind of directed image synthesis'. I'll try to post that here later.
I've been deep into Disco Diffusion for the past few weeks and like John said, even the Pro+ account at Google Colab ($50 per month) locks you out frequently. I've been using a Cloud-based VM at Lambda Labs (forget AWS and Azure) via a Docker container that someone made for me and also just ordered a new PC so I can run it locally(Ubuntu or Windows/Anaconda). (I would recommend a RTX <3090> as the minimum card if you wanted to load several models at once.) There are several free or inexpensive Web-based interfaces for text-to-image, so not sure what John would want to implement. I'm personally mostly interested n the "style transfer" aspect of the technology - easy enough to do with Disco Diffusion by using your own images as a starting point, then prompting "in the stye of," but I've also started to produce some interesting new work after many hours of tweaking prompts (so-called "prompt engineering"). It's pretty fascinating stuff.
I think the state of the art of this stuff is going to be constantly moving over the next few years. And the front wave of that is going to be available in open source online notebooks.
At the same time, working with the notebooks on something like Colab is clearly not what you would want as far as an ideal art generation workspace to be messing with this stuff.
But whatever environment you worked with would have to deal with the concept of allowing manipulation of open source github code libraries as well as maybe the need to run the code in the cloud rather than locally.
I partially say that second part (cloud) because i'm all too aware of the details of setting all of the libraries up properly on a local machine so everything works properly. At the same time it would be really nice if you could take the open source pytorch code depositories and easily run them on either windows (assuming nvidia card that runs it) or on mac studio with enough gpu power.
Assumes apple doesn't put obstacles up for pytorch via metal api on the platform since Meta (facebook) is associated with PyTorch.
So i guess i'm very interested in what people using this stuff would ideally like as an ideal working environment. I have my opinions, but i'm very curious what other people think.
My take on prompt engineering is that you need to figure it out for each flavor of generative model, since what works on one is going to work differently or not at all on a different one.
Interesting. I "met" a guy on Discord who is running several flavors locally and he offered to help me get set up ("2 - 3 hours guesstimate") after building the container at Lambda. Seems very confident in all aspects of the tools (mining crypto when his suite of 3090s isn'tdoing images... ). Also lots of knowledgable guys at Discord tweaking their own versions of the notebook and sending back code for the offical version... There's a program called Visions of Chaos that provides a front end for all the parameters to the ML stuff (I think it's plain Disco)- I'll try to get a screenshot if you're interested. -Just click enable/disable for toggles and provide values for numeric fields. (Have you played with the Disco notebook via Colab? Same stuff you see there but just icely arranged in a single screen.) I'd be perfectly happy using that as opposed to the notebook UI... But I'm not sure most people (even our beloved users' community) yet appreciate the horsepower needed to run this stuff. A 3090 costs between $1500 - 2500 alone (should get cheaper), then your PC needs proper cooling, PSU, and all sorts of other tweaks and a fair amount of "coding" to get it set up. A nice Windows graphical interface woul dbe super but probably not practicla (and Mac is out for now...) And all true about prompts which is why I decided to get abox to run this locally. So much experimentation/time involved testing stuff out (though it's starting to look pretty interesting..