Dear SA developers, Please have a look to DALL·E 2 too.

Replies

Alf April 28, 2022 at 4:24pm

Hi everyone,

I just wanted to add that with my RTX 2060 (the 2021 model with 12GB VRAM, which I paid $ 380, with no additional cooling required) I'm actually having a lot of fun with generative models and some of the results are really amazing.

After playing for long time with colabs (and feeling frustrated by their plans) I can finally run generative models locally, including Disco Diffusion!

Of course the generation process is not as fast with those GPU beasts Google provide, but it's still very bearable!

For example, I just finished generating a 960 x 540 px image (700 steps) with Disco Diffusion and it took about 15 minutes.

I thought I would let you know the above because from what I could read here, it looks like some of you think they need way more expenisve hardware for these things to run, which is not necessarily true.

That said, following is a short list of some of the generative models I experimented with, sorted by the quality of the results I could achieve:

- Disco Diffusion
- GLID-3
- CompVis Latent Models
- CLIP guided diffusion HQ 256x256
- V-Diffusion-Pytorch
- VQGAN-CLIP-APP
- Big-Sleep
- Deep-Daze (I think this was the most humble in terms of hardware resources, I remember running it on my old GPU with just 2GB VRAM!)
- Clip-Glass

For those projects which output low res results you can always use an upscaling model ! So no problem for me with that!

Another great thing which I managed to do was, I started using FFmpeg to stream the generated image as it was bein updated (say once every 2 seconds) interpolating the resulting frames (using FFmpeg filters) into a nice and smooth video which basically is a real-time generated video from the neural network at work! You need to see it! (I was thinking I could start streaming this live on youtube or twitch at some point).

That's all! I'll leave you with a few images generated with some of the models above, I hope you like them:

Disco Diffusion - "The symphony of the last orgasm, trending or artstation"

Disco Diffusion - "The symphony of the last orgasm, trending or artstation"

Disco Diffusion - "Tripping on Datura"

V-Diffusion-Pytorch (? not sure) - "Looking into the void"

VQGAN-CLIP-APP - "Looking into the void"

CLIP-guided-diffusion-HQ256x256 - "Recursive Functions Become Alive"

CLIP-guided-diffusion-HQ256x256 - "A post apocalyptic Babel tower"

CLIP-guided-diffusion-HQ256x256 - "A knight at dawn"

CLIP-glass - "An underground party in a post apocalyptic city"

Disco Diffusion - "If you gaze for long into an abyss, the abyss gazes also into you, trending on artstation"

Disco Diffusion - "Gazing into the abyss"

Disco Diffusion - "An intricate set of mirror reflections"
- Dennis Miller > Alf April 28, 2022 at 6:06pm
  
  Nice looking work. I went with the A6000 for the VRAM and the ability to do native 2k with nearly all DD models loaded. Don't know that I would ever use all the models at once, but it definitely gives me some headroom for what's avalable now and what might be coming. But yes, I did need a second mortgage for that system ;-) I don;t know what Synthetik might want to do in this area - Visions Of Chaos seems to have all the current models available in a UI that is very efficient looking (haven't actually used it yet). Definitely a ton of potential with these tools.
- Synthetik > Alf April 28, 2022 at 6:38pm
  
  Cool, thanks for sharing.
  
  Do you have any comments on workflow enhancements you would like to see in an ideal world for working with generative image synthesis algorithms?
  - Dennis Miller > Synthetik April 29, 2022 at 4:34am
    
    John - did you look at the Visions of Chaos interface shot I sent? It adapts to whichever model you are using (he has incoporated nearly 100). The main thing I like about it is that you don't see all the code for each cell. And you also don;' need to know the syntax for text stuff, like keyframing an animation... Let me know if you want to see more examples of DD output, though I imagine you have seen plenty...
  - Alf > Synthetik May 3, 2022 at 8:33am
    
    Not sure about the ideal workflow, there are many roads to try, but some of those are already being tested: for example were you aware of the following two projects?
    
    1. Big Sleep Creator - https://github.com/enricoros/big-sleep-creator/ - "UI for human-in-the-loop controlled image synthesis"
    
    Planned features:
    
    - Generation of images based on text
    - Hyperparameters control
    - Scatter/select creation, with progressive refinement
    - Branch/continue generation of images (tree of creation)
    - Latent space editing
    - Latent space constraining for continuation
    
    2. DALL·E Flow - https://github.com/jina-ai/dalle-flow "A Human-in-the-Loop workflow for creating HD images from text"
    
    To make it short this makes the generation process more interactive by asking for user input at different generation steps.
    
    ---
    
    I think I saw other projects of this kind, but at the moment I can't remember them...
    
    GitHub - enricoros/big-sleep-creator: UI frontend for lucidrains/big-sleep
    
    UI frontend for lucidrains/big-sleep. Contribute to enricoros/big-sleep-creator development by creating an account on GitHub.
  - Omar B > Synthetik August 29, 2022 at 7:46am
    
    Hi John,
    
    I've toyed around with a number of the other tools out there, but more recently have spent quite some time experimenting with Dalle-2, MidJourney and some Stable Diffusion too.
    
    Dalle is very "coherent" but less naturally "artistic" and, as you pointed out, is very restrictive/censored... not to mention the fixed square aspect ratio and the fact that while one can upload a source image, this can't be combined with prompt input. The one feature that I do appreciate, even if it is way to constrained, is the ability to "edit", i.e. to select an area of an image to modify. This would be far, far more interesting if it provided a full range of selection tools, instead of a crude, round, soft edge eraser.
    
    I generally prefer the aesthetics of the MidJourney output, the variable aspect ratio and the reduced censorship, although this comes with far less "coherent" results. The new "Beta" model is a dramatic improvement, but one is still limited by the effemeral influence of the input image, the difficulty in reproducing/tweaking results, the lack of an "edit" feature, and one might say, excessive influence of the default MJ aesthetic. Overall, while the results are quite impressive, and it is tons of fun to explore, I'd suggest that the artist's ability to exercise intent and direction is somewhat limited; prompt engineering does have an impact, but so much boils down to simply rolling the dice enough times to come up with attractive results.
    
    Stable Diffusion was finally released and shows a lot of promise, in particular given its open source orientation and associated implications for censorship-avoidance. While I initially found it more difficult to generate results as satisfying as those I've been producing with MJ, I really appreciates the ability it provides to reliably and consistently reproduce results and thus to tweak them using the various knobs and levers provided towards a desired output. This should allow the artist to exercise more intent and direction, particularly as the coherence of the models improves over time. Moreoever, unlike either Dalle and MJ, Stable Diffusion is open source, which opens up a whole range of possibilities in terms of how it could be plugged into other tools and workflows, and the price/accessibility dynamics.
    
    Combining the best of Dalle-2, MidJourney and Stable Diffusion, one "ideal" workflow would allow:
    
    inputing images
    
    combining with prompts to generate images with variable aspect ratios using a coherent model, and
    
    providing the ability to "edit" based on a comprehensive set of selection tools, and
    
    providing seed-control that allow tangible control when refining towards a desired output
    
    With respect to Studio Artist, I love the power that it provides, am in awe of its capabilities, and will likely upgrade SA if/when the time comes, but I have also found it very difficult to wield this power. For example, I wasn't able to reproduce Charis Tsevis style results as well as I'd have liked, and mostly ended up using it to generate hundreds/thousands of versions of input images. While this was fun and yielded some great results, it also produced a lot of throw aways along the way, created work separating the wheat from the chaff, and ultimately wasn't an interactive/iterative process. As a result, today I'm more likely to spend those hours playing with these new engines than with SA.
    
    That being said, I can't help but wonder how Studio Artist's capabilities could be leveraged to interact with one or more of these tools. e.g. generating images/selections/masks which are used as inputs to such services, retrieving the results and incorporating them into SA's scripted/batched processing capabilities. I'd LOVE to see what you might come up with.
    
    Keep up the great work!!
    - Synthetik > Omar B August 29, 2022 at 10:12am
      
      Thanks for your suggested workflow comments.
      
      I was wondering if we should setup a specific Group here on the user forum devoted to generative neural net synthesis. To discuss the different available options, different approaches to using Studio Artist to process the output, etc.
      
      We've been heavily researching all of the different approaches for generative neural net image synthesis here for awhile now. I post images/animations every day on my art blog with different neural net ai approaches in addition to the new EBM (energy based model) digital paint tech we are working on. I've recently been experimenting with CLIP guided direct RGB generative image synthesis, which is kind of surprising me to be honest and making me question some underlying assumptions people have about this technology.
      
      I have tried Stable Diffusion, and while it is very cool, the current lack of 'control' is a huge limitation i think for digital artists. The prototype generative ai context we're currently working with here is based on the CompVis latent diffusion model. The cool thing about the way it is setup is that you aren't restricted to just working with text prompts. And you also have a lot of adjustable control over what happens.
      
      Working with a fixed text prompt for the system, i can totally change the content and appearance of the generative output by manipulating the adjustable controls for the algorithm. So systems that just let you enter text with limited or no control over internal parameters for the algorithm are kind of missing the point in some respects i think.
      
      A better approach in my opinion is to view the system as an image synthesis engine, and let then let the user have full control over dialing in everything that system has to offer for controlling the image synthesis process, be it realistic image synthesis, or abstract image synthesis.
      
      It seems like everyone is overly focused on text prompt control of these algorithms. And sure, text prompt control is cool.
      
      But think about all of the subtlety associated with any given image. And if you try to define that with words you could write a novel (violating the CLIP embedding token limit by the way) and still not come close to getting at the subtlety that is right there in the image it self.
      
      There is a reason why the old expression 'an image is worth a thousand words' was even thought up. It is really getting at something very important about images and perception that text only prompt systems are totally missing.
      
      So i'm not saying never use text. Just that is is one adjustable part of a much bigger adjustable system. And if you want real artistic control over manipulating what the system is doing, you need access to all of that,
      
      You mentioned censorship. I have no real interest in using OpenAI Dall-e 2 because of that (although their pricing is also too high for what you are getting). You wonder if the pencil was invented today, would people be having conversations about how you could use it to create harmful images, and how it should be restricted because of that.
      
      Let's consider Guernica, a work of art thought by art critics to be one of the most important anti-war paintings in history. If you tried to make it using Dalle-2, OpenAI would kick you off their system for violation of their terms of usage. It depicts war, violence, animal mutilation.
      
      Stable Diffusion has 2 different censorship things built into it if you look at the actual code for it. One that checks your text prompt, another that checks the output image. Since the code is open source, you could remove or comment out that stuff. I worry a little bit about 'censorship' associated with the training of the 5B model (and associated 'aesthetic' models derived from it) they used. I also worry a little about how even though it is open source, you need to enter an authorization code to activate the model, which leaves a back door for them to turn it off or restrict it at a later time.
      
      I was experimenting with a different latent diffusion model that had censorship code like this in it and it was barking at images of people doing yoga as being obscene. So sure, if you want to make something for kids to create cute pictures of animals, maybe parental mode makes sense. But for artists it is ridiculous.
      
      Happy Island Music
      
      A depository for John Dalton's personal artwork. Studio Artist, MSG, procedural art, WMF, digital painting, image processing, human vision, digital…
      - Omar B > Synthetik August 29, 2022 at 12:00pm
        
        These technologies are here to stay so I can't help but think that a dedicated group devoted to generative neural net synthesis would be warranted.
        
        I fully agree with your analysis regarding the limits of using words/language, and the need to expose levers that provide sufficient artistic control.
        
        You mentioned a "prototype generative [...] based on the CompVis latent diffusion model", you would be able to recommend a place where one could get the best sense of the potential here?
        
        I hesitate to even comment on the possible options going forward, since you understand these things far better than I ever will, and my input will be of limited value here, but I'll take a chance and do so anyway.
        
        It seems to me that when considering a path forward, SA is faced with a choice to either:
        
        build in tools in such a way as to be/appear native to the application,
        
        integrate with such a tool via APIs, or
        
        integrate with various and sundry web services.
        
        In the case of #1, i.e. "SA now features generative neural net synthesis", you might start by integrating an open source option off the shelf, but you'd likely find yourself needing to maintain a fork going forward, the burden of which would only grow over time, which might, in turn, limit feature expansion. While I appreciate that this option is likely the easiest for some new end-users, my experience in the software development arena suggests that "small pieces loosely joined" has a lot of merit.
        
        In the case of #2, i.e. "You can now leverage generative neural net synthesis if you also install product X", you would only have to maintain the API interactions and not the code base itself. This would provide a basis for potentially including other tools over time, and would likely allow feature expansion to be dictated more by the strength of the communities around those tools, then by the size of Synthetik's development team. This leverages the "small pieces loosely joined" approach, but at some cost (additional installation requirements, a little less "total control", etc.)
        
        This is similar in the case of #3, i.e. "You can now leverage your Dalle/MJ/[...] account, to do X", but might involve a longer list of API interactions to maintain. This feels like the least "SA-ish" approach, but could perhaps result is a more generic approach to defining an interaction/API model.
        
        Anyway, food for thought.
    - Synthetik > Omar B August 31, 2022 at 9:40am
      
      All of your workflow suggestions are good, and do resonate with our thinking here about how to incorporate this technology into the Studio Artist workflow.
      
      And your analysis of various approaches we could take is also pretty spot on. I do worry a little bit about the potential for people to drain their bank accounts if we attach web service based approaches that get billed on cloud gpu usage to something like Gallery Show.
      
      One thing that is maybe a little bit different in our thinking is that we're really interested in how to incorporate the notion of 'improvisation' into how one might work with generative image synthesis. So i thought i would throw that out there to see if anyone had any thoughts. Or 'wouldn't it be nice' wishes.
      - Omar B > Synthetik September 1, 2022 at 9:35am
        
        My $0.02, in terms of improvision...
        
        The first thing that comes to mind is how some of these systems allow initial strokes as inputs and/or include "in painting" (including Stable Diffusion I believe, although not via the current DreamStudio web implementation AFAIK). Combined with the ability to expand/pan the canvas, this could allow the artist to do things like select areas, pan or zoom out, and then fill with matching or prompt-based AI-generated pixels. This could be extended to auto-selecting/panning/zooming, etc. and/or rotating/dynamically-chaging prompts.