akamikebphoto

I've largely been messing with generating pictures using AI,..

Added 2023-05-18 23:50:37 +0000 UTC

I've largely been messing with generating pictures using AI, but this morning I wanted to take some time and try and lean further into the video process to see what I could come up with. I'll break these down in gallery order. The reference sources are included and they are from my catalog, I'm sure you guys recognize some. The AI models used (aka "Lora") were trained by me using photos from my catalog. So I feel very strongly that I am using AI in an ethical, artistic manner. I know this is a concern for some folks. In my eyes I'm remixing my stuff with the internet's stuff. I'm using Stable Diffusion's img2img + ControlNets + EbSynth and Eb Utility (that probably looks like word diarrhea to some) The method I used doesn't allow for fine tuning of the prompt afaik, so if the prompt says "Beautiful woman with big boobs" and there are moments in the video where boobs are not visible, the AI will find somewhere to put two big boobs. It might be on the person's arm, it might be on their forehead, it's anyone's guess. So my prompt in all these is typically very vague. "Woman, detailed skin, nude" us the gist of what was used for most of these. In gallery order: #1 - This take was so bad I almost didn't want to include, but I did anyways because I'm gunning for that "most improved" award. Major issues with face tracking when the video is so soft, hence the constant shifting of the face. I'm using the default keyframe length on this, too, which is fairly long and the reason it just looks like a slideshow with morph transitions. Next step: decrease time between keyframes, use a clearer source. #3 - I kept the denoise factor low for this one and did my best to tune the OpenPose and depthmap controlnets to keep the generated model as adhered to the source as possible. About 2/3rds the way in there's moments where the arm goes nuts, that's because the AI didn't know how to to interpret the shape of Carolina's super thick braids and thought they were arms, I guess. Lowering the denoise also helped keep the background in check here, but starting a scene from a face and zooming out too full body was a LOT of change for the AI to chew on. Solution: Use steadier or more anchored source. #5 - The source here, despite being in constant motion, is still well anchored on Joy, with the only big movements being her right arm. I tried using "large_breasts" as a keyword in the prompt and quickly realized that each generated frame was casting its own version of big boobs, so the size fluctuates wildly. I also generated and used a mask for each frame, acting as a dynamic cutout to place the generated model directly over the source material. The second video, in the lagoon, is the denoise factor cranked to the point where none of the source is mixed into the generated image, giving you an idea of the wide gamut of possibilities. I also included the prompt "holding a chalice" and it interpreted that as "place challises wherever". Good times. #8 - This was my final attempt this afternoon, so I upped the resolution a ton which made it take wayyy longer to process. Crystal clear video of JMO, lots of movement but I figured the clarity would help. It did, some, but the backlight was still pretty overpowering and we lose tracking occasionally. Overall it's a massive improvement over my first attempt this morning, so I'll take it. For these renders, I did my best to keep the denoise jussstt tuned enough that we'd keep a chunk of the source material while still generating what looks like a fully rendered video. This meant not being able to do things like "remove window and replace with picture a painting of Bob Ross in the style of Bob Ross" because we were so reliant on the source layer. Using the depth map, it is possible to create an alpha layer 'behind' the model where the AI can generate whatever, but the source video is too flat for that to be effective. Possible next steps: Find even more stable material, and/or try a completely different method. EbSynth, while still one of the best, is just one of several methods of creating AI video, and there's a new method every damn week. Observations: This process strongly favors video that was recorded on a tripod or a slow gimbal. The steadier or more focused/anchored to a subject your video is, the better the AI's interpretation of the source material will be. As is with images, the quality of the Stable Diffusion model (aka Lora) has huge impacts on the quality of the video. If you don't train your models faces on almost every angle, the AI will have to do more work to interpret what that person's face would look like at that angle, you get me? lol. Lastly, there's another method I've yet to explore that involves putting up to 16 keyframes on one reference sheet, encouraging the AI to interpret all of them simultaneously which would greatly reduce flicker. Any questions? 😬