Fine-grained control is necessary to generate high-end visuals satisfying the unique requirements of beauty brands because every detail matters. Existing tools, like Stable Diffusion and Midjourney, can only offer very limited control over the image generation by allowing users to use text prompts, which is because “an image is worth thousands of words”.
To achieve fine-grained control, we think that it is necessary to allow users to use multiple formats of controlling information including both texts and images. The challenge is how to design good formats and how to fuse the multiple controlling information into a single image. To this end, we have designed a novel framework called ControlNOLA.
Currently, under the framework, we design four types of controlling information, which include one text information and three pieces of image information. As shown in the following figure, we design one text controlling module and three image controlling modules to handle the text information and three pieces of image information, respectively. Each module is responsible for controlling certain parts of generated images, and some modules can be optional. We omit the details of the controlling information and how they are fused into the generator here. But it is worth mentioning that the formats of the controlling information are designed by listening to creators/producers’ requirements in the commercial creative industry such that the image generation workflow using our framework is friendly to industry practitioners.