AI prompts
base on <h1 align="center"> HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting </h1>
<p align="center">
<a href="[https://arxiv.org/abs/2305.02034](https://arxiv.org/abs/2311.17957)"><img src="https://img.shields.io/badge/arXiv-Paper-<color>"></a>
# News
**2023.12.1**
The paper is post on arxiv!
**2023.12.29**
First code commit released.
**2024.1.7**
The preprocessor and the finetuned model have been ported to [ComfyUI controlnet](https://github.com/Fannovel16/comfyui_controlnet_aux). The preprocessor has been ported to [sd webui controlnet](https://github.com/Mikubill/sd-webui-controlnet). Thanks for all your great work!
**2024.1.15**
⚠️ When using finetuned ControlNet from this repository or [control_sd15_inpaint_depth_hand](https://huggingface.co/hr16/ControlNet-HandRefiner-pruned), I noticed many still use control strength/control weight of 1 which can result in loss of texture. As stated in the paper, we recommend using a smaller control strength (e.g. 0.4 - 0.8).
# Introduction
This is the official repository of the paper <a href="https://arxiv.org/abs/2311.17957"> HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting </a>
<figure>
<img src="Figs/banner.png">
<figcaption align = "center"><b>Figure 1: Stable Diffusion (first two rows) and SDXL (last row) generate malformed hands (left in each pair), e.g., incorrect
number of fingers or irregular shapes, which can be effectively rectified by our HandRefiner (right in each pair).
</b></figcaption>
</figure>
<p>
<p align="left">
In this study, we introduce a lightweight post-processing solution called <b>HandRefiner</b> to correct malformed hands in generated images. HandRefiner employs a conditional inpainting
approach to rectify malformed hands while leaving other
parts of the image untouched. We leverage the hand mesh
reconstruction model that consistently adheres to the correct number of fingers and hand shape, while also being
capable of fitting the desired hand pose in the generated
image. Given a generated failed image due to malformed
hands, we utilize ControlNet modules to re-inject such correct hand information. Additionally, we uncover a phase
transition phenomenon within ControlNet as we vary the
control strength. It enables us to take advantage of more
readily available synthetic data without suffering from the
domain gap between realistic and synthetic hands.
# Visual Results
<figure>
<img src="Figs/github_results.png">
</figure>
# Installation
Check [installation.md](docs/installation.md) for installation instructions.
# Manual
Check [manual.md](docs/manual.md) for an explanation of commands to execute the HandRefiner.
# Get Started
For single image rectification:
```bash
python handrefiner.py --input_img test/1.jpg --out_dir output --strength 0.55 --weights models/inpaint_depth_control.ckpt --prompt "a man facing the camera, making a hand gesture, indoor" --seed 1
```
For multiple image rectifications:
```bash
python handrefiner.py --input_dir test --out_dir output --strength 0.55 --weights models/inpaint_depth_control.ckpt --prompt_file test/test.json --seed 1
```
# Important Q&A
<ul>
<li> <b>What kind of images can be rectified?</b></li>
Like any method, this method also has its limits. If the original hands are so bad that are inrecognisable from human eyes, then it is pretty much impossible for neural networks to fit a reasonable mesh. Also, due to the fitting nature of the method, we do not rectify the hand size. So if you have a giant malformed hand in the original image, you will still get a giant hand back in the rectified image. Thus malformed hands with hand-like shape and appropriate size can be rectified.
<li> <b>Can we use it on SDXL images?</b>
In the paper, the SDXL images are resized to 512x512 before the rectification, because the base model used in this project is sd1.5.
Solution for SDXL:
However, it is certainly not difficult to implement it in SDXL, and I believe many implementations already have the functionality of using inpainting SDXL combined with depth controlnet to inpaint the image.
So what you can do is get the depth map and masks from the pipeline of this repository, then pipe them to the whatever implementation for SDXL you use for inpainting the image.
A caveat is that I have not tested this before, and as mentioned in the paper, since depth controlnet is not fine-tuned on the hand mesh data, it may have a high rate of failed inpainting. In that case, you can use the technique mentioned in the paper, using available synthetic data to fine-tune the depth sdxl controlnet, for example, using these two datasets here [[1]](https://synthesis.ai/static-gestures-dataset/)[[2]](https://synthesis.ai/animated-gestures-dataset/), then you can adjust control strength to get the desired texture and appearance.
<li> <b>What if the generation failed?</b>
The first thing is to check the depth map, if the depth map is bad, you can consider using a different mesh reconstruction model to reconstruct the mesh.
Second things is to check if the masks of hands fully cover the malformed hands, some malformed hand can have very long fingers so it may not be covered by the detected masks, to fix this
1. Consider using a greater padding by adjusting the pad parameter in the argument
2. Provide a hand-drawn mask
If all of the previous steps are ok, then you may need to regenerate several times or try different control strengths. <- changing the seed can be very helpful.
<li> <b>Since the small hands is a limitation mentioned in the paper, what is the appropriate hand size for the SD v1.5 weight?</b>
Generally, hands with size at least 60px × 60px is recommended for the current weights. To make it applicable for small hands, consider scale up the image using some super-resolution methods.
<li> <b>How to contribute to this project?</b>
In the last decade, the CV community has produced dozens of highly accurate mesh reconstruction models, in this project we use the recent SOTA model Mesh Graphormer on the FreiHAND benchmark. However, it is very welcome to contribute to this project by porting other models here, I have written a template parent class for models under preprocessor folder.
<li> <b>Can I use it for Anime hands or other styles?</b>
As long as the hand detection model and the mesh reconstruction model are able to detect the hands and reconstruct meshes, it should work for other styles. However, from my understanding, these models are not trained on cartoon or anime images, so there is a great chance that the mesh reconstruction stage may fail.
</ul>
## Comments
- Our codebase builds heavily on [stable-diffusion](https://github.com/CompVis/stable-diffusion), [ControlNet](https://github.com/lllyasviel/ControlNet) and [MeshGraphormer](https://github.com/microsoft/MeshGraphormer).
## Citation
If you find HandRefiner helpful, please consider giving this repo a star :star: and citing:
```
@inproceedings{10.1145/3664647.3680693,
author = {Lu, Wenquan and Xu, Yufei and Zhang, Jing and Wang, Chaoyue and Tao, Dacheng},
title = {HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting},
year = {2024},
isbn = {9798400706868},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3664647.3680693},
doi = {10.1145/3664647.3680693},
abstract = {Diffusion models have achieved remarkable success in generating realistic images but suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes. This difficulty arises from the complex task of learning the physical structure and pose of hands from training images, which involves extensive deformations and occlusions. For correct hand generation, our paper introduces a lightweight post-processing solution called HandRefiner. HandRefiner employs a conditional inpainting approach to rectify malformed hands while leaving other parts of the image untouched. We leverage the hand mesh reconstruction model that consistently adheres to the correct number of fingers and hand shape, while also being capable of fitting the desired hand pose in the generated image. Given a generated failed image due to malformed hands, we utilize ControlNet modules to re-inject such correct hand information. Additionally, we uncover a phase transition phenomenon within ControlNet as we vary the control strength. It enables us to take advantage of more readily available synthetic data without suffering from the domain gap between realistic and synthetic hands. Experiments demonstrate that HandRefiner can significantly improve the generation quality quantitatively and qualitatively. The code is available at https://github.com/wenquanlu/HandRefiner.},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
pages = {7085–7093},
numpages = {9},
keywords = {deep learning, diffusion models, image inpainting},
location = {Melbourne VIC, Australia},
series = {MM '24}
}
```
", Assign "at most 3 tags" to the expected json: {"id":"6611","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"