Example workflows

CLIPtion is a fast and small captioning extension to the OpenAI CLIP ViT-L/14 used in Stable Diffusion, SDXL, SD3, FLUX, etc. Feed the CLIP and CLIP_VISION models in and CLIPtion powers them up giving you caption/prompt generation in your workflows!

I made this for fun and am sure bigger dedicated caption models and VLM's will give you more accurate captioning, but this guy is tiny, fast, reuses what you already have loaded, and has options to give better CLIP alignment so give it a try if you like!

Big thanks to Ben Egan, SilentAntagonist, Alex Redden, XWAVE, and Jacky-hate whose synthetic caption datasets I included in the training.

Example workflows

CLIPtion-example.json

CIPtion-flux-schnell.json

CLIPtion-SD3.5L.json

Installation

Clone this repo to your ComfyUI/custom_nodes directory.

cd custom_nodes
git clone https://github.com/pharmapsychotic/comfy-cliption.git
pip install -r comfy-cliption/requirements.txt

Optionally download CLIPtion_20241219_fp16.safetensors and put in your ComfyUI/custom_nodes/comfy-cliption directory. You can skip this step to let it auto-download on first use to your HF_HOME cache.
Restart ComfyUI

You should have the CLIP L text encoder already from SD, SDXL, SD3, FLUX. You also need the CLIP L vision encoder for the Load CLIP Vision node. You can download this through the ComfyUI Manager > Model Manager > search "clip vision large" > click Install for openai/clip-vit-large.

The example workflows use the ComfyUI-Custom-Scripts node for previewing the caption strings. So you'll probably want to install that as well if you don't have it already.

Nodes

CLIPtion Loader

If CLIPtion_20241219_fp16.safetensors is not already downloaded (as in step 2 of Installation) then the loader will automatically download the CLIPtion model for you the first time it is run from the HuggingFace CLIPtion repo. It gets stored in the HuggingFace cache dir (controlled by HF_HOME environment variable).

CLIPtion Generate

Create caption from an image or batch of images.

temperature - controls randomness in generation - higher values produce more diverse outputs, lower values are more focused and predictable
best_of - generates this many captions in parallel and picks the one with best CLIP similarity to the image
ramble - forces generation of full 77 tokens

CLIPtion Beam Search

Deterministic search for caption from an image or batch of images. Less "creative" than Generate node.

beam_width - how many alternative captions are considered in parallel - higher values explore more possibilities but take longer
ramble - forces generation of full 77 tokens

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
media		media
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example workflows

Installation

Nodes

CLIPtion Loader

CLIPtion Generate

CLIPtion Beam Search

About

Releases

Packages

Languages

License

pharmapsychotic/comfy-cliption

Folders and files

Latest commit

History

Repository files navigation

Example workflows

Installation

Nodes

CLIPtion Loader

CLIPtion Generate

CLIPtion Beam Search

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages