Trendshift - Ask AI

base on ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering # Meta CLIP [FAIR, Meta](https://ai.meta.com/research/) [![arXiv](https://img.shields.io/badge/arXiv-2507.22062-b31b1b)](https://arxiv.org/abs/2507.22062) [![arXiv](https://img.shields.io/badge/arXiv-2309.16671-b31b1b)](https://arxiv.org/abs/2309.16671) [![Hugging Face Collection](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Collection-blue)](https://huggingface.co/collections/facebook/meta-clip-687e97787e9155bc480ef446) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/activebus/MetaCLIP) <img src="docs/metaclip2_scaling.gif" style="width: 50%; margin: 0 auto; display: block;" /> <img src="docs/metaclip2_teaser.png" style="width: 80%; margin: 0 auto; display: block;" /> After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges: - large-scale non-English data curation pipelines are largely undeveloped; - the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP. With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can **mutually benefit** and elevate each other, achieving SoTA multilingual performance. ## Updates * 08/25/2025: 🔥 [Meta CLIP 2 (worldwide)](https://arxiv.org/abs/2507.22062) is on [open_clip](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/pretrained.py) and [Huggingface](https://huggingface.co/collections/facebook/meta-clip-687e97787e9155bc480ef446). * 07/29/2025: 🔥 [Meta CLIP 2 (worldwide)](https://arxiv.org/abs/2507.22062) is released. * 12/10/2024: 🔥 Meta CLIP 1.2 (ViT-H/14) trained with Altogether synthetic captions is released. * 10/09/2024: 🔥 [Altogether: Image Captioning via Re-aligning Alt-text](https://arxiv.org/abs/2410.17251) (aka Meta CLIP 1.2) is accepted by EMNLP 2024 with [code](altogether/README.md) released. * 08/15/2024: [v0.1](https://github.com/facebookresearch/MetaCLIP/releases/tag/v0.1) released. * 04/25/2024: 🔥 paper [MoDE: CLIP Data Experts via Clustering](https://arxiv.org/abs/2404.16030) is accepted by CVPR 2024 with [code](mode/README.md) released. * 01/18/2024: 🔥 add [code](metaclip/README_metadata.md) for building metadata. * 01/16/2024: 🔥 paper accepted by ICLR as [spotlight presentation](https://openreview.net/group?id=ICLR.cc/2024/Conference#tab-accept-spotlight). * 12/25/2023: [Huggingface Space](https://huggingface.co/spaces/activebus/MetaCLIP) demo and [Colab](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing) released. * 12/21/2023: Meta CLIP 1.1 (ViT-G/14) released. * 09/28/2023: initial release. ## Quick Start The pre-trained MetaCLIP models are available in <details> <summary>mini_clip (this repo)</summary> ```python import torch from PIL import Image from src.mini_clip.factory import create_model_and_transforms, get_tokenizer model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide') tokenize = get_tokenizer("facebook/xlm-v-base") image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0) text = tokenize(["a diagram", "a dog", "a cat"]) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ``` </details> <details> <summary>Huggingface</summary> ```python from PIL import Image from transformers import AutoProcessor, AutoModel # Meta CLIP 1 processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m") model = AutoModel.from_pretrained("facebook/metaclip-b32-400m") # Meta CLIP 2 # model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu") # processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu") image = Image.open("docs/CLIP.png") inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True) with torch.no_grad(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score text_probs = logits_per_image.softmax(dim=-1) print("Label probs:", text_probs) ``` </details> ## Pre-trained Models Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): **to promote rigorous ablation studies and advance scientific understanding**, as in the old "era of ImageNet". Meta CLIP 2 | `model_name` | `pretrained` | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. | |:--------------------|:-------------|:---------:|:---------:|:---------:|:--------------:| | `ViT-H-14-quickgelu-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_h14_quickgelu_224px_worldwide.pt) | Online Curation | 29B | 224 | 57.4 | | `ViT-H-14-378-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_h14_378px_worldwide.pt) | Online Curation | 29B | 378 | 58.2 | | `ViT-bigG-14-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_bigG14_224px_worldwide.pt) | Online Curation | 29B | 224 | 60.7 | | `ViT-bigG-14-378-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_bigG14_378px_worldwide.pt) | Online Curation | 29B | 378 | 62.0 | (WIP): Meta CLIP 2: distilled smaller models and tokenizers. Meta CLIP 1 | `model_name` | `pretrained` | Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. | |:--------------------|:-------------|:---------:|:---------:|:---------:|:---------:|:--------------:| | `ViT-B-32-quickgelu` | [`metaclip_400m`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_400m.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_400m.json) | 12.8B | 224 | 64 x V100 | 65.5 | | `ViT-B-16-quickgelu` | [`metaclip_400m`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b16_400m.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_400m.json) | 12.8B | 224 | 64 x V100 | 70.8 | | `ViT-L-14-quickgelu` | [`metaclip_400m`](https://dl.fbaipublicfiles.com/MMPT/metaclip/l14_400m.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_400m.json) | 12.8B | 224 | 128 x V100 | 76.2 | | `ViT-B-32-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 64 x V100 | 67.6 | | `ViT-B-16-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b16_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 64 x V100 | 72.1 | | `ViT-L-14-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/l14_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 128 x V100 | 79.2 | | `ViT-H-14-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/h14_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 256 x A100 | 80.5 | | `ViT-bigG-14-quickgelu` (v1.1) | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/G14_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 256 x A100 | 82.1 | | `ViT-H-14` (v1.2) | [`metaclip_v1_2_altogether`](https://dl.fbaipublicfiles.com/MMPT/metaclip/h14_v1.2_altogether.pt) | Online Curation | 35B | 224 | 256 x H100 | 82.0 | ## Environment This code is customized from [OpenCLIP](https://github.com/mlfoundations/open_clip) and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and `submitit=1.2.1` used by this repo: ```bash conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \ -c pytorch-nightly \ -c nvidia \ -c conda-forge \ -c anaconda ``` ## Curation See [MetaCLIP 2](docs/metaclip2.md) and [MetaCLIP 1](docs/metaclip1.md). ## Bugs or questions? If you have any questions related to the code or the paper, feel free to email Hu Xu (`[email protected]`). ## Citation Please cite the following paper if MetaCLIP helps your work: ```bibtex ```bibtex @inproceedings{chuang2025metaclip2, title={MetaCLIP 2: A Worldwide Scaling Recipe}, author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu}, journal={arXiv preprint arXiv:2507.22062}, year={2025} } @inproceedings{xu2023metaclip, title={Demystifying CLIP Data}, author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer}, journal={arXiv preprint arXiv:2309.16671}, year={2023} } @inproceedings{xu2024altogether, title={Altogether: Image Captioning via Re-aligning Alt-text}, author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer}, journal={arXiv preprint arXiv:2410.17251}, year={2024} } @inproceedings{ma2024mode, title={Mode: Clip data experts via clustering}, author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, year={2024} } ``` ## Reference The training code is developed based on [OpenCLIP](https://github.com/mlfoundations/open_clip), modified to the vanilla CLIP training setup. ## TODO - pip installation of metaclip package; - refactor mini_clip with apps for MoDE, altogether. - more updates for Meta CLIP 2: metadata, data loader, training code. ## License The majority of Meta CLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license. ## Acknowledgement We gratefully acknowledge the [OpenCLIP](https://github.com/mlfoundations/open_clip) team for initial CLIP codebase and integration and [NielsRogge](https://github.com/NielsRogge)'s integration into [Huggingface](https://huggingface.co/models?other=metaclip). ", Assign "at most 3 tags" to the expected json: {"id":"2801","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"

AI prompts