Computer vision
Official Implementation of MultiWorld: Scalable Multi-Agent Multi-View Video World Models
Analyse · Learn · Ingest · Curate · Export — AI-powered YOLO dataset management toolkit
A feed-forward 3D foundation model for reconstructing scenes from streaming data
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
AR 3D object detection for iPhone with LiDAR — YOLO 2D + BoxerNet 3D lifting
Fine-tune Gemma 4 and 3n with audio, images and text on Apple Silicon, using PyTorch and Metal Performance Shaders.
SteerViT is a framework that equips any ViT with the ability to steer both its global and local visual representations with natural language.
Control OpenLayers, Google Maps, and Leaflet with hand gestures via webcam. Uses MediaPipe for real time hand tracking to pan, zoom, and navigate maps hands free in the browser. No backend required.
Allen Institute for AI: WildDet3D: Scaling Promptable 3D Detection in the Wild
A simple video streaming baseline that outperforms SOTAs.
Give Claude the ability to watch and understand videos — Claude Code plugin with frame extraction and multimodal audio analysis
"Single-image Layer Decomposition for Anime Characters" (SIGGRAPH 2026, Conditionally Accepted)
Inference repo for Falcon-Perception and Falcon-OCR model, early-fusion, natively multimodal, dense Autoregressive Transformer models.
[CVPR 2026] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
[CVPR 2026 Highlight] A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
MegaFlow: Zero-Shot Large Displacement Optical Flow
Official implementation and models for OVIE (One View Is Enough! Monocular Training for In-the-Wild Novel View Generation)
Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.