Select the vision-language model for image description generation
A popular series of small vision language models built by Microsoft, including a 250 Million (base) and a 700 Million (large) parameter variant.
The 700 Million parameter vision language model variant of the Florence-2 series.
A 256 Million parameter vision language model built by Hugging Face.
A 500 Million parameter vision language model built by Hugging Face.
A 2 Billion parameter vision language model used for image captioning / extracting image text.
INT8 quantized version of Moondream2 (2B params) for memory-constrained hardware. Reduces memory from ~5GB to ~1.5-2GB with minimal quality loss. Ideal for CPU-only machines.
External Ollama instance