LOADING

Falcon Perception

Next-generation multimodal model that unifies vision and language in a single dense transformer architecture

Advanced perception capabilities, combining image and text understanding in one framework.

Processes images and text together from the very first layer. Identify objects, highlight parts of an image, or read text from documents in one model.

Powerful. Efficient. Practical to deploy.

overview

Falcon Perception is a multimodal AI model that enables systems to see, read, and understand images using natural language prompts.

By combining vision and language capabilities in a single architecture, Falcon Perception simplifies how AI interprets visual information while remaining efficient.

Overview

A Unified Approach to Visual Understanding

Colorful Canned Drinks

Falcon Perception extends the Falcon ecosystem beyond language models into advanced visual perception. It is designed to handle tasks such as object grounding, image segmentation, and document understanding within a single model.

Snow Goose

Traditional vision languge systems are often built as pipelines, with separate components for image encoding, language fusion, and task-specific outputs. Falcon Perception takes a simpler approach, using a single early-fusion transformer that processes image and text information together from the first layer.

Person using phone

This avoids many of the bottlenecks and added complexity found in modular systems, making the model easier to scale and more efficient to run.

One Model,
Multiple Visual Tasks

Falcon Perception is built to work across a range of perception tasks using a unified structure. It supports open-vocabulary perception, meaning users can interact with images using natural language prompts rather than relying on fixed label sets. In practice, that allows the model to identify objects from written descriptions, highlight the correct regions in an image, and interpret text within documents using one shared architecture.

The model uses a lightweight structured output process to move from object location to object size to segmentation, allowing it to handle variable numbers of objects in an image without the overhead of heavier decoder-based systems.

Image

Built for Practical Deployment

Falcon Perception is designed for practical use across domains including document analysis, medical imaging, satellite imagery, robotics, and autonomous systems. Its combact architecture is intended to deliver strong multimodal performance without the size and infrastructure demands typically associated with large vision-language systems.

Image

Medical image interpretation

Image

Satellite and geospatial imagery

Image

Robotics and autonomous systems

Competitive Performance at Compact Scale

Despite its compact size, Falcon Perception delivers strong results across perception benchmarks. With a 0.6B-parameter early-fusion model for segmentation and grounding, and a 0.3B model for document understanding, it delivers a strong multimodal performance without the scale typically required by vision-language systems.

Image
Image

benchmark

Despite its compact size, Falcon Perception demonstrates strong performance across leading benchmarks:

  • Segmentation: Matches state-of-the-art results from leading models such as Meta’s SAM3 on the SaCO benchmark for object segmentation.
  • Document understanding: Achieves competitive results on OmniDocBench, matching or approaching the performance of much larger systems including Mistral-OCR, DOTS-OCR, and Qwen-VL-235B.
  • Complex visual understanding: Outperforms competing models on more challenging prompts involving attributes, comparisons, and dense scenes.

This performance-to-efficiency ratio highlights a broader shift in AI innovation:

Progress is increasingly defined not only by scale, but by architectural refinement and deployability.

Falcon Perception

Benchmarking Intelligence

Where do we stand?

FEATURE FALCON PERCEPTION MOONDREAM3 QWEN3 SAM3
Architecture Early fusion Dense ViT+Dense ViT+Dense DETR
Size 0.6B 2/9B 4B/8B 0.9B
Simple Nouns
Complex Expressions
Segmentation
Interactive Refinement
Auto-regressive

* Performance benchmarks based on standardized evaluation metrics

Segmentation Performance

On the SaCO open-vocabulary segmentation benchmark, Falcon Perception achieved 68.0 Macro-F1, compared with 62.3 for SAM 3, with especially strong gains on more complex prompt types and dense scenarios.

The model also performs strongly on PBench, a benchmark introduced to test specific capabilities such as attributes, OCR-guided identification, spatial understanding, relations, and crowded scenes. The largest gains appear as prompts become more demanding, particularly when the task requires reading text in images, understanding spatial constraints, or reasoning about relationships between objects.

Image

Falcon-OCR

Document Understanding

For OCR and document tasks, the Falcon OCR model achieved scores of 80.3 on olmOCR and 88.6 on OmniDocBench, with the highest throughput among open-source OCR models.

Together, these results show that a unified vision-language design can compete with — and in some areas outperform — larger or more specialized systems, while keeping the overall architecture compact and efficient.

Falcon Perception is powerful, efficient, and practical to deploy.

By bringing image understanding, segmentation, and document reading into a more unified architecture, Falcon Perception helps simplify multimodal AI development while expanding what compact models can do.

Image

Falcon OCR

Benchmarking OCR Intelligence

FEATURE FALCON OCR PADDLE DoisOCR Qwen3-VL-235B
Architecture Early fusion Dense ViT+Dense ViT+Dense ViT+Dense
Size 0.3B 0.9B 2B 235B
Layout Recognition
Element Parsing
VQA
Information Extraction

* Performance benchmarks based on standardized evaluation metrics