Falcon Perception

Next-generation multimodal model that unifies vision and language in a single dense transformer architecture

Advanced perception capabilities, combining image and text understanding in one framework.

Processes images and text together from the very first layer. Identify objects, highlight parts of an image, or read text from documents in one model.

Powerful. Efficient. Practical to deploy.

Try Falcon Perception & OCR

overview

Falcon Perception is a multimodal AI model that enables systems to see, read, and understand images using natural language prompts.

By combining vision and language capabilities in a single architecture, Falcon Perception simplifies how AI interprets visual information while remaining efficient.

A Unified Approach to Visual Understanding

Colorful Canned Drinks

Falcon Perception extends the Falcon ecosystem beyond language models into advanced visual perception. It is designed to handle tasks such as object grounding, image segmentation, and document understanding within a single model.

Snow Goose

Traditional vision languge systems are often built as pipelines, with separate components for image encoding, language fusion, and task-specific outputs. Falcon Perception takes a simpler approach, using a single early-fusion transformer that processes image and text information together from the first layer.

Person using phone

This avoids many of the bottlenecks and added complexity found in modular systems, making the model easier to scale and more efficient to run.

One Model,
Multiple Visual Tasks

Falcon Perception is built to work across a range of perception tasks using a unified structure. It supports open-vocabulary perception, meaning users can interact with images using natural language prompts rather than relying on fixed label sets. In practice, that allows the model to identify objects from written descriptions, highlight the correct regions in an image, and interpret text within documents using one shared architecture.

The model uses a lightweight structured output process to move from object location to object size to segmentation, allowing it to handle variable numbers of objects in an image without the overhead of heavier decoder-based systems.

Built for Practical Deployment

Falcon Perception is designed for practical use across domains including document analysis, medical imaging, satellite imagery, robotics, and autonomous systems. Its combact architecture is intended to deliver strong multimodal performance without the size and infrastructure demands typically associated with large vision-language systems.

Competitive Performance at Compact Scale

Despite its compact size, Falcon Perception delivers strong results across perception benchmarks. With a 0.6B-parameter early-fusion model for segmentation and grounding, and a 0.3B model for document understanding, it delivers a strong multimodal performance without the scale typically required by vision-language systems.

benchmark

Despite its compact size, Falcon Perception demonstrates strong performance across leading benchmarks:

Segmentation: Matches state-of-the-art results from leading models such as Meta’s SAM3 on the SaCO benchmark for object segmentation.
Document understanding: Achieves competitive results on OmniDocBench, matching or approaching the performance of much larger systems including Mistral-OCR, DOTS-OCR, and Qwen-VL-235B.

Complex visual understanding: Outperforms competing models on more challenging prompts involving attributes, comparisons, and dense scenes.

This performance-to-efficiency ratio highlights a broader shift in AI innovation:

Progress is increasingly defined not only by scale, but by architectural refinement and deployability.

Falcon Perception

Benchmarking Intelligence

Where do we stand?

FEATURE	FALCON PERCEPTION	MOONDREAM3	QWEN3	SAM3
Architecture	Early fusion Dense	ViT+Dense	ViT+Dense	DETR
Size	0.6B	2/9B	4B/8B	0.9B
Simple Nouns
Complex Expressions
Segmentation
Interactive Refinement
Auto-regressive

* Performance benchmarks based on standardized evaluation metrics

Segmentation Performance

On the SaCO open-vocabulary segmentation benchmark, Falcon Perception achieved 68.0 Macro-F1, compared with 62.3 for SAM 3, with especially strong gains on more complex prompt types and dense scenarios.

The model also performs strongly on PBench, a benchmark introduced to test specific capabilities such as attributes, OCR-guided identification, spatial understanding, relations, and crowded scenes. The largest gains appear as prompts become more demanding, particularly when the task requires reading text in images, understanding spatial constraints, or reasoning about relationships between objects.

Falcon-OCR

Document Understanding

For OCR and document tasks, the Falcon OCR model achieved scores of 80.3 on olmOCR and 88.6 on OmniDocBench, with the highest throughput among open-source OCR models.

Together, these results show that a unified vision-language design can compete with — and in some areas outperform — larger or more specialized systems, while keeping the overall architecture compact and efficient.

Falcon Perception is powerful, efficient, and practical to deploy.

By bringing image understanding, segmentation, and document reading into a more unified architecture, Falcon Perception helps simplify multimodal AI development while expanding what compact models can do.

Falcon OCR

Benchmarking OCR Intelligence

FEATURE	FALCON OCR	PADDLE	DoisOCR	Qwen3-VL-235B
Architecture	Early fusion Dense	ViT+Dense	ViT+Dense	ViT+Dense
Size	0.3B	0.9B	2B	235B
Layout Recognition
Element Parsing
VQA
Information Extraction

* Performance benchmarks based on standardized evaluation metrics

Falcon Perception

Next-generation multimodal model that unifies vision and language in a single dense transformer architecture

Powerful. Efficient. Practical to deploy.

overview

A Unified Approach to Visual Understanding

One Model,
Multiple Visual Tasks

Built for Practical Deployment

Medical image interpretation

Satellite and geospatial imagery

Robotics and autonomous systems

Competitive Performance at Compact Scale

benchmark

Despite its compact size, Falcon Perception demonstrates strong performance across leading benchmarks:

This performance-to-efficiency ratio highlights a broader shift in AI innovation:

Falcon Perception

Where do we stand?

Segmentation Performance

Falcon-OCR

Document Understanding

Falcon Perception is powerful, efficient, and practical to deploy.

Falcon OCR

Falcon Perception

Next-generation multimodal model that unifies vision and language in a single dense transformer architecture

Powerful. Efficient. Practical to deploy.

overview

A Unified Approach to Visual Understanding

One Model, Multiple Visual Tasks

Built for Practical Deployment

Medical image interpretation

Satellite and geospatial imagery

Robotics and autonomous systems

Competitive Performance at Compact Scale

benchmark

Despite its compact size, Falcon Perception demonstrates strong performance across leading benchmarks:

This performance-to-efficiency ratio highlights a broader shift in AI innovation:

Falcon Perception

Where do we stand?

Segmentation Performance

Falcon-OCR

Document Understanding

Falcon Perception is powerful, efficient, and practical to deploy.

Falcon OCR

One Model,
Multiple Visual Tasks