Report this

What is the reason for this report?

YOLOE: A Faster Model for Object Detection

Published on June 2, 2025
Shaoni Mukherjee

By Shaoni Mukherjee

Technical Writer

YOLOE: A Faster Model for Object Detection

Object detection and segmentation are key parts of computer vision, used in everything from self-driving cars to medical image analysis. Popular models like the YOLO series are fast and accurate, but they can only recognize a fixed set of object categories. This makes them less useful in real-world scenarios where new or uncommon objects may appear. To fix this, recent research has focused on “open-set” models that can detect and label any object, even those not seen during training, using prompts like text or visual cues.

YOLOE, a powerful and efficient model that works like a human eye, recognizing any object across various prompt types: text-based prompts, visual hints, or even no prompts at all. It builds upon the strengths of YOLO models but is designed for more flexible use in the real world, all while keeping the speed and light weight that made YOLO famous.

How does YOLOE work?

Here’s how YOLOE works across the three prompt types:

  1. Text Prompts (RepRTA Strategy)
    For situations where you describe what you’re looking for (e.g., “find all bicycles”), YOLOE uses a strategy called Re-parameterizable Region-Text Alignment (RepRTA). It improves how the model connects text and images using a lightweight helper network. During inference, this helper network is folded into the main model, so there’s no extra cost or delay.

  2. Visual Prompts (SAVPE Strategy)
    If you provide an example region or visual cue, YOLOE uses the Semantic-Activated Visual Prompt Encoder (SAVPE). It splits the job into two branches—one for understanding the meaning (semantics) and another for activating relevant regions. This smart separation allows the model to stay accurate while keeping things simple and fast.

  3. Prompt-Free (LRPC Strategy)
    When no prompt is given, YOLOE uses Lazy Region-Prompt Contrast (LRPC). Instead of relying on large, slow language models, it matches detected objects with a built-in list of known categories. This allows it to perform well while saving on memory and computation.

image

YOLOE supports detection and segmentation across diverse open prompt types by using re-parameterizable region-text alignment for text, SAVPE for efficient visual prompt embedding, and lazy region-prompt contrast for prompt-free object categorization.

Getting Started with YOLOE: Zero-Shot Object Detection and Segmentation

Here is the code walkthrough to use YOLOE for your projects:

# Step 1: Clone the YOLOE Repository
git clone https://github.com/THU-MIG/yoloe.git
cd yoloe
# Step 2: Install Dependencies
pip install -r requirements.txt
# Step 3: Download Pretrained Models
# Visit https://github.com/THU-MIG/yoloe to download pretrained weights (e.g., YOLOE-v8-S.pth)
# Place them in the appropriate directory (e.g., yoloe/weights/)
# Step 4: Prepare Your Dataset
# Place your test images in a folder (e.g., ./data/images/)
# For zero-shot detection, make sure you have text prompts or class descriptions ready
# Step 5: Run Inference
python predict_text_prompt.py \
    --source ./data/images/  \
    --checkpoint pretrain/yoloe-v8l-seg.pt \
    --text_prompts "cat, dog, car, person" \
    --device cuda:0

# Step 6: Visualize Results
# Each image will show:
# - Bounding boxes
# - Segmentation masks

Conclusion

To conclude, we can say that YOLOE is another breakthrough model that combines the best of speed, flexibility, and simplicity. It works across all types of prompts—text, visual, or none—without the heavy cost of complex models. It’s a big step toward truly intelligent, real-time computer vision that adapts to whatever the world throws at it. Personally, I find YOLOE’s practical design and architecture not just impressive, but a promising shift toward practical, real-time AI that’s actually deployable in the applications.

Resources

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee
Shaoni Mukherjee
Author
Technical Writer
See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.