I tested YOLO-E, which finds objects using only text without trained classes

Terms to Know First

TermDescription
YOLOStands for "You Only Look Once". A fast AI model that detects objects in real-time by looking at an image only once.
Object DetectionA technology that identifies "what something is and where it's located" in an image. Positions are marked with bounding boxes.
Zero-shot DetectionThe ability to detect objects that have never been trained on. It can find new things using only text descriptions.
Text EmbeddingConverting sentences into numerical vectors. This allows AI to understand "meaning".
FlangeA metal disc that connects pipes or machine parts. It has a large hole in the center and bolt holes around the perimeter.

In traditional object detection, models are confined to trained classes. Dogs, cars, people... But what if you want to detect something really specific?

For example, what if you want to detect something like this? "The large central hole of a metal flange — excluding the small bolt holes"

A typical model would fail. But YOLO-E is a game changer.

YOLO-E extends classic YOLO with text-based detection. Instead of fixed labels, it responds to your descriptions.

And it actually works.


default alt text

The code is surprisingly simple

Here's the code I used to detect only the large central hole in a metal flange:

from ultralytics import YOLOE
from PIL import Image
import numpy as np
import cv2
import matplotlib.pyplot as plt

IMAGE_PATH = "flange.jpg"

# I only want one thing: the large central hole
PROMPTS = [
    "the large central opening of a metal flange, wide and smooth, much bigger than bolt holes"
]

CONF_THRESHOLD = 0.01

# Load model
model = YOLOE("yoloe-v8l-seg.pt").cuda()

# ⭐ The key: Convert text to embeddings
text_pe = model.get_text_pe(PROMPTS)
model.set_classes(PROMPTS, text_pe)

# Run prediction
image = Image.open(IMAGE_PATH).convert("RGB")
results = model.predict(image, conf=CONF_THRESHOLD, verbose=False)[0]

# Visualize results
img = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
for box, cls, score in zip(results.boxes.xyxy, results.boxes.cls, results.boxes.conf):
    x1, y1, x2, y2 = map(int, box)
    cv2.rectangle(img, (x1, y1), (x2, y2), (0, 0, 255), 2)
    cv2.putText(img, f"center hole {score*100:.1f}%", (x1, y1 - 10),
                cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2)

plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
plt.show()

🪄 Why does this work? — Prompt Engineering for Vision

YOLO-E uses text embeddings internally:

text_pe = model.get_text_pe(PROMPTS)
model.set_classes(PROMPTS, text_pe)

This single step transforms the model into a zero-shot detector. In other words, it attempts to match the scene with the meaning of your sentences.


default alt text


💡 Key Trick: Make your prompt describe the large hole while implicitly excluding the smaller holes.


✍️ Prompts that work vs prompts that don't

✗ Prompts to avoid

PromptProblem
"hole"Detects all holes
"opening"Too vague
"circle"Detects everything circular

✓ Prompts that work well

  • "the large central opening of a metal flange"
  • "big round inner hole much larger than bolt holes"
  • "wide center cavity of a flange"

default alt text

Real Results

Initially, the model started like this (too many detections):

  • 8 small bolt holes
  • 1 central hole
  • Random noise detections

After tuning the prompt, it accurately detected only the large hole.

🌟 Where YOLO-E shines: It lets your words guide the detector.


🏗️ Why does this matter?

Text-based detection opens the door to various possibilities:


default alt text


🏭 Industrial Inspection

Selectively detect specific defects, cracks, and anomalies

🔧 Custom QC Pipelines

Define quality inspection criteria in text

🤖 Robotics (Selective Picking)

Guide robots to pick up only specific objects

💎 Rare/Domain-Specific Objects

Detection possible without training data

⚡ Rapid Prototyping

Test immediately without dataset labeling!


🎯 The Biggest Advantage

Rapid prototyping is possible without dataset labeling.

Just describe what you want and you're done!


Object detection is no longer about "what did the model learn" It's now about "what do I want to find".