I tested YOLO-E, which finds objects using only text without trained classes

Terms to Know First

TermDescription
YOLOStands for "You Only Look Once". A fast AI model that detects objects in real-time by looking at an image only once.
Object DetectionA technology that identifies "what something is and where it's located" in an image. Positions are marked with bounding boxes.
Zero-shot DetectionThe ability to detect objects that have never been trained on. It can find new things using only text descriptions.
Text EmbeddingConverting sentences into numerical vectors. This allows AI to understand "meaning".
FlangeA metal disc that connects pipes or machine parts. It has a large hole in the center and bolt holes around the perimeter.

In traditional object detection, models are confined to trained classes. Dogs, cars, people... But what if you want to detect something really specific?

For example, what if you want to detect something like this? "The large central hole of a metal flange โ€” excluding the small bolt holes"

A typical model would fail. But YOLO-E is a game changer.

YOLO-E extends classic YOLO with text-based detection. Instead of fixed labels, it responds to your descriptions.

And it actually works.


default alt text

The code is surprisingly simple

Here's the code I used to detect only the large central hole in a metal flange:

from ultralytics import YOLOE
from PIL import Image
import numpy as np
import cv2
import matplotlib.pyplot as plt

IMAGE_PATH = "flange.jpg"

# I only want one thing: the large central hole
PROMPTS = [
    "the large central opening of a metal flange, wide and smooth, much bigger than bolt holes"
]

CONF_THRESHOLD = 0.01

# Load model
model = YOLOE("yoloe-v8l-seg.pt").cuda()

โญ The key: Convert text to embeddings

text_pe = model.get_text_pe(PROMPTS) model.set_classes(PROMPTS, text_pe)

Run prediction

image = Image.open(IMAGE_PATH).convert("RGB") results = model.predict(image, conf=CONF_THRESHOLD, verbose=False)[0]

Visualize results

img = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR) for box, cls, score in zip(results.boxes.xyxy, results.boxes.cls, results.boxes.conf): x1, y1, x2, y2 = map(int, box) cv2.rectangle(img, (x1, y1), (x2, y2), (0, 0, 255), 2) cv2.putText(img, f"center hole {score*100:.1f}%", (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2)

plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB)) plt.show()


---

## ๐Ÿช„ Why does this work? โ€” Prompt Engineering for Vision

YOLO-E uses **text embeddings** internally:

```python
text_pe = model.get_text_pe(PROMPTS)
model.set_classes(PROMPTS, text_pe)

This single step transforms the model into a zero-shot detector. In other words, it attempts to match the scene with the meaning of your sentences.


default alt text


๐Ÿ’ก Key Trick: Make your prompt describe the large hole while implicitly excluding the smaller holes.


โœ๏ธ Prompts that work vs prompts that don't

โœ— Prompts to avoid

PromptProblem
"hole"Detects all holes
"opening"Too vague
"circle"Detects everything circular

โœ“ Prompts that work well

  • "the large central opening of a metal flange"
  • "big round inner hole much larger than bolt holes"
  • "wide center cavity of a flange"

default alt text

Real Results

Initially, the model started like this (too many detections):

  • 8 small bolt holes
  • 1 central hole
  • Random noise detections

After tuning the prompt, it accurately detected only the large hole.

๐ŸŒŸ Where YOLO-E shines: It lets your words guide the detector.


๐Ÿ—๏ธ Why does this matter?

Text-based detection opens the door to various possibilities:


default alt text


๐Ÿญ Industrial Inspection

Selectively detect specific defects, cracks, and anomalies

๐Ÿ”ง Custom QC Pipelines

Define quality inspection criteria in text

๐Ÿค– Robotics (Selective Picking)

Guide robots to pick up only specific objects

๐Ÿ’Ž Rare/Domain-Specific Objects

Detection possible without training data

โšก Rapid Prototyping

Test immediately without dataset labeling!


๐ŸŽฏ The Biggest Advantage

Rapid prototyping is possible without dataset labeling.

Just describe what you want and you're done!


Object detection is no longer about "what did the model learn" It's now about "what do I want to find".