-->
AI Generated image: Introduction to YOLO & Ultralytics: The Evolution of Object Detection
Click on play arrow icon to play the video.
Have you ever wondered how AI-powered cameras instantly detect people, cars, or even animals in real-time? From self-driving cars to security systems, object detection plays a crucial role in today's tech-driven world. And at the heart of this innovation lies one of the most powerful deep learning models – YOLO, or 'You Only Look Once'.
Click on play arrow icon to play the video.
In this video, we’ll take a deep dive into YOLO and Ultralytics – exploring how this revolutionary AI model evolved, its real-world applications, and why YOLOv12, the latest version, is changing the game.
This YOLO system provides a fast, efficient, and scalable approach to vehicle identification for various industries.
YOLO (You Only Look Once) is an advanced real-time object detection algorithm designed for speed and accuracy. Unlike traditional methods, YOLO processes an entire image in a single pass, making it highly efficient for object detection.
Image 1: oblassification, localization and detection.
Image classification assigns images to categories by identifying the main object, while object localization determines both the object's presence and location. Object detection goes further by identifying multiple objects in an image and marking them with bounding boxes. Real-time object detection, like in self-driving cars, detects vehicles, pedestrians, and traffic signs to enhance safety. 🚗📸
YOLO divides an image into a grid, predicting bounding boxes and class probabilities for each section.
Image 2: Convolutional layer process.
The convolutional layer in a CNN applies a filter (kernel) to an input image, sliding across regions (receptive fields) with a defined stride. Each filter, matching the input depth, performs element-wise multiplications, summing the values into a feature map. This process, similar to cross-correlation, helps detect edges, blurs, or sharpens images. Larger filters capture broader patterns, while smaller ones focus on local details. Multiple layers work together to recognize complex features like a cat’s nose and ears. 🐱📸
Newly released YOLOv12 uses a new attention-based design instead of the traditional CNN approach while keeping its fast detection speed. It improves accuracy with advanced attention mechanisms and a better network structure, making it one of the most powerful real-time object detection models.
Area Attention Mechanismp
A new way for the model to focus on important areas while keeping calculations fast. It splits images into smaller parts (default: 4) to process them efficiently, reducing computation compared to standard attention methods.
An upgraded method for combining features, improving learning in large attention-based models. It uses special connections (residual scaling) and a bottleneck-like structure for better optimization.
YOLOv12 fine-tunes attention mechanisms for efficiency, using:
Works for object detection, segmentation, image classification, pose estimation, and detecting rotated objects.
Achieves better results while using less computational power, making it both fast and efficient.
Can run on anything from small edge devices to powerful cloud servers.
YOLO12 supports a variety of computer vision tasks. The table below shows task support and the operational modes (Inference, Validation, Training, and Export) enabled for each:
Model Type | Task | Inference | Validation | Training | Export |
---|---|---|---|---|---|
YOLO12 | Detection | ✅ | ✅ | ✅ | ✅ |
YOLO12-seg | Segmentation | ✅ | ✅ | ✅ | ✅ |
YOLO12-pose | Pose | ✅ | ✅ | ✅ | ✅ |
YOLO12-cls | Classification | ✅ | ✅ | ✅ | ✅ |
YOLO12-obb | OBB | ✅ | ✅ | ✅ | ✅ |
YOLO12 demonstrates significant accuracy improvements across all model scales, with some trade-offs in speed compared to the fastest prior YOLO models. Below are quantitative results for object detection on the COCO validation dataset:
The table shows that YOLO12 models are pre-trained on the COCO dataset and include different versions: YOLO12n.pt, YOLO12s.pt, YOLO12m.pt, YOLO12l.pt, and YOLO12x.pt.
For our demo exercises, we will use NVIDIA’s T4 GPU on Google Colab Notebook for efficient performance.
The COCO (Common Objects in Context) dataset is a large-scale dataset designed for object detection, segmentation, and image captioning. It is widely used in computer vision tasks, particularly in deep learning models for object recognition and instance segmentation.
YOLO uses COCO dataset by default.
Key Features of COCO Dataset:
How Many Classes Are in COCO?
COCO has 80 object classes for detection and segmentation, plus 91 stuff classes. Below is a list of the 80 common object categories in COCO:
COCO Dataset Object Classes
Default pre-trained COCO Dataset models contain following objects for detections.
1. Person 2. Bicycle 3. Car 4. Motorcycle 5. Airplane 6. Bus 7. Train 8. Truck 9. Boat 10. Traffic light 11. Fire hydrant 12. Stop sign 13. Parking meter 14. Bench 15. Bird 16. Cat |
17. Dog 18. Horse 19. Sheep 20. Cow 21. Elephant 22. Bear 23. Zebra 24. Giraffe 25. Backpack 26. Umbrella 27. Handbag 28. Tie 29. Suitcase 30. Frisbee 31. Skis 32. Snowboard |
33. Sports ball 34. Kite 35. Baseball bat 36. Baseball glove 37. Skateboard 38. Surfboard 39. Tennis racket 40. Bottle 41. Wine glass 42. Cup 43. Fork 44. Knife 45. Spoon 46. Bowl 47. Banana 48. Apple |
49. Sandwich 50. Orange 51. Broccoli 52. Carrot 53. Hot dog 54. Pizza 55. Donut 56. Cake 57. Chair 58. Couch 59. Potted plant 60. Bed 61. Dining table 62. Toilet 63. TV 64. Laptop |
65. Mouse 66. Remote 67. Keyboard 68. Cell phone 69. Microwave 70. Oven 71. Toaster 72. Sink 73. Refrigerator 74. Book 75. Clock 76. Vase 77. Scissors 78. Teddy bear 79. Hairdryer 80. Toothbrush |
Bounding boxes are rectangular boxes used in object detection to define the location of an object in an image. They enclose the detected objects and provide coordinates (x, y, width, height).
Why and How Does YOLO Create Bounding Boxes?
YOLO (You Only Look Once) generates bounding boxes to detect objects in real-time. It does this by:
.pt
and .yaml
Files?.pt
FilesA .pt
file is a PyTorch model checkpoint that stores a trained YOLO model. It contains:
Example: yolov12.pt
, yolov12s.pt
These files are created using:
torch.save(model.state_dict(), "model.pt")
.yaml
FilesA .yaml
file is a configuration file that defines model architecture, dataset details, and training settings. It is used to train and fine-tune YOLO models.
It contains:
Example: yolov12.yaml
, coco.yaml
Sample content of a .yaml
file:
nc: 80
train: ./data/train/images
val: ./data/val/images
names: ['person', 'bicycle', 'car', ...]
.pt
files are used for model inference and deployment..yaml
files configure training and dataset paths.YOLO12n.pt
)If you don’t have a Google Drive account, create one and log in.
Ensure your YOLOV12
folder on your local computer contains the following images and videos:
football1.jpg
soccer1b.mp4
car2.mp4
Upload this folder to your Google Drive.
Go to Google Colab and log in.
Step 5: Verify GPU Setup
Run the following command in a new cell:
import tensorflow as tf
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))
If the output is Num GPUs Available: 1
, then the GPU is configured correctly.
Step 6: Install Ultralytics
Similar to Command Prompt,
!pip install ultralytics
Click the "Run" button. Once installed, hide the output and add a new cell.
Note: In Command Prompt '!' should not be added. To install Ultralytics in Command Prompt we shall have to use 'pip install ultralytics'. In Jupyter Notebook cell also, we can install Ultralytics using '!' before the pip command.
Step 7: Mount Google Drive
Run the following command in a new Colab cell:
from google.colab import drive
drive.mount('/content/drive')
Step 8: Access YOLO Model
You can either download YOLO12n.pt
and upload it to your Google Drive or directly load it from Ultralytics.
Step 9: Change to YOLOV12 Directory
Run this command to go to the mounted drive:
%cd /content/drive/MyDrive/YOLOV12
Step 10: Load YOLO Model
Run the following command in a new cell:
from ultralytics import YOLO
Then, load the pre-trained model:
model = YOLO('yolo12n.pt')
Step 11: Run Object Detection on Images
Run the following command to detect objects in an image:
result = model('/content/drive/MyDrive/YOLOV12/football1.jpg', save=True)
The detected image will be saved in the folder: runs/detect/predict
.
Step 12: Run Object Detection on Videos
Run similar commands for video files:
result = model('/content/drive/MyDrive/YOLOV12/soccer1b.mp4', save=True)
result = model('/content/drive/MyDrive/YOLOV12/car2.mp4', save=True)
Detected and image and videos will be saved in the runs/detect/predict
folder as:
soccer1b.avi
car2.avi
Download the image and video files.
Step 13: Convert AVI to WebM
we can convert the .avi
files to .webm
format for web playback using online tools:
Step 14: Test how detections are successful:
Double click on football1.jpg
YOLOv12 Object Detection Comparison
On the right hand side, image showing that players are detected with 90% confidence and the ball is detected with 89% confidence. Players and the ball are encicled by bounding boxes after detection.
VIDEO: Click on play arrow icon to play the video.
VIDEO: Click on play arrow icon to play the video.
VIDEO: Click on play arrow icon to play the video.
VIDEO: Click on play arrow icon to play the video.
By following these steps, you can use YOLOv12's pre-trained model for object detection on images and videos. The detected outputs can be saved and converted into web-friendly formats for easy sharing.
In this demonstration, we successfully utilized the YOLOv12n.pt pre-trained model to detect objects in both images and videos. Using Google Colab with a T4 GPU, we performed object detection on sample media files with the Ultralytics YOLO framework. The detected objects were accurately identified and highlighted using bounding boxes. Additionally, we explored methods to convert and display the results efficiently.
As the next step, we will create a new webpage demonstrating object detection using a custom dataset, allowing us to train YOLO on specific objects for improved accuracy in specialized applications.
I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.