Introduction to YOLO & Ultralytics: The Evolution of Object Detection

AI Generated image: Introduction to YOLO & Ultralytics: The Evolution of Object Detection

YOLO & Ultralytics: The Evolution of Object Detection

Click on play arrow icon to play the video.

✅ Featured Project: Introduction to YOLO and Ultralytics YOLOv12

Tools:

Ultralytics YOLOv12: The latest version of YOLO, used for real-time object detection tasks.
Python + PyTorch: Core framework powering YOLO’s model training and inference.
COCO Dataset: A standardized dataset used to train and benchmark object detection models.
Bounding Box Visualization Tools: Used to interpret object detection output.

Goals:

To provide a foundational understanding of the YOLO object detection methodology and its evolution.
To introduce the features and improvements in Ultralytics YOLOv12.
To demonstrate basic object detection using default YOLOv12n datasets.

Impacts:

Equips learners and developers with conceptual clarity on YOLO’s architecture and progress.
Helps readers explore YOLO’s practical applications in real-time vision tasks.
Lays groundwork for those planning to fine-tune or deploy custom YOLO models.

1. Introduction

Have you ever wondered how AI-powered cameras instantly detect people, cars, or even animals in real-time? From self-driving cars to security systems, object detection plays a crucial role in today's tech-driven world. And at the heart of this innovation lies one of the most powerful deep learning models – YOLO, or 'You Only Look Once'.

Click on play arrow icon to play the video.

In this video, we’ll take a deep dive into YOLO and Ultralytics – exploring how this revolutionary AI model evolved, its real-world applications, and why YOLOv12, the latest version, is changing the game.

This YOLO system provides a fast, efficient, and scalable approach to vehicle identification for various industries.

2. What is YOLO Methodology & why does it matter??

YOLO (You Only Look Once) is an advanced real-time object detection algorithm designed for speed and accuracy. Unlike traditional methods, YOLO processes an entire image in a single pass, making it highly efficient for object detection.

2.1 How YOLO Object Detection Works?

Image 1: oblassification, localization and detection.

Image classification assigns images to categories by identifying the main object, while object localization determines both the object's presence and location. Object detection goes further by identifying multiple objects in an image and marking them with bounding boxes. Real-time object detection, like in self-driving cars, detects vehicles, pedestrians, and traffic signs to enhance safety. 🚗📸

YOLO divides an image into a grid, predicting bounding boxes and class probabilities for each section.

Image 2: Convolutional layer process.

The convolutional layer in a CNN applies a filter (kernel) to an input image, sliding across regions (receptive fields) with a defined stride. Each filter, matching the input depth, performs element-wise multiplications, summing the values into a feature map. This process, similar to cross-correlation, helps detect edges, blurs, or sharpens images. Larger filters capture broader patterns, while smaller ones focus on local details. Multiple layers work together to recognize complex features like a cat’s nose and ears. 🐱📸

3. The Evolution of YOLO (Versions 1 to 12)

3.1 Advancements in YOLO Versions

YOLO Version 1: Released in June, 2015.
YOLO Version 2: Released in December, 2016.
YOLO Version 3: Released in March, 2018.
YOLO Version 4: Released in April, 2020.
YOLO Version 5: Released in June, 2020.
YOLO Version 6: Released in June, 2022.
YOLO Version 7: Released in July, 2022.
YOLO Version 8: Released in January, 2023.
YOLO Version 9: Released in February, 2024.
YOLO Version 10: Released in May, 2024.
YOLO Version 11: Released in September, 2024.
YOLO Version 12: Released in February, 2025.

4. 🚀YOLO12: Attention-Centric Object Detection

4.1 Overview

Newly released YOLOv12 uses a new attention-based design instead of the traditional CNN approach while keeping its fast detection speed. It improves accuracy with advanced attention mechanisms and a better network structure, making it one of the most powerful real-time object detection models.

4.2 YOLOv12: Key Innovations

Area Attention Mechanismp

A new way for the model to focus on important areas while keeping calculations fast. It splits images into smaller parts (default: 4) to process them efficiently, reducing computation compared to standard attention methods.

Residual Efficient Layer Aggregation Networks (R-ELAN)

An upgraded method for combining features, improving learning in large attention-based models. It uses special connections (residual scaling) and a bottleneck-like structure for better optimization.

Optimized Attention Architecture

YOLOv12 fine-tunes attention mechanisms for efficiency, using:

FlashAttention to reduce memory use.
No positional encoding for faster processing.
A better balance between attention and feed-forward layers.
Fewer layers for easier optimization.
7x7 separable convolution to help understand positions without extra complexity.

Supports Multiple Tasks

Works for object detection, segmentation, image classification, pose estimation, and detecting rotated objects.

Higher Accuracy with Fewer Parameters

Achieves better results while using less computational power, making it both fast and efficient.

Flexible for Different Devices

Can run on anything from small edge devices to powerful cloud servers.

4.3 Supported Tasks and Modes

YOLO12 supports a variety of computer vision tasks. The table below shows task support and the operational modes (Inference, Validation, Training, and Export) enabled for each:

Model Type	Task	Inference	Validation	Training	Export
YOLO12	Detection	✅	✅	✅	✅
YOLO12-seg	Segmentation	✅	✅	✅	✅
YOLO12-pose	Pose	✅	✅	✅	✅
YOLO12-cls	Classification	✅	✅	✅	✅
YOLO12-obb	OBB	✅	✅	✅	✅

4.4 Performance Metrics

YOLO12 demonstrates significant accuracy improvements across all model scales, with some trade-offs in speed compared to the fastest prior YOLO models. Below are quantitative results for object detection on the COCO validation dataset:

The table shows that YOLO12 models are pre-trained on the COCO dataset and include different versions: YOLO12n.pt, YOLO12s.pt, YOLO12m.pt, YOLO12l.pt, and YOLO12x.pt.

Inference speed measured on an NVIDIA T4 GPU with TensorRT FP16 precision.
YOLO12n.pt is the smallest and fastest model but has lower accuracy.
YOLO12x.pt is the largest model with the highest accuracy but requires significantly more computational power. It is much slower due to its complexity.
These models are not optimized for normal PC-based CPUs. Running them effectively requires a GPU.

For our demo exercises, we will use NVIDIA’s T4 GPU on Google Colab Notebook for efficient performance.

5. 🚀The Role of COCO Dataset in Object Detection

5.1 COCO Dataset Overview

The COCO (Common Objects in Context) dataset is a large-scale dataset designed for object detection, segmentation, and image captioning. It is widely used in computer vision tasks, particularly in deep learning models for object recognition and instance segmentation.

YOLO uses COCO dataset by default.

Key Features of COCO Dataset:

Contains 330,000 images (about 200,000 labeled images).
Includes 1.5 million object instances.
Provides 80 object categories for detection and segmentation.
Supports 5 captions per image for image captioning tasks.
Includes stuff annotations with 91 stuff classes (e.g., sky, road, grass).
Offers keypoint annotations for person pose estimation with 17 keypoints per person.

How Many Classes Are in COCO?

COCO has 80 object classes for detection and segmentation, plus 91 stuff classes. Below is a list of the 80 common object categories in COCO:

COCO Dataset Object Classes

5.2 List of COCO Dataset Object Classes:

Default pre-trained COCO Dataset models contain following objects for detections.

1. Person
2. Bicycle
3. Car
4. Motorcycle
5. Airplane
6. Bus
7. Train
8. Truck
9. Boat
10. Traffic light
11. Fire hydrant
12. Stop sign
13. Parking meter
14. Bench
15. Bird
16. Cat

17. Dog
18. Horse
19. Sheep
20. Cow
21. Elephant
22. Bear
23. Zebra
24. Giraffe
25. Backpack
26. Umbrella
27. Handbag
28. Tie
29. Suitcase
30. Frisbee
31. Skis
32. Snowboard

33. Sports ball
34. Kite
35. Baseball bat
36. Baseball glove
37. Skateboard
38. Surfboard
39. Tennis racket
40. Bottle
41. Wine glass
42. Cup
43. Fork
44. Knife
45. Spoon
46. Bowl
47. Banana
48. Apple

49. Sandwich
50. Orange
51. Broccoli
52. Carrot
53. Hot dog
54. Pizza
55. Donut
56. Cake
57. Chair
58. Couch
59. Potted plant
60. Bed
61. Dining table
62. Toilet
63. TV
64. Laptop

65. Mouse
66. Remote
67. Keyboard
68. Cell phone
69. Microwave
70. Oven
71. Toaster
72. Sink
73. Refrigerator
74. Book
75. Clock
76. Vase
77. Scissors
78. Teddy bear
79. Hairdryer
80. Toothbrush

We can detect any of the above listed items using YOLO12n.pt, YOLO12s.pt, YOLO12m.pt, YOLO12l.pt and YOLO12x.pt pre-trained models.
But to detect other items like forest fire, brain tumer, car registration license plates that are not listed in Coco dataset, we need to create pre-trained custom dataset.
In this project we shall use default pretrained model YOLO12n.pt pretrained model to detect persons, cars, trucks, bicycles and so on.
In our next project we shall create customised pre-trained models to detect forest fire, brain tumer, car registration license plates

6. What are Bounding Boxes?

🚀6.1 Bounding Boxes Overview

Bounding boxes are rectangular boxes used in object detection to define the location of an object in an image. They enclose the detected objects and provide coordinates (x, y, width, height).

Boundind boxes detecting players and Baseball

Bounding boxes detecting players and a ball

Why and How Does YOLO Create Bounding Boxes?

YOLO (You Only Look Once) generates bounding boxes to detect objects in real-time. It does this by:

Dividing an image into a grid.
Assigning each grid cell responsibility for predicting objects.
Using anchor boxes to estimate object positions and dimensions.
Applying non-maximum suppression (NMS) to remove duplicate detections.

6.2 What Are `.pt` and `.yaml` Files?

`.pt` Files

A .pt file is a PyTorch model checkpoint that stores a trained YOLO model. It contains:

Model architecture.
Trained weights.
Optimizer states (if saved).

Example: yolov12.pt, yolov12s.pt

These files are created using:

torch.save(model.state_dict(), "model.pt")

`.yaml` Files

A .yaml file is a configuration file that defines model architecture, dataset details, and training settings. It is used to train and fine-tune YOLO models.

It contains:

Number of object classes.
Paths to training and validation datasets.
Model layers and parameters.

Example: yolov12.yaml, coco.yaml

Sample content of a .yaml file:


nc: 80
train: ./data/train/images
val: ./data/val/images
names: ['person', 'bicycle', 'car', ...]

How Are These Files Used?

.pt files are used for model inference and deployment.
.yaml files configure training and dataset paths.
YOLO loads both files when training or running object detection.

7. Object Detection demo using YOLO12n default dataset

7.1 Practical demo

Demo: Using YOLOv12 Pre-trained Model (`YOLO12n.pt`)

Step 1: Login to Google Drive

If you don’t have a Google Drive account, create one and log in.

Step 2: Upload YOLOV12 Folder

Ensure your YOLOV12 folder on your local computer contains the following images and videos:

football1.jpg
soccer1b.mp4
car2.mp4

Upload this folder to your Google Drive.

Step 3: Open Google Colab

Go to Google Colab and log in.

Step 4: Configure Hardware (T4 GPU)

Click Runtime in the top menu.
Click Change Runtime Type.
Select T4 GPU as the hardware accelerator.
Click Save.

Step 5: Verify GPU Setup

Run the following command in a new cell:

import tensorflow as tf
    

print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

If the output is Num GPUs Available: 1, then the GPU is configured correctly.

Step 6: Install Ultralytics

Similar to Command Prompt, "pip install" command can be executed on Colab Notebook cell. But in Notebook cell '!' should be added before pip install command like "!pip install ultralytics" as shown below:

!pip install ultralytics

Click the "Run" button. Once installed, hide the output and add a new cell.

Note: In Command Prompt '!' should not be added. To install Ultralytics in Command Prompt we shall have to use 'pip install ultralytics'. In Jupyter Notebook cell also, we can install Ultralytics using '!' before the pip command.

Step 7: Mount Google Drive

Run the following command in a new Colab cell:

from google.colab import drive
drive.mount('/content/drive')

Step 8: Access YOLO Model

You can either download YOLO12n.pt and upload it to your Google Drive or directly load it from Ultralytics.

Step 9: Change to YOLOV12 Directory

Run this command to go to the mounted drive:

%cd /content/drive/MyDrive/YOLOV12

Step 10: Load YOLO Model

Run the following command in a new cell:

from ultralytics import YOLO

Then, load the pre-trained model:

model = YOLO('yolo12n.pt')

Step 11: Run Object Detection on Images

Run the following command to detect objects in an image:

result = model('/content/drive/MyDrive/YOLOV12/football1.jpg', save=True)

The detected image will be saved in the folder: runs/detect/predict.

Step 12: Run Object Detection on Videos

Run similar commands for video files:

result = model('/content/drive/MyDrive/YOLOV12/soccer1b.mp4', save=True)
result = model('/content/drive/MyDrive/YOLOV12/car2.mp4', save=True)

Detected and image and videos will be saved in the runs/detect/predict folder as:

football1.jpg
soccer1b.avi
car2.avi

Download the image and video files.

Step 13: Convert AVI to WebM

we can convert the .avi files to .webm format for web playback using online tools:

Step 14: Test how detections are successful:

Double click on football1.jpg

YOLOv12 Object Detection Comparison

Image before detection

Image after detection with bounding boxes

On the right hand side, image showing that players are detected with 90% confidence and the ball is detected with 89% confidence. Players and the ball are encicled by bounding boxes after detection.

Video Example 1: Before and After Detections

Video before detection

VIDEO: Click on play arrow icon to play the video.

Video After detection

VIDEO: Click on play arrow icon to play the video.

Video Example 2: Before and After Detections

Video before detection

VIDEO: Click on play arrow icon to play the video.

Video After detection

VIDEO: Click on play arrow icon to play the video.

Final Thoughts

By following these steps, you can use YOLOv12's pre-trained model for object detection on images and videos. The detected outputs can be saved and converted into web-friendly formats for easy sharing.

8. Conclusion

In this demonstration, we successfully utilized the YOLOv12n.pt pre-trained model to detect objects in both images and videos. Using Google Colab with a T4 GPU, we performed object detection on sample media files with the Ultralytics YOLO framework. The detected objects were accurately identified and highlighted using bounding boxes. Additionally, we explored methods to convert and display the results efficiently.

As the next step, we will create a new webpage demonstrating object detection using a custom dataset, allowing us to train YOLO on specific objects for improved accuracy in specialized applications.

9. References

Redmon, J., & Farhadi, A. (2016). YOLO: Real-Time Object Detection. arXiv preprint arXiv:1506.02640.
Ultralytics. (2023). YOLOv11 Documentation. Available at: https://ultralytics.com/
PyTorch Documentation. Available at: https://pytorch.org/
COCO Dataset:

Google Colab:

CloudConvert for video conversion:

EZGIF for video format conversion:

Acknowledgments

I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.

Introduction to YOLO & Ultralytics: The Evolution of Object Detection

YOLO & Ultralytics: The Evolution of Object Detection

✅ Featured Project: Introduction to YOLO and Ultralytics YOLOv12

Tools:

Goals:

Impacts:

Contents

1. Introduction

2. What is YOLO Methodology & why does it matter??

2.1 How YOLO Object Detection Works?

3. The Evolution of YOLO (Versions 1 to 12)

3.1 Advancements in YOLO Versions

4. 🚀YOLO12: Attention-Centric Object Detection

4.1 Overview

4.2 YOLOv12: Key Innovations

Residual Efficient Layer Aggregation Networks (R-ELAN)

Optimized Attention Architecture

Supports Multiple Tasks

Higher Accuracy with Fewer Parameters

Flexible for Different Devices

4.3 Supported Tasks and Modes

4.4 Performance Metrics

5. 🚀The Role of COCO Dataset in Object Detection

5.1 COCO Dataset Overview

5.2 List of COCO Dataset Object Classes:

6. What are Bounding Boxes?

🚀6.1 Bounding Boxes Overview

6.2 What Are .pt and .yaml Files?

.pt Files

.yaml Files

How Are These Files Used?

7. Object Detection demo using YOLO12n default dataset

7.1 Practical demo

Demo: Using YOLOv12 Pre-trained Model (YOLO12n.pt)

Step 1: Login to Google Drive

Step 2: Upload YOLOV12 Folder

Step 3: Open Google Colab

Step 4: Configure Hardware (T4 GPU)

Video Example 1: Before and After Detections

Video before detection

Video After detection

Video Example 2: Before and After Detections

Video before detection

Video After detection

Final Thoughts

8. Conclusion

9. References

Acknowledgments

Contact Me

6.2 What Are `.pt` and `.yaml` Files?

`.pt` Files

`.yaml` Files

Demo: Using YOLOv12 Pre-trained Model (`YOLO12n.pt`)