-->

Introduction to YOLO & Ultralytics: The Evolution of Object Detection


yolo_ultralytics_preview

AI Generated image: Introduction to YOLO & Ultralytics: The Evolution of Object Detection


YOLO & Ultralytics: The Evolution of Object Detection

Click on play arrow icon to play the video.






1. Introduction


Have you ever wondered how AI-powered cameras instantly detect people, cars, or even animals in real-time? From self-driving cars to security systems, object detection plays a crucial role in today's tech-driven world. And at the heart of this innovation lies one of the most powerful deep learning models – YOLO, or 'You Only Look Once'.


Click on play arrow icon to play the video.


In this video, we’ll take a deep dive into YOLO and Ultralytics – exploring how this revolutionary AI model evolved, its real-world applications, and why YOLOv12, the latest version, is changing the game.


This YOLO system provides a fast, efficient, and scalable approach to vehicle identification for various industries.



2. What is YOLO Methodology & why does it matter??


YOLO (You Only Look Once) is an advanced real-time object detection algorithm designed for speed and accuracy. Unlike traditional methods, YOLO processes an entire image in a single pass, making it highly efficient for object detection.

2.1 How YOLO Object Detection Works?

yolo object detection

Image 1: oblassification, localization and detection.

Image classification assigns images to categories by identifying the main object, while object localization determines both the object's presence and location. Object detection goes further by identifying multiple objects in an image and marking them with bounding boxes. Real-time object detection, like in self-driving cars, detects vehicles, pedestrians, and traffic signs to enhance safety. 🚗📸

YOLO divides an image into a grid, predicting bounding boxes and class probabilities for each section.

k-means clustering

Image 2: Convolutional layer process.

The convolutional layer in a CNN applies a filter (kernel) to an input image, sliding across regions (receptive fields) with a defined stride. Each filter, matching the input depth, performs element-wise multiplications, summing the values into a feature map. This process, similar to cross-correlation, helps detect edges, blurs, or sharpens images. Larger filters capture broader patterns, while smaller ones focus on local details. Multiple layers work together to recognize complex features like a cat’s nose and ears. 🐱📸



3. The Evolution of YOLO (Versions 1 to 12)


3.1 Advancements in YOLO Versions

  • YOLO Version 1: Released in June, 2015.
  • YOLO Version 2: Released in December, 2016.
  • YOLO Version 3: Released in March, 2018.
  • YOLO Version 4: Released in April, 2020.
  • YOLO Version 5: Released in June, 2020.
  • YOLO Version 6: Released in June, 2022.
  • YOLO Version 7: Released in July, 2022.
  • YOLO Version 8: Released in January, 2023.
  • YOLO Version 9: Released in February, 2024.
  • YOLO Version 10: Released in May, 2024.
  • YOLO Version 11: Released in September, 2024.
  • YOLO Version 12: Released in February, 2025.


4. 🚀YOLO12: Attention-Centric Object Detection

4.1 Overview

Newly released YOLOv12 uses a new attention-based design instead of the traditional CNN approach while keeping its fast detection speed. It improves accuracy with advanced attention mechanisms and a better network structure, making it one of the most powerful real-time object detection models.



4.2 YOLOv12: Key Innovations


Area Attention Mechanismp


A new way for the model to focus on important areas while keeping calculations fast. It splits images into smaller parts (default: 4) to process them efficiently, reducing computation compared to standard attention methods.


Residual Efficient Layer Aggregation Networks (R-ELAN)


An upgraded method for combining features, improving learning in large attention-based models. It uses special connections (residual scaling) and a bottleneck-like structure for better optimization.


Optimized Attention Architecture


YOLOv12 fine-tunes attention mechanisms for efficiency, using:


  • FlashAttention to reduce memory use.
  • No positional encoding for faster processing.
  • A better balance between attention and feed-forward layers.
  • Fewer layers for easier optimization.
  • 7x7 separable convolution to help understand positions without extra complexity.

Supports Multiple Tasks


Works for object detection, segmentation, image classification, pose estimation, and detecting rotated objects.


Higher Accuracy with Fewer Parameters


Achieves better results while using less computational power, making it both fast and efficient.


Flexible for Different Devices


Can run on anything from small edge devices to powerful cloud servers.


4.3 Supported Tasks and Modes


YOLO12 supports a variety of computer vision tasks. The table below shows task support and the operational modes (Inference, Validation, Training, and Export) enabled for each:


Model Type Task Inference Validation Training Export
YOLO12 Detection
YOLO12-seg Segmentation
YOLO12-pose Pose
YOLO12-cls Classification
YOLO12-obb OBB


4.4 Performance Metrics


YOLO12 demonstrates significant accuracy improvements across all model scales, with some trade-offs in speed compared to the fastest prior YOLO models. Below are quantitative results for object detection on the COCO validation dataset:

yolo_ultralytics_preview


The table shows that YOLO12 models are pre-trained on the COCO dataset and include different versions: YOLO12n.pt, YOLO12s.pt, YOLO12m.pt, YOLO12l.pt, and YOLO12x.pt.


  • Inference speed measured on an NVIDIA T4 GPU with TensorRT FP16 precision.
  • YOLO12n.pt is the smallest and fastest model but has lower accuracy.
  • YOLO12x.pt is the largest model with the highest accuracy but requires significantly more computational power. It is much slower due to its complexity.
  • These models are not optimized for normal PC-based CPUs. Running them effectively requires a GPU.

For our demo exercises, we will use NVIDIA’s T4 GPU on Google Colab Notebook for efficient performance.



5. 🚀The Role of COCO Dataset in Object Detection

5.1 COCO Dataset Overview

The COCO (Common Objects in Context) dataset is a large-scale dataset designed for object detection, segmentation, and image captioning. It is widely used in computer vision tasks, particularly in deep learning models for object recognition and instance segmentation.

YOLO uses COCO dataset by default.


Key Features of COCO Dataset:


  • Contains 330,000 images (about 200,000 labeled images).
  • Includes 1.5 million object instances.
  • Provides 80 object categories for detection and segmentation.
  • Supports 5 captions per image for image captioning tasks.
  • Includes stuff annotations with 91 stuff classes (e.g., sky, road, grass).
  • Offers keypoint annotations for person pose estimation with 17 keypoints per person.

How Many Classes Are in COCO?


COCO has 80 object classes for detection and segmentation, plus 91 stuff classes. Below is a list of the 80 common object categories in COCO:


COCO Dataset Object Classes


5.2 List of COCO Dataset Object Classes:

Default pre-trained COCO Dataset models contain following objects for detections.

1. Person
2. Bicycle
3. Car
4. Motorcycle
5. Airplane
6. Bus
7. Train
8. Truck
9. Boat
10. Traffic light
11. Fire hydrant
12. Stop sign
13. Parking meter
14. Bench
15. Bird
16. Cat
17. Dog
18. Horse
19. Sheep
20. Cow
21. Elephant
22. Bear
23. Zebra
24. Giraffe
25. Backpack
26. Umbrella
27. Handbag
28. Tie
29. Suitcase
30. Frisbee
31. Skis
32. Snowboard
33. Sports ball
34. Kite
35. Baseball bat
36. Baseball glove
37. Skateboard
38. Surfboard
39. Tennis racket
40. Bottle
41. Wine glass
42. Cup
43. Fork
44. Knife
45. Spoon
46. Bowl
47. Banana
48. Apple
49. Sandwich
50. Orange
51. Broccoli
52. Carrot
53. Hot dog
54. Pizza
55. Donut
56. Cake
57. Chair
58. Couch
59. Potted plant
60. Bed
61. Dining table
62. Toilet
63. TV
64. Laptop
65. Mouse
66. Remote
67. Keyboard
68. Cell phone
69. Microwave
70. Oven
71. Toaster
72. Sink
73. Refrigerator
74. Book
75. Clock
76. Vase
77. Scissors
78. Teddy bear
79. Hairdryer
80. Toothbrush

  • We can detect any of the above listed items using YOLO12n.pt, YOLO12s.pt, YOLO12m.pt, YOLO12l.pt and YOLO12x.pt pre-trained models.
  • But to detect other items like forest fire, brain tumer, car registration license plates that are not listed in Coco dataset, we need to create pre-trained custom dataset.
  • In this project we shall use default pretrained model YOLO12n.pt pretrained model to detect persons, cars, trucks, bicycles and so on.
  • In our next project we shall create customised pre-trained models to detect forest fire, brain tumer, car registration license plates



6. What are Bounding Boxes?

🚀6.1 Bounding Boxes Overview

Bounding boxes are rectangular boxes used in object detection to define the location of an object in an image. They enclose the detected objects and provide coordinates (x, y, width, height).

Boundind boxes detecting players and Baseball
Bounding boxes detecting players and a ball

Why and How Does YOLO Create Bounding Boxes?


YOLO (You Only Look Once) generates bounding boxes to detect objects in real-time. It does this by:

  • Dividing an image into a grid.
  • Assigning each grid cell responsibility for predicting objects.
  • Using anchor boxes to estimate object positions and dimensions.
  • Applying non-maximum suppression (NMS) to remove duplicate detections.

6.2 What Are .pt and .yaml Files?

.pt Files


A .pt file is a PyTorch model checkpoint that stores a trained YOLO model. It contains:

  • Model architecture.
  • Trained weights.
  • Optimizer states (if saved).

Example: yolov12.pt, yolov12s.pt

These files are created using:

torch.save(model.state_dict(), "model.pt")

.yaml Files


A .yaml file is a configuration file that defines model architecture, dataset details, and training settings. It is used to train and fine-tune YOLO models.

It contains:

  • Number of object classes.
  • Paths to training and validation datasets.
  • Model layers and parameters.

Example: yolov12.yaml, coco.yaml

Sample content of a .yaml file:


nc: 80
train: ./data/train/images
val: ./data/val/images
names: ['person', 'bicycle', 'car', ...]

How Are These Files Used?

  • .pt files are used for model inference and deployment.
  • .yaml files configure training and dataset paths.
  • YOLO loads both files when training or running object detection.


7. Object Detection demo using YOLO12n default dataset

7.1 Practical demo

Demo: Using YOLOv12 Pre-trained Model (YOLO12n.pt)

Step 1: Login to Google Drive


If you don’t have a Google Drive account, create one and log in.


Step 2: Upload YOLOV12 Folder


Ensure your YOLOV12 folder on your local computer contains the following images and videos:

  • football1.jpg
  • soccer1b.mp4
  • car2.mp4

Upload this folder to your Google Drive.


Step 3: Open Google Colab


Go to Google Colab and log in.


Step 4: Configure Hardware (T4 GPU)


Setup GPU

Setup GPU

Setup GPU
  1. Click Runtime in the top menu.
  2. Click Change Runtime Type.
  3. Select T4 GPU as the hardware accelerator.
  4. Click Save.
Colab work

Step 5: Verify GPU Setup


Run the following command in a new cell:


import tensorflow as tf
    
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

If the output is Num GPUs Available: 1, then the GPU is configured correctly.



Step 6: Install Ultralytics


Similar to Command Prompt, "pip install" command can be executed on Colab Notebook cell. But in Notebook cell '!' should be added before pip install command like "!pip install ultralytics" as shown below:


!pip install ultralytics

Click the "Run" button. Once installed, hide the output and add a new cell.


Note: In Command Prompt '!' should not be added. To install Ultralytics in Command Prompt we shall have to use 'pip install ultralytics'. In Jupyter Notebook cell also, we can install Ultralytics using '!' before the pip command.



Step 7: Mount Google Drive


Run the following command in a new Colab cell:


from google.colab import drive
drive.mount('/content/drive')


Step 8: Access YOLO Model


You can either download YOLO12n.pt and upload it to your Google Drive or directly load it from Ultralytics.



Step 9: Change to YOLOV12 Directory


Run this command to go to the mounted drive:


%cd /content/drive/MyDrive/YOLOV12


Step 10: Load YOLO Model


Run the following command in a new cell:


from ultralytics import YOLO

Then, load the pre-trained model:


model = YOLO('yolo12n.pt')


Step 11: Run Object Detection on Images


Run the following command to detect objects in an image:


result = model('/content/drive/MyDrive/YOLOV12/football1.jpg', save=True)

The detected image will be saved in the folder: runs/detect/predict.



Step 12: Run Object Detection on Videos


Run similar commands for video files:


result = model('/content/drive/MyDrive/YOLOV12/soccer1b.mp4', save=True)
result = model('/content/drive/MyDrive/YOLOV12/car2.mp4', save=True)

Detected and image and videos will be saved in the runs/detect/predict folder as:


  • football1.jpg
  • soccer1b.avi
  • car2.avi

Download the image and video files.



Step 13: Convert AVI to WebM


we can convert the .avi files to .webm format for web playback using online tools:



Step 14: Test how detections are successful:


Double click on football1.jpg


YOLOv12 Object Detection Comparison


Original Image
Image before detection
Detected Image
Image after detection with bounding boxes

On the right hand side, image showing that players are detected with 90% confidence and the ball is detected with 89% confidence. Players and the ball are encicled by bounding boxes after detection.



Video Example 1: Before and After Detections

Video before detection

VIDEO: Click on play arrow icon to play the video.


Video After detection

VIDEO: Click on play arrow icon to play the video.


Video Example 2: Before and After Detections

Video before detection

VIDEO: Click on play arrow icon to play the video.


Video After detection

VIDEO: Click on play arrow icon to play the video.

Final Thoughts

By following these steps, you can use YOLOv12's pre-trained model for object detection on images and videos. The detected outputs can be saved and converted into web-friendly formats for easy sharing.



8. Conclusion


In this demonstration, we successfully utilized the YOLOv12n.pt pre-trained model to detect objects in both images and videos. Using Google Colab with a T4 GPU, we performed object detection on sample media files with the Ultralytics YOLO framework. The detected objects were accurately identified and highlighted using bounding boxes. Additionally, we explored methods to convert and display the results efficiently.

As the next step, we will create a new webpage demonstrating object detection using a custom dataset, allowing us to train YOLO on specific objects for improved accuracy in specialized applications.



9. References




Acknowledgments


I sincerely thank Prof. Andrew NG (DeepLearning.AI, Stanford University) for his inspiring courses that laid the foundation for this project.