Computer Vision

Computer Vision Foundations

Difficulty:Advanced

Reading Time:25 min

Track:

Computer Vision

The broad field of enabling computers to see, segment, track, and interpret visual data from the physical world.

Prerequisites

Convolutional Neural Networks

Computer VisionModule 2 of 8

Computer Vision Foundations

25%

TL;DR

Computer vision tasks form a hierarchy of increasing spatial detail: classification (one label per image) → detection (boxes + labels for each object) → segmentation (a label for every pixel).
Object detection is graded with Intersection over Union (IoU), $IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}$ , which checks whether a predicted box sufficiently overlaps the ground truth.
Modern detectors (YOLO, Faster R-CNN) place a dense grid of anchor boxes over the image and turn detection into per-anchor classification (is there an object, and which class?) plus regression (how to nudge this anchor into a tight box).
Segmentation networks use per-pixel losses (pixel-wise cross-entropy or Dice loss) because a single per-image label cannot express where an object’s boundary actually lies.
As granularity increases, so does annotation cost and compute: classification labels are cheap, bounding boxes are moderate effort, and pixel masks are the most expensive to collect.
Evaluation metrics track the task: top-1/top-5 accuracy for classification, mAP (mean Average Precision over IoU thresholds) for detection, and mIoU (mean IoU per class) for segmentation.

Learning Objectives

Distinguish between classification, object detection, semantic segmentation, and instance segmentation
Explain how bounding boxes are represented and predicted
Compute Intersection over Union (IoU) to evaluate object detection boundaries
Describe standard feature extractor (backbone) architectures used in computer vision

Intuition

How to think conceptually about this topic

Imagine trying to teach a computer to outline objects on a chalkboard. Standard image classification is like shouting "there's a dog on the board!".

Object detection is like walking up to the board and drawing a chalk box around the dog's body. Semantic segmentation is like taking colored paint and carefully coloring every single pixel that belongs to the dog, coloring the grass green, and the sidewalk gray, carving out the exact geometric reality of the scene.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Computer Vision (CV) bridges the gap between raw camera pixel grids and semantic physical understanding. While image classification assigns a single label to an entire image, practical computer vision covers:

Object Detection: Finding where objects are, drawing bounding boxes, and classifying them.
Semantic & Instance Segmentation: Classifying every single pixel in the image, carving out exact item borders.
Keypoint Detection: Tracking specific anatomical joints or structural landmarks across image space.

These techniques form the core of self-driving cars, industrial automation, medical diagnostics, and spatial computing.

How It Compares

Image Classification vs Object Detection vs Semantic Segmentation

Dimension	Image Classification	Object Detection	Semantic Segmentation
Output granularity	One label for the whole image	A set of bounding boxes, each with a class and confidence score	One class label for every pixel in the image
Typical loss function	Cross-entropy over image-level class scores	Combined loss: classification (cross-entropy) + localization (smooth-L1 / IoU-based) per anchor or proposal	Pixel-wise cross-entropy, often combined with a region-overlap loss like Dice or Tversky loss
Typical architecture	CNN backbone (ResNet, EfficientNet, ViT) + global pooling + softmax head	Backbone + region proposal/anchor mechanism + detection heads (Faster R-CNN, YOLO, SSD, DETR)	Backbone + encoder-decoder with skip connections to recover spatial resolution (U-Net, DeepLab, Mask R-CNN’s mask head)
Evaluation metric	Top-1 / Top-5 accuracy	mean Average Precision (mAP) across classes and IoU thresholds	mean Intersection over Union (mIoU) across classes
Annotation cost	Lowest — one label per image	Moderate — bounding box per object instance	Highest — a label for every pixel

TakeawayEach step up the hierarchy — classification to detection to segmentation — trades cheaper, coarser supervision for richer, more spatially precise output, and the loss/architecture/metric all co-evolve to match exactly how much spatial detail the task demands.

When to Use It

Reach for this when

You only need to know what is in an image, not where (e.g. tagging photos by scene type) — use classification; it is the cheapest to train and label.
You need to count, locate, or track discrete object instances (e.g. counting cars, tracking pedestrians, retail shelf auditing) — use object detection.
You need precise pixel-level boundaries for downstream geometric reasoning (e.g. medical tumor delineation, autonomous-driving free-space estimation, background removal) — use semantic or instance segmentation.
You have limited compute or need real-time inference on edge devices and only coarse localization is acceptable — a single-stage detector (YOLO/SSD) is usually the right tradeoff over a two-stage detector or full segmentation network.

Avoid it when

You don’t have the budget to collect bounding-box or pixel-mask annotations — consider weak supervision (image-level labels with class-activation maps) or pre-trained zero-shot models before committing to full detection/segmentation labeling.
Objects of interest rarely overlap and only coarse location matters — full per-pixel segmentation is overkill and adds unnecessary annotation and compute cost; bounding boxes suffice.
Real-time, low-latency constraints are critical and you would need a heavy two-stage detector or large segmentation backbone — these add substantial inference latency versus a lighter classifier or single-stage detector.
Your classes are highly imbalanced at the pixel level (e.g. tiny lesions in a mostly-background medical image) and you only use plain pixel-wise cross-entropy — this will be dominated by the background class; you need to also account for class imbalance (e.g. with Dice/focal loss) or the task is a poor fit as posed.

Rules of thumb

Always report mAP at multiple IoU thresholds (e.g. COCO’s 0.5:0.95) rather than a single threshold — a detector can look great at IoU 0.5 and mediocre at IoU 0.75.
When tuning NMS, a lower IoU threshold removes more duplicate boxes but risks merging genuinely distinct, closely-spaced objects.
Start from a backbone pre-trained on a large dataset (ImageNet/COCO) and fine-tune — training detection/segmentation heads from scratch is rarely worth it given how data-hungry these tasks are.

Implementation

Reference code implementation

Python

model_fitting.py

1import torch
2import torchvision.models.detection as detection
3
4# Load a pre-trained Object Detection model (Faster R-CNN)
5# It detects 80 standard COCO classes (cars, dogs, people, etc.)
6model = detection.fasterrcnn_resnet50_fpn(pretrained=True)
7model.eval() # Set model to evaluation mode
8
9# Create a fake image batch: [batch_size, channels, height, width]
10# Normalized between 0 and 1
11fake_images = [torch.rand(3, 300, 300)]
12
13# Run inference!
14with torch.no_grad():
15    predictions = model(fake_images)
16
17# Inspect predictions for the first image
18pred = predictions[0]
19print("Detected keys:", pred.keys())
20# Output contains 'boxes' (coordinates), 'labels' (classes), and 'scores' (confidence)
21print(f"Number of boxes detected: {len(pred['boxes'])}")

Strengths & Advantages

High spatial precision: Provides detailed pixel masks, bounding coordinates, and labels.
Real-world utility: Essential for visual sorting, camera alignment, self-driving navigation, and robotic limbs.
Rich pre-trained models: Access to powerful pre-trained models (like Segment Anything or YOLO) that perform zero-shot tasks.

Limitations & Drawbacks

Extremely expensive annotation: Marking individual pixels or drawing thousands of tight bounding boxes requires significant manual labor.
Condition sensitivity: Highly sensitive to shadows, lighting shifts, motion blur, and camera lens distortions.
High memory requirements: Running real-time high-resolution detection pipelines demands high VRAM and compute.

Real-World Case Studies

YOLO: Real-Time Object Detection as a Single Regression Problem

Real-time object detection

Scenario

Two-stage detectors like Faster R-CNN achieved strong accuracy but ran too slowly for real-time applications such as video surveillance and robotics, since they first generate region proposals and only then classify each one in a second pass.

Approach

YOLO (You Only Look Once) reframes detection as a single regression problem solved in one forward pass: the image is divided into a grid, and each grid cell directly predicts bounding-box coordinates, objectness confidence, and class probabilities simultaneously for a fixed set of anchors, eliminating the separate proposal stage entirely.

Outcome

The original YOLO ran at 45 frames per second (and a smaller "Fast YOLO" variant at 155 FPS) while achieving competitive mean Average Precision on PASCAL VOC, roughly 2x or more the speed of contemporary two-stage detectors at a modest accuracy cost — establishing single-stage, anchor-based detection as the standard approach for real-time applications.

Source: You Only Look Once: Unified, Real-Time Object Detection — Redmon, J., Divvala, S., Girshick, R. and Farhadi, A.

Common Misconceptions

MisconceptionSemantic segmentation distinguishes between individual instances of the same class.

CorrectionSemantic segmentation labels all pixels of a class (e.g. all sheep) with the same color, treating them as a single region. Instance segmentation distinguishes individual sheep as separate objects.

MisconceptionObject detection models must scan the image pixel-by-pixel with sliding windows.

CorrectionModern detectors like YOLO or Faster R-CNN process the entire image in a single forward pass, predicting all boxes and classes simultaneously.

References & Further Reading

Computer Vision: Algorithms and Applicationstextbook
By Szeliski, R
View Resource →
Deep Learning for Computer Visiontextbook
By Rosebrock, A
View Resource →

Computer Vision Foundations

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

Image Classification vs Object Detection vs Semantic Segmentation

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

YOLO: Real-Time Object Detection as a Single Regression Problem

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Convolutional Neural Networks

Neural Networks & Deep Learning

Image Segmentation

Vision Transformers (ViT)