Computer Vision Foundations
Computer Vision (CV) bridges the gap between raw camera pixel grids and semantic physical understanding. While image classification assigns a single label to an entire image, practical computer vision covers:
-
Object Detection: Finding where objects are, drawing bounding boxes, and classifying them.
-
Semantic & Instance Segmentation: Classifying every single pixel in the image, carving out exact item borders.
-
Keypoint Detection: Tracking specific anatomical joints or structural landmarks across image space.
These techniques form the core of self-driving cars, industrial automation, medical diagnostics, and spatial computing.
High spatial precision: Provides detailed pixel masks, bounding coordinates, and labels.
Extremely expensive annotation: Marking individual pixels or drawing thousands of tight bounding boxes requires significant manual labor.
Intuition
How to think about this algorithm
Imagine trying to teach a computer to outline objects on a chalkboard. Standard image classification is like shouting "there's a dog on the board!".
Object detection is like walking up to the board and drawing a chalk box around the dog's body. Semantic segmentation is like taking colored paint and carefully coloring every single pixel that belongs to the dog, coloring the grass green, and the sidewalk gray, carving out the exact geometric reality of the scene.
Gradient Edge Detectors
Click grid cells on the left to draw/erase edges. Toggle horizontal vs vertical filters to view output directionality.
The Logic
Mathematical core for computer vision foundations
1. Intersection over Union (IoU)
To evaluate how accurately a bounding box prediction () matches the ground truth box (), we compute the ratio of their overlap area to their total union area:
An IoU is typically considered a successful overlap matching.
2. Multi-Task Bounding Box Loss
Object detectors optimize both where the box is (localization) and what is inside it (classification):
Where is class probability, is true label, is predicted box offsets, is target offsets, and activates localization loss only when an object actually exists in the region.
Code Example
computer_vision_foundations.py · pytorch example
1import torch
2import torchvision.models.detection as detection
3
4# Load a pre-trained Object Detection model (Faster R-CNN)
5# It detects 80 standard COCO classes (cars, dogs, people, etc.)
6model = detection.fasterrcnn_resnet50_fpn(pretrained=True)
7model.eval() # Set model to evaluation mode
8
9# Create a fake image batch: [batch_size, channels, height, width]
10# Normalized between 0 and 1
11fake_images = [torch.rand(3, 300, 300)]
12
13# Run inference!
14with torch.no_grad():
15 predictions = model(fake_images)
16
17# Inspect predictions for the first image
18pred = predictions[0]
19print("Detected keys:", pred.keys())
20# Output contains 'boxes' (coordinates), 'labels' (classes), and 'scores' (confidence)
21print(f"Number of boxes detected: {len(pred['boxes'])}")Strengths
High spatial precision: Provides detailed pixel masks, bounding coordinates, and labels.
Real-world utility: Essential for visual sorting, camera alignment, self-driving navigation, and robotic limbs.
Rich pre-trained models: Access to powerful pre-trained models (like Segment Anything or YOLO) that perform zero-shot tasks.
Limitations
Extremely expensive annotation: Marking individual pixels or drawing thousands of tight bounding boxes requires significant manual labor.
Condition sensitivity: Highly sensitive to shadows, lighting shifts, motion blur, and camera lens distortions.
High memory requirements: Running real-time high-resolution detection pipelines demands high VRAM and compute.
Key Assumptions
Scope conditions and interpretation notes
- 1
Camera resolutions and inputs remain within training domain distributions.
- 2
Bounding box labels accurately bound semantic targets.
References
Books and papers for deeper study
Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) 'You only look once: Unified, real-time object detection', in IEEE Conference on Computer Vision and Pattern Recognition (CVPR).