Computer Vision

Computer Vision Foundations

Computer Vision (CV) bridges the gap between raw camera pixel grids and semantic physical understanding. While image classification assigns a single label to an entire image, practical computer vision covers:

  1. Object Detection: Finding where objects are, drawing bounding boxes, and classifying them.

  2. Semantic & Instance Segmentation: Classifying every single pixel in the image, carving out exact item borders.

  3. Keypoint Detection: Tracking specific anatomical joints or structural landmarks across image space.

These techniques form the core of self-driving cars, industrial automation, medical diagnostics, and spatial computing.

Best use

High spatial precision: Provides detailed pixel masks, bounding coordinates, and labels.

Watch out for

Extremely expensive annotation: Marking individual pixels or drawing thousands of tight bounding boxes requires significant manual labor.

i

Intuition

How to think about this algorithm

Imagine trying to teach a computer to outline objects on a chalkboard. Standard image classification is like shouting "there's a dog on the board!".

Object detection is like walking up to the board and drawing a chalk box around the dog's body. Semantic segmentation is like taking colored paint and carefully coloring every single pixel that belongs to the dog, coloring the grass green, and the sidewalk gray, carving out the exact geometric reality of the scene.

Interactive Diagram

Gradient Edge Detectors

Click grid cells on the left to draw/erase edges. Toggle horizontal vs vertical filters to view output directionality.

Edge Intensity
Gradient Direction
[Click left grid to toggle edge pixels]
Filter Direction
Threshold Cutoff40
Key InsightComputer vision models use directional convolutional filters (like Sobel operators) to compute image intensity derivatives, exposing geometric boundary edges.

The Logic

Mathematical core for computer vision foundations

1. Intersection over Union (IoU)

To evaluate how accurately a bounding box prediction (BpredB_{pred}) matches the ground truth box (BgtB_{gt}), we compute the ratio of their overlap area to their total union area:

IoU=Area of OverlapArea of Union=BgtBpredBgtBpred\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{|B_{gt} \cap B_{pred}|}{|B_{gt} \cup B_{pred}|}

An IoU 0.5\ge 0.5 is typically considered a successful overlap matching.

2. Multi-Task Bounding Box Loss

Object detectors optimize both where the box is (localization) and what is inside it (classification):

Ltotal=Lclass(p,p)+λI[u1]Lloc(t,t)\mathcal{L}_{total} = \mathcal{L}_{class}(p, p^*) + \lambda \cdot \mathbb{I}_{[u \ge 1]} \mathcal{L}_{loc}(t, t^*)

Where pp is class probability, pp^* is true label, tt is predicted box offsets, tt^* is target offsets, and I[u1]\mathbb{I}_{[u \ge 1]} activates localization loss only when an object actually exists in the region.

Code Example

computer_vision_foundations.py · pytorch example

Python
model_fitting.py
1import torch
2import torchvision.models.detection as detection
3
4# Load a pre-trained Object Detection model (Faster R-CNN)
5# It detects 80 standard COCO classes (cars, dogs, people, etc.)
6model = detection.fasterrcnn_resnet50_fpn(pretrained=True)
7model.eval() # Set model to evaluation mode
8
9# Create a fake image batch: [batch_size, channels, height, width]
10# Normalized between 0 and 1
11fake_images = [torch.rand(3, 300, 300)]
12
13# Run inference!
14with torch.no_grad():
15    predictions = model(fake_images)
16
17# Inspect predictions for the first image
18pred = predictions[0]
19print("Detected keys:", pred.keys())
20# Output contains 'boxes' (coordinates), 'labels' (classes), and 'scores' (confidence)
21print(f"Number of boxes detected: {len(pred['boxes'])}")

Strengths

  • High spatial precision: Provides detailed pixel masks, bounding coordinates, and labels.

  • Real-world utility: Essential for visual sorting, camera alignment, self-driving navigation, and robotic limbs.

  • Rich pre-trained models: Access to powerful pre-trained models (like Segment Anything or YOLO) that perform zero-shot tasks.

!

Limitations

  • Extremely expensive annotation: Marking individual pixels or drawing thousands of tight bounding boxes requires significant manual labor.

  • Condition sensitivity: Highly sensitive to shadows, lighting shifts, motion blur, and camera lens distortions.

  • High memory requirements: Running real-time high-resolution detection pipelines demands high VRAM and compute.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    Camera resolutions and inputs remain within training domain distributions.

  • 2

    Bounding box labels accurately bound semantic targets.

R

References

Books and papers for deeper study

  • Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) 'You only look once: Unified, real-time object detection', in IEEE Conference on Computer Vision and Pattern Recognition (CVPR).