AI - Computer Vision

Computer Vision, a subfield of Artificial Intelligence, focuses on enabling machines to extract, analyze, and interpret meaningful information from visual data. With the growth of computational power, data availability, and deep learning techniques, Computer Vision has evolved from basic image processing to complex visual understanding tasks. Lets explore the technical foundations of AI in Computer Vision, including key algorithms, architectures, datasets, evaluation metrics and common vision tasks.

Historical Context

Computer vision has evolved dramatically over the past few decades. Early systems relied on handcrafted features such as edges and corners getting extracted using algorithms such as SIFT and HOG. These approaches worked well for simple tasks but struggled with real world complexity. The turning point came in 2012, when AlexNet, a deep Convolutional Neural Network (CNN), won the ImageNet challenge by a wide margin. This victory demonstrated that neural networks could outperform traditional methods in image classification, sparking a revolution in computer vision. Since then, CNNs have become the foundation of nearly every modern vision system.

Convolutional Neural Networks (CNNs)

One of the most common machine learning model architectures for computer vision is the Convolutional Neural Network (CNN), a deep learning architecture designed to process visual data. CNNs are particularly effective at tasks like image classification, object detection, and segmentation because they learn to extract meaningful patterns from pixel data.

How CNNs Works

CNNs use filters (also known as kernels) to scan across an image and extract numeric feature maps. These maps represent visual patterns such as edges, textures, and shapes. The extracted features are then passed through deeper layers of the network to generate a label prediction essentially answering the question, “What is this an image of?” For example, in an image classification scenario, you might train a CNN model with images of different kinds of fruit such as apples, bananas, and oranges. The model learns to associate certain visual features with each fruit type. When presented with a new image, the CNN predicts the label: “This is a banana.”

Training a CNN

During the training process:

  • Filter kernels are initially defined using randomly generated weight values.
  • The model processes labeled images and makes predictions.
  • These predictions are evaluated against known label values.
  • The filter weights are adjusted to improve accuracy.

Eventually, the trained fruit image classification model uses the filter weights that best extract features that help identify different kinds of fruit. This process allows CNNs to learn which visual characteristics like shape, color, and texture are most useful for distinguishing between classes.

Core Concepts & Techniques

CNNs are built on several key principles that make them effective:

  • Local connectivity: Neurons connect to small regions of the input, preserving spatial relationships.
  • Weight sharing: Filters are reused across the image, reducing the number of parameters.
  • Activation functions: Non-linear functions like ReLU introduce complexity.
  • Pooling: Downsampling reduces dimensionality and improves efficiency.
  • Backpropagation: Errors are propagated backward to update weights.
  • Regularization: Techniques like dropout and batch normalization prevent overfitting.

These concepts enable CNNs to learn complex patterns while remaining computationally efficient.

Key Architectures

Over time, CNNs have evolved into deeper and more sophisticated architectures:

Architecture Key Features Use Case
AlexNet First deep CNN to win ImageNet Image classification
VGGNet Uniform architecture with small filters Transfer learning
ResNet Introduced residual connections Very deep networks
Inception Multi-scale feature extraction Efficient computation
DenseNet Dense connections between layers Feature reuse

Each architecture builds on the strengths of its predecessors, pushing the boundaries of accuracy and scalability.

Vision Models: CNNs and Beyond

Computer vision models come in many architectural flavors. While Convolutional Neural Networks (CNNs) have long been the backbone of visual recognition, newer models like Transformers and Graph Neural Networks are expanding the frontier. Here is a overview of both CNN based and non CNN based models, organized by task and innovation.

Model Name Type Task Focus Key Innovation
AlexNet CNN Image Classification First deep CNN to win ImageNet
VGGNet CNN Classification & Transfer Uniform architecture with small filters
ResNet CNN Deep Classification Residual connections for deeper networks
YOLO CNN Real-Time Detection Single-pass object detection
U-Net CNN Image Segmentation Symmetric encoder-decoder architecture
Vision Transformer (ViT) Transformer General Vision Tasks Self-attention replaces convolutions
Graph Neural Network (GNN) Graph-based Relational Vision Tasks Models data as nodes and edges
Capsule Network Capsule-based Spatial Reasoning Preserves part whole relationships
Support Vector Machine (SVM) Classical ML Classification Hyperplane-based decision boundaries
Random Forest Classical ML Classification & Regression Ensemble of decision trees

Datasets & Benchmark

To train and evaluate computer vision models effectively, researchers rely on datasets and benchmarks, two foundational tools that shape the development of intelligent visual systems.

Datasets are curated collections of labeled images (or other visual data) used to teach models how to recognize patterns.

  • They serve as the training material for machine learning algorithms.
  • Labels indicate what each image contains, such as objects, categories, or pixel-level annotations.
  • A diverse and well-labeled dataset helps models generalize to real world scenarios.

Think of a dataset as the curriculum for a vision model, it is how the model learns.

Benchmarks are standardized tests that evaluate and compare model performance.

  • They use fixed datasets and scoring metrics (e.g. accuracy, precision, recall).
  • Researchers submit models to benchmark challenges to see how well they perform.
  • Benchmarks identify state of the art models and track progress across the field.

Benchmarks are like the final exam, everyone takes the same test, and the scores reveal who’s best.

Popular Datasets & Benchmarks:

Name Type of Task Description Benchmark Role
ImageNet Classification 14M+ labeled images across 20K categories Image classification
COCO Detection & Segmentation Rich annotations for objects in context Object detection, segmentation
Pascal VOC Detection & Classification Early benchmark for object recognition Model comparison
Cityscapes Semantic Segmentation Urban scenes with pixel-level labels Autonomous driving
LUNA16 Medical Imaging CT scans for lung nodule detection Healthcare AI
OpenImages Multi label Classification Large scale dataset with bounding boxes General purpose vision

These datasets and benchmarks are the backbone of modern computer vision research. They ensure fair comparisons, drive innovation, and help developers choose the right models for their applications.

Common Vision Tasks

Computer vision covers a range of core tasks that allow machines to understand and interpret visual information. Below are four of the most widely used tasks, each with distinct goals, workflows, and model choices.

Image Classification

Assigns a single label to an entire image, answering the question: "What is this image of?"
Typical Use Cases: Product categorisation, species identification, defect detection in manufacturing.
Approaches:

  • Small dataset: Transfer learning from pretrained CNNs like ResNet‑50 or MobileNet, with strong augmentation.
  • Large dataset: Train deep architectures (EfficientNet, Vision Transformers) from scratch for maximum accuracy.

Object Detection

Identifies and localises multiple objects in an image by returning bounding boxes with class labels.
Typical Use Cases: Autonomous driving, security surveillance, retail analytics.
Approaches:

  • Small dataset: Fine‑tune pretrained detectors such as YOLOv5 or Faster R‑CNN, combined with data augmentation.
  • Large dataset: Custom‑train models like EfficientDet or Cascade R‑CNN with multi‑scale training.

Image Segmentation

Classifies each pixel in the image, producing a segmentation map that outlines objects or regions.
Typical Use Cases: Medical imaging (tumour segmentation), autonomous navigation (lane marking detection), satellite imagery analysis.
Approaches:

  • Real‑time constraint: Lightweight networks like BiSeNet or Fast‑SCNN for speed‑critical applications.
  • No real‑time constraint: Accuracy‑first architectures like DeepLabv3+ or Mask R‑CNN.

Optical Character Recognition (OCR)

Extracts text from images or video frames, converting it into machine‑readable form. Modern OCR often uses CNNs for feature extraction, coupled with RNNs or Transformers for sequence decoding.
Typical Use Cases: Document digitisation, real‑time translation of street signs, automated invoice processing.
Approaches:

  1. Text Detection: Locate text regions with models like EAST, CRAFT, or DBNet.
  2. Text Recognition: Convert detected regions into text using CRNN, Rosetta, or TrOCR.