6 May 2025 • on ai computer-vision cnns image-models ocr

AI - Computer Vision

Computer Vision, a subfield of Artificial Intelligence, focuses on enabling machines to extract, analyze, and interpret meaningful information from visual data. With the growth of computational power, data availability, and deep learning techniques, Computer Vision has evolved from basic image processing to complex visual understanding tasks. Lets explore the technical foundations of AI in Computer Vision, including key algorithms, architectures, datasets, evaluation metrics and common vision tasks.

Historical Context

Computer vision has evolved dramatically over the past few decades. Early systems relied on handcrafted features such as edges and corners getting extracted using algorithms such as SIFT and HOG. These approaches worked well for simple tasks but struggled with real world complexity. The turning point came in 2012, when AlexNet, a deep Convolutional Neural Network (CNN), won the ImageNet challenge by a wide margin. This victory demonstrated that neural networks could outperform traditional methods in image classification, sparking a revolution in computer vision. Since then, CNNs have become the foundation of nearly every modern vision system.

Convolutional Neural Networks (CNNs)

One of the most common machine learning model architectures for computer vision is the Convolutional Neural Network (CNN), a deep learning architecture designed to process visual data. CNNs are particularly effective at tasks like image classification, object detection, and segmentation because they learn to extract meaningful patterns from pixel data.

How CNNs Works

CNNs use filters (also known as kernels) to scan across an image and extract numeric feature maps. These maps represent visual patterns such as edges, textures, and shapes. The extracted features are then passed through deeper layers of the network to generate a label prediction essentially answering the question, “What is this an image of?” For example, in an image classification scenario, you might train a CNN model with images of different kinds of fruit such as apples, bananas, and oranges. The model learns to associate certain visual features with each fruit type. When presented with a new image, the CNN predicts the label: “This is a banana.”

Training a CNN

During the training process:

Filter kernels are initially defined using randomly generated weight values.
The model processes labeled images and makes predictions.
These predictions are evaluated against known label values.
The filter weights are adjusted to improve accuracy.

Eventually, the trained fruit image classification model uses the filter weights that best extract features that help identify different kinds of fruit. This process allows CNNs to learn which visual characteristics like shape, color, and texture are most useful for distinguishing between classes.

Core Concepts & Techniques

CNNs are built on several key principles that make them effective:

Local connectivity: Neurons connect to small regions of the input, preserving spatial relationships.
Weight sharing: Filters are reused across the image, reducing the number of parameters.
Activation functions: Non-linear functions like ReLU introduce complexity.
Pooling: Downsampling reduces dimensionality and improves efficiency.
Backpropagation: Errors are propagated backward to update weights.
Regularization: Techniques like dropout and batch normalization prevent overfitting.

These concepts enable CNNs to learn complex patterns while remaining computationally efficient.

Key Architectures

Over time, CNNs have evolved into deeper and more sophisticated architectures:

Architecture	Key Features	Use Case
AlexNet	First deep CNN to win ImageNet	Image classification
VGGNet	Uniform architecture with small filters	Transfer learning
ResNet	Introduced residual connections	Very deep networks
Inception	Multi-scale feature extraction	Efficient computation
DenseNet	Dense connections between layers	Feature reuse

Each architecture builds on the strengths of its predecessors, pushing the boundaries of accuracy and scalability.

Vision Models: CNNs and Beyond

Computer vision models come in many architectural flavors. While Convolutional Neural Networks (CNNs) have long been the backbone of visual recognition, newer models like Transformers and Graph Neural Networks are expanding the frontier. Here is a overview of both CNN based and non CNN based models, organized by task and innovation.

Model Name	Type	Task Focus	Key Innovation
AlexNet	CNN	Image Classification	First deep CNN to win ImageNet
VGGNet	CNN	Classification & Transfer	Uniform architecture with small filters
ResNet	CNN	Deep Classification	Residual connections for deeper networks
YOLO	CNN	Real-Time Detection	Single-pass object detection
U-Net	CNN	Image Segmentation	Symmetric encoder-decoder architecture
Vision Transformer (ViT)	Transformer	General Vision Tasks	Self-attention replaces convolutions
Graph Neural Network (GNN)	Graph-based	Relational Vision Tasks	Models data as nodes and edges
Capsule Network	Capsule-based	Spatial Reasoning	Preserves part whole relationships
Support Vector Machine (SVM)	Classical ML	Classification	Hyperplane-based decision boundaries
Random Forest	Classical ML	Classification & Regression	Ensemble of decision trees

Datasets & Benchmark

To train and evaluate computer vision models effectively, researchers rely on datasets and benchmarks, two foundational tools that shape the development of intelligent visual systems.

Datasets are curated collections of labeled images (or other visual data) used to teach models how to recognize patterns.

They serve as the training material for machine learning algorithms.
Labels indicate what each image contains, such as objects, categories, or pixel-level annotations.
A diverse and well-labeled dataset helps models generalize to real world scenarios.

Think of a dataset as the curriculum for a vision model, it is how the model learns.

Benchmarks are standardized tests that evaluate and compare model performance.

They use fixed datasets and scoring metrics (e.g. accuracy, precision, recall).
Researchers submit models to benchmark challenges to see how well they perform.
Benchmarks identify state of the art models and track progress across the field.

Benchmarks are like the final exam, everyone takes the same test, and the scores reveal who’s best.

Popular Datasets & Benchmarks:

Name	Type of Task	Description	Benchmark Role
ImageNet	Classification	14M+ labeled images across 20K categories	Image classification
COCO	Detection & Segmentation	Rich annotations for objects in context	Object detection, segmentation
Pascal VOC	Detection & Classification	Early benchmark for object recognition	Model comparison
Cityscapes	Semantic Segmentation	Urban scenes with pixel-level labels	Autonomous driving
LUNA16	Medical Imaging	CT scans for lung nodule detection	Healthcare AI
OpenImages	Multi label Classification	Large scale dataset with bounding boxes	General purpose vision

These datasets and benchmarks are the backbone of modern computer vision research. They ensure fair comparisons, drive innovation, and help developers choose the right models for their applications.

Common Vision Tasks

Computer vision covers a range of core tasks that allow machines to understand and interpret visual information. Below are four of the most widely used tasks, each with distinct goals, workflows, and model choices.

Image Classification

Assigns a single label to an entire image, answering the question: "What is this image of?"
Typical Use Cases: Product categorisation, species identification, defect detection in manufacturing.
Approaches:

Small dataset: Transfer learning from pretrained CNNs like ResNet‑50 or MobileNet, with strong augmentation.
Large dataset: Train deep architectures (EfficientNet, Vision Transformers) from scratch for maximum accuracy.

Object Detection

Identifies and localises multiple objects in an image by returning bounding boxes with class labels.
Typical Use Cases: Autonomous driving, security surveillance, retail analytics.
Approaches:

Small dataset: Fine‑tune pretrained detectors such as YOLOv5 or Faster R‑CNN, combined with data augmentation.
Large dataset: Custom‑train models like EfficientDet or Cascade R‑CNN with multi‑scale training.

Image Segmentation

Classifies each pixel in the image, producing a segmentation map that outlines objects or regions.
Typical Use Cases: Medical imaging (tumour segmentation), autonomous navigation (lane marking detection), satellite imagery analysis.
Approaches:

Real‑time constraint: Lightweight networks like BiSeNet or Fast‑SCNN for speed‑critical applications.
No real‑time constraint: Accuracy‑first architectures like DeepLabv3+ or Mask R‑CNN.

Optical Character Recognition (OCR)

Extracts text from images or video frames, converting it into machine‑readable form. Modern OCR often uses CNNs for feature extraction, coupled with RNNs or Transformers for sequence decoding.
Typical Use Cases: Document digitisation, real‑time translation of street signs, automated invoice processing.
Approaches:

Text Detection: Locate text regions with models like EAST, CRAFT, or DBNet.
Text Recognition: Convert detected regions into text using CRNN, Rosetta, or TrOCR.