Computer Vision
Also known as: cv
The field of teaching computers to interpret images and video — classification, detection, segmentation, generation.
- Primary domain
- Artificial Intelligence
- Sub-category
- Computer Vision
In simple terms
Computer vision is the field of getting computers to interpret images and video — figuring out what’s in a picture, where objects are, what’s happening, sometimes even creating new images. Modern computer vision is overwhelmingly deep learning: convolutional networks until the late 2010s, transformer-based vision models from then onward.
More detail
The core tasks:
- Image classification — “what is this a picture of?” One label per image.
- Object detection — “what is in this image and where?” Bounding boxes around objects.
- Semantic segmentation — pixel-level labels: every pixel gets a class.
- Instance segmentation — like semantic but distinguishes individual objects of the same class.
- Pose estimation — locate body joints / facial landmarks.
- OCR (Optical Character Recognition) — extract text from images.
- Image generation — produce new images from prompts (Stable Diffusion, Midjourney, DALL-E, Imagen).
- Video analysis — action recognition, tracking, motion estimation.
- 3D reconstruction / NeRF / Gaussian Splatting — build 3D scenes from 2D images.
- Optical flow — estimate per-pixel motion between frames.
Architectural eras:
- Pre-deep-learning (~2012): hand-crafted features (SIFT, HOG) + classifiers.
- CNNs (AlexNet 2012 → VGG → ResNet → EfficientNet): the workhorse for a decade.
- Vision Transformers (ViT), 2020+: apply transformers to images by treating image patches as tokens. Now dominant for most tasks.
- CLIP (2021): joint image-text training that learns transferable embeddings — opened the door to “describe what you want, get the picture / match the picture”.
- Diffusion models (2022+): iteratively denoise random noise into a target image. Basis of Stable Diffusion, Midjourney, DALL-E 3.
- Multimodal LLMs (2023+): GPT-4V, Claude 3, Gemini, Llama 4 — text+image input, fluent reasoning about images.
Datasets that drove the field:
- ImageNet (2009) — 14M labelled images; the benchmark that triggered the deep learning revolution.
- COCO — common objects in context, with detection and segmentation labels.
- LAION-5B — 5 billion image-text pairs from the web; trained CLIP, Stable Diffusion, and most open multimodal models.
The hardware story: vision is matrix multiplication; GPUs win. Vision Transformers especially benefit from the same matmul-heavy patterns LLMs use, so the same GPUs serve both.
Why it matters
Computer vision is in your pocket — every smartphone photo passes through neural-net-driven enhancement, autofocus, scene detection, and increasingly background removal and generative fill. It’s in self-driving cars, medical imaging, satellite analysis, manufacturing QA, content moderation, security cameras, and most modern user-facing AI products.
Real-world examples
- iPhone photography runs many neural networks per shot — semantic segmentation for portrait mode, denoising for low light, scene detection for exposure.
- Tesla / Waymo stacks fuse camera, radar, and lidar; vision models classify and track every object on the road in real time.
- AlphaFold-2 (Nobel Prize 2024) is technically protein structure prediction, but it borrowed heavily from vision-style attention architectures.
- Midjourney v6 and Stable Diffusion 3 generate photorealistic images from short prompts — the artistic side of computer vision.
- MRI analysis with deep CV models now matches or exceeds radiologists on specific tasks, deployed in many clinics in 2026.
Common misconceptions
- “Computer vision is solved.” Image classification on benchmarks is, but real-world robustness, 3D understanding, video reasoning, and out-of-distribution generalisation are very much active research.
- “Vision Transformers replaced CNNs.” ViTs dominate large-scale; CNNs are still best for many edge / low-resource deployments because they’re parameter-efficient.
Learn next
The dominant model family today: transformer. The representation step at the heart of CLIP-style models: embedding.
Read this in a learning path
All paths →This topic is part of a learning path. Start in context to keep prev/next and progress tracking.
Relationships
- Requires
- Related
- Required by
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.