
Pub date: September 1, 2025
ISBN: 9787302696834
Rights: All Rights Available
360 p.p.
This book provides a comprehensive introduction to the fundamental concepts and methodological framework of computer vision. It offers an in-depth exploration of image representation learning, covering image representation modeling, non-parametric image understanding, and deep learning-based approaches. The content systematically presents mainstream computer vision research areas from a deep learning perspective, including 2D vision, 3D vision, and video vision.
The book is structured to guide readers through a progressive learning path: Chapters 2 and 3 focus on low-level vision tasks (such as super-resolution and denoising); Chapters 4 through 6 address mid-level vision tasks (such as feature matching, 3D reconstruction, and depth estimation); and Chapters 7 through 11 cover high-level image understanding tasks (such as classification, semantic segmentation, object detection, and metric learning) along with classical theoretical methods. Furthermore, Chapters 12 and 13 delve into cutting-edge advancements in the field, including diffusion models, model compression and acceleration, large-scale vision models, and vision-language models.
Through this course, students will gain a deep understanding of the principles and applications of computer vision, master relevant knowledge and skills, and develop the ability to conduct related research and practical applications.
This book is well-suited as a textbook for computer vision courses in university departments such as Computer Science, Automation, and Electronic Information Engineering. It also serves as a valuable self-study reference for developers and practitioners working in computer vision technology.
Chapter 1: Introduction
1.1 Basic Concepts
1.2 Historical Development
1.3 Application Examples
1.4 Book Overview
Chapter 2: Visual Information Acquisition
2.1 Projective Geometry and Transformations
2.1.1 2D Projective Geometry and Transformations
2.1.2 3D Projective Geometry and Transformations
2.2 Camera Models
2.2.1 Finite Cameras
2.2.2 General Projective Cameras
2.2.3 Cameras at Infinity
2.3 Photometric Image Formation
2.3.1 Illumination
2.3.2 Reflection and Shading
2.3.3 Optics
2.4 Depth Image and Point Cloud Acquisition
2.4.1 Depth Cameras
2.4.2 LiDAR
Review Questions
Chapter 3: Image and Video Processing
3.1 Local Image Processing Operators
3.1.1 Morphological Transformations
3.1.2 Linear Filters
3.1.3 Image Sharpening and Blurring
3.1.4 Distance Transform
3.2 Global Image Processing Operators
3.2.1 Fourier Transform
3.2.2 Image Interpolation Methods
3.2.3 Image Geometric Transformations
3.2.4 Image Color Space Transformations
3.2.5 Histogram Equalization
3.3 Typical Image Augmentation Methods
3.3.1 Random Affine Transformations
3.3.2 Random Brightness/Hue/Saturation Transformations
3.3.3 Random Cropping and Pasting
Review Questions
Chapter 4: Visual Feature Extraction
4.1 Point and Patch Feature Extraction
4.1.1 Auto-correlation Function
4.1.2 Adaptive Non-Maximal Suppression
4.1.3 Scale-Invariant Feature Transform (SIFT)
4.1.4 Gradient Location and Orientation Histogram (GLOH)
4.1.5 Local Binary Descriptors
4.2 Edge Feature Extraction
4.2.1 Edge Detection Operators
4.2.2 Edge Linking
4.2.3 Application Example: Lane Detection
4.3 Shape Feature Extraction
4.3.1 Hough Transform
4.3.2 Vanishing Point Detection
4.3.3 Circle Detection
Review Questions
Chapter 5: Visual Feature Registration
5.1 Traditional Feature Matching
5.1.1 Matching Distance
5.1.2 Matching Strategy
5.1.3 Matching Evaluation Criteria
5.2 Traditional Geometric Registration
5.2.1 Traditional Geometric Registration Methods
5.2.2 Optimization Methods for Geometric Registration
5.2.3 Evaluation Metrics for Geometric Registration
5.2.4 Improvement Approaches for Geometric Registration
5.2.5 Application Example: Occluded Face Recognition
5.3 Deep Learning-based Registration
5.3.1 Image-based Deep Registration
5.3.2 Point Cloud-based Deep Registration
5.3.3 Multi-modal Deep Registration
5.4 Optical Flow Estimation
5.4.1 Lucas-Kanade Optical Flow
5.4.2 Cost Volume
5.4.3 Deep Learning-based Optical Flow Estimation
5.5 Scene Flow Estimation
5.5.1 Image-based Scene Flow Estimation
5.5.2 Point Cloud-based Scene Flow Estimation
Review Questions
Chapter 6: 3D Scene Reconstruction
6.1 Structure from Motion
6.1.1 Epipolar Geometry Constraints
6.1.2 Essential Matrix Estimation and Decomposition
6.1.3 Triangulation
6.1.4 PnP Problem and Solving Methods
6.2 Dense Scene Reconstruction
6.2.1 Traditional Geometry-based Dense Reconstruction
6.2.2 Deep Learning-based Dense Reconstruction
6.3 Depth Estimation
6.3.1 Monocular Depth Estimation
6.3.2 Multi-view Depth Estimation
6.3.3 Surround-view Depth Estimation
6.4 Neural Radiance Fields (NeRF)
6.4.1 Starting from the Concept of a "Field"
6.4.2 Fundamental Concepts in Graphics
6.4.3 Neural Radiance Fields and Novel View Synthesis
6.4.4 Neural Radiance Fields and Surface Reconstruction
6.4.5 Other Advances in Neural Radiance Fields
Review Questions
Chapter 7: Visual Object Recognition
7.1 Overview of Visual Recognition
7.1.1 Problem Definition
7.1.2 System Overview
7.2 Traditional Recognition Methods
7.2.1 Feature Extraction Methods
7.2.2 Typical Classification Methods
7.3 Object Recognition based on Convolutional Neural Networks (CNNs)
7.3.1 Convolutional Neural Networks
7.3.2 Evolution of CNNs
7.3.3 Residual Networks (ResNet)
7.3.4 Realities and Challenges of Deep Neural Networks
7.4 Object Recognition based on Self-Attention Networks
7.4.1 Self-Attention Mechanism
7.4.2 Visual Self-Attention Networks
7.5 Object Recognition based on Vision-Language Models
Review Questions
Chapter 8: Visual Semantic Segmentation
8.1 Introduction to Visual Segmentation Tasks
8.1.1 Semantic Segmentation
8.1.2 Instance Segmentation
8.1.3 Panoptic Segmentation
8.2 Traditional Segmentation Models
8.2.1 Threshold-based Methods
8.2.2 Region-based Segmentation Methods
8.2.3 Edge-based Segmentation Methods
8.2.4 Clustering-based Segmentation Methods
8.3 Multi-modal Segmentation Models: 2D Image Segmentation
8.3.1 Developmental Context
8.3.2 Classical Methods: FCN and U-Net
8.3.3 Innovations and Developments: Network Architectures and Algorithmic Techniques
8.4 Multi-modal Segmentation Models: Temporal Video Segmentation
8.4.1 Key Scientific Issues
8.4.2 Classical Models and Methods
8.5 Multi-modal Segmentation Models: 3D Point Cloud Segmentation
8.5.1 Key Scientific Issues
8.5.2 Classical Models and Methods
8.6 Training Methods: Semi-Supervised Segmentation
8.6.1 Semi-Supervised Training Paradigms
8.6.2 Semi-Supervised Segmentation Models
8.7 Training Methods: Unsupervised Segmentation
8.7.1 Fully Unsupervised Methods
8.7.2 Unsupervised Domain Adaptation Methods
8.8 Frontier Research: Cross-Modal Segmentation
8.8.1 Referring Image Segmentation
8.8.2 Open-Vocabulary Segmentation
Review Questions
Chapter 9: Visual Object Detection
9.1 Object Detection Task Definition
9.2 Introduction to Traditional Visual Detection Methods
9.2.1 Viola-Jones Detector
9.2.2 Histogram of Oriented Gradients (HOG) Detector
9.2.3 Deformable Part-based Model (DPM)
9.2.4 Hough Voting-based Detection Models
9.3 Deep Learning-based Detection Methods
9.3.1 2D Image Detection based on Deep Learning
9.3.2 3D Point Cloud Detection based on Deep Learning
9.3.3 Surround-view Image Detection based on Deep Learning
9.4 Multi-modal Fusion Methods
9.4.1 Projection-based Multi-modal Alignment
9.4.2 Attention-based Multi-modal Alignment
9.5 Other Emerging Object Detection Related Topics
9.5.1 Oriented Object Detection
9.5.2 Salient Object Detection
Review Questions
Chapter 10: Visual Object Tracking
10.1 Introduction to Visual Object Tracking
10.1.1 Research Content
10.1.2 Research Categories
10.2 Classical Visual Tracking Methods
10.2.1 Target Representation Methods
10.2.2 Deterministic Tracking Methods
10.2.3 Stochastic Tracking Methods
10.3 Deep Learning-based Visual Tracking Methods
10.3.1 Deep Correlation Filter Tracking
10.3.2 Classification-based Deep Trackers
10.3.3 Siamese Network Tracking Algorithms
10.3.4 Gradient Optimization-based Deep Tracking Methods
10.3.5 Transformer-based Deep Tracking Methods
10.4 Common Datasets for Visual Object Tracking
10.4.1 OTB-100
10.4.2 VOT
10.4.3 NFS
10.4.4 LaSOT
Review Questions
Chapter 11: Visual Content Retrieval
11.1 Visual Content Retrieval Task Introduction
11.2 Traditional Visual Content Retrieval Methods
11.2.1 Background and Significance
11.2.2 Text-based Image Retrieval (TBIR)
11.2.3 Content-based Image Retrieval (CBIR)
11.2.4 CBIR Techniques
11.2.5 Approximate Nearest Neighbor (ANN) Search Methods
11.2.6 Summary
11.3 Deep Learning-based Visual Content Retrieval Methods
11.3.1 Deep Metric Learning-based Image Retrieval
11.3.2 Deep Hashing-based Image Retrieval
11.3.3 Summary
Review Questions
Chapter 12: Visual Content Generation
12.1 Generative Adversarial Networks (GANs)
12.1.1 Basic Principles
12.1.2 Variant Models
12.2 Variational Autoencoders (VAEs)
12.2.1 Basic Principles
12.2.2 Variant Models
12.3 Autoregressive Models
12.3.1 Basic Principles
12.3.2 Variant Models
12.4 Flow-based Generative Models
12.4.1 Normalizing Flows
12.4.2 Real NVP (Non-Volume Preserving)
12.4.3 Glow (Generative Flow)
12.5 Denoising Diffusion Probabilistic Models
12.6 Multi-modal Generative Models
Review Questions
Chapter 13: Frontiers in Computer Vision
13.1 Foundation Model Background
13.1.1 Development Trends in Deep Learning Models
13.1.2 Advantages of Foundation Models
13.2 Representative Foundation Models
13.2.1 Large Language Models (LLMs)
13.2.2 Large Vision Models (LVMs)
13.2.3 Multimodal Large Models
13.3 Foundation Model Pre-training
13.3.1 Bidirectional Encoder Representations from Vision Transformers
13.3.2 Masked Autoencoders (MAE)
13.4 Applications of Foundation Models
13.4.1 Embodied AI
13.4.2 Digital Humans
13.4.3 Autonomous Driving
Review Questions
References