Tsinghua University Press

Computer Vision

ByLU Jiwen, DUAN Yueqi, TANG Yansong, CHEN Lei

Pub date: September 1, 2025

ISBN: 9787302696834

Rights: All Rights Available

360 p.p.

Paperback

Description

About Author

Table of Contents

This book provides a comprehensive introduction to the fundamental concepts and methodological framework of computer vision. It offers an in-depth exploration of image representation learning, covering image representation modeling, non-parametric image understanding, and deep learning-based approaches. The content systematically presents mainstream computer vision research areas from a deep learning perspective, including 2D vision, 3D vision, and video vision.

The book is structured to guide readers through a progressive learning path: Chapters 2 and 3 focus on low-level vision tasks (such as super-resolution and denoising); Chapters 4 through 6 address mid-level vision tasks (such as feature matching, 3D reconstruction, and depth estimation); and Chapters 7 through 11 cover high-level image understanding tasks (such as classification, semantic segmentation, object detection, and metric learning) along with classical theoretical methods. Furthermore, Chapters 12 and 13 delve into cutting-edge advancements in the field, including diffusion models, model compression and acceleration, large-scale vision models, and vision-language models.

Through this course, students will gain a deep understanding of the principles and applications of computer vision, master relevant knowledge and skills, and develop the ability to conduct related research and practical applications.

This book is well-suited as a textbook for computer vision courses in university departments such as Computer Science, Automation, and Electronic Information Engineering. It also serves as a valuable self-study reference for developers and practitioners working in computer vision technology.

LU Jiwen is a tenured associate professor and doctoral supervisor at Tsinghua University, where he also serves as the Deputy Director of the Department of Automation. He has been honored with the National Science Fund for Distinguished Young Scholars and is a Fellow of both IEEE and IAPR. Currently, he holds the position of Editor-in-Chief for the journal Pattern Recognition Letters. His primary research interests include computer vision, pattern recognition, and intelligent robotics.

Chapter 1: Introduction

1.1 Basic Concepts

1.2 Historical Development

1.3 Application Examples

1.4 Book Overview

Chapter 2: Visual Information Acquisition

2.1 Projective Geometry and Transformations

2.1.1 2D Projective Geometry and Transformations

2.1.2 3D Projective Geometry and Transformations

2.2 Camera Models

2.2.1 Finite Cameras

2.2.2 General Projective Cameras

2.2.3 Cameras at Infinity

2.3 Photometric Image Formation

2.3.1 Illumination

2.3.2 Reflection and Shading

2.3.3 Optics

2.4 Depth Image and Point Cloud Acquisition

2.4.1 Depth Cameras

2.4.2 LiDAR

Review Questions

Chapter 3: Image and Video Processing

3.1 Local Image Processing Operators

3.1.1 Morphological Transformations

3.1.2 Linear Filters

3.1.3 Image Sharpening and Blurring

3.1.4 Distance Transform

3.2 Global Image Processing Operators

3.2.1 Fourier Transform

3.2.2 Image Interpolation Methods

3.2.3 Image Geometric Transformations

3.2.4 Image Color Space Transformations

3.2.5 Histogram Equalization

3.3 Typical Image Augmentation Methods

3.3.1 Random Affine Transformations

3.3.2 Random Brightness/Hue/Saturation Transformations

3.3.3 Random Cropping and Pasting

Review Questions

Chapter 4: Visual Feature Extraction

4.1 Point and Patch Feature Extraction

4.1.1 Auto-correlation Function

4.1.2 Adaptive Non-Maximal Suppression

4.1.3 Scale-Invariant Feature Transform (SIFT)

4.1.4 Gradient Location and Orientation Histogram (GLOH)

4.1.5 Local Binary Descriptors

4.2 Edge Feature Extraction

4.2.1 Edge Detection Operators

4.2.2 Edge Linking

4.2.3 Application Example: Lane Detection

4.3 Shape Feature Extraction

4.3.1 Hough Transform

4.3.2 Vanishing Point Detection

4.3.3 Circle Detection

Review Questions

Chapter 5: Visual Feature Registration

5.1 Traditional Feature Matching

5.1.1 Matching Distance

5.1.2 Matching Strategy

5.1.3 Matching Evaluation Criteria

5.2 Traditional Geometric Registration

5.2.1 Traditional Geometric Registration Methods

5.2.2 Optimization Methods for Geometric Registration

5.2.3 Evaluation Metrics for Geometric Registration

5.2.4 Improvement Approaches for Geometric Registration

5.2.5 Application Example: Occluded Face Recognition

5.3 Deep Learning-based Registration

5.3.1 Image-based Deep Registration

5.3.2 Point Cloud-based Deep Registration

5.3.3 Multi-modal Deep Registration

5.4 Optical Flow Estimation

5.4.1 Lucas-Kanade Optical Flow

5.4.2 Cost Volume

5.4.3 Deep Learning-based Optical Flow Estimation

5.5 Scene Flow Estimation

5.5.1 Image-based Scene Flow Estimation

5.5.2 Point Cloud-based Scene Flow Estimation

Review Questions

Chapter 6: 3D Scene Reconstruction

6.1 Structure from Motion

6.1.1 Epipolar Geometry Constraints

6.1.2 Essential Matrix Estimation and Decomposition

6.1.3 Triangulation

6.1.4 PnP Problem and Solving Methods

6.2 Dense Scene Reconstruction

6.2.1 Traditional Geometry-based Dense Reconstruction

6.2.2 Deep Learning-based Dense Reconstruction

6.3 Depth Estimation

6.3.1 Monocular Depth Estimation

6.3.2 Multi-view Depth Estimation

6.3.3 Surround-view Depth Estimation

6.4 Neural Radiance Fields (NeRF)

6.4.1 Starting from the Concept of a "Field"

6.4.2 Fundamental Concepts in Graphics

6.4.3 Neural Radiance Fields and Novel View Synthesis

6.4.4 Neural Radiance Fields and Surface Reconstruction

6.4.5 Other Advances in Neural Radiance Fields

Review Questions

Chapter 7: Visual Object Recognition

7.1 Overview of Visual Recognition

7.1.1 Problem Definition

7.1.2 System Overview

7.2 Traditional Recognition Methods

7.2.1 Feature Extraction Methods

7.2.2 Typical Classification Methods

7.3 Object Recognition based on Convolutional Neural Networks (CNNs)

7.3.1 Convolutional Neural Networks

7.3.2 Evolution of CNNs

7.3.3 Residual Networks (ResNet)

7.3.4 Realities and Challenges of Deep Neural Networks

7.4 Object Recognition based on Self-Attention Networks

7.4.1 Self-Attention Mechanism

7.4.2 Visual Self-Attention Networks

7.5 Object Recognition based on Vision-Language Models

Review Questions

Chapter 8: Visual Semantic Segmentation

8.1 Introduction to Visual Segmentation Tasks

8.1.1 Semantic Segmentation

8.1.2 Instance Segmentation

8.1.3 Panoptic Segmentation

8.2 Traditional Segmentation Models

8.2.1 Threshold-based Methods

8.2.2 Region-based Segmentation Methods

8.2.3 Edge-based Segmentation Methods

8.2.4 Clustering-based Segmentation Methods

8.3 Multi-modal Segmentation Models: 2D Image Segmentation

8.3.1 Developmental Context

8.3.2 Classical Methods: FCN and U-Net

8.3.3 Innovations and Developments: Network Architectures and Algorithmic Techniques

8.4 Multi-modal Segmentation Models: Temporal Video Segmentation

8.4.1 Key Scientific Issues

8.4.2 Classical Models and Methods

8.5 Multi-modal Segmentation Models: 3D Point Cloud Segmentation

8.5.1 Key Scientific Issues

8.5.2 Classical Models and Methods

8.6 Training Methods: Semi-Supervised Segmentation

8.6.1 Semi-Supervised Training Paradigms

8.6.2 Semi-Supervised Segmentation Models

8.7 Training Methods: Unsupervised Segmentation

8.7.1 Fully Unsupervised Methods

8.7.2 Unsupervised Domain Adaptation Methods

8.8 Frontier Research: Cross-Modal Segmentation

8.8.1 Referring Image Segmentation

8.8.2 Open-Vocabulary Segmentation

Review Questions

Chapter 9: Visual Object Detection

9.1 Object Detection Task Definition

9.2 Introduction to Traditional Visual Detection Methods

9.2.1 Viola-Jones Detector

9.2.2 Histogram of Oriented Gradients (HOG) Detector

9.2.3 Deformable Part-based Model (DPM)

9.2.4 Hough Voting-based Detection Models

9.3 Deep Learning-based Detection Methods

9.3.1 2D Image Detection based on Deep Learning

9.3.2 3D Point Cloud Detection based on Deep Learning

9.3.3 Surround-view Image Detection based on Deep Learning

9.4 Multi-modal Fusion Methods

9.4.1 Projection-based Multi-modal Alignment

9.4.2 Attention-based Multi-modal Alignment

9.5 Other Emerging Object Detection Related Topics

9.5.1 Oriented Object Detection

9.5.2 Salient Object Detection

Review Questions

Chapter 10: Visual Object Tracking

10.1 Introduction to Visual Object Tracking

10.1.1 Research Content

10.1.2 Research Categories

10.2 Classical Visual Tracking Methods

10.2.1 Target Representation Methods

10.2.2 Deterministic Tracking Methods

10.2.3 Stochastic Tracking Methods

10.3 Deep Learning-based Visual Tracking Methods

10.3.1 Deep Correlation Filter Tracking

10.3.2 Classification-based Deep Trackers

10.3.3 Siamese Network Tracking Algorithms

10.3.4 Gradient Optimization-based Deep Tracking Methods

10.3.5 Transformer-based Deep Tracking Methods

10.4 Common Datasets for Visual Object Tracking

10.4.1 OTB-100

10.4.2 VOT

10.4.3 NFS

10.4.4 LaSOT

Review Questions

Chapter 11: Visual Content Retrieval

11.1 Visual Content Retrieval Task Introduction

11.2 Traditional Visual Content Retrieval Methods

11.2.1 Background and Significance

11.2.2 Text-based Image Retrieval (TBIR)

11.2.3 Content-based Image Retrieval (CBIR)

11.2.4 CBIR Techniques

11.2.5 Approximate Nearest Neighbor (ANN) Search Methods

11.2.6 Summary

11.3 Deep Learning-based Visual Content Retrieval Methods

11.3.1 Deep Metric Learning-based Image Retrieval

11.3.2 Deep Hashing-based Image Retrieval

11.3.3 Summary

Review Questions

Chapter 12: Visual Content Generation

12.1 Generative Adversarial Networks (GANs)

12.1.1 Basic Principles

12.1.2 Variant Models

12.2 Variational Autoencoders (VAEs)

12.2.1 Basic Principles

12.2.2 Variant Models

12.3 Autoregressive Models

12.3.1 Basic Principles

12.3.2 Variant Models

12.4 Flow-based Generative Models

12.4.1 Normalizing Flows

12.4.2 Real NVP (Non-Volume Preserving)

12.4.3 Glow (Generative Flow)

12.5 Denoising Diffusion Probabilistic Models

12.6 Multi-modal Generative Models

Review Questions

Chapter 13: Frontiers in Computer Vision

13.1 Foundation Model Background

13.1.1 Development Trends in Deep Learning Models

13.1.2 Advantages of Foundation Models

13.2 Representative Foundation Models

13.2.1 Large Language Models (LLMs)

13.2.2 Large Vision Models (LVMs)

13.2.3 Multimodal Large Models

13.3 Foundation Model Pre-training

13.3.1 Bidirectional Encoder Representations from Vision Transformers

13.3.2 Masked Autoencoders (MAE)

13.4 Applications of Foundation Models

13.4.1 Embodied AI

13.4.2 Digital Humans

13.4.3 Autonomous Driving

Review Questions

References