Deep_learning

Website

Acknowledgments

Notation

This section provides a concise reference describing the notation used throughout this book. If you are unfamiliar with any of the corresponding mathematical concepts, we describe most of these ideas in chapters 2–4.

Numbers and Arrays a A scalar (integer or real)

$a$
小写、斜体、衬线
A scalar
$\vec{a}$
小写、斜体、衬线、上箭头
A vector
$\mathbf{A}$
斜体、粗体、衬线
A matrix
$\mathsf{A}$
斜体、粗体、衬线
A tensor
$\mathbf{I}_{n}$n
斜体、粗体、衬线
Identity matrix with rows and columns n n
$\vec{a}$I
斜体、粗体、衬线
Identity matrix with dimensionality implied by context e( ) i Standard basis vector [0,..., 0, 1,0,...,0] with a 1 at position i
diag( ) a A square, diagonal matrix with diagonal entries given by a
$\vec{a}$a
斜体、粗体、衬线
A scalar random variable
$\vec{a}$a
斜体、粗体、衬线
A vector-valued random variable
$\vec{a}$A
斜体、粗体、衬线
A matrix-valued random variable

Sets and Graphs

A A set
R The set of real numbers
{0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n
[a, b] The real interval including a and b
(a, b] The real interval excluding a but including b
A\B Set subtraction, i.e., the set containing the elements of A that are not in B
G A graph
PaG(xi) The parents of xi in G

Indexing

ai Element i of vector a, with indexing starting at 1
a−i All elements of vector a except for element i
Ai,j Element i, j of matrix A

****

Ai,j Element i, j of matrix A Ai,: Row i of matrix A A:,i Column i of matrix A Ai,j,k Element (i, j, k) of a 3-D tensor A A:,:,i 2-D slice of a 3-D tensor ai Element i of the random vector a Linear Algebra Operations A Transpose of matrix A A+ Moore-Penrose pseudoinverse of A A  B Element-wise (Hadamard) product of A and B det(A) Determinant of A

1 Introduction

1.1 Who Should Read This Book?

I Applied Math and Machine Learning Basics

2 Linear Algebra

2.1 Scalars, Vectors, Matrices and Tensors

2.2 Multiplying Matrices and Vectors

2.3 Identity and Inverse Matrices

2.4 Linear Dependence and Span

2.5 Norms

2.6 Special Kinds of Matrices and Vectors

2.7 Eigendecomposition

2.8 Singular Value Decomposition

2.9 The Moore-Penrose Pseudoinverse

2.10 The Trace Operator

2.11 The Determinant

2.12 Example: Principal Components Analysis

3 Probability and Information Theory

3.1 Why Probability?

3.2 Random Variables

3.3 Probability Distributions

3.4 Marginal Probability

3.5 Conditional Probability

3.6 The Chain Rule of Conditional Probabilities

3.7 Independence and Conditional Independence

3.8 Expectation, Variance and Covariance

3.9 Common Probability Distributions

3.10 Useful Properties of Common Functions

3.11 Bayes’ Rule

3.12 Technical Details of Continuous Variables

3.13 Information Theory

3.14 Structured Probabilistic Models

4 Numerical Computation

4.1 Overflow and Underflow

4.2 Poor Conditioning

4.3 Gradient-Based Optimization

4.4 Constrained Optimization

4.5 Example: Linear Least Squares

5 Machine Learning Basics

5.1 Learning Algorithms

5.2 Capacity, Overfitting and Underfitting

5.3 Hyperparameters and Validation Sets

5.4 Estimators, Bias and Variance

5.5 Maximum Likelihood Estimation

5.6 Bayesian Statistics

5.7 Supervised Learning Algorithms

5.8 Unsupervised Learning Algorithms

5.9 Stochastic Gradient Descent

5.10 Building a Machine Learning Algorithm

5.11 Challenges Motivating Deep Learning

II Deep Networks: Modern Practices

6 Deep Feedforward Networks

6.1 Example: Learning XOR

6.2 Gradient-Based Learning

6.3 Hidden Units

6.4 Architecture Design

6.5 Back-Propagation and Other Differentiation Algorithms

6.6 Historical Notes

7 Regularization for Deep Learning

7.1 Parameter Norm Penalties

7.2 Norm Penalties as Constrained Optimization

7.3 Regularization and Under-Constrained Problems

7.4 Dataset Augmentation

7.5 Noise Robustness

7.6 Semi-Supervised Learning

7.7 Multi-Task Learning

7.8 Early Stopping

7.9 Parameter Tying and Parameter Sharing

7.10 Sparse Representations

7.11 Bagging and Other Ensemble Methods

7.12 Dropout

7.13 Adversarial Training

7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

8 Optimization for Training Deep Models

8.1 How Learning Differs from Pure Optimization

8.2 Challenges in Neural Network Optimization

8.3 Basic Algorithms

8.4 Parameter Initialization Strategies

8.5 Algorithms with Adaptive Learning Rates

8.6 Approximate Second-Order Methods

8.7 Optimization Strategies and Meta-Algorithms

9 Convolutional Networks

9.1 The Convolution Operation

9.2 Motivation

9.3 Pooling

9.4 Convolution and Pooling as an Infinitely Strong Prior

9.5 Variants of the Basic Convolution Function

9.6 Structured Outputs

9.7 Data Types

9.8 Efficient Convolution Algorithms

9.9 Random or Unsupervised Features

9.10 The Neuroscientific Basis for Convolutional Networks

9.11 Convolutional Networks and the History of Deep Learning

10 Sequence Modeling: Recurrent and Recursive Nets

10.1 Unfolding Computational Graphs

10.2 Recurrent Neural Networks

10.3 Bidirectional RNNs

10.4 Encoder-Decoder Sequence-to-Sequence Architectures

10.5 Deep Recurrent Networks

10.6 Recursive Neural Networks

10.7 The Challenge of Long-Term Dependencies

10.8 Echo State Networks

10.9 Leaky Units and Other Strategies for Multiple Time Scales

10.10 The Long Short-Term Memory and Other Gated RNNs

10.11 Optimization for Long-Term Dependencies

10.12 Explicit Memory

11 Practical Methodology

11.1 Performance Metrics

11.2 Default Baseline Models

11.3 Determining Whether to Gather More Data

11.4 Selecting Hyperparameters

11.5 Debugging Strategies

11.6 Example: Multi-Digit Number Recognition

12 Applications

12.1 Large-Scale Deep Learning

12.2 Computer Vision

12.3 Speech Recognition

12.4 Natural Language Processing

12.5 Other Applications

III Deep Learning Research

13 Linear Factor Models

13.1 Probabilistic PCA and Factor Analysis

13.2 Independent Component Analysis (ICA)

13.3 Slow Feature Analysis

13.4 Sparse Coding

13.5 Manifold Interpretation of PCA

14 Autoencoders

14.1 Undercomplete Autoencoders

14.2 Regularized Autoencoders

14.3 Representational Power, Layer Size and Depth

14.4 Stochastic Encoders and Decoders

14.5 Denoising Autoencoders

14.6 Learning Manifolds with Autoencoders

14.7 Contractive Autoencoders

14.8 Predictive Sparse Decomposition

14.9 Applications of Autoencoders

15 Representation Learning

15.1 Greedy Layer-Wise Unsupervised Pretraining

15.2 Transfer Learning and Domain Adaptation

15.3 Semi-Supervised Disentangling of Causal Factors

15.4 Distributed Representation

15.5 Exponential Gains from Depth

15.6 Providing Clues to Discover Underlying Causes

16 Structured Probabilistic Models for Deep Learning

16.1 The Challenge of Unstructured Modeling

16.2 Using Graphs to Describe Model Structure

16.3 Sampling from Graphical Models

16.4 Advantages of Structured Modeling

16.5 Learning about Dependencies

16.6 Inference and Approximate Inference

16.7 The Deep Learning Approach to Structured Probabilistic Models

17 Monte Carlo Methods

17.1 Sampling and Monte Carlo Methods

17.2 Importance Sampling

17.3 Markov Chain Monte Carlo Methods

17.4 Gibbs Sampling

17.5 The Challenge of Mixing between Separated Modes

18 Confronting the Partition Function

18.1 The Log-Likelihood Gradient

18.2 Stochastic Maximum Likelihood and Contrastive Divergence

18.3 Pseudolikelihood

18.4 Score Matching and Ratio Matching

18.5 Denoising Score Matching

18.6 Noise-Contrastive Estimation

18.7 Estimating the Partition Function

19 Approximate Inference

19.1 Inference as Optimization

19.2 Expectation Maximization

19.3 MAP Inference and Sparse Coding

19.4 Variational Inference and Learning

19.5 Learned Approximate Inference

20 Deep Generative Models

20.1 Boltzmann Machines

20.2 Restricted Boltzmann Machines

20.3 Deep Belief Networks

20.4 Deep Boltzmann Machines

20.5 Boltzmann Machines for Real-Valued Data

20.6 Convolutional Boltzmann Machines

20.7 Boltzmann Machines for Structured or Sequential Outputs

20.8 Other Boltzmann Machines

20.9 Back-Propagation through Random Operations

20.10 Directed Generative Nets

20.11 Drawing Samples from Autoencoders

20.12 Generative Stochastic Networks

20.13 Other Generation Schemes

20.14 Evaluating Generative Models

20.15 Conclusion

Bibliography

Index