使用神经网络识别手写数字——代码实现

部分代码解读

在之前描述 MNIST 数据时，我说它分成了 60,000 个训练图像和 10,000 个测试图像。这是官方的 MNIST 的描述。实际上，我们将用稍微不同的方法对数据进行划分。我们将测试集保持原样，但是将 60,000 个图像的 MNIST 训练集分成两个部分：一部分 50,000 个图像，我们将用来训练我们的神经网络，和一个单独的 10,000 个图像的验证集。在本章中我们不使用验证数据，但是在本书的后面我们将会发现它对于解决如何去设置某些神经网络中的超参数是很有用的，例如学习率等，这些参数不被我们的学习算法直接选择。尽管验证数据不是原始 MNIST 规范的一部分，然而许多人以这种方式使用 MNIST，并且在神经网络中使用验证数据是很普遍的。从现在起当我提到“MNIST 训练数据”时，我指的是我们的 50,000 个图像数据集，而不是原始的 60,000图像数据集

除了 MNIST 数据，我们还需要一个叫做 Numpy 的 Python 库，用来做快速线性代数。如果你没有安装过 Numpy。在列出一个完整的代码清单之前，让我解释一下神经网络代码的核心特性。核心片段是一个 Network 类，我们用来表示一个神经网络。这是我们用来初始化一个 Network 对象的代码：

神经网络架构

class Network(object):

	def __init__(self, sizes):
		self.num_layers = len(sizes)
		self.sizes = sizes
		self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
		self.weights = [np.random.randn(y, x)
						for x, y in zip(sizes[:-1], sizes[1:])]

在这段代码中，列表 sizes 包含各层神经元的数量。例如，如果我们想创建一个在第一层有 2 个神经元，第二层有 3 个神经元，最后层有 1 个神经元的 Network 对象，我们应这样写代码：

1	net = Network([2, 3, 1])

Network 对象中的权重和偏置都是被随机初始化的，使用 Numpy 的 np.random.randn! 函数来生成均值为 0，标准差为 1 的高斯分布。这样的随机初始化给了我们的随机梯度下降算法一个起点。

激活函数

我们从定义 S型函数开始：

1 2	def sigmoid(z): return 1.0/(1.0+np.exp(-z))

前馈神经网络

然后对 Network 类添加一个 feedforward 方法，对于网络
给定一个输入 $a$，返回对应的输出,这个方法所做的是对每一层应用方程。

def feedforward(self, a):
    """Return the output of the network if "a" is input."""
    for b, w in zip(self.biases, self.weights):
        a = sigmoid(np.dot(w, a)+b)
    return a

随机梯度下降算法

当然，我们想要 Network 对象做的主要事情是学习。为此我们给它们一个实现随机梯度下降算法的 SGD 方法。代码如下。其中一些地方看似有一点神秘，我会在代码后面逐个分析。

def SGD(self, training_data, epochs, mini_batch_size, eta,
        test_data=None):
    """Train the neural network using mini-batch stochastic
    gradient descent.  The "training_data" is a list of tuples
    "(x, y)" representing the training inputs and the desired
    outputs.  The other non-optional parameters are
       self-explanatory.  If "test_data" is provided then the
       network will be evaluated against the test data after each
       epoch, and partial progress printed out.  This is useful for
       tracking progress, but slows things down substantially."""
       if test_data: n_test = len(test_data)
       n = len(training_data)
       for j in xrange(epochs):
           random.shuffle(training_data)
           mini_batches = [
               training_data[k:k+mini_batch_size]
               for k in xrange(0, n, mini_batch_size)]
           for mini_batch in mini_batches:
               self.update_mini_batch(mini_batch, eta)
           if test_data:
               print "Epoch {0}: {1} / {2}".format(
                   j, self.evaluate(test_data), n_test)
           else:
               print "Epoch {0} complete".format(j)

training_data 是一个 (x, y) 元组的列表，表示训练输入和其对应的期望输出。变量 epochs 和 mini_batch_size 正如你预料的迭代期数量，和采样时的小批量数据的大小。eta 是学习率，$\eta$。如果给出了可选参数 test_data，那么程序会在每个训练器后评估网络，并打印出部分进展。这对于追踪进度很有用，但相当拖慢执行速度。

代码如下工作。在每个周期，它首先随机地将训练数据打乱，然后将它分成多个适当大小的小批量数据。这是一个简单的从训练数据的随机采样方法。然后对于每一个小批量数据我们应用一次梯度下降。这是通过代码 self.update_mini_batch(mini_batch, eta) 完成的，它仅仅使用 mini_batch 中的训练数据，根据单次梯度下降的迭代更新网络的权重和偏置。这是 update_mini_batch 方法的代码：

def update_mini_batch(self, mini_batch, eta):
   """Update the network's weights and biases by applying
   gradient descent using backpropagation to a single mini batch.
   The "mini_batch" is a list of tuples "(x, y)", and "eta"
     is the learning rate."""
   nabla_b = [np.zeros(b.shape) for b in self.biases]
   nabla_w = [np.zeros(w.shape) for w in self.weights]
   for x, y in mini_batch:
       delta_nabla_b, delta_nabla_w = self.backprop(x, y)
       nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
       nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
   self.weights = [w-(eta/len(mini_batch))*nw
                   for w, nw in zip(self.weights, nabla_w)]
   self.biases = [b-(eta/len(mini_batch))*nb
                   for b, nb in zip(self.biases, nabla_b)]

大部分工作由这行代码完成：

1	delta_nabla_b, delta_nabla_w = self.backprop(x, y)

这行调用了一个称为反向传播的算法，一种快速计算代价函数的梯度的方法。因此 update_mini_batch 的工作仅仅是对 mini_batch 中的每一个训练样本计算梯度，然后适当地更新 self.weights 和 self.biases。

我现在不会列出 self.backprop 的代码。我们将在下章中学习反向传播是怎样工作的，包括 self.backprop 的代码。现在，就假设它按照我们要求的工作，返回与训练样本 $x$ 相关代价的适当梯度。

Network.py 完整代码

让我们看一下完整的程序，包括我之前忽略的文档注释。除了 self.backprop，程序已经有了足够的文档注释，所有的繁重工作由 self.SGD 和 self.update_mini_batch 完成，对此我们已经有讨论过。self.backprop 方法利用一些额外的函数来帮助计算梯度，即 sigmoid_prime，它计算 $\sigma$ 函数的导数，以及 self.cost_derivative，这里我不会对它过多描述。你能够通过查看代码或文档注释来获得这些的要点（或者细节）。我们将在下章详细地看它们。注意，虽然程序显得很长，但是很多代码是用来使代码更容易理解的文档注释。实际上，程序只包含 74 行非空、非注释的代码。所有的代码可以在 GitHub 上找到。

注意作者的代码是 Python2 版本的，下面替换成 Python3 版本的（仅仅针对不同 Python 版本语法进行了代码修改，原理没有任何修改）。

"""
network.py
~~~~~~~~~~
A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        test_data = list(test_data)
        training_data = list(training_data)
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print("Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test))
            else:
                print("Epoch {0} complete".format(j))

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforwar
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

运行结果

这个程序对识别手写数字效果如何？好吧，让我们先加载 MNIST 数据。我将用下面所描述的一小段辅助程序 mnist_loader.py 来完成。我们在一个 Python shell 中执行下面的命令，

1
2
3

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

在加载完 MNIST 数据之后，我们将设置一个有 30 个隐藏层神经元的 Network。我们在导入如上所列的名为 network 的 Python 程序后做，

1 2	>>> import network >>> net = network.Network([784, 30, 10])

最后，我们将使用随机梯度下降来从 MNIST training_data 学习超过 30 次 epoch，mini-batch 大小为 10，学习率 $\eta = 3.0$，

1	>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

打印内容显示了在每轮训练期后神经网络能正确识别测试图像的数量。正如你所见到，在仅仅一次 epoch后，达到了 10,000 中选中的 9,129 个。而且数目还在持续增长，

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000

更确切地说，经过训练的网络给出的识别率约为 95% 在峰值时为 95.42%（“Epoch 28”）作为第一次尝试，这是非常令人鼓舞的。然而我应该提醒你，如果你运行代码然后得到的结果和我的不完全一样，那是因为我们使用了（不同的）随机权重和偏置来初始化我们的网络。

让我们重新运行上面的实验，将隐藏神经元数量改到 100。

1 2	>>> net = network.Network([784, 100, 10]) >>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

果然，它将结果提升至 96.59%。至少在这种情况下，使用更多的隐藏神经元帮助我们得到了更好的结果。（注意，有的反馈表明本实验在结果上有相当多的变化，而且一些训练运行给出的结果相当糟糕。使用第三章所介绍的技术将大大减少我们网络上这些不同训练运行性能的差别。）

当然，为了获得这些准确性，我不得不对训练的迭代期数量，mini-batch 大小和学习率 $\eta$ 做特别的选择。正如我上面所提到的，这些在我们的神经网络中被称为超参数，以区别于通过我们的学习算法所学到的参数权重和偏置。如果我们选择了糟糕的超参数，我们会得到较差的结果。假如我们定学习率为 $\eta = 0.001$,

结果则不太令人鼓舞了，

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

然而，你可以看到网络的性能随着时间的推移慢慢地变好了。这表明应该增大学习率，例如 $\eta = 0.01$。如果我们那样做了，我们会得到更好的结果，这表明我们应该再次增加学习率。（如果改变能够改善一些事情，试着做更多！）如果我们这样做几次，我们最终会得到一个像 $\eta = 1.0$ 的学习率（或者调整到$3.0$），这跟我们之前的实验很接近。因此即使我们最初选择了糟糕的超参数，我们至少获得了足够的信息来帮助我们改善对于超参数的选择。通常，调试一个神经网络是具有挑战性的。

从这得到的教训是调试一个神经网络不是琐碎的，就像常规编程那样，它是一门艺术。你需要学习调试的艺术来获得神经网络更好的结果。更普通的是，我们需要启发式方法来选择好的超参数和好的结构。我们将在整本书中讨论这些，包括上面我是怎么样选择超参数的。

补充：加载训练数据的代码

前文中，我跳过了如何加载 MNIST 数据的细节。这很简单。这里列出了完整的代码。用于存储 MNIST 数据的数据结构在文档注释中有详细描述，都是简单的类型，元组和 Numpy ndarry 对象的列表（如果你不熟悉 ndarray，那就把它们看成向量）：

原来的 mnist_loader 代码

"""
mnist_loader
~~~~~~~~~~~~
A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import pickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.
    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.
    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.
    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.
    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = pickle.load(f, encoding='bytes')
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.
    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.
    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.
    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = list(zip(training_inputs, training_results))
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = list(zip(validation_inputs, va_d[1]))
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = list(zip(test_inputs, te_d[1]))
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

怎么做到识别手写数字的？迈向深度学习

虽然我们的神经网络给出了令人印象深刻的表现，但这样的表现带有几分神秘。网络中的权重和偏置是被自动发现的。这意味着我们不能立即解释网络怎么做的、做了什么。我们能否找到一些方法来理解我们的网络通过什么原理分类手写数字？并且，在知道了这些原理后，我们能做得更好吗？

为了让这些问题更具体，我们假设数十年后神经网络引发了人工智能（AI）。到那个时候，我们能明白这种智能网络的工作机制吗？或许，因为有着自动学习得到的权重和偏置，这些是我们无法理解的，这样的神经网络对我们来说是不透明的。在人工智能的早期研究阶段，人们希望在构建人工智能的努力过程中，也同时能够帮助我们理解智能背后的机制，以及人类大脑的运转方式。但结果可能是我们既不能够理解大脑的机制，也不能够理解人工智能的机制。

为解决这些问题，让我们重新思考一下我在本章开始时所给的人工神经元的解释，作为一种衡量证据的方法。假设我们要确定一幅图像是否显示有人脸：

我们可以用解决手写识别问题的相同方式来攻克这个问题。网络的输入是图像中的像素，网络的输出是一个单个的神经元用于表明“是的，这是一张脸”或“不，这不是一张脸”。

假设我们就采取了这个方法，但接下来我们先不去使用一个学习算法。而是去尝试亲手设计一个网络，并为它选择合适的权重和偏置。我们要怎样做呢？暂时先忘掉神经网络，我们受到启发的一个想法是将这个问题分解成子问题：图像的左上角有一个眼睛吗？右上角有一个眼睛吗？中间有一个鼻子吗？下面中央有一个嘴吗？上面有头发吗？诸如此类。

如果一些问题的回答是“是”，或者甚至仅仅是“可能是”，那么我们可以作出结论这个图像可能是一张脸。相反地，如果大多数这些问题的答案是“不是”，那么这张图像可能不是一张脸。

当然，这仅仅是一个粗略的想法，而且它存在许多缺陷。也许有个人是秃头，没有头发。也许我们仅仅能看到脸的部分，或者这张脸是有角度的，因此一些面部特征是模糊的。不过这个想法表明了如果我们能够使用神经网络来解决这些子问题，那么我们也许可以通过将这些解决子问题的网络结合起来，构成一个人脸检测的神经网络。下图是一个可能的结构，其中的方框表示子网络。注意，这不是一个人脸检测问题的现实的解决方法，而是为了帮助我们构建起网络如何运转的直观感受。下图是这个网络的结构：

子网络也可以被继续分解，这看上去很合理。假设我们考虑这个问题：“左上角有一个眼睛吗？”。这个问题可以被分解成这些子问题：“有一个眉毛吗？”，“有睫毛吗？”，“有虹膜吗？”，等等。当然这些问题也应该包含关于位置的信息，诸如“在左上角有眉毛，上面有虹膜吗？”，但是让我们先保持简单。回答问题“左上角有一个眼睛吗？”的网络能够被分解成：

这些子问题也同样可以继续被分解，并通过多个网络层传递得越来越远。最终，我们的子网络可以回答那些只包含若干个像素点的简单问题。举例来说，这些简单的问题可能是询问图像中的几个像素是否构成非常简单的形状。这些问题就可以被那些与图像中原始像素点相连的单个神经元所回答。

最终的结果是，我们设计出了一个网络，它将一个非常复杂的问题，这张图像是否有一张人脸，分解成在单像素层面上就可回答的非常简单的问题。它通过一系列多层结构来完成，在前面的网络层，它回答关于输入图像非常简单明确的问题，在后面的网络层，它建立了一个更加复杂和抽象的层级结构。包含这种多层结构，两层或更多隐藏层的网络被称为深度神经网络。

当然，我没有提到如何去递归地分解成子网络。手工设计网络中的权重和偏置无疑是不切实际的。取而代之的是，我们希望使用学习算法来让网络能够自动从训练数据中学习权重和偏置。这样，形成一个概念的层次结构。80年代和 90年代的研究人员尝试了使用随机梯度下降和反向传播来训练深度网络。不幸的是，除了一些特殊的结构，他们并没有取得很好的效果。虽然网络能够学习，但是学习速度非常缓慢，不适合在实际中使用。

自 2006 年以来，人们已经开发了一系列技术使深度神经网络能够学习。这些深度学习技术基于随机梯度下降和反向传播，并引进了新的想法。这些技术已经使更深（更大）的网络能够被训练~——~现在训练一个有 5 到 10 层隐藏层的网络都是很常见的。而且事实证明，在许多问题上，它们比那些浅层神经网络，例如仅有一个隐藏层的网络，表现的更加出色。当然，原因是深度网络能够构建起一个复杂的概念的层次结构。这有点像传统编程语言使用模块化的设计和抽象的思想来创建复杂的计算机程序。将深度网络与浅层网络进行对比，有点像将一个能够进行函数调用的程序语言与一个不能进行函数调用的精简语言进行对比。抽象在神经网络中的形式和传统的编程方式相比不同，但它同样重要。

参考文献

[1] Michael Nielsen. CHAPTER 1 Using neural nets to recognize handwritten digits[DB/OL]. http://neuralnetworksanddeeplearning.com/chap1.html, 2018-06-19.

[2] Zhu Xiaohu. Zhang Freeman.Another Chinese Translation of Neural Networks and Deep Learning[DB/OL]. https://github.com/zhanggyb/nndl/blob/master/chap1.tex, 2018-06-19.

[3] skylook. neural-networks-and-deep-learning, mnist_loader.py[DB/OL]. https://github.com/skylook/neural-networks-and-deep-learning/blob/master/src/mnist_loader.py, 2018-06-19.

[4] skylook. neural-networks-and-deep-learning, network.py[DB/OL]. https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py, 2018-06-19.