Skip to content

week 17.10 23.10

Matthijs Van keirsbilck edited this page Mar 29, 2017 · 1 revision

Installation of Python 2.7 and Lasagne/Theano on leda? Eben Olsen Theano/Lasagne tutorial

  • printing theano.printing.debugprint(y) = formatted theano.pprint(y) from IPython.display import SVG SVG(theano.printing.pydotprint(y, return_image=True, format='svg')) -> beautiful graph of operations

  • compile function: f=theano.function([x],y) #x = input; always has to be a list of inputs

  • types: vector, matrix, tensor3, tensor4

  • gradient: T.grad(y,x) # y is function of x; then grad.eval({x:2}) # needs a dictionary as argument

  • theano float32: dtype=theano.config.floatX

  • matrix size: '.shape'

  • shared vars and updates:

count= theano.shared(0)
newcount = count+1
updates = {count: new_count}
f= theano.function([], count, updates= updates}

Stanford NN: Convolutional Neural Networks

In summary:

  • A ConvNet architecture is in the simplest case a list of Layers that transform the image volume into an output volume (e.g. holding the class scores)
  • There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular)
  • Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function
  • Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
  • Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)
Conv layer

Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
We slide the filter over the width and height of the input volume and produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
We will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.

Example: If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 553 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.

Properties:

  • Depth: number of filters we would like to use, each learning to look for something different in the input.
  • Stride: When the stride is 1 then we move the filters one pixel at a time. This will produce smaller output volumes spatially.
  • Zero-Padding: As we will soon see, sometimes it will be convenient to pad the input volume with zeros around the border. The size of this zero-padding is a hyperparameter. It will allow us to control the spatial size of the output volumes (to make sure that input and output width and height are the same).

Formula of output: (W − F + 2P)/S + 1 with F receptive field, W input volume size

  • Setting zero padding to be P=(F−1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially

  • Stride is dependent on others; numerator has to be divisible by S (result must be natural number)

  • Parameter sharing: the depth slices are reused for all positions on the image to conserve resources -> effectively each slice is a filter that executes a convolution (we can do this because of the the translationally-invariant structure of images)

Note that sometimes the parameter sharing assumption may not make sense. For example: when the input are faces that have been centered in the image. Different eg eye-specific features should be learned in different spatial locations. Here we need to relax parameter sharing, and call the layer a Locally-Connected Layer. http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

The conv layer accepts a volume of size W1×H1×D1, and requires four hyperparameters:

  • Number of filters K
  • spatial extent F
  • the stride S
  • the amount of zero padding P

It produces a volume of size W2×H2×D2, where:

  • W2=(W1−F+2P)/S+1
  • H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry)
  • D2=K

With parameter sharing, it introduces (F⋅F⋅D1) weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K biases.

In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

A common setting of the hyperparameters is F=3,S=1,P=1 However, there are common conventions and rules of thumb that motivate these hyperparameters. See the ConvNet architectures section below.

Pooling Layer

Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. In practice: A pooling layer with F=3 (-> 3x3 pool),S=2 (also called overlapping pooling), and more commonly F=2,S=2.

We can convert FC layers to CONV layers and the other way around; this can save computations see this reshaping) the weight matrix W in each FC layer into CONV layer filters. It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.