Projects > Deep Learning Practicum > Assignment 3

Sabina Chen - sabinach@mit.edu
Overview

Assignment 3 - Convolutional Neural Networks (Instructions)

Section 1.3: Experimenting with hyperparameters
Training Data & Results via Model Builder
I modified the hyperparameters by changing the number of conv/relu/conv triples, changing the number of outputs for the conv layer, and changing the stride size for the max pool layer. Below are my results and observations for each modification.
  • Original Network w/ CIFAR10: Conv -> ReLU -> Max Pool -> Conv -> ReLU -> Max Pool -> Flatten -> Fully Connected
    Observations: 577 examples/sec, ~45% accuracy Original Network

    Conv -> ReLU -> Max Pool -> Conv -> ReLU -> Max Pool -> Flatten -> Fully Connected

  • I removed one and two conv/relu/maxpool triples during this test iteration. Interestingly enough, it did not really change the accuracy that much, and stayed around the 40%-50% range between the original and modified layers. The only thing that changed was that the training speed increased after removing the extra layers (which is really interesting because one would think that the speed should increase since there are less layers to go through), but it might be due to my own computer CPU that could be the cause of the change in training speed, and not the change in layers themselves.
    Remove one triple

    Removed one Conv, ReLU, Max pool triple

    Remove two triples

    Remove two Conv, ReLU, Max pool triple

  • For this test iteration, I changed the outputs of the convolution layers to 5 and 50 from the original 16. The accuracy for output 5 decreased slightly to 40%, whereas the accuracy for output 50 largely stayed the same ~45% but the training speed decrease ALOT. It took about 4-5 minutes more to train the same number of images due to the increase of outputs for each conv layer.
    Change conv output to 5

    Outputs: 5

    Change conv output to 50

    Outputs: 50

  • For this test iteration, I changed the stride for the max pool layer (original stride: 2). Changing the stride to 1 caused the test to run significantly longer. It took around 15 minutes to run, and the resulting accuracy also decreased significantly to 10%. This is because using a stride of 1 causes the network to iterate through each pixel one-by-one which can cause overfitting and thus bad accuracy for new test images. Changing the stride to 5 did not significantly change the testing time and accuracy from the original.
    Change max pool stride to 1

    Max Pool Stride: 1

    Change max pool stride to 5

    Max Pool Stride: 5

Section 1.4: Visualizing convolutional neural networks
2D Visualization based on paper from Interactive Node-Link Visualization of Convolutional Neural Networks
Visualization Demo here

  • I created a drawing that looks vaguely between a 4 and 6, which caused the network to register 4 and 6 as most probable. The extended line for the loopy-part of the 6 can be registered incorrectly as the straight part of the 4, which caused the network to also believe the number to be a 4. A human looking at this number might also be confused between a 4 and 6. Two labels

    Two labels registered

  • I drew a 7 that was registered confidently as a 1 by decreasing the length of the top horizontal line. As he convolutional layers of the network slowly blur out portions of the number, the small hook of the 7 gets filtered out, and thus the network only registers the more prominent 1 part of the image. Confused label

    7 that looks like 1

  • I drew a dotted 7 that was incorrectly registered as a 2 and 3. This was a really interesting test for me because I did not think that the dots could cause the network to be so confused. The network registered the top circular part of the 7, but neglected to register the straight bottom part. Dotted Number

    Dotted 7

Section 2: Style transfer examples (from last week)
Generated images via Deep Art

The generated images seem to take the shape of the original image, while overlaying the style of the style image, as shown by the generated pacman-shaped image with terrifying eyes and feathers, and the owl shaped yellow blob of the second generated image. Style images with too much background and not one focused color scheme tend to do better, as shown by the greyish bush that was generated. The last generated image is my favorite, with the shape of the block and cartoonish style of the bush.
original

Original

style

Style

generated

Generated

original

Original

style

Style

generated

Generated

original

Original

style

Style

generated

Generated

original

Original

style

Style

generated

Generated

Section 3.5: Fast style transfer
Generated images results via Fast Neural Style Transfer with Deeplearn.JS

    • Passing the same image through the same filter several times produced the exact same generated image every single time.
      first pass

      First Pass

      second pass

      Second Pass

      third pass

      Third Pass

    • Passing the same image through one filter, and then passing the filtered image through another of the same filter again generated a more abstract/blocky version of the original filtered image, with the padding style more prominent in the second pass filtered image.
      First Filter

      Filtered Image - First Pass

      Second Filter

      Filtered Image - Second Pass

  1. I applied a filter to an image, and then applied another filter to that already-filtered image. Comparing the 2-pass filtered image (Diane -> Udnie, Francis Picabia -> The Scream, Edvard Munch) to an image that was directly filtered with the original image (Diane -> The Scream, Edvard Munch), the 2-pass filtered image kept some of the blockiness of the first Udnie filter, consequently losing a lot of the details of Diane's face. In comparison, directly filtering Diane with the Scream, one can still see all the details and subtleties of Diane's facial features.
    Filter 1

    Diane -> Udnie

    Filter 2

    Diane -> Udnie -> The Scream

    Filter 3

    Diane -> The Scream

  2. I tried different combinations of filters by combining all the possible filters from the website. One interesting observation is that at each new filter, the generated image keeps a lot of the style of the previous filtered image, and seems to just layer the next style on top of the new image. At each new filter, details of Diane's face get more and more blocky and filtered; by the last filter (Wreck of the Minotaru), it gets difficult to even tell if Diane is smiling because the lower portion of her face has lost so much detail due to the blurring of each filter layer.
    Filter 1

    Filter 1 - Udnie, Francis Picabia

    Filter 2

    Filter 2 - The Scream, Edvard Munch

    Filter 3

    Filter 3 - La Muse, Pablo Picasso

    Filter 4

    Filter 4 - Rain Princess, Leonid Afremov

    Filter 5

    Filter 5 - The Wave, Katsushika Hokusai

    Filter 6

    Filter 6 - The Wreck of the Minotaur, J.M.W. Turner

Section 3.6: Building CNNs with code
  1. Model:
    Started off with original model from Model Builder:
    input -> convolution -> ReLU -> max pool -> convolution -> ReLU -> max pool -> flatten -> fully connected -> softmax cross-entropy -> output
    Observations:
    In general, smaller batch sizes allow for quick erfeedback, and large num batches allow the network time to learn and observe the general accuracy trend
    • MNIST:
      batch_size:20, num_batches:3000
      -> got to >95% accuracy after ~300 batches. Accuracy oscillates up and down more extremely but the network in general trains pretty fast
      batch_size:500, num_batches:3000
      -> getting to >90% accuracy after ~600 batches; takes longer to run because larger batch size (ie. more images to loop through in each batch)
      MNIST

      batch_size:20, num_batches:3000

      MNIST

      batch_size:500, num_batches:3000

    • Fashion MNIST:
      batch_size:20, num_batches:3000
      -> got to >80% accuracy after ~1300 batches. Pretty extreme up and downs in accuracy as well as an exponential decline in loss. The network trains slightly slower overall than MNIST because the accuracy stayed around the 65% range for a while before finally getting to avg 80% accuracy.
      batch_size:500, num_batches:3000
      -> Trained up to batch 600 and saw that it started plateauing around 70% accuracy and 0.75 loss
      FashionMNIST

      batch_size:20, num_batches:3000

      FashionMNIST

      batch_size:500, num_batches:3000

    • CIFAR-10:
      batch_size:20, num_batches:3000
      -> Got to avg 40% accuracy (max 54%) in 2500 batches and plateaued. The loss is a slow linear decine, compared to the exponential decline of the MNIST/Fashion MNIST datasets. Accuracy is very slow incline with very oscillatory up and downs.
      batch_size:500, num_batches:3000
      -> Stopped the training around batch 2000 at 40% accuracy. Trained VERY slowly compared to all the previous networks/datasets. It took around 20 minutes to get to batch 2000 (compared to previous networks which took max 5 minutes to plateau or get to >90% accuracy), however there seems to be a general upward trend, which looks like it'll keep increasing if given more time to train.
      CIFAR-10

      batch_size:20, num_batches:3000

      CIFAR-10

      batch_size:500, num_batches:3000

  2. I adjusted the field_size and stides while keeping the batch_size and num_batches constant. A field size of [5,5] with a stride between 1-5 seems to get the best accuracy.
    • Stable parameters: batch_size:50, num_batches:500, stride:1
      Variable parameters: field_size = [1,1], field_size[5,5], field_size=[7,7]
      Obervations: Best at field_size 5, not as good for field_sizes 1 and 7 because 1 is looking through each pixel at a time, whereas 7 is looking at too many at once.
      Small field size

      Field_size = [1,1]

      Medium field size

      Field_size = [5,5]

      Large field size

      Field_size = [7,7]

    • Stable parameters: batch_size=50, num_batches=500, field_size=[5,5]
      Variable parameters: stride=1, stride=3, stride=5, stride=10
      Obervations: There is not much qualitative difference in accuracy trends between strides 1-5 (~90% accuracy), but the accuracy for stride 10 is slightly dicreased, because the network might be skipping over too many details with the larger strides.
      FashionMNIST

      Stride: 1

      FashionMNIST

      Stride: 3

      FashionMNIST

      Stride: 5

      FashionMNIST

      Stride: 10

  3. CIFAR-10 is harder to train than Fashion MNIST and MNIST because the objects within the images are shown from more different angles/perspectives than the MNIST/FashionMNIST datasets. Whereas MNIST images (ie. clothes + numbers) are shown from a straight-on, almost 2D perspective, CIFAR-10 images (ie. dogs, cats, etc) are shown from varying perspectives, more 3D-like, and thus it’s harder to classify and match pixels.
  4. I tested the model by adding 2 extra conv layers and 4 extra conv layers. I am able to reach max accuracy of 70% with the model by adding four conv layers. However it seems to plateau around 45% average.
    • Added two extra conv layers and lowered the batch size down to 20 -> Speed was about the same as before (w/out the added conv layers), but accuracy stayed pretty low and the increase rate for accuracy also got slower. However at 2500 batches, it was able to get to around 50% accuracy, which is better than the original, which only got to around 40% accuracy at 2500 batches. The added two conv layer made the network startup slower in accuracy, but ended up with a better overall accuracy. CIFAR10

      input -> conv -> reLU -> maxpool -> conv -> reLU -> maxpool -> CONV -> CONV -> flatten -> fully connected -> softmax -> output

    • Added four extra conv layer to the original network and kept the batch size to 20 -> speed slowed down slightly but the accuracy increased at a slightly faster rate, averaging at 40% accuracy at 1200 batches, but the accuracy oscillations were also more extreme. It was the first time I saw a network hit 0% accuracy three times at the beginning of training. CIFAR10

      input -> conv -> CONV -> reLU -> maxpool -> conv -> CONV -> reLU -> maxpool -> CONV -> CONV -> flatten -> fully connected -> softmax -> output

  5. I adjusted the kernel size, filters, and pool size of the network: kernelsize=[3,3], filters=30, poolSize=2. The best I was able to get was 78%% accuracy of CIFAR-10 after 3000 batches, which took 8 minutes to train. There is definitely a very slow upward trend in accuracy and slow linear decline in loss; it seems both the accuracy and loss are starting to plateau. CIFAR10
  6. Network w/ CIFAR 78% accuracy: index.js
    Network Code

    input -> convolution -> ReLU -> max pool -> convolution -> ReLU -> max pool -> flatten -> fully connected -> softmax cross-entropy -> output