Sabina Chen

Overview

Assignment 4 - Embeddings and Generative Models (Instructions)

Section 1.1: Visualizing datasets using the embedding projector

Images Generated via Embedding Projector
Super Cool Visualizations: Principal Component Analysis, How to Use t-SNE Effectively

MNIST w/ Images, Color by Label, PCA, Not spherized
The different digits separate into somewhat distinct clouds, but there is still alot of mixture and overlap. I ran PCA multiple times, and although the final separation of digits were not always the same, digit clouds still overlapped in similar ways. For example, 1s are strongly separated from other numbers (there is not a lot of overlap with other numbers). In contrast, 0s/6s, 4s/6s, 2s/3s, 8s/3s, 9s/7s/4s, and 5s/3s overlap strongly. Inspecting the numbers that tend to overlap, we can see that it is because of the local similarities of the numbers between each other that cause the numbers to sometimes be classified incorrectly (ie. 6s being classified as 0s because of the circular bottom curvature, 8s being classified as 3s because they both have the similarity of two circles being stacked on top of each other, and 9s/7s/4s being mixed because of the vertical straightness of the digits).

MNIST PCA

MNIST PCA
MNIST w/ Images, Color w/ Label, T-SNE, Not spherized
Using T-SNE, the digits split into very distinct clouds, with a lot less overlap between clouds compared to that of PCA, thereby making it a lot easier to analyze more specific digit overlap cases. From a quick overview, the 1,7,2,4,6,0 digits all split into very distinct clouds. It is also interesting to note that 2 split into two different clouds. The digit categorization may be affected by the original location of the digits before classification, thereby the two separate clouds for 2 may have been created because there were two dense clusters of 2s in the original dataset. Furthermore, the closeness overlap between 0s/6s and 4s/6s is very similar to PCA, but using T-SNE we are able ot more easily identify exactly which specific digit is causing the overlap.

MNIST T-SNE

Section 1.3: Word Geometry

One great point to note (also mentioned in the assignment) is that the geometric locations of the words are not based on word definitions, but are determined by the frequency in which words co-occur in phrases (ie. Google news articles). I experimented with three diffeent examples, politics (good to bad), engineering (man to woman), and slightly more sensitive word, gay (good to bad). For each example, I analyzed results for both Word2Vec 10K and Word2Vec All. Word2Vec All created more neutral results, whereas Word2Vec 10K created more biased results. This makes sense because larger pools of data tend to create more averaged results.

Given: politics; Range: good to bad
- Word2Vec 10K: The word that is most left ('good') was 'trade', and the most right ('bad') was militant. Most words on the 'bad' side tended to be related to government types (ie. militant, nationalist, marxism, theology, socialism, governments, autonomy, conservative, liberal). This suggests that discussion involving various types of government types are usually negative. The word 'politics' itself is also located a little more to the right (ie. the 'bad' side), suggesting that politics is just a pretty negatively viewed topic in general.
  
  politics, good bad, 10k
- Word2Vec All: Words are way more spread out in general, compared to the 10K dataset. 'Military' is still considered a 'bad' word, however the word 'politics' itself is now located more in the neutral. This suggests that although there are still batches of words/topics that are considered positively/negatively, the overall 'politics' is still a pretty neutral topic.
  
  politics, good bad, All
Given: engineering; Range: man to woman
- Word2Vec 10K: The word 'engineering' itself is more tended towards 'man'. Technology, mechE, and computer-related words tend to be located more towards 'man', whereas medical, biology, and chemistry related words are located more towards 'woman'. This suggests that men are mentioned in areas with more hardware/software related phrases, whereas women are mentioned more in bio-related fields.
  
  engineering, man woman, 10k
- Word2Vec All: Interestingly enough, although I expected Word2Vec All to cause the dataset to be more evenly spread out, the dataset is even more prominently divided than the 10K examples. Now, almost all the physics/hardware/computer-related words are located near the 'man' side, and almost no words related to computer-engineering are located towards the 'woman' side. Although terms related to medical/psychology/education tend towards the right, the use of engineering terminology for women is still extremely lacking. This visualization suggests that despite there being improvements in small groups, there is still a large discrepency between men/women in engineering roles overall, as shown in the comparison between the Word2Vec All and Word2Vec 10K datasets.
  
  engineering, man woman, All
Given: gay; Range: good to bad For this example, I specifically want to try to a word that could be taken very positively or very negatively in the greater community.
- Word2Vec 10K: 'Good' words are: rights, advocates, media, equality, and gender. 'Bad'words are: activist, riots, racism, angry, prejudice, and politicians. The word 'gay' itself is located near the center, meaning that overall it is a neutral topic. The difference in good and bad word types suggests that although people are happy with the media and progressiveness for gender-equality, there is still alot of negativity in the negative effects of change (ie. riots, racism, prejudice, etc.)
  
  gay, good bad, 10k
- Word2Vec All: 'Good' words are: man, love, everyone, canadian, pride... and interestingly enough homophobia, murder, and fag. 'Bad' words are bisexuals, transgender, homosexuality, lesbian... which is also interesting because these 'bad' words are literally mostly used as synonyms for 'gay'. Other 'bad' words include: violent, democracy, socially, internal, movement, and culture. This suggests that topics related to the lgbt movement on politics is generally considered 'bad' whereas showing love and acceptance during tragedies are considered 'good'.
  
  gay, good bad, All

Section 1.4: Finding word analogies with vector algebra

Word2Vec Demo: here

Interesting Examples:

alphabet is to soup AS pinyin is to congee: This is interestingly accurate because while "pinyin" is the chinese parallel to "alphabet", "congee" is the chinese version of "soup".
alphabet is to rice AS pinyin is to glutinous rice: I found this pretty funny more than anything because yes, we chinese enjoy glutinous rice, and I don't think it is something that US people eat.
computer is to paper AS music is to songlines: Interesting because "paper" and "songlines" are the stripped-down, primitive versions of "computer" and "music" respectively.
caltech is to california AS MIT is to massachusetts: It got the location of the universities correct!!!
caltech is to pasadena AS MIT is to santa monica: But not the exact city locations of the universities... MIT is not located in Santa Monica :(

Section 1.5: Exploring fonts with the Embedding Projector

Embedding Project Demo: here

Viewing the fonts with PCA embedding, clumps with certain font characteristics appear. For example, tall skinny A's are in one prominent clump (Font IDs: 2488, 6079, 7023, 6655), blocky/artsy/design-oriented A's are scattered towards the side (Font IDs: 3229,3115,3134, 3077), and bolder-looking A's (Font IDs: 5857, 4657) act as the transition between the skinny A's and the blocky/artsy A's.

Skinny/tall As

Bolder As

Blocky/design-oriented As
Viewing the fonts with T-SNE embedding, it took ~500 iterations for the cloud to stabilize. T-SNE caused the cloud to be more continuously clumped. There is a smoother transition from thinner/tall/pencil-drawn-like As to bolder As to blocky/creative As.

Skinny/tall As

Bolder As

Blocky/design-oriented As
1. font_id: 4884 - this font is very pixel-like and block-shaped. When isolated, the neighboring fonts all look very similar to the font I specified.
  
  font_id: 4884
2. font_id: 221 - this font is typewriter-like. No fonts seem to look similar to the font I chose, and the "nearest" neighbors are on the other side of the graph!
  
  font_id: 221
3. font_id: 3230 - this font is an A with a christmas tree design around it. Similar neighboring fonts also have some type of design surrounding the characters, which is expected.
  
  font_id: 3230

Section 2.3: Creating new fonts

Font Finder: here
Latent Space Explorer: GitHub Repo

Editing FontModel.js, I changed the sample character to 'b', and displayed the uppercase, lowercase, and numerical sample font characters on the right side of the screen.

Modified UI
Editing VectorChooser.vue, I added a new button called "Get KNN" that shows the font Id of the neareset neighbor font that is most similar to the current font, out of all the fonts in the 50K training set, by finding the cosine similarity of the two vectors fonts. I used console.log() to record the similarity values of the font Ids as they were generated, and outputted the most similarity and font id of the most similar font id at the very end.

Implemented getKNN() function
- original font w/ cosine similarities
  
  nearest neighbor font
- original font w/ cosine similarities
  
  nearest neighbor font
- original font w/ cosine similarities
  
  nearest neighbor font
Editing VectorChooser.vue, I added a new button called "Get KNN of Avg Font" that shows the font Id of the neareset neighbor font that is most similar to the average font. I calculated the "average" of the set of fonts by finding the average of each dimension for all 50K vectors. Interestingly, the average font is literally the bottom half of a font!

Implemented getAvgKNN() function

average font w/ cosine similarities

nearest neighbor font to average font
1. Using one 'bolding vector' merely adds that specific bolding vector to the original vector. This not only adds bolding to the original font (ie. adding thickening), but also includes other characteristics of the bolding vector to the original vector. For exmaple, although my initial font was lower case b, adding the singular bolding vector caused the new font to become uppercase B.
  
  Basic Bolding Vector
  
  Original Font
  
  Bolded Font
2. - 10 fonts w/ bold quality:
    
    Bold Font 1
    
    Bold Font 2
    
    Bold Font 3
    
    Bold Font 4
    
    Bold Font 5
    
    Bold Font 6
    
    Bold Font 7
    
    Bold Font 8
    
    Bold Font 9
    
    Bold Font 10
    
    Got a bolding vector that looked like this:
    
    Averaged bolding vector
    
    Code for ApplyBoldingVector()
    
    ApplyBoldingVector()
    
    Tested the bolding vector on a random font with thin lines. The averaged bolding vector did well! It bolded the original font without changing the innate characteristics of the original font.
    
    Original Thin Font
    
    Font w/ Bolding Vector Applied
  - 10 fonts w/ cursive quality:
    
    Cursive Font 1
    
    Cursive Font 2
    
    Cursive Font 3
    
    Cursive Font 4
    
    Cursive Font 5
    
    Cursive Font 6
    
    Cursive Font 7
    
    Cursive Font 8
    
    Cursive Font 9
    
    Cursive Font 10
    
    Code for ApplyCursiveVector() is the same as ApplyBoldedVector(), except with different 10 averaged fonts.
    
    ApplyCursiveVector()
3. Found 10 bold fonts and 10 non-bold fonts, subtracted the vectors of each pair, and then found the averaged difference. Here is the code:
  
  Code for ApplyAvgBoldingVector()
I don't think there is a specific vector index that changes the uppercase/lowercase of a font. I wrote a script that compares the sign of each index for uppercase fonts to try to pinpoint a specific vector index that could attribute to whether a font was uppercase of lowercase, and turns out there isn't. Making a font uppercase or lowercase is likely the effect of a combination of multiple indices/dimensions. Here was the script I wrote: uppercase_index.py
This is my created vector based on my personal font tastes:

Personal Fonts

Out of curiosity, I set all the vectors to either all -0.5s, all 0s, or all 0.5s. These were the resulting fonts:

Font w/ all -0.5

Font w/ all 0

Font w/ all +0.5
Code that implements applyBoldingVector(), applyCursiveVector(), getKNN(), getAvgKNN() in Latent Space Explorer: VectorChooser.vue

Projects

> Deep Learning Practicum > Embeddings and Generative Models

← How to Mine the Interwebs for Data

Generative Adversarial Networks →