Wednesday, March 2, 2022

The Coolest Superhero According to Cosine Similarity

Who is the coolest superhero?

Given only the two text columns, can you find a formula to find the coolest superhero?


In the description of Superheroes NLP Dataset in Kaggle, the creator Jonathan Besomi, also the co-development of text preprocessing toolkit Texthero, has some suggestions of analysis, this first listed above one may be the most interesting and challenging.


How can we determine who is the coolest? For social science researchers, they may define what is cool first, then list the different aspects of the concept of cool, and come up with some scoring schemes. This is the analytical way. An alternative idea springing up is somewhat like dating apps, you have the pictures of superheroes displayed, swipe right for cool and swipe left for not cool, and compile a ranked list. This crowdsourcing approach is based on popular perceptions, which may involve preconceived images. 


But if we look for purely textual analysis, word embedding vectors and cosine similarity may be the ready made tools. The following four attempts represent various ways of applying these tools, some of which may be unconventional, and the results may be unconvincing. Nevertheless it is a good exercise to get a feel of the potentials and limitations of using natural language processing for large amounts of textual data.


Word Embedding Vector and Cosine Similarity

Some words on the dataset first. The Superheros NLP dataset is scraped from Superheroes Database (SHDb). It has features of various power statistics, superpowers and appearance and so on. The columns in focus here are two textual columns, one on the history of each superhero character, the other on the description of power.  Both columns have more than a hundred null values, I fill them all by "NA", and join them into a single “text” column.



Word embedding vectors is mapping words into high dimensional vector space. In nlp library SpaCy used here, every word in their trained model has a vector of 300 dimensions. And in that vector space of word representation, words of similar meanings are pointing at roughly the same directions, so the cosine of the angle of the two vectors acts as a measure of similarity. We can get the word embedding vectors of the words and use a function to calculate cosine similarity, or use SpaCy's method of similarity, which is doing the same thing. SpaCy also has most_similar method to get the list of most similar words to a given word vector.


The word "cool" has multiple meanings of many subtle overtones. In popular usages when we say someone is cool, we usually mean he or she is hip, fashionable, excellent, composed, uniquely at their own, but it can also refer to not enthused, dispassionate and unfriendly, and also moderately cold temperature. The 100 most similar words of cool in SpaCy's large English model have "cool" in different spellings at the tops, then followed by "awesome", "nice", "pretty" and "fun", which are closer to the hip and excellent side of cool.


Word              Similarity Score

COOL                 1.0

COol                 1.0

cool                 1.0

Cool                 1.0

CooL                 1.0

AWesome              0.7616

Awesome              0.7616

AWESOME              0.7616

AWEsome              0.7616

awesome              0.7616

nICE                 0.7374

nice                 0.7374

NICE                 0.7374

Nice                 0.7374

NIce                 0.7374

PRETTY               0.6568

Pretty               0.6568

pretty               0.6568

PRetty               0.6568

Fun                  0.6487

fun                  0.6487

FUN                  0.6487

kinda                0.6418

Kinda                0.6418

KINDA                0.6418

neat                 0.6415

Neat                 0.6415

NEAT                 0.6415

amaZing              0.6363

amazing              0.6363

AMazing              0.6363

AMAZING              0.6363

Amazing              0.6363

Really               0.632

REALLY               0.632

reAlly               0.632

REALLy               0.632

REally               0.632

really               0.632

sO                   0.6226

So                   0.6226

SO                   0.6226

so                   0.6226

WARM                 0.6214

Warm                 0.6214

warm                 0.6214

AWSOME               0.6195

Awsome               0.6195

awsome               0.6195

TOo                  0.6193

TOO                  0.6193

too                  0.6193

toO                  0.6193

Too                  0.6193

ToO                  0.6193

STUFF                0.6189

Stuff                0.6189

stuff                0.6189

Cute                 0.6118

cute                 0.6118

CUTE                 0.6118

CuTe                 0.6118

coolest              0.6084

COOLEST              0.6084

Coolest              0.6084

Chill                0.6075

chill                0.6075

CHILL                0.6075

FUNNY                0.6073

FUnny                0.6073

funny                0.6073

Funny                0.6073

Lol                  0.607

lOl                  0.607

LoL                  0.607

lol                  0.607

lOL                  0.607

LOL                  0.607

LOl                  0.607

loL                  0.607

great                0.6023

Great                0.6023

greAt                0.6023

GREAT                0.6023

GReat                0.6023

Weird                0.6017

WEIRD                0.6017

weird                0.6017

SUPER                0.6002

super                0.6002

Super                0.6002

SUper                0.6002

LOOK                 0.5991

look                 0.5991

Look                 0.5991

LOOKS                0.5976

looks                0.5976

Looks                0.5976

KIND                 0.596

kind                 0.596



And the similarity scores of other words show the side of composed, nonchalant and confidence in cool is ranked low in SpaCy's large English model. While "nerd" and "geek" are regarded as opposite to cool, their relatively higher similarity scores (above 0.4) are attributed to the fact that they are talked more often together with cool than other unrelated topics. And one interesting thing is "Batman" has a higher similar score then "Superman", perhaps reflecting in common sayings Batman is cooler than Superman.


Word            Similarity Score

cool                 1.0

cold                 0.5644185

coolest              0.608426

awesome              0.7615766

amazing              0.6363099

chill                0.6075392

confidence           0.16269511

composed             0.19751108

calm                 0.44255415

nonchalant           0.24065392

Batman               0.36020163

Superman             0.31441408

nerd                 0.421054

geek                 0.4376547

uncool               0.33795375

hot                  0.5611704

ice                  0.41151524

fashion              0.37175122


And for vector representation of sentences, paragraphs and even documents, SpaCy follows the conventional method of using centroid vectors, meaning taking the mean of all tokenized words (including punctuations) of the sentence, paragraph and document. Generally speaking, sentences or paragraphs which contain more of the word "cool" or similar meaning words should get higher cosine similarity scores.


The next two paragraphs for testing are taken from an article on the meaning of cool, while the third one is taken from a news article. And the first two paragraphs do have higher similarity scores with the word 'cool' than the third one.


"It’s tough to define the exact qualities that make someone cool, since pretty much everyone has a different idea of what 'cool' is. For some, it’s a leather-coat-wearing motorcyclist on an open road. For others, it’s the lead singer of a band, an English major surrounded by books, or a really chic neighbor who always burns the best incense. These people are wildly different, and yet they can all be considered cool because they project something special — a certain je ne sais quoi — that makes them stand out."

similarity score with “cool”: 0.6362329653887256


“You know you’re in the presence of a cool person when you feel at ease. The reason? “Cool people are present, focused, and interested in those around them,” Romanoff says. They listen, they try to understand — and as a result, they help everyone feel seen and understood.”

similarity score with “cool”: 0.5910140438438523


“Her visit comes after three high-level diplomatic meetings last week ended with Russian troops still on Ukraine’s borders, but no definitive sign whether Putin would risk a military incursion or instead start talks with the US about arms control in Europe, a more limited agenda than his call for a redrawing of the security architecture of Europe.”

similarity score with “cool”: 0.48815137638105816


First Attempt: Ranking the Raw Texts by Cosine Similarity Scores with the Word "Cool"

For now it seems to be a promising approach. So a simple way of determining who is the coolest superhero is to get each one's text description tokenized and get the centroid vector, then calculate the cosine similarity with the word "cool" as a "cool score" and get them ranked. Is it that simple? Let's see how it goes.



Red Mist    Source:SHDb


So for this first trial, the Red Mist is the coolest superhero. In the description there is a direct reference stating he has "cooler appearance", but in the list of tokens in the text that have highest similarity scores with the word "cool", stop words like "but" and "some" also get moderately high similarity scores.


“The Red Mist was another teenager following the example of Kick-Ass. But his cooler appearance stole some Kick-Ass fandom. Trying to settle things right, Dave tried to talk to him and force Red Mist to give up his super hero identity but in the end they decided to team-up when a building was on fire.   When Kick-Ass was visited by Hit-Girl and Big Daddy, Red Mist was reluctant to join their team. Both friends were in they way to meet Hit-Girl and her father in their headquarters ready to make a counter offer. But what they saw was a heavenly wounded Big Daddy pleading for help at the hands of Johnny G. At that precise moment, Red Mist was exposed not only as a traitor who set the heroes up but as Johnny G's son. NA”



The second placed Hulk (Stark Gauntlet) (MCU) and third placed Batgirl (New 52) fare worse. Both descriptions are extremely short, but with some common words which have fairly high similarity scores with "cool" like "it", "everyone", "good" and "very", they are ranked high in "cool score". 


“After Tony created a new gaunlet Hulk uses it to revive everyone. NA”


“NA Barbara is very intellegent she is one of the smartest dc characters . She is also a very good fighter and has many gadgetsand weapons.”


It is not convincing, and indicates that the current approach of getting centroid vector to represent a text in it's raw state is discriminating against long descriptions. A long text may say about ten things about a superhero, while it may have some good words on coolness, but the talk of other nine things dilutes so much that the whole text gets a low similarity score. 



Second Attempt: TF-IDF Weighted Document Embedding Vectors

A better approach may be to clean the text first and suppress the weighing of words that are common across documents by means like TF-IDF. In creating TF-IDF weighted embedding vectors for documents, I adopt the codes of John Cardente. Then we use the TF-IDF weighted document embedding vectors to calculate cosine similarity scores for a second attempt of coolness ranking.


And I adapt the codes of Nathan Kelber for displaying the top 20 tokens in selected document by their TF-IDF scores, and their cosine similarity score and also their products with respect to a stated text (in this case 'cool') in embedded vector form.



In this second attempt, Red Mist is still ranked top, apparently factored significantly by the word "cooler".  Hulk (Stark Gauntlet) (MCU) and Batgirl (New 52) drop out from the top 10, but the inclusions of Kool-Aid Man, Iceman and Jack Frost reflect that terms related to temperature like "kool", "ice", "snow", "cold" come into focus, which is not the sense of "cool" we are talking about.


The Most Significant Words in description of Red Mist

Kool-Aid Man. Source:SHDb
Kool-Aid Man. Source:SHDb



“Before he was officially the Kool-Aid Man in 1975, he was the “Pitcher Man”. The Pitcher Man was created in 1954 by Marvin Plotts, an art director for a New York-based advertising agency. General Foods had just purchased Kool-Aid from the drink’s creator Edwin Perkins the year before, and Plotts was charged with drafting a concept to illustrate the copy message: “A 5-cent package makes two quarts. " Working from his Chicago home on a cold day, Potts watched as his young son traced smiley face patterns on a frosty windowpane," recounts Sue Uerling, marketing and communications director for Hastings Museum of Natural and Cultural History. This inspired Marvin Plotts to create a beaming glass pitcher filled with flavorful drink: the Pitcher Man. From there on the joyful pitcher was on all the Kool-Aid’s advertisements. the voice of the man is John Fickley. In 1975 Kraft Foods created the character’s first costume with arms and legs. He also became more of an action figure in commercials — performing extreme sports and busting through brick walls. Kool-Aid Man is famously known for shouting, “Oh, Yeah!” as he is summoned by thirsty children with the phrase, "Hey, Kool-Aid!". Commercials of the era also featured a catchy jingle, always ending with the Kool-Aid Man\'s phrase. Starting in the late 1980s, the character was given dialogue, and his mouth would be digitally manipulated to "move" while the voice actor talked. Sometime in the 1990s, the live-action character was retired; from that point until 2008, the character became entirely computer-generated (although other characters -- such as the kids -- remained live-action). In 2000, a new series of commercials were created for Kool-Aid Fierce and the actor chosen to play Kool-Aid Man was Jon Carr. The most recent Kool-Aid commercials feature a new and different live-action Kool-Aid Man playing street basketball and battling "Cola" to stay balanced on a log. NA”

The Most Significant Words in description of Kool-Aid Man



Third Attempt: Refining the "Cool” Vector

This brings us to the other cool property of word embedding vectors. When a vector space model is well trained, it can capture the semantic structure of words, so that related word pairs become parallel vectors that can perform arithmetic operations. If we use "|word|'' to denote a word vector, the famous examples are:


|King|-|man|+|woman|=|Queen|


|Paris|-|France|+|Germany|=|Berlin|


When we perform the same arithmetics on SpaCy's word embedding vectors, the closest words for the resulting vectors are in fact "King" and "Germany", but "Queen" and "Berlin" come as the close second.


The Most Similar Words for “King-man+woman”

KIng                 0.8024

King                 0.8024

king                 0.8024

KING                 0.8024

Queen                0.7881

queen                0.7881

QUEEN                0.7881

PRINCE               0.6401

prince               0.6401

Prince               0.6401

The Most Similar Words for “paris-france+germany”

Germany              0.8028

GERMANY              0.8028

germany              0.8028

BERLIN               0.7547

Berlin               0.7547

berlin               0.7547

paris                0.6961

PARIS                0.6961

Paris                0.6961

FRANKFURT            0.6708




How about subtracting "cold" from "cool"? The closest word becomes "kewl", the alternative spelling of "cool" in slang, but with cosine similarity score of only 0.4408, "cool" features even lower in 0.3774. But when we add one more vector of "cool" to it, the most similar word becomes "cool" again.


The Most Similar Words for “cool-cold” 

Kewl                 0.4408

KEWL                 0.4408

kewl                 0.4408

AWESOME              0.4206

AWEsome              0.4206

awesome              0.4206

Awesome              0.4206

AWesome              0.4206

AWSOME               0.3894

Awsome               0.3894

awsome               0.3894

coool                0.3781

COOOL                0.3781

Coool                0.3781

COol                 0.3774

CooL                 0.3774

cool                 0.3774


The Most Similar Words for “cool+cool-cold”

CooL                 0.8318

COol                 0.8318

cool                 0.8318

COOL                 0.8318

Cool                 0.8318

AWEsome              0.7133

awesome              0.7133

AWesome              0.7133

Awesome              0.7133

AWESOME              0.7133

NICE                 0.6181

nice                 0.6181

Nice                 0.6181

nICE                 0.6181

NIce                 0.6181



In marketing research about brand coolness, it is said that there are ten characteristics associated with cool:


  • Authentic
  • Inspiring
  • Creative
  • Attractive
  • Edgy
  • Rebellious
  • Surprising
  • Mysterious
  • Unique
  • Takes Risks


So when I make the formulation "|cool|+|cool|-|cold|+|authentic|+|rebellious|",the most similar words not only include "cool", "authentic", "awesome" and "rebellious", "edgy", "inspiring" and "unique" appear too. It looks hopeful that this vector captures much of the idea when we are looking for the coolest superhero. Is it the kind of formula in Besomi's mind?


The Most Similar Words for “cool+cool-cold+authentic+rebellious”

Cool                 0.7232

COOL                 0.7232

CooL                 0.7232

cool                 0.7232

COol                 0.7232

Authentic            0.6419

AUTHENTIC            0.6419

authentic            0.6419

AWesome              0.624

Awesome              0.624

awesome              0.624

AWESOME              0.624

AWEsome              0.624

QUIRKY               0.602

Quirky               0.602

quirky               0.602

Inspired             0.6

inspired             0.6

INSPIRED             0.6

Funky                0.5964

FUNKY                0.5964

funky                0.5964

Classy               0.5933

CLASSY               0.5933

classy               0.5933

EDGY                 0.5839

edgy                 0.5839

Edgy                 0.5839

amaZing              0.5742

Amazing              0.5742

AMazing              0.5742

amazing              0.5742

AMAZING              0.5742

REBELLIOUS           0.5684

Rebellious           0.5684

rebellious           0.5684

Inspiring            0.5668

inspiring            0.5668

INSPIRING            0.5668

Badass               0.5642

badass               0.5642

BADASS               0.5642

BadAss               0.5642

retro                0.5587

Retro                0.5587

RETRO                0.5587

fun                  0.5578

Fun                  0.5578

FUN                  0.5578

CHIC                 0.5566

chic                 0.5566

Chic                 0.5566

COLORFUL             0.5542

Colorful             0.5542

colorful             0.5542

Coolest              0.5515

COOLEST              0.5515

coolest              0.5515

STYLISH              0.5483

Stylish              0.5483

stylish              0.5483

cute                 0.5435

CUTE                 0.5435

Cute                 0.5435

CuTe                 0.5435

STYLE                0.542

style                0.542

Style                0.542

Trendy               0.5418

TRENDY               0.5418

trendy               0.5418

look                 0.538

Look                 0.538

LOOK                 0.538

NIce                 0.5378

Nice                 0.5378

nICE                 0.5378

nice                 0.5378

NICE                 0.5378

KIND                 0.537

kind                 0.537

Kind                 0.537

artsy                0.536

ARTSY                0.536

Artsy                0.536

fabulous             0.5353

Fabulous             0.5353

FABULOUS             0.5353

FABulous             0.5353

GROOVY               0.5341

groovy               0.5341

Groovy               0.5341

unique               0.534

Unique               0.534

UNIQUE               0.534

sassy                0.5324

Sassy                0.5324

SASSY                0.5324

Charming             0.531

CHARMING             0.531



So in the third formulation of "cool score", I calculate the cosine similarity scores between the TF-IDF weighted document vectors and the new "cool" vector. 



In this third ranking, Red Mist retreats to tenth. The new chart topper Kai from LEGO Ninjago Movie may not look too cool, but apparently the sentence "he seems to be curious or perhaps more sassy" in the description helps him get high marks. The second placed Fandral may be more aligned with conventional view of "cool", who is described as "one of the most good-looking Asgardians which along with his charm, gave him the reputation as a ladies' man". Despite the new formulation of "cool", Kool-Aid Man still ranks third.


Kai. Source:SHDb


“Kai\'s attitude is more serious, like Cole in the TV show. Despite this, he is very compassionate and approachable, as he is "always ready with a hug." Much like his own TV show counterpart, Kai is possibly impulsive, enjoys fighting enemies, and is loyal and protective of those he cares about, especially his teammates. Unlike his TV counterpart (beside the alternate face on his figurine), he seems to be curious or perhaps more sassy. He also appears to enjoy describing things with a variety of onomatopoeia. Kai wields a pair of katanas in the trailers, but he may be skillful with other weapons, though these are his favorite weapons. He and the other Ninja have Elemental Powers like their TV show counterparts, allowing him to create and manipulate fire. As seen in the trailers, Kai\'s vehicle has weapons that are fire-based like the flamethrower in his mech. Like the other Ninja, he is a master builder.”


The Most Significant Words in description of Kai

Fandral. Source:SHDb

"Fandral the Dashing was a charter member of the Warriors Three, a trio of Asgardian adventurers consisting of himself, Hogun the Grim, and Volstagg the Voluminous. Fandral was a strong and brave and a good friend to Thor. He fought in countless battles with his friends, to preserve and protect his people. He has been described as one of the most good-looking Asgardians which along with his charm, gave him the reputation as a ladies' man. Besides his looks, Fandral is also known for his skills in swordsmanship and bravery. He and Thor first met when the Warriors Three joined the Thunder God on an expedition to restore the Odinsword that had become cracked.Allegedly, Volstagg the Staggeringly Perfect led the youth Hogun the Good, Fandral the Quite Plain, Thor and Loki in Hel, fighting against all of its hordes for forty days and nights. Eventually Hogun was hurt and forced to retreat, helped by Fandral. Due to the battle, Hogun the Good became Hogun the Grim, and for some reason, Fandral the Quite Plain became Fandral the Dashing later, while Volstagg started eating every time and Thor was deemed worthy of Mjolnir. Fandral possesses all of the various superhuman attributes common among the Asgardians."

The Most Significant Words in description of Fandral


And the new ranking has some intriguing results. Apart from Kai, GPL, Lyold(in two entries), Killow and Masako from the LEGO universe get into the top 10. Does LEGO have a secret formula to make its characters look cool in descriptions? On the surface of words the cool factor is not apparent. On the other hand, Jack Kirby, the creator or co-creator of many classics like Avengers, X-Men and Fantastic Four gets into seventh with a long description.



Fourth Attempt: Compare Only the Top Similarity Scoring Terms

Despite Kirby's entry, most of the high similarity scoring texts are relatively short, reflecting that even if TD-IDF is used, it cannot overcome the dilution problem of long documents effectively. An alternative approach is to concentrate just on the top similarity scoring terms in each document of description, instead of getting the similarity score from the vector of the whole document. In the following implementation, I get the top 10 scoring terms with the new "cool" vector and then take the average in each document as a basis of comparison. To avoid the case that some descriptions may score high simply because they have repeated mentions of certain fairly high scoring terms such as "look" and "kind", I only count the unique terms. In this way, descriptions which have more different high similarity terms with the new "cool" vector would get higher scores, but long documents would have advantage as they are more likely to include various words related to "cool".




After this change of approach, some of the more well known names finally emerged, although with some surprise. According to this measurement, Hulk is the coolest superhero. In fact five versions of Hulk, each with very similar descriptions, get into the top 10. The cool related terms in descriptions of the Hulks include "awesome", "amazing", "style", "look", "unique", "fantastic". "great", "truly", "incredibly" and "love".


Hulk. Source:SHDb


The Highest Scoring Terms in Hulk's Description



After four of the Hulks, Sonic the Hedgehog and the Devilman claim the highest positions. Sonic's high scoring terms include "authentic", "amazing" and "unique", while Devilman's description refers to "keep everything cool".


Sonic the Hedgehog. Source:SHDb


The Highest Scoring Terms in Sonic the Hedgehog's Description


Devilman. Source:SHDb


The Highest Scoring Terms in Devilmans Description




Conclusion


Are Red Mist, Kai and Hulk really the coolest superheros? The question of "what is cool?" can draw many different opinions itself, perhaps more so with which superhero is the coolest. The four attempts described above include some unconventional approaches, and the answers arrived may not agree with many people, nonetheless they are judged purely on the basis of textual descriptions, with some quantifiable criteria. 


The application of word embedding vectors as a basis of comparison has an advantage over word matching that it might be better at capturing vague ideas such as "cool", but just like counting the appearance of words, it is far from perfect. The cool words in the history part of the description of superhero may be referring to others rather than the superhero himself/herself, and "look" the verb may be counted incorrectly as "look" the noun as one aspect of "cool". But perhaps it is the simpler way without using much more complex approaches, such as deivising matching rules on parts of speech and entities.


Original Dataset from Kaggle and Github

Source Code: Github and Kaggle


No comments:

  How Feature Engineering Can Greatly Improved Model Predictions: The Case of Medical Insurance Cost (With Codes) Photo by  Martha Dominguez...