Who is the coolest superhero?
Given only the two text columns, can you find a formula to find the coolest superhero?
In the description of Superheroes NLP Dataset in Kaggle, the creator Jonathan Besomi, also the co-development of text preprocessing toolkit Texthero, has some suggestions of analysis, this first listed above one may be the most interesting and challenging.
How can we determine who is the coolest? For social science researchers, they may define what is cool first, then list the different aspects of the concept of cool, and come up with some scoring schemes. This is the analytical way. An alternative idea springing up is somewhat like dating apps, you have the pictures of superheroes displayed, swipe right for cool and swipe left for not cool, and compile a ranked list. This crowdsourcing approach is based on popular perceptions, which may involve preconceived images.
But if we look for purely textual analysis, word embedding vectors and cosine similarity may be the ready made tools. The following four attempts represent various ways of applying these tools, some of which may be unconventional, and the results may be unconvincing. Nevertheless it is a good exercise to get a feel of the potentials and limitations of using natural language processing for large amounts of textual data.
Word Embedding Vector and Cosine Similarity
Some words on the dataset first. The Superheros NLP dataset is scraped from Superheroes Database (SHDb). It has features of various power statistics, superpowers and appearance and so on. The columns in focus here are two textual columns, one on the history of each superhero character, the other on the description of power. Both columns have more than a hundred null values, I fill them all by "NA", and join them into a single “text” column.
Word embedding vectors is mapping words into high dimensional vector space. In nlp library SpaCy used here, every word in their trained model has a vector of 300 dimensions. And in that vector space of word representation, words of similar meanings are pointing at roughly the same directions, so the cosine of the angle of the two vectors acts as a measure of similarity. We can get the word embedding vectors of the words and use a function to calculate cosine similarity, or use SpaCy's method of similarity, which is doing the same thing. SpaCy also has most_similar method to get the list of most similar words to a given word vector.
The word "cool" has multiple meanings of many subtle overtones. In popular usages when we say someone is cool, we usually mean he or she is hip, fashionable, excellent, composed, uniquely at their own, but it can also refer to not enthused, dispassionate and unfriendly, and also moderately cold temperature. The 100 most similar words of cool in SpaCy's large English model have "cool" in different spellings at the tops, then followed by "awesome", "nice", "pretty" and "fun", which are closer to the hip and excellent side of cool.
Word Similarity Score
COOL 1.0
COol 1.0
cool 1.0
Cool 1.0
CooL 1.0
AWesome 0.7616
Awesome 0.7616
AWESOME 0.7616
AWEsome 0.7616
awesome 0.7616
nICE 0.7374
nice 0.7374
NICE 0.7374
Nice 0.7374
NIce 0.7374
PRETTY 0.6568
Pretty 0.6568
pretty 0.6568
PRetty 0.6568
Fun 0.6487
fun 0.6487
FUN 0.6487
kinda 0.6418
Kinda 0.6418
KINDA 0.6418
neat 0.6415
Neat 0.6415
NEAT 0.6415
amaZing 0.6363
amazing 0.6363
AMazing 0.6363
AMAZING 0.6363
Amazing 0.6363
Really 0.632
REALLY 0.632
reAlly 0.632
REALLy 0.632
REally 0.632
really 0.632
sO 0.6226
So 0.6226
SO 0.6226
so 0.6226
WARM 0.6214
Warm 0.6214
warm 0.6214
AWSOME 0.6195
Awsome 0.6195
awsome 0.6195
TOo 0.6193
TOO 0.6193
too 0.6193
toO 0.6193
Too 0.6193
ToO 0.6193
STUFF 0.6189
Stuff 0.6189
stuff 0.6189
Cute 0.6118
cute 0.6118
CUTE 0.6118
CuTe 0.6118
coolest 0.6084
COOLEST 0.6084
Coolest 0.6084
Chill 0.6075
chill 0.6075
CHILL 0.6075
FUNNY 0.6073
FUnny 0.6073
funny 0.6073
Funny 0.6073
Lol 0.607
lOl 0.607
LoL 0.607
lol 0.607
lOL 0.607
LOL 0.607
LOl 0.607
loL 0.607
great 0.6023
Great 0.6023
greAt 0.6023
GREAT 0.6023
GReat 0.6023
Weird 0.6017
WEIRD 0.6017
weird 0.6017
SUPER 0.6002
super 0.6002
Super 0.6002
SUper 0.6002
LOOK 0.5991
look 0.5991
Look 0.5991
LOOKS 0.5976
looks 0.5976
Looks 0.5976
KIND 0.596
kind 0.596
And the similarity scores of other words show the side of composed, nonchalant and confidence in cool is ranked low in SpaCy's large English model. While "nerd" and "geek" are regarded as opposite to cool, their relatively higher similarity scores (above 0.4) are attributed to the fact that they are talked more often together with cool than other unrelated topics. And one interesting thing is "Batman" has a higher similar score then "Superman", perhaps reflecting in common sayings Batman is cooler than Superman.
Word Similarity Score
cool 1.0
cold 0.5644185
coolest 0.608426
awesome 0.7615766
amazing 0.6363099
chill 0.6075392
confidence 0.16269511
composed 0.19751108
calm 0.44255415
nonchalant 0.24065392
Batman 0.36020163
Superman 0.31441408
nerd 0.421054
geek 0.4376547
uncool 0.33795375
hot 0.5611704
ice 0.41151524
fashion 0.37175122
And for vector representation of sentences, paragraphs and even documents, SpaCy follows the conventional method of using centroid vectors, meaning taking the mean of all tokenized words (including punctuations) of the sentence, paragraph and document. Generally speaking, sentences or paragraphs which contain more of the word "cool" or similar meaning words should get higher cosine similarity scores.
The next two paragraphs for testing are taken from an article on the meaning of cool, while the third one is taken from a news article. And the first two paragraphs do have higher similarity scores with the word 'cool' than the third one.
"It’s tough to define the exact qualities that make someone cool, since pretty much everyone has a different idea of what 'cool' is. For some, it’s a leather-coat-wearing motorcyclist on an open road. For others, it’s the lead singer of a band, an English major surrounded by books, or a really chic neighbor who always burns the best incense. These people are wildly different, and yet they can all be considered cool because they project something special — a certain je ne sais quoi — that makes them stand out."
similarity score with “cool”: 0.6362329653887256
“You know you’re in the presence of a cool person when you feel at ease. The reason? “Cool people are present, focused, and interested in those around them,” Romanoff says. They listen, they try to understand — and as a result, they help everyone feel seen and understood.”
similarity score with “cool”: 0.5910140438438523
“Her visit comes after three high-level diplomatic meetings last week ended with Russian troops still on Ukraine’s borders, but no definitive sign whether Putin would risk a military incursion or instead start talks with the US about arms control in Europe, a more limited agenda than his call for a redrawing of the security architecture of Europe.”
similarity score with “cool”: 0.48815137638105816
First Attempt: Ranking the Raw Texts by Cosine Similarity Scores with the Word "Cool"
For now it seems to be a promising approach. So a simple way of determining who is the coolest superhero is to get each one's text description tokenized and get the centroid vector, then calculate the cosine similarity with the word "cool" as a "cool score" and get them ranked. Is it that simple? Let's see how it goes.
| Red Mist Source:SHDb |
So for this first trial, the Red Mist is the coolest superhero. In the description there is a direct reference stating he has "cooler appearance", but in the list of tokens in the text that have highest similarity scores with the word "cool", stop words like "but" and "some" also get moderately high similarity scores.
“The Red Mist was another teenager following the example of Kick-Ass. But his cooler appearance stole some Kick-Ass fandom. Trying to settle things right, Dave tried to talk to him and force Red Mist to give up his super hero identity but in the end they decided to team-up when a building was on fire. When Kick-Ass was visited by Hit-Girl and Big Daddy, Red Mist was reluctant to join their team. Both friends were in they way to meet Hit-Girl and her father in their headquarters ready to make a counter offer. But what they saw was a heavenly wounded Big Daddy pleading for help at the hands of Johnny G. At that precise moment, Red Mist was exposed not only as a traitor who set the heroes up but as Johnny G's son. NA”
The second placed Hulk (Stark Gauntlet) (MCU) and third placed Batgirl (New 52) fare worse. Both descriptions are extremely short, but with some common words which have fairly high similarity scores with "cool" like "it", "everyone", "good" and "very", they are ranked high in "cool score".
“After Tony created a new gaunlet Hulk uses it to revive everyone. NA”
“NA Barbara is very intellegent she is one of the smartest dc characters . She is also a very good fighter and has many gadgetsand weapons.”
It is not convincing, and indicates that the current approach of getting centroid vector to represent a text in it's raw state is discriminating against long descriptions. A long text may say about ten things about a superhero, while it may have some good words on coolness, but the talk of other nine things dilutes so much that the whole text gets a low similarity score.
Second Attempt: TF-IDF Weighted Document Embedding Vectors
A better approach may be to clean the text first and suppress the weighing of words that are common across documents by means like TF-IDF. In creating TF-IDF weighted embedding vectors for documents, I adopt the codes of John Cardente. Then we use the TF-IDF weighted document embedding vectors to calculate cosine similarity scores for a second attempt of coolness ranking.
And I adapt the codes of Nathan Kelber for displaying the top 20 tokens in selected document by their TF-IDF scores, and their cosine similarity score and also their products with respect to a stated text (in this case 'cool') in embedded vector form.
In this second attempt, Red Mist is still ranked top, apparently factored significantly by the word "cooler". Hulk (Stark Gauntlet) (MCU) and Batgirl (New 52) drop out from the top 10, but the inclusions of Kool-Aid Man, Iceman and Jack Frost reflect that terms related to temperature like "kool", "ice", "snow", "cold" come into focus, which is not the sense of "cool" we are talking about.
| The Most Significant Words in description of Red Mist |
“Before he was officially the Kool-Aid Man in 1975, he was the “Pitcher Man”. The Pitcher Man was created in 1954 by Marvin Plotts, an art director for a New York-based advertising agency. General Foods had just purchased Kool-Aid from the drink’s creator Edwin Perkins the year before, and Plotts was charged with drafting a concept to illustrate the copy message: “A 5-cent package makes two quarts. " Working from his Chicago home on a cold day, Potts watched as his young son traced smiley face patterns on a frosty windowpane," recounts Sue Uerling, marketing and communications director for Hastings Museum of Natural and Cultural History. This inspired Marvin Plotts to create a beaming glass pitcher filled with flavorful drink: the Pitcher Man. From there on the joyful pitcher was on all the Kool-Aid’s advertisements. the voice of the man is John Fickley. In 1975 Kraft Foods created the character’s first costume with arms and legs. He also became more of an action figure in commercials — performing extreme sports and busting through brick walls. Kool-Aid Man is famously known for shouting, “Oh, Yeah!” as he is summoned by thirsty children with the phrase, "Hey, Kool-Aid!". Commercials of the era also featured a catchy jingle, always ending with the Kool-Aid Man\'s phrase. Starting in the late 1980s, the character was given dialogue, and his mouth would be digitally manipulated to "move" while the voice actor talked. Sometime in the 1990s, the live-action character was retired; from that point until 2008, the character became entirely computer-generated (although other characters -- such as the kids -- remained live-action). In 2000, a new series of commercials were created for Kool-Aid Fierce and the actor chosen to play Kool-Aid Man was Jon Carr. The most recent Kool-Aid commercials feature a new and different live-action Kool-Aid Man playing street basketball and battling "Cola" to stay balanced on a log. NA”
| The Most Significant Words in description of Kool-Aid Man |
Third Attempt: Refining the "Cool” Vector
This brings us to the other cool property of word embedding vectors. When a vector space model is well trained, it can capture the semantic structure of words, so that related word pairs become parallel vectors that can perform arithmetic operations. If we use "|word|'' to denote a word vector, the famous examples are:
|King|-|man|+|woman|=|Queen|
|Paris|-|France|+|Germany|=|Berlin|
When we perform the same arithmetics on SpaCy's word embedding vectors, the closest words for the resulting vectors are in fact "King" and "Germany", but "Queen" and "Berlin" come as the close second.
The Most Similar Words for “King-man+woman”
KIng 0.8024
King 0.8024
king 0.8024
KING 0.8024
Queen 0.7881
queen 0.7881
QUEEN 0.7881
PRINCE 0.6401
prince 0.6401
Prince 0.6401
The Most Similar Words for “paris-france+germany”
Germany 0.8028
GERMANY 0.8028
germany 0.8028
BERLIN 0.7547
Berlin 0.7547
berlin 0.7547
paris 0.6961
PARIS 0.6961
Paris 0.6961
FRANKFURT 0.6708
How about subtracting "cold" from "cool"? The closest word becomes "kewl", the alternative spelling of "cool" in slang, but with cosine similarity score of only 0.4408, "cool" features even lower in 0.3774. But when we add one more vector of "cool" to it, the most similar word becomes "cool" again.
The Most Similar Words for “cool-cold”
Kewl 0.4408
KEWL 0.4408
kewl 0.4408
AWESOME 0.4206
AWEsome 0.4206
awesome 0.4206
Awesome 0.4206
AWesome 0.4206
AWSOME 0.3894
Awsome 0.3894
awsome 0.3894
coool 0.3781
COOOL 0.3781
Coool 0.3781
COol 0.3774
CooL 0.3774
cool 0.3774
The Most Similar Words for “cool+cool-cold”
CooL 0.8318
COol 0.8318
cool 0.8318
COOL 0.8318
Cool 0.8318
AWEsome 0.7133
awesome 0.7133
AWesome 0.7133
Awesome 0.7133
AWESOME 0.7133
NICE 0.6181
nice 0.6181
Nice 0.6181
nICE 0.6181
NIce 0.6181
In marketing research about brand coolness, it is said that there are ten characteristics associated with cool:
- Authentic
- Inspiring
- Creative
- Attractive
- Edgy
- Rebellious
- Surprising
- Mysterious
- Unique
- Takes Risks
So when I make the formulation "|cool|+|cool|-|cold|+|authentic|+|rebellious|",the most similar words not only include "cool", "authentic", "awesome" and "rebellious", "edgy", "inspiring" and "unique" appear too. It looks hopeful that this vector captures much of the idea when we are looking for the coolest superhero. Is it the kind of formula in Besomi's mind?
The Most Similar Words for “cool+cool-cold+authentic+rebellious”
Cool 0.7232
COOL 0.7232
CooL 0.7232
cool 0.7232
COol 0.7232
Authentic 0.6419
AUTHENTIC 0.6419
authentic 0.6419
AWesome 0.624
Awesome 0.624
awesome 0.624
AWESOME 0.624
AWEsome 0.624
QUIRKY 0.602
Quirky 0.602
quirky 0.602
Inspired 0.6
inspired 0.6
INSPIRED 0.6
Funky 0.5964
FUNKY 0.5964
funky 0.5964
Classy 0.5933
CLASSY 0.5933
classy 0.5933
EDGY 0.5839
edgy 0.5839
Edgy 0.5839
amaZing 0.5742
Amazing 0.5742
AMazing 0.5742
amazing 0.5742
AMAZING 0.5742
REBELLIOUS 0.5684
Rebellious 0.5684
rebellious 0.5684
Inspiring 0.5668
inspiring 0.5668
INSPIRING 0.5668
Badass 0.5642
badass 0.5642
BADASS 0.5642
BadAss 0.5642
retro 0.5587
Retro 0.5587
RETRO 0.5587
fun 0.5578
Fun 0.5578
FUN 0.5578
CHIC 0.5566
chic 0.5566
Chic 0.5566
COLORFUL 0.5542
Colorful 0.5542
colorful 0.5542
Coolest 0.5515
COOLEST 0.5515
coolest 0.5515
STYLISH 0.5483
Stylish 0.5483
stylish 0.5483
cute 0.5435
CUTE 0.5435
Cute 0.5435
CuTe 0.5435
STYLE 0.542
style 0.542
Style 0.542
Trendy 0.5418
TRENDY 0.5418
trendy 0.5418
look 0.538
Look 0.538
LOOK 0.538
NIce 0.5378
Nice 0.5378
nICE 0.5378
nice 0.5378
NICE 0.5378
KIND 0.537
kind 0.537
Kind 0.537
artsy 0.536
ARTSY 0.536
Artsy 0.536
fabulous 0.5353
Fabulous 0.5353
FABULOUS 0.5353
FABulous 0.5353
GROOVY 0.5341
groovy 0.5341
Groovy 0.5341
unique 0.534
Unique 0.534
UNIQUE 0.534
sassy 0.5324
Sassy 0.5324
SASSY 0.5324
Charming 0.531
CHARMING 0.531
So in the third formulation of "cool score", I calculate the cosine similarity scores between the TF-IDF weighted document vectors and the new "cool" vector.
In this third ranking, Red Mist retreats to tenth. The new chart topper Kai from LEGO Ninjago Movie may not look too cool, but apparently the sentence "he seems to be curious or perhaps more sassy" in the description helps him get high marks. The second placed Fandral may be more aligned with conventional view of "cool", who is described as "one of the most good-looking Asgardians which along with his charm, gave him the reputation as a ladies' man". Despite the new formulation of "cool", Kool-Aid Man still ranks third.
“Kai\'s attitude is more serious, like Cole in the TV show. Despite this, he is very compassionate and approachable, as he is "always ready with a hug." Much like his own TV show counterpart, Kai is possibly impulsive, enjoys fighting enemies, and is loyal and protective of those he cares about, especially his teammates. Unlike his TV counterpart (beside the alternate face on his figurine), he seems to be curious or perhaps more sassy. He also appears to enjoy describing things with a variety of onomatopoeia. Kai wields a pair of katanas in the trailers, but he may be skillful with other weapons, though these are his favorite weapons. He and the other Ninja have Elemental Powers like their TV show counterparts, allowing him to create and manipulate fire. As seen in the trailers, Kai\'s vehicle has weapons that are fire-based like the flamethrower in his mech. Like the other Ninja, he is a master builder.”
| The Most Significant Words in description of Kai |
| Fandral. Source:SHDb |
"Fandral the Dashing was a charter member of the Warriors Three, a trio of Asgardian adventurers consisting of himself, Hogun the Grim, and Volstagg the Voluminous. Fandral was a strong and brave and a good friend to Thor. He fought in countless battles with his friends, to preserve and protect his people. He has been described as one of the most good-looking Asgardians which along with his charm, gave him the reputation as a ladies' man. Besides his looks, Fandral is also known for his skills in swordsmanship and bravery. He and Thor first met when the Warriors Three joined the Thunder God on an expedition to restore the Odinsword that had become cracked.Allegedly, Volstagg the Staggeringly Perfect led the youth Hogun the Good, Fandral the Quite Plain, Thor and Loki in Hel, fighting against all of its hordes for forty days and nights. Eventually Hogun was hurt and forced to retreat, helped by Fandral. Due to the battle, Hogun the Good became Hogun the Grim, and for some reason, Fandral the Quite Plain became Fandral the Dashing later, while Volstagg started eating every time and Thor was deemed worthy of Mjolnir. Fandral possesses all of the various superhuman attributes common among the Asgardians."
| The Most Significant Words in description of Fandral |
And the new ranking has some intriguing results. Apart from Kai, GPL, Lyold(in two entries), Killow and Masako from the LEGO universe get into the top 10. Does LEGO have a secret formula to make its characters look cool in descriptions? On the surface of words the cool factor is not apparent. On the other hand, Jack Kirby, the creator or co-creator of many classics like Avengers, X-Men and Fantastic Four gets into seventh with a long description.
Fourth Attempt: Compare Only the Top Similarity Scoring Terms
Despite Kirby's entry, most of the high similarity scoring texts are relatively short, reflecting that even if TD-IDF is used, it cannot overcome the dilution problem of long documents effectively. An alternative approach is to concentrate just on the top similarity scoring terms in each document of description, instead of getting the similarity score from the vector of the whole document. In the following implementation, I get the top 10 scoring terms with the new "cool" vector and then take the average in each document as a basis of comparison. To avoid the case that some descriptions may score high simply because they have repeated mentions of certain fairly high scoring terms such as "look" and "kind", I only count the unique terms. In this way, descriptions which have more different high similarity terms with the new "cool" vector would get higher scores, but long documents would have advantage as they are more likely to include various words related to "cool".
After this change of approach, some of the more well known names finally emerged, although with some surprise. According to this measurement, Hulk is the coolest superhero. In fact five versions of Hulk, each with very similar descriptions, get into the top 10. The cool related terms in descriptions of the Hulks include "awesome", "amazing", "style", "look", "unique", "fantastic". "great", "truly", "incredibly" and "love".
| Hulk. Source:SHDb |
| The Highest Scoring Terms in Hulk's Description |
After four of the Hulks, Sonic the Hedgehog and the Devilman claim the highest positions. Sonic's high scoring terms include "authentic", "amazing" and "unique", while Devilman's description refers to "keep everything cool".
| Sonic the Hedgehog. Source:SHDb |
| The Highest Scoring Terms in Sonic the Hedgehog's Description |
| Devilman. Source:SHDb |
| The Highest Scoring Terms in Devilmans Description |
Conclusion
Are Red Mist, Kai and Hulk really the coolest superheros? The question of "what is cool?" can draw many different opinions itself, perhaps more so with which superhero is the coolest. The four attempts described above include some unconventional approaches, and the answers arrived may not agree with many people, nonetheless they are judged purely on the basis of textual descriptions, with some quantifiable criteria.
The application of word embedding vectors as a basis of comparison has an advantage over word matching that it might be better at capturing vague ideas such as "cool", but just like counting the appearance of words, it is far from perfect. The cool words in the history part of the description of superhero may be referring to others rather than the superhero himself/herself, and "look" the verb may be counted incorrectly as "look" the noun as one aspect of "cool". But perhaps it is the simpler way without using much more complex approaches, such as deivising matching rules on parts of speech and entities.
Original Dataset from Kaggle and Github
Source Code: Github and Kaggle
No comments:
Post a Comment