API.

" /> cmertin.com | Emoji2Vec (😀️✌️➡️)

Emoji2Vec (😀️✌️➡️)

Posted by Christopher Mertin on December 21, 2018 in Project • 7 min read

Setup

This data was done by downloading over 19 million tweets from twitter using the Python Twitter module. In order to reduce noise, these were reduced to only containing tweets that contain an emoji. The data can be download here, and the source code for downloading the tweets can be seen below. Note: These tweets have been cleaned and anonymized, as described below

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import nltk
import json
import pickle
import tweepy
from emoji import UNICODE_EMOJI
import datetime
import os.path
import subprocess

def HasEmojis(text):
    return len(set(text).intersection(EMOJIS)) > 0

def CommonWords(filename):
    return [line.strip() for line in open(filename)]

class StdOutListener(tweepy.StreamListener):

    def on_data(self, data):
        global tweets
        global n_tweets
        try:
            text = json.loads(data)["text"]
            if HasEmojis(text) == True:
                tweets.append(text)
                if len(tweets) == 1000:
                    with open(OUT_FILE, 'a') as f:
                        for tweet in tweets:
                            f.write(" ".join(tweet.split()) + '\n')

                    t_ = datetime.datetime.now()
                    t_ = t_.strftime("%Y-%m-%d %I:%M %p")
                    n_tweets = n_tweets + len(tweets)
                    print(str(t_) + ":", "{:,}".format(n_tweets))
                    tweets = []
            return True, text
        except:
            pass

    def on_error(self, status):
        return False, "none"


ACCESS_TOKEN = "ACCESS_TOKEN_KEY"
ACCESS_TOKEN_SECRET = "ACCESS_TOKEN_SECRET"
CONSUMER_TOKEN = "CONSUMER_TOKEN_KEY"
CONSUMER_TOKEN_SECRET = "CONSUMER_TOKEN_SECRET"
EMOJIS = set(list(UNICODE_EMOJI))
OUT_FILE = "twitter_data.dat"
filter_words = CommonWords("common_words.dat")


print("Initializing...", end='')
if os.path.isfile(OUT_FILE):
    result = subprocess.run(["wc", "-l", OUT_FILE], stdout=subprocess.PIPE)
    n_tweets = int(str(result.stdout.split()[0])[2:-1])
else:
    n_tweets = 0
    result = subprocess.run(["touch", OUT_FILE], stdout=subprocess.PIPE)
print("DONE")

tweets = []

l = StdOutListener()
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_TOKEN_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
stream = tweepy.Stream(auth, l)

stream.filter(track=filter_words)

After downloading these tweets, I ran them through the following script which cleaned them up and anonymized them. The script does the following:

  • Turns twitter URLs into TWITTER_URL
  • Turns retweets into RETWEET
  • Turns users into USER
  • Turns other URLs into URL
  • Spaces emojis apart
  • Removes multiple repeating emojis and turns them into a single emoji
  • Turns hashtags into HASHTAG
  • Converts emojis to a standardized format emoji_0, emoji_1, …
  • Removes non-word characters
  • Removes tweets with no words in them
  • Removes stopwords (provided by NLTK)

The functions that do this can be found below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def CleanTweet(tweet):
    temp = re.sub(r"(RT|retweet|from|via)((?:\b\W*@\w+)+)", "RETWEET", tweet, flags=re.IGNORECASE)
    temp = re.sub(r"@(\w+)", "USER", temp)
    temp = re.sub(r"https://t.co/[0-9A-Za-z]+", "TWITTER_URL", temp)
    temp = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "URL", temp)
    temp = re.sub(r"#(\w+)", "HASHTAG", temp)
    temp = re.sub(r"\'", "", temp)
    temp = re.sub(r"’", "", temp)
    temp = EmojiConvert(temp)

    temp = re.sub(r"\W+", " ", tweet.lower())
    stop_found = set(temp.split()).intersection(stop_words)
    for word in stop_words:
        rgx = "\\b" + word + "\\b"
        temp = re.sub(rgx, " ", temp)

    return temp

def EmojiConvert(tweet):
    tweet_chars = set(tweet)
    emojis_loc = EMOJIS.intersection(tweet_chars)
    if len(emojis_loc) > 0:
        for emj in emojis_loc:
            rgx = emj + "+"
            tweet = re.sub(rgx, " " + EMOJI_DICT[emj] + " ", tweet)
    return " ".join(tweet.split())

After doing this, analysis can be performed on the tweets to determine the parameters for training the word2vec model. A word2vec model creates a sliding window over text and looks at words that co-occur in that window. For example, given a segment of text like “the dog wags his tail,” the word2vec model would realize that “tail” and “wags” are similar to the word “dog” since they occur near each other. However, the idea is that words like “dog” and “cat” would appear in similar types of sentences, (for example: “my dog is awesome” “I love my cat”) and would realize that “dog” and “cat” are similar to one another, ie they’re both pets.

This can be used to determine what an emoji essentially means, by looking at what words occur most frequently with an emoji in a given tweet. When training a word2vec model, you have the freedom of choosing the dimensionality of the wordvectors. Industry standard uses a dimensionality size of N=300 for the wordvector dimensionality which seems to work well. Much above or below this does not allow for the model to differentiate between words.

For determining the window size to use for the model, I looked at the distribution of length of tweets. The idea behind this is that tweets with emojis are semantically based. Meaning that happy/celebratory tweets will have happy-style emojis, and likewise for other emojis. In looking at this, the mean length is 15 words per tweet. The distribution of this shows that the majority of tweets are at or below this limit, so choosing a window size of 15 would be a good choice for training a word2vec model to capture the sentiment of emojis.

Tweet Length Distribution

The word2vec model was trained with the following script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec

def ReadTweets(filename):
    tweets = []

    with open(filename, 'r') as f:
        for lines in f:
            tweets.append(lines.split())
    return tweets

path = get_tmpfile("emoji2vec.model")
tweets = ReadTweets("emoji_tweets.dat")

model = Word2Vec(tweets, size=300, window=15, min_count=100, workers=3, epochs=2)
model.save("emoji2vec.model")

Finally, in the trained model, I limited it to be for minimum word count to be greater than or equal to 100 occurrences. So words which show up less than 100 times aren’t counted in the word2vec model. This was done mainly to limit the size of the model so that it can be loaded into memory on my website.

Finally, the code is run on my AWS Webserver using Flask so that it can be interactive. The code for my Flask module is below

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
from flask import Flask
from flask_restful import Api, Resource, reqparse
from gensim.models import Word2Vec
import numpy as np
import pickle

app = Flask(__name__)
api = Api(app)

model_file = "emoji2vec.model"
emoji_dict_file = "emoji_dict.pkl"
dict_emoji = {}
model = Word2Vec.load(model_file)
with open(emoji_dict_file, "rb") as f:
    emoji_dict = pickle.load(f)

for emoji in emoji_dict.keys():
    dict_emoji[emoji_dict[emoji]] = emoji

emoji_matrix = []
emoji_index = []
model_vocab = model.wv.vocab
for emoji in np.sort(list(dict_emoji.keys())):
    if emoji not in model_vocab:
        continue
    emoji_index.append(emoji)
    emoji_matrix.append(model.wv[emoji]/np.linalg.norm(model.wv[emoji]))

emoji_matrix = np.array(emoji_matrix)
emoji_matrix = np.transpose(emoji_matrix)


def MostSimilar(word, emoji2word):
    global emoji_matrix
    global emoji_index
    global model
    global emoji_index
    global dict_emoji
    global emoji_index

    if emoji2word == "True":
        emoji2word = True
    else:
        emoji2word = False

    results = {}
    results["valid"] = False
    word = word.lower()

    # Get the most similar words for a given emoji
    if emoji2word == False:
        if word in emoji_dict.keys():
            word = emoji_dict[word]
        elif word not in model.wv.vocab:
            return results

        closest = model.wv.most_similar([word], topn=30)
        count = 0
        for word in closest:
            if word[0] not in emoji_index:
                results[str(count+1)] = {
                    "word":word[0],
                    "similarity": "%.5f" % word[1]
                }
                count += 1
            if count >= 10:
                break
        results["valid"] = True
        return results
    # Get the most similar emojis for a given word
    else:
        if word in emoji_dict.keys():
            word = emoji_dict[word]
        if word not in model.wv.vocab:
            return results
        wv = model.wv[word]/np.linalg.norm(model.wv[word])
        residual = np.dot(wv, emoji_matrix)
        res_max = residual.argsort()[-10:][::-1]
        for idx, emoji_idx in enumerate(res_max):
            results[str(idx+1)] = {
                "word":dict_emoji[emoji_index[emoji_idx]],
                "similarity":"%.5f" % residual[emoji_idx]
            }
        results["valid"] = True
        return results


class Emoji2Vec(Resource):
    def get(self, text):
        parser = reqparse.RequestParser()
        parser.add_argument("emoji")
        parser.add_argument("text")
        args = parser.parse_args()
        result = MostSimilar(args["text"], args["emoji"])

        if result["valid"] == True:
            return result, 202
        else:
            return  result, 404

api.add_resource(Emoji2Vec, "/emoji2vec/<string:text>")

app.run(debug=False)

The MostSimilar function is the function that gets the wordvectors that are closest to a given input. However, some logic had to be implemented to separate the emojis and the words.

For returning only emojis, we can build a matrix of the wordvectors for all of the emojis. This matrix will be used for comparing a given input to all of the emojis efficiently.

We can exploit the fact that the cosine similarity is defined as

$$\cos(\theta) = \frac{\mathbf{A}\cdot\mathbf{B}}{\left|\left|\mathbf{A}\right|\right|_{2}\left|\left|\mathbf{B}\right|\right|_{2}}$$

This allows us to separate the equation into individual parts to build the matrix. For example, we can normalize each emoji vector by dividing by the norm to take care of \(\left|\left|\mathbf{B}\right|\right|_{2}\) early on in the calculation. Then, when doing the comparison we can also simply divide the input by its norm, before we take the dot product of the input wordvector and this resulting emoji matrix. Following this, we can get the corresponding most similar emojis by performing an argmax on the list to get the indices.

This is an efficient way of doing cosine similarity comparisons in Python, rather than iterating over each instance and doing a Cosine Similarity calculation one at a time.

For pulling out the words, we can rely on gensims backend C implementation which iterates over all of the words in the corpus and finds the closest vectors. This isn’t as intensive as if it was written in Python, which allows for the easy iteration over the entire corpus of words. Then, we return the top 30 words, and skip those results that are an emoji. This allows us to filter to just the words, even though the corpus was trained on both.

The interactive search function can be found below:



Word/EmojiSimilarity Score

Conclusion

Overall, I am quite happy with the results. With more time and resources, they could be improved upon. For example, using noun chunking to join words together based on the grammar that appears in the tweet. This was tried with using gensims Phrases function, but it wound up joining emojis together which subtracted from the overall relevance of emojis to words.

On top of this, I would have liked to have pulled more than the original 19 million tweets. As these tweets were taken over the course of just a few weeks, there are instances where the similarity results are heavily influenced based on the news cycle. More tweets, or at the very least taking them over a longer time-frame, would have gotten rid of this issue.