Natural Language Processing
Up to this point, we have only considered machine learning algorithms that mostly operate on numerical inputs. If we want to use text, we must find a way to convert the text into numbers. There are many ways to do this and we will explore a few common ways this is achieved.”
If we consider the sentence “tensorflow makes m
achine learning easy”, we could convert the words t
o numbers in the order that we observe them. This would make the sentence become “1 2 3 4 5”. Then when we see a new sentence, “machine learning is easy”, we can translate this as “3 4 0 5”. Denoting words we haven’t seen bore with an index of zero. With these two examples, we have limited our vocabulary to 6 numbers. With large texts we can choose how many words we want to keep, and usually keep the most frequent words, labeling everything else with the index of zero.
If the word “learning” has a numerical value of 4, and the word “makes” has a numerical value of 2, then it would be natural to assume that “learning” is twice “makes”. Since we do not want this type of numerical relationship between words, we assume these numbers represent categories and not relational numbers.
Another problem is that these two sentences are of different size. Each observation we make (sentences in this case) need to have the same size input to a model we wish to create. To get around this, we create each sentence into a sparse vector that has that value of one in a specific index if that word occurs in that index.
Natural Language Processing (NLP) Introduction
In this chapter we cover the following topics:
- Working with Bag of Words
- Implementing TF-IDF
- Working with Skip-gram Embeddings
- Working with CBOW Embeddings
- Making Predictions with Word2vec
Using Doc2vec for Sentiment Analysis
Up to this point, we have only considered machine learning algorithms that mostly operate on numerical inputs. If we want to use text, we must find a way to convert the text into numbers. There are many ways to do this and we will explore a few common ways this is achieved.
If we consider the sentence “tensorflow makes machine learning easy”, we could convert the words to numbers in the order that we observe them. This would make the sentence become “1 2 3 4 5”. Then when we see a new sentence, “machine learning is easy”, we can translate this as “3 4 0 5”. Denoting words we haven’t seen bore with an index of zero. With these two examples, we have limited our vocabulary to 6 numbers. With large texts we can choose how many words we want to keep, and usually keep the most frequent words, labeling everything else with the index of zero.
If the word “learning” has a numerical value of 4, and the word “makes” has a numerical value of 2, then it would be natural to assume that “learning” is twice “makes”. Since we do not want this type of numerical relationship between words, we assume these numbers represent categories and not relational numbers.
Another problem is that these two sentences are of different size. Each observation we make (sentences in this case) need to have the same size input to a model we wish to create. To get around this, we create each sentence into a sparse vector that has that value of one in a specific index if that word occurs in that index.
word —> | tensorflow | makes | machine | learning | easy |
---|---|---|---|---|---|
word index —> | 1 | 2 | 3 | 4 | 5 |
The occurrence vector would then be:
sentence1 = [0, 1, 1, 1, 1, 1]
This is a vector of length 6 because we have 5 words in our vocabulary and we reserve the 0-th index for unknown or rare words
Now consider the sentence, ‘machine learning is easy’.
word —> | machine | learning | is | easy |
---|---|---|---|---|
word index —> | 3 | 4 | 0 | 5 |
The occurrence vector for this sentence is now:
sentence2 = [1, 0, 0, 1, 1, 1]
Notice that we now have a procedure that converts any sentence to a fixed length numerical vector.
A disadvantage to this method is that we lose any indication of word order. The two sentences “tensorflow makes machine learning easy” and “machine learning makes tensorflow easy” would result in the same sentence vector.
It is also worthwhile to note that the length of these vectors is equal to the size of our vocabulary that we pick.
It is common to pick a very large vocabulary, so these sentence vectors can be very sparse. This type of embedding that we have covered in this introduction is called “bag of words”. We will implement this in the next section.
Another drawback is that the words “is” and “tensorflow” have the same numerical index value of one. We can imagine that the word “is” might be less important that the occurrence of the word “tensorflow”.
We will explore different types of embeddings in this chapter that attempt to address these ideas, but first we start with an implementation of bag of words.
Working with Bag of Words
In this example, we will download and preprocess the ham/spam text data. We will then use a one-hot-encoding to make a bag of words set of features to use in logistic regression.
We will use these one-hot-vectors for logistic regression to predict if a text is spam or ham.
We start by loading the necessary libraries.
1 | import tensorflow as tf |
We start a computation graph session.
1 | # Start a graph session |
Check if data was downloaded, otherwise download it and save for future use
1 | save_file_name = os.path.join('temp','temp_spam_data.csv') |
To reduce the potential vocabulary size, we normalize the text. To do this, we remove the influence of capitalization and numbers in the text.
1 | # Relabel 'spam' as 1, 'ham' as 0 |
To determine a good sentence length to pad/crop at, we plot a histogram of text lengths (in words).
1 | %matplotlib inline |
We crop/pad all texts to be 25 words long. We also will filter out any words that do not appear at least 3 times.
1 | # Choose max text word length at 25 |
TensorFlow has a built in text processing function called VocabularyProcessor()
. We use this function to process the texts.
1 | # Setup vocabulary processor |
To test our logistic model (predicting spam/ham), we split the texts into a train and test set.
1 | # Split up data set into train/test |
For one-hot-encoding, we setup an identity matrix for the TensorFlow embedding lookup.
We also create the variables and placeholders for the logistic regression we will perform.
1 | # Setup Index Matrix for one-hot-encoding |
Next, we create the text-word embedding lookup with the prior identity matrix.
Our logistic regression will use the counts of the words as the input. The counts are created by summing the embedding output across the rows.
Then we declare the logistic regression operations. Note that we do not wrap the logistic operations in the sigmoid function because this will be done in the loss function later on.
1 | # Text-Vocab Embedding |
Now we declare our loss function (which has the sigmoid built in), prediction operations, optimizer, and initialize the variables.
1 | # Declare loss function (Cross Entropy loss) |
Now we loop through the iterations and fit the logistic regression on wether or not the text is spam or ham.
1 | # Start Logistic Regression |
Starting Training Over 4459 Sentences.
Training Observation #50: Loss = 4.7342416e-14
...
Training Observation #4450: Loss = 3.811978e-11
Now that we have a logistic model, we can evaluate the accuracy on the test dataset.
1 | # Get test set accuracy |
Getting Test Set Accuracy For 1115 Sentences.
Test Observation #100
Test Observation #200
Test Observation #300
Test Observation #400
Test Observation #500
Test Observation #600
Let’s look at the training accuracy over all the iterations.
1 | # Plot training accuracy over time |
It is worthwhile to mention the motivation of limiting the sentence (or text) size. In this example we limited the text size to 25 words. This is a common practice with bag of words because it limits the effect of text length on the prediction. You can imagine that if we find a word, “meeting” for example, that is predictive of a text being ham (not spam), then a spam message might get through by putting in many occurrences of that word at the end. In fact, this is a common problem with imbalanced target data. Imbalanced data might occur in this situation, since spam may be hard to find and ham may be easy to find. Because of this fact, our vocabulary that we create might be heavily skewed toward words represented in the ham part of our data (more ham means more words are represented in ham than spam). If we allow unlimited length of texts, then spammers might take advantage of this and create very long texts, which have a higher probability of triggering non-spam word factors in our logistic model.
In the next section, we attempt to tackle this problem in a better way using the frequency of word occurrence to determine the values of the word embeddings.
Implementing TF-IDF
TF-IDF is an acronym that stands for Text Frequency – Inverse Document Frequency. This term is essentially the product of text frequency and inverse document frequency for each word.
In the prior recipe, we introduced the bag of words methodology, which assigned a value of one for every occurrence of a word in a sentence. This is probably not ideal as each category of sentence (spam and ham for the prior recipe example) most likely has the same frequency of “the”, “and” and other words, whereas words like “viagra” and “sale” probably should have increased importance in figuring out whether or not the text is spam.
We first want to take into consideration the word frequency. Here we consider the frequency that a word occurs in an individual entry. The purpose of this part (TF), is to find terms that appear to be important in each entry.
But words like “the” and “and” may appear very frequently in every entry. We want to down weight the importance of these words, so we can imagine that multiplying the above text frequency (TF) by the inverse of the whole document frequency might help find important words. But since a collection of texts (a corpus) may be quite large, it is common to take the logarithm of the inverse document frequency. This leaves us with the following formula for TF-IDF for each word in each document entry.
Where $w_{tf}$ is the word frequency by document, and $w_{df}$ is the total frequency of such word across all documents. We can imagine that high values of TF-IDF might indicate words that are very important to determining what a document is about.
Here we implement TF-IDF, (Text Frequency - Inverse Document Frequency) for the spam-ham text data.
We will use a hybrid approach of encoding the texts with sci-kit learn’s TFIDF vectorizer. Then we will use the regular TensorFlow logistic algorithm outline.
Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrences of each word before we can start training our model. Because of this, it is not implemented fully in Tensorflow, so we will use Scikit-learn for creating our TF-IDF embedding, but use Tensorflow to fit the logistic model.
We start by loading the necessary libraries.
1 | import tensorflow as tf |
Start a computational graph session.
1 | sess = tf.Session() |
We set two parameters, batch_size
and max_features
. batch_size
is the size of the batch we will train our logistic model on, and max_features
is the maximum number of tf-idf textual words we will use in our logistic regression.
1 | batch_size = 200 |
Check if data was downloaded, otherwise download it and save for future use
1 | save_file_name = 'temp_spam_data.csv' |
We now clean our texts. This will decrease our vocabulary size by converting everything to lower case, removing punctuation and getting rid of numbers.
1 | texts = [x[1] for x in text_data] |
Define tokenizer function and create the TF-IDF vectors with SciKit-Learn.
1 | def tokenizer(text): |
Split up data set into train/test.
1 | texts[:3] |
['go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat',
'ok lar joking wif u oni',
'free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry questionstd txt ratetcs apply overs']
1 | sparse_tfidf_texts[:3] |
<3x1000 sparse matrix of type '<class 'numpy.float64'>'
with 26 stored elements in Compressed Sparse Row format>
1 | train_indices = np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False) |
Now we create the variables and placeholders necessary for logistic regression. After which, we declare our logistic regression operation. Remember that the sigmoid part of the logistic regression will be in the loss function.
1 | # Create variables for logistic regression |
Next, we declare the loss function (which has the sigmoid in it), and the prediction function. The prediction function will have to have a sigmoid inside of it because it is not in the model output.
1 | # Declare loss function (Cross Entropy loss) |
Now we create the optimization function and initialize the model variables.
1 | # Declare optimizer |
Finally, we perform our logisitic regression on the 1000 TF-IDF features.
1 | train_loss = [] |
Generation # 500. Train Loss (Test Loss): 1.07 (1.08). Train Acc (Test Acc): 0.36 (0.35)
...
Generation # 9500. Train Loss (Test Loss): 0.39 (0.46). Train Acc (Test Acc): 0.88 (0.85)
Generation # 10000. Train Loss (Test Loss): 0.52 (0.46). Train Acc (Test Acc): 0.80 (0.85)
Here is matplotlib code to plot the loss and accuracies.
1 | # Plot loss over time |
Word2Vec: Skipgram Model
Working with Skip Gram Embeddings
Prior to this recipe, we have not considered the order of words to be relevant in creating word embeddings. In early 2013, Tomas Mikolov and other researchers at Google authored a paper about creating word embeddings that address this issue (https://arxiv.org/abs/1301.3781), and they named their methods “word2vec”.
The basic idea is to create word embeddings that capture a relational aspect of words. We seek to understand how various words are related to each other. Some examples of how these embeddings might behave are as follows.
- “king” – “man” + “woman” = “queen”
- “india pale ale” – “hops” + “malt” = “stout”
We might achieve such numerical representation of words if we only consider their positional relationship to each other. If we could analyse a large enough source of coherent documents, we might find that the words “king”, “man”, and “queen” are mentioned closely to each other in our texts. If we also know that “man” and “woman” are related in a different way, then we might conclude that “man” is to “king” as “woman” is to “queen” and so on.
To go about finding such an embedding, we will use a neural network that predicts surrounding words giving an input word. We could, just as easily, switched that and tried to predict a target word given a set of surrounding words, but we will start with the prior method. Both are variations of the word2vec procedure. But the prior method of predicting the surrounding words (the context) from a target word is called the skip-gram model. In the next recipe, we will implement the other method, predicting the target word from the context, which is called the continuous bag of words method (CBOW).
See below figure for an illustration.
In this example, we will download and preprocess the movie review data.
From this data set we will compute/fit the skipgram model of the Word2Vec Algorithm
Skipgram: based on predicting the surrounding words from the
Ex sentence “the cat in the hat”
- context word: [“hat”]
- target words: [“the”, “cat”, “in”, “the”]
- context-target pairs: (“hat”, “the”), (“hat”, “cat”), (“hat”, “in”), (“hat”, “the”)
We start by loading the necessary libraries.
1 | import tensorflow as tf |
Start a computational graph session.
1 | sess = tf.Session() |
Declare model parameters
1 | batch_size = 100 # How many sets of words to train on at once. |
We will remove stop words and create a test validation set of words.
1 | # Declare stop words |
Next, we load the movie review data. We check if the data was downloaded, and not, download and save it.
1 | def load_movie_data(): |
Now we create a function that normalizes/cleans the text.
1 | # Normalize text |
With the normalized movie reviews, we now build a dictionary of words.
1 | # Build dictionary of words |
With the above dictionary, we can turn text data into lists of integers from such dictionary.
1 | def text_to_numbers(sentences, word_dict): |
Let us now build a function that will generate random data points from our text and parameters.
1 | # Generate data randomly (N words behind, target, N words ahead) |
Next we define our model and placeholders.
1 | # Define Embeddings: |
1 | embed |
<tf.Tensor 'embedding_lookup/Identity:0' shape=(100, 100) dtype=float32>
Here is our loss function, optimizer, cosine similarity, and initialization of the model variables.
For the loss function we will minimize the average of the NCE loss (noise-contrastive estimation).
1 | # Get loss from prediction |
WARNING:tensorflow:From <ipython-input-18-90dede70073c>:13: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
1 | sim_init = sess.run(similarity) |
Now we can train our skip-gram model.
Note that we have the line:
nearest = (-sim[j, :]).argsort()[1:top_k+1]
below. The negative of the similarity matrix is used becauseargsort()
sorts the values from least to greatest. Since we want to take the greatest numbers, we sort in the opposite direction by taking the negative of the similarity matrix, then calling theargsort()
method.
1 | # Run the skip gram model. |
Loss at step 500 : 19.154987335205078
...
Nearest to cliche: sparkling, chosen, duty, thoughtful, pile,
Nearest to love: shimmering, transcend, economical, review, affable,
Nearest to hate: tried, recycled, anybody, complexity, enthusiasm,
Nearest to silly: denis, audacity, gutwrenching, irritating, callar,
Nearest to sad: adequately, surreal, paint, human, exploitative,
Loss at step 60500 : 3.153820514678955
Working with CBOW Embeddings
In this recipe we will implement the CBOW (continuous bag of words) method of word2vec. It is very similar to the skip-gram method, except we are predicting a single target word from a surrounding window of context words.
In the prior example we treated each combination of window and target as a group of paired inputs and outputs, but with CBOW we will add the surrounding window embeddings together to get one embedding to predict the target word embedding.
Most of the code will stay the same, except we will need to change how we create the embeddings and how we generate the data from the sentences.
To make the code easier to read, we have moved all the major functions to a separate file, called ‘text_helpers.py’ in the same directory. This function holds the data loading, text normalization, dictionary creation, and batch generation functions. This functions are exactly as they appear in the prior recipe, “Working with Skip-gram Embeddings”, except where noted.
See the following illustration of a CBOW example.
1 | print('Creating Model') |
1 | # Get loss from prediction |