Natural Language Processing

Up to this point, we have only considered machine learning algorithms that mostly operate on numerical inputs. If we want to use text, we must find a way to convert the text into numbers. There are many ways to do this and we will explore a few common ways this is achieved.”

If we consider the sentence “tensorflow makes m
achine learning easy”, we could convert the words t
o numbers in the order that we observe them. This would make the sentence become “1 2 3 4 5”. Then when we see a new sentence, “machine learning is easy”, we can translate this as “3 4 0 5”. Denoting words we haven’t seen bore with an index of zero. With these two examples, we have limited our vocabulary to 6 numbers. With large texts we can choose how many words we want to keep, and usually keep the most frequent words, labeling everything else with the index of zero.

Another problem is that these two sentences are of different size. Each observation we make (sentences in this case) need to have the same size input to a model we wish to create. To get around this, we create each sentence into a sparse vector that has that value of one in a specific index if that word occurs in that index.

Natural Language Processing (NLP) Introduction

In this chapter we cover the following topics:

Working with Bag of Words
Implementing TF-IDF
Working with Skip-gram Embeddings
Working with CBOW Embeddings
Making Predictions with Word2vec
Using Doc2vec for Sentiment Analysis

Up to this point, we have only considered machine learning algorithms that mostly operate on numerical inputs. If we want to use text, we must find a way to convert the text into numbers. There are many ways to do this and we will explore a few common ways this is achieved.

If we consider the sentence “tensorflow makes machine learning easy”, we could convert the words to numbers in the order that we observe them. This would make the sentence become “1 2 3 4 5”. Then when we see a new sentence, “machine learning is easy”, we can translate this as “3 4 0 5”. Denoting words we haven’t seen bore with an index of zero. With these two examples, we have limited our vocabulary to 6 numbers. With large texts we can choose how many words we want to keep, and usually keep the most frequent words, labeling everything else with the index of zero.

If the word “learning” has a numerical value of 4, and the word “makes” has a numerical value of 2, then it would be natural to assume that “learning” is twice “makes”. Since we do not want this type of numerical relationship between words, we assume these numbers represent categories and not relational numbers.
Another problem is that these two sentences are of different size. Each observation we make (sentences in this case) need to have the same size input to a model we wish to create. To get around this, we create each sentence into a sparse vector that has that value of one in a specific index if that word occurs in that index.

word —>	tensorflow	makes	machine	learning	easy
word index —>	1	2	3	4	5

The occurrence vector would then be:

sentence1 = [0, 1, 1, 1, 1, 1]

This is a vector of length 6 because we have 5 words in our vocabulary and we reserve the 0-th index for unknown or rare words

Now consider the sentence, ‘machine learning is easy’.

word —>	machine	learning	is	easy
word index —>	3	4	0	5

The occurrence vector for this sentence is now:

sentence2 = [1, 0, 0, 1, 1, 1]

Notice that we now have a procedure that converts any sentence to a fixed length numerical vector.

A disadvantage to this method is that we lose any indication of word order. The two sentences “tensorflow makes machine learning easy” and “machine learning makes tensorflow easy” would result in the same sentence vector.
It is also worthwhile to note that the length of these vectors is equal to the size of our vocabulary that we pick.
It is common to pick a very large vocabulary, so these sentence vectors can be very sparse. This type of embedding that we have covered in this introduction is called “bag of words”. We will implement this in the next section.

Another drawback is that the words “is” and “tensorflow” have the same numerical index value of one. We can imagine that the word “is” might be less important that the occurrence of the word “tensorflow”.
We will explore different types of embeddings in this chapter that attempt to address these ideas, but first we start with an implementation of bag of words.

Working with Bag of Words

In this example, we will download and preprocess the ham/spam text data. We will then use a one-hot-encoding to make a bag of words set of features to use in logistic regression.

We will use these one-hot-vectors for logistic regression to predict if a text is spam or ham.

We start by loading the necessary libraries.

import tensorflow as tf
import matplotlib.pyplot as plt
import os
import numpy as np
import csv
import string
import requests
import io
from zipfile import ZipFile
from tensorflow.contrib import learn
from tensorflow.python.framework import ops
ops.reset_default_graph()

We start a computation graph session.

1 2	# Start a graph session sess = tf.Session()

Check if data was downloaded, otherwise download it and save for future use

save_file_name = os.path.join('temp','temp_spam_data.csv')

# Create directory if it doesn't exist
if not os.path.exists('temp'):
    os.makedirs('temp')

if os.path.isfile(save_file_name):
    text_data = []
    with open(save_file_name, 'r') as temp_output_file:
        reader = csv.reader(temp_output_file)
        for row in reader:
            if len(row)==2:
                text_data.append(row)
else:
    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    r = requests.get(zip_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('SMSSpamCollection')
    # Format Data
    text_data = file.decode()
    text_data = text_data.encode('ascii',errors='ignore')
    text_data = text_data.decode().split('\n')
    text_data = [x.split('\t') for x in text_data if len(x)>=1]

    # And write to csv
    with open(save_file_name, 'w') as temp_output_file:
        writer = csv.writer(temp_output_file)
        writer.writerows(text_data)

texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]

To reduce the potential vocabulary size, we normalize the text. To do this, we remove the influence of capitalization and numbers in the text.

# Relabel 'spam' as 1, 'ham' as 0
target = [1 if x=='spam' else 0 for x in target]

# Normalize text
# Lower case
texts = [x.lower() for x in texts]

# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]

To determine a good sentence length to pad/crop at, we plot a histogram of text lengths (in words).

%matplotlib inline
# Plot histogram of text lengths
text_lengths = [len(x.split()) for x in texts]
text_lengths = [x for x in text_lengths if x < 50]
plt.hist(text_lengths, bins=25)
plt.title('Histogram of # of Words in Texts')
plt.show()

We crop/pad all texts to be 25 words long. We also will filter out any words that do not appear at least 3 times.

1
2
3

# Choose max text word length at 25
sentence_size = 25
min_word_freq = 3

TensorFlow has a built in text processing function called VocabularyProcessor(). We use this function to process the texts.

# Setup vocabulary processor
vocab_processor = learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq)

# Have to fit transform to get length of unique words.
vocab_processor.transform(texts)
transformed_texts = np.array([x for x in vocab_processor.transform(texts)])
embedding_size = len(np.unique(transformed_texts))

To test our logistic model (predicting spam/ham), we split the texts into a train and test set.

# Split up data set into train/test
train_indices = np.random.choice(len(texts), round(len(texts)*0.8), replace=False)
test_indices = np.array(list(set(range(len(texts))) - set(train_indices)))
texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]
texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]
target_train = [x for ix, x in enumerate(target) if ix in train_indices]
target_test = [x for ix, x in enumerate(target) if ix in test_indices]

For one-hot-encoding, we setup an identity matrix for the TensorFlow embedding lookup.

We also create the variables and placeholders for the logistic regression we will perform.

# Setup Index Matrix for one-hot-encoding
identity_mat = tf.diag(tf.ones(shape=[embedding_size]))

# Create variables for logistic regression
A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)

Next, we create the text-word embedding lookup with the prior identity matrix.

Our logistic regression will use the counts of the words as the input. The counts are created by summing the embedding output across the rows.

Then we declare the logistic regression operations. Note that we do not wrap the logistic operations in the sigmoid function because this will be done in the loss function later on.

# Text-Vocab Embedding
x_embed = tf.nn.embedding_lookup(identity_mat, x_data)
x_col_sums = tf.reduce_sum(x_embed, 0)

# Declare model operations
x_col_sums_2D = tf.expand_dims(x_col_sums, 0)
model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)

Now we declare our loss function (which has the sigmoid built in), prediction operations, optimizer, and initialize the variables.

# Declare loss function (Cross Entropy loss)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))

# Prediction operation
prediction = tf.sigmoid(model_output)

# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)

# Intitialize Variables
init = tf.global_variables_initializer()
sess.run(init)

Now we loop through the iterations and fit the logistic regression on wether or not the text is spam or ham.

# Start Logistic Regression
print('Starting Training Over {} Sentences.'.format(len(texts_train)))
loss_vec = []
train_acc_all = []
train_acc_avg = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_train)):
    y_data = [[target_train[ix]]]


    sess.run(train_step, feed_dict={x_data: t, y_target: y_data})
    temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_data})
    loss_vec.append(temp_loss)

    if (ix+1)%50==0:
        print('Training Observation #' + str(ix+1) + ': Loss = ' + str(temp_loss))

    # Keep trailing average of past 50 observations accuracy
    # Get prediction of single observation
    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})
    # Get True/False if prediction is accurate
    train_acc_temp = target_train[ix]==np.round(temp_pred)
    train_acc_all.append(train_acc_temp)
    if len(train_acc_all) >= 50:
        train_acc_avg.append(np.mean(train_acc_all[-50:]))

Starting Training Over 4459 Sentences.
Training Observation #50: Loss = 4.7342416e-14
...
Training Observation #4450: Loss = 3.811978e-11

Now that we have a logistic model, we can evaluate the accuracy on the test dataset.

# Get test set accuracy
print('Getting Test Set Accuracy For {} Sentences.'.format(len(texts_test)))
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
    y_data = [[target_test[ix]]]

    if (ix+1)%100==0:
        print('Test Observation #' + str(ix+1))    

    # Keep trailing average of past 50 observations accuracy
    # Get prediction of single observation
    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})
    # Get True/False if prediction is accurate
    test_acc_temp = target_test[ix]==np.round(temp_pred)
    test_acc_all.append(test_acc_temp)

print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))

Getting Test Set Accuracy For 1115 Sentences.
Test Observation #100
Test Observation #200
Test Observation #300
Test Observation #400
Test Observation #500
Test Observation #600

Let’s look at the training accuracy over all the iterations.

# Plot training accuracy over time
plt.plot(range(len(train_acc_avg)), train_acc_avg, 'k-', label='Train Accuracy')
plt.title('Avg Training Acc Over Past 50 Iterations')
plt.xlabel('Iterations')
plt.ylabel('Training Accuracy')
plt.show()

It is worthwhile to mention the motivation of limiting the sentence (or text) size. In this example we limited the text size to 25 words. This is a common practice with bag of words because it limits the effect of text length on the prediction. You can imagine that if we find a word, “meeting” for example, that is predictive of a text being ham (not spam), then a spam message might get through by putting in many occurrences of that word at the end. In fact, this is a common problem with imbalanced target data. Imbalanced data might occur in this situation, since spam may be hard to find and ham may be easy to find. Because of this fact, our vocabulary that we create might be heavily skewed toward words represented in the ham part of our data (more ham means more words are represented in ham than spam). If we allow unlimited length of texts, then spammers might take advantage of this and create very long texts, which have a higher probability of triggering non-spam word factors in our logistic model.

In the next section, we attempt to tackle this problem in a better way using the frequency of word occurrence to determine the values of the word embeddings.

Implementing TF-IDF

TF-IDF is an acronym that stands for Text Frequency – Inverse Document Frequency. This term is essentially the product of text frequency and inverse document frequency for each word.

In the prior recipe, we introduced the bag of words methodology, which assigned a value of one for every occurrence of a word in a sentence. This is probably not ideal as each category of sentence (spam and ham for the prior recipe example) most likely has the same frequency of “the”, “and” and other words, whereas words like “viagra” and “sale” probably should have increased importance in figuring out whether or not the text is spam.

We first want to take into consideration the word frequency. Here we consider the frequency that a word occurs in an individual entry. The purpose of this part (TF), is to find terms that appear to be important in each entry.

But words like “the” and “and” may appear very frequently in every entry. We want to down weight the importance of these words, so we can imagine that multiplying the above text frequency (TF) by the inverse of the whole document frequency might help find important words. But since a collection of texts (a corpus) may be quite large, it is common to take the logarithm of the inverse document frequency. This leaves us with the following formula for TF-IDF for each word in each document entry.

$w_{tf-idf}=w_{tf} \cdot \frac{1}{log(w_{df})}$

Where $w_{tf}$ is the word frequency by document, and $w_{df}$ is the total frequency of such word across all documents. We can imagine that high values of TF-IDF might indicate words that are very important to determining what a document is about.

Here we implement TF-IDF, (Text Frequency - Inverse Document Frequency) for the spam-ham text data.

We will use a hybrid approach of encoding the texts with sci-kit learn’s TFIDF vectorizer. Then we will use the regular TensorFlow logistic algorithm outline.

Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrences of each word before we can start training our model. Because of this, it is not implemented fully in Tensorflow, so we will use Scikit-learn for creating our TF-IDF embedding, but use Tensorflow to fit the logistic model.

We start by loading the necessary libraries.

import tensorflow as tf
import matplotlib.pyplot as plt
import csv
import numpy as np
import os
import string
import requests
import io
import nltk
from zipfile import ZipFile
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.python.framework import ops
ops.reset_default_graph()

Start a computational graph session.

1	sess = tf.Session()

We set two parameters, batch_size and max_features. batch_size is the size of the batch we will train our logistic model on, and max_features is the maximum number of tf-idf textual words we will use in our logistic regression.

1 2	batch_size = 200 max_features = 1000

Check if data was downloaded, otherwise download it and save for future use

save_file_name = 'temp_spam_data.csv'
if os.path.isfile(save_file_name):
    text_data = []
    with open(save_file_name, 'r') as temp_output_file:
        reader = csv.reader(temp_output_file)
        for row in reader:
            text_data.append(row)
else:
    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    r = requests.get(zip_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('SMSSpamCollection')
    # Format Data
    text_data = file.decode()
    text_data = text_data.encode('ascii',errors='ignore')
    text_data = text_data.decode().split('\n')
    text_data = [x.split('\t') for x in text_data if len(x)>=1]

    # And write to csv
    with open(save_file_name, 'w') as temp_output_file:
        writer = csv.writer(temp_output_file)
        writer.writerows(text_data)

We now clean our texts. This will decrease our vocabulary size by converting everything to lower case, removing punctuation and getting rid of numbers.

texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]

# Relabel 'spam' as 1, 'ham' as 0
target = [1. if x=='spam' else 0. for x in target]

# Normalize text
# Lower case
texts = [x.lower() for x in texts]

# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]

Define tokenizer function and create the TF-IDF vectors with SciKit-Learn.

def tokenizer(text):
    words = nltk.word_tokenize(text)
    return words

# Create TF-IDF of texts
tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
sparse_tfidf_texts = tfidf.fit_transform(texts)

Split up data set into train/test.

texts[:3]

['go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat',
 'ok lar joking wif u oni',
 'free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry questionstd txt ratetcs apply overs']

1	sparse_tfidf_texts[:3]

<3x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 26 stored elements in Compressed Sparse Row format>

train_indices = np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False)
test_indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) - set(train_indices)))
texts_train = sparse_tfidf_texts[train_indices]
texts_test = sparse_tfidf_texts[test_indices]
target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])

Now we create the variables and placeholders necessary for logistic regression. After which, we declare our logistic regression operation. Remember that the sigmoid part of the logistic regression will be in the loss function.

# Create variables for logistic regression
A = tf.Variable(tf.random_normal(shape=[max_features,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders
x_data = tf.placeholder(shape=[None, max_features], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)

# Declare logistic model (sigmoid in loss function)
model_output = tf.add(tf.matmul(x_data, A), b)

Next, we declare the loss function (which has the sigmoid in it), and the prediction function. The prediction function will have to have a sigmoid inside of it because it is not in the model output.

# Declare loss function (Cross Entropy loss)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))

# Prediction
prediction = tf.round(tf.sigmoid(model_output))
predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32)
accuracy = tf.reduce_mean(predictions_correct)

Now we create the optimization function and initialize the model variables.

# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.0025)
train_step = my_opt.minimize(loss)

# Intitialize Variables
init = tf.global_variables_initializer()
sess.run(init)

Finally, we perform our logisitic regression on the 1000 TF-IDF features.

train_loss = []
test_loss = []
train_acc = []
test_acc = []
i_data = []
for i in range(10000):
    rand_index = np.random.choice(texts_train.shape[0], size=batch_size)
    rand_x = texts_train[rand_index].todense()
    rand_y = np.transpose([target_train[rand_index]])
    sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})

    # Only record loss and accuracy every 100 generations
    if (i+1)%100==0:
        i_data.append(i+1)
        train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
        train_loss.append(train_loss_temp)

        test_loss_temp = sess.run(loss, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
        test_loss.append(test_loss_temp)

        train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})
        train_acc.append(train_acc_temp)

        test_acc_temp = sess.run(accuracy, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
        test_acc.append(test_acc_temp)
    if (i+1)%500==0:
        acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
        acc_and_loss = [np.round(x,2) for x in acc_and_loss]
        print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

Generation # 500. Train Loss (Test Loss): 1.07 (1.08). Train Acc (Test Acc): 0.36 (0.35)
...
Generation # 9500. Train Loss (Test Loss): 0.39 (0.46). Train Acc (Test Acc): 0.88 (0.85)
Generation # 10000. Train Loss (Test Loss): 0.52 (0.46). Train Acc (Test Acc): 0.80 (0.85)

Here is matplotlib code to plot the loss and accuracies.

# Plot loss over time
plt.plot(i_data, train_loss, 'k-', label='Train Loss')
plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4)
plt.title('Cross Entropy Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Cross Entropy Loss')
plt.legend(loc='upper right')
plt.show()

# Plot train and test accuracy
plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy')
plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4)
plt.title('Train and Test Accuracy')
plt.xlabel('Generation')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()

Word2Vec: Skipgram Model

Working with Skip Gram Embeddings

Prior to this recipe, we have not considered the order of words to be relevant in creating word embeddings. In early 2013, Tomas Mikolov and other researchers at Google authored a paper about creating word embeddings that address this issue (https://arxiv.org/abs/1301.3781), and they named their methods “word2vec”.

The basic idea is to create word embeddings that capture a relational aspect of words. We seek to understand how various words are related to each other. Some examples of how these embeddings might behave are as follows.

“king” – “man” + “woman” = “queen”
“india pale ale” – “hops” + “malt” = “stout”

We might achieve such numerical representation of words if we only consider their positional relationship to each other. If we could analyse a large enough source of coherent documents, we might find that the words “king”, “man”, and “queen” are mentioned closely to each other in our texts. If we also know that “man” and “woman” are related in a different way, then we might conclude that “man” is to “king” as “woman” is to “queen” and so on.

To go about finding such an embedding, we will use a neural network that predicts surrounding words giving an input word. We could, just as easily, switched that and tried to predict a target word given a set of surrounding words, but we will start with the prior method. Both are variations of the word2vec procedure. But the prior method of predicting the surrounding words (the context) from a target word is called the skip-gram model. In the next recipe, we will implement the other method, predicting the target word from the context, which is called the continuous bag of words method (CBOW).

See below figure for an illustration.

In this example, we will download and preprocess the movie review data.

From this data set we will compute/fit the skipgram model of the Word2Vec Algorithm

Skipgram: based on predicting the surrounding words from the

Ex sentence “the cat in the hat”

context word: [“hat”]
target words: [“the”, “cat”, “in”, “the”]
context-target pairs: (“hat”, “the”), (“hat”, “cat”), (“hat”, “in”), (“hat”, “the”)

We start by loading the necessary libraries.

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import string
import requests
import collections
import io
import tarfile
import gzip
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
ops.reset_default_graph()

Start a computational graph session.

1	sess = tf.Session()

Declare model parameters

batch_size = 100         # How many sets of words to train on at once.
embedding_size = 100    # The embedding size of each word to train.
vocabulary_size = 5000 # How many words we will consider for training.
generations = 100000    # How many iterations we will perform the training on.
print_loss_every = 500  # Print the loss every so many iterations

num_sampled = int(batch_size/2) # Number of negative examples to sample.
window_size = 2         # How many words to consider left and right.

We will remove stop words and create a test validation set of words.

# Declare stop words
stops = stopwords.words('english')

# We pick five test words. We are expecting synonyms to appear
print_valid_every = 10000
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']
# Later we will have to transform these into indices

Next, we load the movie review data. We check if the data was downloaded, and not, download and save it.

def load_movie_data():
    save_folder_name = 'temp'
    pos_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.pos')
    neg_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.neg')
    if not os.path.exists(save_folder_name):
        os.mkdir(save_folder_name)
    # Check if files are already downloaded
    if not os.path.exists(os.path.join(save_folder_name, 'rt-polaritydata')):
        movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'

        # Save tar.gz file
        req = requests.get(movie_data_url, stream=True)
        with open(os.path.join(save_folder_name,'temp_movie_review_temp.tar.gz'), 'wb') as f:
            for chunk in req.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()
        # Extract tar.gz file into temp folder
        tar = tarfile.open(os.path.join(save_folder_name,'temp_movie_review_temp.tar.gz'), "r:gz")
        tar.extractall(path='temp')
        tar.close()

    pos_data = []
    with open(pos_file, 'r', encoding='latin-1') as f:
        for line in f:
            pos_data.append(line.encode('ascii',errors='ignore').decode())
    f.close()
    pos_data = [x.rstrip() for x in pos_data]

    neg_data = []
    with open(neg_file, 'r', encoding='latin-1') as f:
        for line in f:
            neg_data.append(line.encode('ascii',errors='ignore').decode())
    f.close()
    neg_data = [x.rstrip() for x in neg_data]

    texts = pos_data + neg_data
    target = [1]*len(pos_data) + [0]*len(neg_data)

    return(texts, target)


texts, target = load_movie_data()

Now we create a function that normalizes/cleans the text.

# Normalize text
def normalize_text(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]

    return(texts)

texts = normalize_text(texts, stops)

# Texts must contain at least 3 words
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

With the normalized movie reviews, we now build a dictionary of words.

# Build dictionary of words
def build_dictionary(sentences, vocabulary_size):
    # Turn sentences (list of strings) into lists of words
    split_sentences = [s.split() for s in sentences]
    words = [x for sublist in split_sentences for x in sublist]

    # Initialize list of [word, word_count] for each word, starting with unknown
    count = [['RARE', -1]]

    # Now add most frequent words, limited to the N-most frequent (N=vocabulary size)
    count.extend(collections.Counter(words).most_common(vocabulary_size-1))

    # Now create the dictionary
    word_dict = {}
    # For each word, that we want in the dictionary, add it, then make it
    # the value of the prior dictionary length
    for word, word_count in count:
        word_dict[word] = len(word_dict)

    return(word_dict)

With the above dictionary, we can turn text data into lists of integers from such dictionary.

def text_to_numbers(sentences, word_dict):
    # Initialize the returned data
    data = []
    for sentence in sentences:
        sentence_data = []
        # For each word, either use selected index or rare word index
        for word in sentence.split(' '):
            if word in word_dict:
                word_ix = word_dict[word]
            else:
                word_ix = 0
            sentence_data.append(word_ix)
        data.append(sentence_data)
    return(data)

# Build our data set and dictionaries
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))
text_data = text_to_numbers(texts, word_dictionary)

# Get validation word keys
valid_examples = [word_dictionary[x] for x in valid_words]

Let us now build a function that will generate random data points from our text and parameters.

# Generate data randomly (N words behind, target, N words ahead)
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
    # Fill up data batch
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        # select random sentence to start
        rand_sentence = np.random.choice(sentences)
        # Generate consecutive windows to look at
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
        # Denote which element of each window is the center word of interest
        label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]

        # Pull out center word of interest for each window and create a tuple for each window
        if method=='skip_gram':
            batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
        elif method=='cbow':
            batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x]
        else:
            raise ValueError('Method {} not implemented yet.'.format(method))

        # extract batch and labels
        batch, labels = [list(x) for x in zip(*tuple_data)]
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
    # Trim batch and label at the end
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]

    # Convert to numpy array
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))

    return(batch_data, label_data)

Next we define our model and placeholders.

# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                               stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Lookup the word embedding:
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

embed

<tf.Tensor 'embedding_lookup/Identity:0' shape=(100, 100) dtype=float32>

Here is our loss function, optimizer, cosine similarity, and initialization of the model variables.

For the loss function we will minimize the average of the NCE loss (noise-contrastive estimation).

# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_biases,
                                     labels=y_target,
                                     inputs=embed,
                                     num_sampled=num_sampled,
                                     num_classes=vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)


#Add variable initializer.
init = tf.global_variables_initializer()
sess.run(init)

WARNING:tensorflow:From <ipython-input-18-90dede70073c>:13: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead

1	sim_init = sess.run(similarity)

Now we can train our skip-gram model.

Note that we have the line: nearest = (-sim[j, :]).argsort()[1:top_k+1] below. The negative of the similarity matrix is used because argsort() sorts the values from least to greatest. Since we want to take the greatest numbers, we sort in the opposite direction by taking the negative of the similarity matrix, then calling the argsort() method.

# Run the skip gram model.
loss_vec = []
loss_x_vec = []
for i in range(generations):
    batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, window_size)
    feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}

    # Run the train step
    sess.run(optimizer, feed_dict=feed_dict)

    # Return the loss
    if (i+1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print("Loss at step {} : {}".format(i+1, loss_val))

    # Validation: Print some random words and top 5 related words
    if (i+1) % print_valid_every == 0:
        sim = sess.run(similarity)
        for j in range(len(valid_words)):
            valid_word = word_dictionary_rev[valid_examples[j]]
            top_k = 5 # number of nearest neighbors
            nearest = (-sim[j, :]).argsort()[1:top_k+1]
            log_str = "Nearest to {}:".format(valid_word)
            for k in range(top_k):
                close_word = word_dictionary_rev[nearest[k]]
                score = sim[j,nearest[k]]
                log_str = "%s %s," % (log_str, close_word)
            print(log_str)

Loss at step 500 : 19.154987335205078
...
Nearest to cliche: sparkling, chosen, duty, thoughtful, pile,
Nearest to love: shimmering, transcend, economical, review, affable,
Nearest to hate: tried, recycled, anybody, complexity, enthusiasm,
Nearest to silly: denis, audacity, gutwrenching, irritating, callar,
Nearest to sad: adequately, surreal, paint, human, exploitative,
Loss at step 60500 : 3.153820514678955

Working with CBOW Embeddings

In this recipe we will implement the CBOW (continuous bag of words) method of word2vec. It is very similar to the skip-gram method, except we are predicting a single target word from a surrounding window of context words.

In the prior example we treated each combination of window and target as a group of paired inputs and outputs, but with CBOW we will add the surrounding window embeddings together to get one embedding to predict the target word embedding.

Most of the code will stay the same, except we will need to change how we create the embeddings and how we generate the data from the sentences.

To make the code easier to read, we have moved all the major functions to a separate file, called ‘text_helpers.py’ in the same directory. This function holds the data loading, text normalization, dictionary creation, and batch generation functions. This functions are exactly as they appear in the prior recipe, “Working with Skip-gram Embeddings”, except where noted.

See the following illustration of a CBOW example.

print('Creating Model')
# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                               stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size, 2*window_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Lookup the word embedding
# Add together window embeddings:
embed = tf.zeros([batch_size, embedding_size])
for element in range(2*window_size):
    embed += tf.nn.embedding_lookup(embeddings, x_inputs[:, element])

# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_biases,
                                     labels=y_target,
                                     inputs=embed,
                                     num_sampled=num_sampled,
                                     num_classes=vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate).minimize(loss)

# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

人工智能

tensorflow cookbook 自然语言处理

Natural Language Processing

Natural Language Processing (NLP) Introduction

Working with Bag of Words

Implementing TF-IDF

Word2Vec: Skipgram Model

Working with CBOW Embeddings