0%

最近邻方法 Nearest Neighbor Methords

最近邻回归 Nearest neighbor regression

首先回归和分类最要得区别就是:回归的目标数据是连续的,分类的目标数据是离散的。最近邻回归的原理是从训练样本中找到与新点在距离上最近的预定数量的几个点,并从这些点中预测标签。 这些点的数量可以是用户自定义的常量(K-最近邻学习),也可以根据不同的点的局部密度(基于半径的最近邻学习)。距离通常可以通过任何方式来度量: standard Euclidean distance(标准欧式距离)是最常见的选择。Neighbors-based(基于邻居的)方法被称为 非泛化 机器学习方法, 因为它们只是简单地”记住”了其所有的训练数据(可能转换为一个快速索引结构,如 Ball Tree或 KD Tree)

Nearest Neighbor Methords

Nearest Neighbor methods are a very popular ML algorithm. We show how to implement k-Nearest Neighbors, weighted k-Nearest Neighbors, and k-Nearest Neighbors with mixed distance functions. In this chapter we also show how to use the Levenshtein distance (edit distance) in TensorFlow, and use it to calculate the distance between strings. We end this chapter with showing how to use k-Nearest Neighbors for categorical prediction with the MNIST handwritten digit recognition.

Nearest neighbor methods are based on a distance-based conceptual idea. We consider our training set as the model and make predictions on new points based on how close they are to points in the training set. A naïve way is to make the prediction as the closest training data point class. But since most datasets contain a degree of noise, a more common method would be to take a weighted average of a set of k nearest neighbors. This method is called k-nearest neighbors (k-NN).

Given a training dataset (x1, x2, …, xn), with corresponding targets (y1, y2, …, yn), we can make a prediction on a point, z, by looking at a set of nearest neighbors. The actual method of prediction depends on whether or not we are doing regression (continuous y) or classification (discrete y).

For discrete classification targets, the prediction may be given by a maximum voting scheme weighted by the distance to the prediction point:

prediction(z) = max ( weighted sum of distances of points in each class )

  • see jupyter notebook for the formula

Here, our prediction is the maximum weighted value over all classes (j), where the weighted distance from the prediction point is usually given by the L1 or L2 distance functions.

Continuous targets are very similar, but we usually just compute a weighted average of the target variable (y) by distance.

There are many different specifications of distance metrics that we can choose. In this chapter, we will explore the L1 and L2 metrics as well as edit and textual distances.

We also have to choose how to weight the distances. A straight forward way to weight the distances is by the distance itself. Points that are further away from our prediction should have less impact than nearer points. The most common way to weight is by the normalized inverse of the distance. We will implement this method in the next recipe.

Note that k-NN is an aggregating method. For regression, we are performing a weighted average of neighbors. Because of this, predictions will be less extreme and less varied than the actual targets. The magnitude of this effect will be determined by k, the number of neighbors in the algorithm.

1. k-Nearest Neighbor

This function illustrates how to use k-nearest neighbors in tensorflow

We will use the 1970s Boston housing dataset which is available through the UCI ML data repository.

Data:

—————x-values—————-

  • CRIM : per capita crime rate by town
  • ZN : prop. of res. land zones
  • INDUS : prop. of non-retail business acres
  • CHAS : Charles river dummy variable
  • NOX : nitrix oxides concentration / 10 M
  • RM : Avg. # of rooms per building
  • AGE : prop. of buildings built prior to 1940
  • DIS : Weighted distances to employment centers
  • RAD : Index of radian highway access
  • TAX : Full tax rate value per $10k
  • PTRATIO: Pupil/Teacher ratio by town
  • B : 1000*(Bk-0.63)^2, Bk=prop. of blacks
  • LSTAT : % lower status of pop

——————y-value—————-

  • MEDV : Median Value of homes in $1,000’s
1
2
3
4
5
6
7
# import required libraries
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import requests
from tensorflow.python.framework import ops
ops.reset_default_graph()

Create graph

1
sess = tf.Session()

Load the data

1
2
3
4
5
6
7
8
9
10
11
12
housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
num_features = len(cols_used)
housing_file = requests.get(housing_url)
housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1]

y_vals = np.transpose([np.array([y[13] for y in housing_data])])
x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i] in cols_used] for y in housing_data])

## Min-Max Scaling
x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)
1
x_vals.shape
(506, 10)
1
y_vals.shape
(506, 1)

Split the data into train and test sets

1
2
3
4
5
6
7
np.random.seed(13)  #make results reproducible
train_indices = np.random.choice(len(x_vals), round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) - set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]

Parameters to control run

1
2
3
4
5
6
7
8
9
# Declare k-value and batch size
k = 4
batch_size=len(x_vals_test)

# Placeholders
x_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32)

Declare distance metric

L1 Distance Metric

Uncomment following line and comment L2

1
distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), axis=2)

L2 Distance Metric

Uncomment following line and comment L1 above

1
#distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), reduction_indices=1))

Predict: Get min distance index (Nearest neighbor)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#prediction = tf.arg_min(distance, 0)
top_k_xvals, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
top_k_xvals = tf.truediv(1.0, top_k_xvals)
x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)
x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32))
x_val_weights = tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1)

top_k_yvals = tf.gather(y_target_train, top_k_indices)
prediction = tf.squeeze(tf.matmul(x_val_weights,top_k_yvals), axis=[1])
#prediction = tf.reduce_mean(top_k_yvals, 1)

# Calculate MSE
mse = tf.div(tf.reduce_sum(tf.square(tf.subtract(prediction, y_target_test))), batch_size)

# Calculate how many loops over training data
num_loops = int(np.ceil(len(x_vals_test)/batch_size))

for i in range(num_loops):
min_index = i*batch_size
max_index = min((i+1)*batch_size,len(x_vals_train))
x_batch = x_vals_test[min_index:max_index]
y_batch = y_vals_test[min_index:max_index]
predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})
batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})

print('Batch #' + str(i+1) + ' MSE: ' + str(np.round(batch_mse,3)))
Batch #1 MSE: 10.625
1
2
3
4
5
6
7
8
9
10
11
%matplotlib inline
# Plot prediction and actual distribution
bins = np.linspace(5, 50, 45)

plt.hist(predictions, bins, alpha=0.5, label='Prediction')
plt.hist(y_batch, bins, alpha=0.5, label='Actual')
plt.title('Histogram of Predicted and Actual Values')
plt.xlabel('Med Home Value in $1,000s')
plt.ylabel('Frequency')
plt.legend(loc='upper right')
plt.show()

png


2. MNIST Digit Prediction with k-Nearest Neighbors

This script will load the MNIST data, and split it into test/train and perform prediction with nearest neighbors.

For each test integer, we will return the closest image/integer.

Integer images are represented as 28x28 matrices of floating point numbers.

1
2
3
4
5
6
7
8
import random
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from PIL import Image
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.framework import ops
ops.reset_default_graph()

Create graph

1
sess = tf.Session()

Load the data

1
mnist = input_data.read_data_sets("MNIST_data", one_hot=True)

Generate random sample

1
2
3
4
5
6
7
8
9
np.random.seed(13)  # set seed for reproducibility
train_size = 1000
test_size = 102
rand_train_indices = np.random.choice(len(mnist.train.images), train_size, replace=False)
rand_test_indices = np.random.choice(len(mnist.test.images), test_size, replace=False)
x_vals_train = mnist.train.images[rand_train_indices]
x_vals_test = mnist.test.images[rand_test_indices]
y_vals_train = mnist.train.labels[rand_train_indices]
y_vals_test = mnist.test.labels[rand_test_indices]

Declare k-value and batch size

1
2
3
4
5
6
7
8
k = 4
batch_size=6

# Placeholders
x_data_train = tf.placeholder(shape=[None, 784], dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, 784], dtype=tf.float32)
y_target_train = tf.placeholder(shape=[None, 10], dtype=tf.float32)
y_target_test = tf.placeholder(shape=[None, 10], dtype=tf.float32)

Declare distance metric

L1 metric - uncomment following line and comment L2 metric below

1
distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), axis=2)

L2 metric - uncomment following line and comment the L1 metric above

1
#distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), reduction_indices=1))

Predict: Get min distance index (Nearest neighbor)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Get min distance index (Nearest neighbor)
top_k_xvals, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
prediction_indices = tf.gather(y_target_train, top_k_indices)

# Predict the mode category
count_of_predictions = tf.reduce_sum(prediction_indices, axis=1)
prediction = tf.argmax(count_of_predictions, axis=1)

# Calculate how many loops over training data
num_loops = int(np.ceil(len(x_vals_test)/batch_size))

test_output = []
actual_vals = []
for i in range(num_loops):
min_index = i*batch_size
max_index = min((i+1)*batch_size,len(x_vals_train))
x_batch = x_vals_test[min_index:max_index]
y_batch = y_vals_test[min_index:max_index]
predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})
test_output.extend(predictions)
actual_vals.extend(np.argmax(y_batch, axis=1))

accuracy = sum([1./test_size for i in range(test_size) if test_output[i]==actual_vals[i]])
print('Accuracy on test set: ' + str(accuracy))
Accuracy on test set: 0.8823529411764696
1
2
3
4
5
6
7
8
9
10
11
12
13
14
%matplotlib inline
# Plot the last batch results:
actuals = np.argmax(y_batch, axis=1)

Nrows = 2
Ncols = 3
for i in range(len(actuals)):
plt.subplot(Nrows, Ncols, i+1)
plt.imshow(np.reshape(x_batch[i], [28,28]), cmap='Greys_r')
plt.title('Actual: ' + str(actuals[i]) + ' Pred: ' + str(predictions[i]),
fontsize=10)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)

png


3. Text Distances

This notebook illustrates how to use the Levenstein distance (edit distance) in TensorFlow.

Get required libarary and start tensorflow session.

1
2
3
import tensorflow as tf

sess = tf.Session()

First compute the edit distance between ‘bear’ and ‘beers’

1
2
3
4
5
6
7
8
9
10
11
hypothesis = list('bear')
truth = list('beers')
h1 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3]],
hypothesis,
[1,1,1])

t1 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3],[0,0,4]],
truth,
[1,1,1])

print(sess.run(tf.edit_distance(h1, t1, normalize=False)))
[[2.]]

Compute the edit distance between (‘bear’,’beer’) and ‘beers’:

1
2
3
4
5
6
7
8
9
10
11
hypothesis2 = list('bearbeer')
truth2 = list('beersbeers')
h2 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3], [0,1,0], [0,1,1], [0,1,2], [0,1,3]],
hypothesis2,
[1,2,4])

t2 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3], [0,0,4], [0,1,0], [0,1,1], [0,1,2], [0,1,3], [0,1,4]],
truth2,
[1,2,5])

print(sess.run(tf.edit_distance(h2, t2, normalize=True)))
[[0.4 0.2]]

Now compute distance between four words and ‘beers’ more efficiently:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
hypothesis_words = ['bear','bar','tensor','beers']
truth_word = ['beers']

num_h_words = len(hypothesis_words)
h_indices = [[xi, 0, yi] for xi,x in enumerate(hypothesis_words) for yi,y in enumerate(x)]
h_chars = list(''.join(hypothesis_words))

h3 = tf.SparseTensor(h_indices, h_chars, [num_h_words,1,1])

truth_word_vec = truth_word*num_h_words
t_indices = [[xi, 0, yi] for xi,x in enumerate(truth_word_vec) for yi,y in enumerate(x)]
t_chars = list(''.join(truth_word_vec))

t3 = tf.SparseTensor(t_indices, t_chars, [num_h_words,1,1])

print(sess.run(tf.edit_distance(h3, t3, normalize=True)))
[[0.4]
 [0.6]
 [1. ]
 [0. ]]

4. Mixed Distance Functions for k-Nearest Neighbor

This function shows how to use different distance metrics on different features for kNN.

Data:

—————x-values—————-

  • CRIM : per capita crime rate by town
  • ZN : prop. of res. land zones
  • INDUS : prop. of non-retail business acres
  • CHAS : Charles river dummy variable
  • NOX : nitrix oxides concentration / 10 M
  • RM : Avg. # of rooms per building
  • AGE : prop. of buildings built prior to 1940
  • DIS : Weighted distances to employment centers
  • RAD : Index of radian highway access
  • TAX : Full tax rate value per $10k
  • PTRATIO: Pupil/Teacher ratio by town
  • B : 1000*(Bk-0.63)^2, Bk=prop. of blacks
  • LSTAT : % lower status of pop

——————y-value—————-

  • MEDV : Median Value of homes in $1,000’s
1
2
3
4
5
6
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import requests
from tensorflow.python.framework import ops
ops.reset_default_graph()

Create graph

1
sess = tf.Session()

Load the data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
num_features = len(cols_used)
housing_file = requests.get(housing_url)
housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1]

y_vals = np.transpose([np.array([y[13] for y in housing_data])])
x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i] in cols_used] for y in housing_data])

## Min-Max Scaling
x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)

## Create distance metric weight matrix weighted by standard deviation
weight_diagonal = x_vals.std(0)
weight_matrix = tf.cast(tf.diag(weight_diagonal), dtype=tf.float32)

# Split the data into train and test sets
np.random.seed(13) #make results reproducible
train_indices = np.random.choice(len(x_vals), round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) - set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]

Declare k-value and batch size

1
2
3
4
5
6
7
8
k = 4
batch_size=len(x_vals_test)

# Placeholders
x_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32)
1
2
3
4
5
6
7
# Declare weighted distance metric
# Weighted - L2 = sqrt((x-y)^T * A * (x-y))
subtraction_term = tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))
first_product = tf.matmul(subtraction_term, tf.tile(tf.expand_dims(weight_matrix,0), [batch_size,1,1]))
second_product = tf.matmul(first_product, tf.transpose(subtraction_term, perm=[0,2,1]))
# matrix_diag_part(...): Returns the batched diagonal part of a batched tensor.
distance = tf.sqrt(tf.matrix_diag_part(second_product))

Predict: Get min distance index (Nearest neighbor)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
top_k_xvals, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)
x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32))
x_val_weights = tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1)

top_k_yvals = tf.gather(y_target_train, top_k_indices)
prediction = tf.squeeze(tf.matmul(x_val_weights,top_k_yvals), axis=[1])

# Calculate MSE
mse = tf.div(tf.reduce_sum(tf.square(tf.subtract(prediction, y_target_test))), batch_size)

# Calculate how many loops over training data
num_loops = int(np.ceil(len(x_vals_test)/batch_size))

for i in range(num_loops):
min_index = i*batch_size
max_index = min((i+1)*batch_size,len(x_vals_train))
x_batch = x_vals_test[min_index:max_index]
y_batch = y_vals_test[min_index:max_index]
predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})
batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})

print('Batch #' + str(i+1) + ' MSE: ' + str(np.round(batch_mse,3)))
Batch #1 MSE: 18.847
1
2
3
4
5
6
7
8
9
10
11
%matplotlib inline
# Plot prediction and actual distribution
bins = np.linspace(5, 50, 45)

plt.hist(predictions, bins, alpha=0.5, label='Prediction')
plt.hist(y_batch, bins, alpha=0.5, label='Actual')
plt.title('Histogram of Predicted and Actual Values')
plt.xlabel('Med Home Value in $1,000s')
plt.ylabel('Frequency')
plt.legend(loc='upper right')
plt.show()

png


5. Address Matching with k-Nearest Neighbors

This function illustrates a way to perform address matching between two data sets.

For each test address, we will return the closest reference address to it.

We will consider two distance functions:

  1. Edit distance for street number/name and
  2. Euclidian distance (L2) for the zip codes
1
2
3
4
5
6
import random
import string
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
ops.reset_default_graph()

First we generate the data sets we will need

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# n = Size of created data sets
n = 10
street_names = ['abbey', 'baker', 'canal', 'donner', 'elm']
street_types = ['rd', 'st', 'ln', 'pass', 'ave']

random.seed(31) #make results reproducible
rand_zips = [random.randint(65000,65999) for i in range(5)]

# Function to randomly create one typo in a string w/ a probability
def create_typo(s, prob=0.75):
if random.uniform(0,1) < prob:
rand_ind = random.choice(range(len(s)))
s_list = list(s)
s_list[rand_ind]=random.choice(string.ascii_lowercase)
s = ''.join(s_list)
return(s)

# Generate the reference dataset
numbers = [random.randint(1, 9999) for i in range(n)]
streets = [random.choice(street_names) for i in range(n)]
street_suffs = [random.choice(street_types) for i in range(n)]
zips = [random.choice(rand_zips) for i in range(n)]
full_streets = [str(x) + ' ' + y + ' ' + z for x,y,z in zip(numbers, streets, street_suffs)]
reference_data = [list(x) for x in zip(full_streets,zips)]

# Generate test dataset with some typos
typo_streets = [create_typo(x) for x in streets]
typo_full_streets = [str(x) + ' ' + y + ' ' + z for x,y,z in zip(numbers, typo_streets, street_suffs)]
test_data = [list(x) for x in zip(typo_full_streets,zips)]

Now we can perform address matching

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Create graph
sess = tf.Session()

# Placeholders
test_address = tf.sparse_placeholder( dtype=tf.string)
test_zip = tf.placeholder(shape=[None, 1], dtype=tf.float32)
ref_address = tf.sparse_placeholder(dtype=tf.string)
ref_zip = tf.placeholder(shape=[None, n], dtype=tf.float32)

# Declare Zip code distance for a test zip and reference set
zip_dist = tf.square(tf.subtract(ref_zip, test_zip))

# Declare Edit distance for address
address_dist = tf.edit_distance(test_address, ref_address, normalize=True)

# Create similarity scores
zip_max = tf.gather(tf.squeeze(zip_dist), tf.argmax(zip_dist, 1))
zip_min = tf.gather(tf.squeeze(zip_dist), tf.argmin(zip_dist, 1))
zip_sim = tf.div(tf.subtract(zip_max, zip_dist), tf.subtract(zip_max, zip_min))
address_sim = tf.subtract(1., address_dist)

# Combine distance functions
address_weight = 0.5
zip_weight = 1. - address_weight
weighted_sim = tf.add(tf.transpose(tf.multiply(address_weight, address_sim)), tf.multiply(zip_weight, zip_sim))

# Predict: Get max similarity entry
top_match_index = tf.argmax(weighted_sim, 1)

# Function to Create a character-sparse tensor from strings
def sparse_from_word_vec(word_vec):
num_words = len(word_vec)
indices = [[xi, 0, yi] for xi,x in enumerate(word_vec) for yi,y in enumerate(x)]
chars = list(''.join(word_vec))
return(tf.SparseTensorValue(indices, chars, [num_words,1,1]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Loop through test indices
reference_addresses = [x[0] for x in reference_data]
reference_zips = np.array([[x[1] for x in reference_data]])

# Create sparse address reference set
sparse_ref_set = sparse_from_word_vec(reference_addresses)

for i in range(n):
test_address_entry = test_data[i][0]
test_zip_entry = [[test_data[i][1]]]

# Create sparse address vectors
test_address_repeated = [test_address_entry] * n
sparse_test_set = sparse_from_word_vec(test_address_repeated)

feeddict={test_address: sparse_test_set,
test_zip: test_zip_entry,
ref_address: sparse_ref_set,
ref_zip: reference_zips}
best_match = sess.run(top_match_index, feed_dict=feeddict)
best_street = reference_addresses[best_match[0]]
[best_zip] = reference_zips[0][best_match]
[[test_zip_]] = test_zip_entry
print('Address: ' + str(test_address_entry) + ', ' + str(test_zip_))
print('Match : ' + str(best_street) + ', ' + str(best_zip))
Address: 2308 bakar rd, 65480
Match  : 2308 baker rd, 65480
Address: 709 bakeo pass, 65480
Match  : 709 baker pass, 65480
Address: 2273 glm ln, 65782
Match  : 2273 elm ln, 65782
Address: 1843 donner st, 65402
Match  : 1843 donner st, 65402
Address: 8769 klm st, 65402
Match  : 8769 elm st, 65402
Address: 3798 dpnner ln, 65012
Match  : 3798 donner ln, 65012
Address: 2288 bajer pass, 65012
Match  : 2288 baker pass, 65012
Address: 2416 epm ln, 65480
Match  : 2416 elm ln, 65480
Address: 543 abgey ave, 65115
Match  : 543 abbey ave, 65115
Address: 994 abbey st, 65480
Match  : 994 abbey st, 65480
本站所有文章和源码均免费开放,如您喜欢,可以请我喝杯咖啡