The dataset includes the following:
Timeline View: Visualize all the tweets over a timeline and identify peak moments
Keyword Analysis: Which keywords derived from the name, username, description, location, and tweets were the most commonly used by ISIS fanboys? Examples include: "baqiyah", "dabiq", "wilayat", "amaq"
Social Network Cluster Analysis: Who are the major players in the pro-ISIS twitter network? Ideally, we would like this visualized via a cluster network with the biggest influencers scaled larger than smaller influencers.
Topic Modeling: What are the general topics being discussed in the twitter network?
Dataset built by Fifth Tribe, released under CC0, Public Domain
The following cells will clean up the tweets, removing non-text elements like emoticons. We will also create a second set of tweets that removes @mentions, links, hashtags, and more to further narrow dataset for our Keyword Search and our Topic Modeling to be completed later in this lab.
!conda install -c conda-forge wordcloud -y
import os
import re
import boto3
import sagemaker
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from scipy import interpolate
import scipy.sparse as sparse
import networkx as nx
%matplotlib inline
warnings.filterwarnings('ignore')
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
Download the dataset from public S3 bucket:
raw_data_filename = 'tweets.csv'
s3 = boto3.resource('s3')
s3.Bucket('wwps-sagemaker-workshop').download_file('tweets.csv', raw_data_filename)
df = pd.read_csv(raw_data_filename, parse_dates= [6])
df.head()
name | username | description | location | followers | numberstatuses | time | tweets | |
---|---|---|---|---|---|---|---|---|
0 | GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:07:00 | ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU... |
1 | GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:27:00 | ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI '... |
2 | GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:29:00 | ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH ... |
3 | GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:37:00 | ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI ... |
4 | GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:45:00 | ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH... |
Now remove "http", non-Unicode strings, "ENGLISH TRANSLATION", and non-text elements from the tweets:
df.username = df.username.str.lower()
def clean_tweet(tweet):
ext = "http"
text = tweet[:tweet.find(ext)].lower()
text = re.sub("[^\S]", " ", text)
text = re.sub("english translation ", "", text)
textOnly = re.sub("[^a-zA-Z0-9@# ]", "", text)
return(textOnly)
df.tweets = df.tweets.apply(clean_tweet)
df.tweets = df.tweets.apply(clean_tweet)
df.drop(['description'], axis=1)
df.head()
name | username | description | location | followers | numberstatuses | time | tweets | |
---|---|---|---|---|---|---|---|---|
0 | GunsandCoffee | gunsandcoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:07:00 | a message to the truthful in syria sheikh abu... |
1 | GunsandCoffee | gunsandcoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:27:00 | sheikh fatih al jawlani for the people of inte... |
2 | GunsandCoffee | gunsandcoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:29:00 | first audio meeting with sheikh fatih al jawla... |
3 | GunsandCoffee | gunsandcoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:37:00 | sheikh nasir al wuhayshi ha leader of aqap the... |
4 | GunsandCoffee | gunsandcoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NaN | 640 | 49 | 2015-01-06 21:45:00 | aqap response to sheikh baghdadis statement al... |
To get warmed up lets do a simple analysis that compares the average number of tweets per day to World Events
df.time = pd.to_datetime(df.time)
perhr = df.set_index(df.time).resample('D').count()
fig, ax = plt.subplots(figsize = (20,8))
perhr['2016-01-01':].numberstatuses.interpolate(method='linear').plot(ax = ax, color="black", fontsize=12, alpha=0.1)
pd.DataFrame.rolling(perhr, window=7).mean().tweets['2016-01-01':].plot(color ='r')
yemen = '2016-01-29'
brussels = '2016-03-22'
ax.annotate('Bombing in Brussels',xy=(brussels, 200),xytext=('2016-03-03', 310),
arrowprops=dict(facecolor='white', shrink=0.05), size=15)
ax.annotate('Car bombing in Yemen',xy=(yemen, 200),xytext=('2016-01-10', 310),
arrowprops=dict(facecolor='white', shrink=0.05),size=15)
ax.margins(None,0.1)
ax.legend(['Tweets Per Day','7-Day Rolling Average'], loc = 'upper right',
numpoints = 1, labelspacing = 2.0, fontsize = 14)
ax.set_xlabel('Date')
ax.set_ylabel('Number of Tweets')
plt.show()
A simple way to do a Keyword Analysis is by using a word cloud. A word cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.
emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
Tfidf_vectorizer = TfidfVectorizer(analyzer='word', tokenizer=tokenize, ngram_range=(1,2), stop_words='english',
token_pattern='\\b[a-z][a-z]+\\b', max_df=.5)
tfidf_tweets = Tfidf_vectorizer.fit_transform(df.tweets)
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
terms = Tfidf_vectorizer.get_feature_names()
wc = WordCloud(height=1000, width=1000, max_words=1000).generate("".join(terms))
plt.figure(figsize=(10, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("Keyword Word Cloud")
plt.show()
Based on this information, some of the most talked about topics include ISI, Syria, and Islamic State. Interestingly, there are also several call outs to twitter handles such as, ramiallolah, warreporter1, nidalgazuai, unclesamcoco, and more. Lets quickly plot the Top 10 Twitter handles and see if these are indeed our most active users. To do this we will count the number of tweets from all of the users and plot the top 10:
top_users = df.username.value_counts().sort_values(ascending=False)
top_users.head(10).plot.bar(title="Top 10 Twitter Handles", figsize=(16,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f13ba4d3898>
Based on the analysis above, the most active twitter handles are also frequently appearing in the tweet data. This is most likely due to @mentions of different users. This leads us to our next phase in the analysis. Lets build a Social Network graph to understand who is mentioning who within the twitter network. To do this we will create a new DataFrame that maps which user mentioned who, and then create weights for our network based on have active/influental that user is.
pd.options.mode.chained_assignment = None # default='warn'
mentions = df.loc[df.tweets.str.contains('@')]
mentions['Tagged User'] = mentions.tweets.apply(lambda x: re.findall(r'@([A-Za-z0-9_]+)',str(x)))
users = mentions.username.unique()
mentions['Tagged User Cnt'] = mentions['Tagged User'].apply(lambda x: list(set(x).intersection(users)) )
mentions['Cnt length'] = mentions['Tagged User Cnt'].apply(lambda x: len(x))
for i in range(len(mentions.tweets)):
row = mentions.iloc[i,:]
for j in range(len(row['Tagged User'])):
tmp = pd.DataFrame({'User':[row['username']],
'Mentions':[row['Tagged User'][j]],
'Time': [row['time']],
'User num status':[row['numberstatuses']],
'Followers':[row['followers']],
'Weight': [1]})
if i==0 and j==0:
mention_net = tmp
else:
mention_net = mention_net.append(tmp, ignore_index=True)
mention_net = mention_net[mention_net['User']!=mention_net['Mentions']].reset_index(drop=True)
mention_net.head(5)
Followers | Mentions | Time | User | User num status | Weight | |
---|---|---|---|---|---|---|
0 | 640 | khalidmaghrebi | 2015-01-06 22:17:00 | gunsandcoffee70 | 49 | 1 |
1 | 640 | seifulmaslul123 | 2015-01-06 22:17:00 | gunsandcoffee70 | 49 | 1 |
2 | 640 | cheerleadunit | 2015-01-06 22:17:00 | gunsandcoffee70 | 49 | 1 |
3 | 640 | khalidmaghrebi | 2015-01-10 00:08:00 | gunsandcoffee70 | 49 | 1 |
4 | 640 | seifulmaslul123 | 2015-01-10 00:08:00 | gunsandcoffee70 | 49 | 1 |
Let's use our new DataFrame to do a quick analysis on who our most mentioned and most active users are! The results from this analysis should match with our Word Cloud.
most_mentions = mention_net.Mentions.value_counts().sort_values(ascending=False)
most_active = mention_net.User.value_counts().sort_values(ascending=False)
plt.figure(figsize=(16,8))
ax1 = plt.subplot(121)
ax1 = sns.barplot(most_mentions[0:5].index, most_mentions[0:5].values)
ax1 = plt.title("Most Mentioned Handles")
ax2 = plt.subplot(122)
ax2 = sns.barplot(most_active[0:5].index, most_active[0:5].values)
ax2 = plt.title("Most Active Handles")
plt.show()
Our analysis looks to line up, so we can trust that our results thus far are consistent. If you look back to the word cloud you will see many of the same twitter handles.
network = mention_net.iloc[:,[1,3,5]] #Use the mentions, user, and weight columns to bulid the network
network = network.groupby(by=['Mentions','User'],as_index='False')['Weight'].sum().reset_index(name='Weight')
network= pd.DataFrame(network).sort_values(by='Weight',ascending=False).reset_index(drop=True)
network = network[network['Weight']>20]
print('Most frequent user/mention pairs:')
network.head(10)
Most frequent user/mention pairs:
Mentions | User | Weight | |
---|---|---|---|
0 | ramiallolah | mobi_ayubi | 195 |
1 | nidalgazaui | warrnews | 184 |
2 | scotsmaninfidel | melvynlion | 79 |
3 | didyouknowvs | warreporter2 | 70 |
4 | sparksofirhabi3 | melvynlion | 63 |
5 | spicylatte123 | melvynlion | 61 |
6 | ele7vn | melvynlion | 58 |
7 | 1texanna | melvynlion | 56 |
8 | sassysassyred | melvynlion | 54 |
9 | kafirkaty | melvynlion | 48 |
G = nx.Graph()
for i in range(len(network['User'])):
G.add_edge(network['User'][i],network['Mentions'][i],weight=network['Weight'][i])
plt.figure(figsize=(14,14))
d=list(nx.degree(G))
weights = [G.edges[u, v]['weight']/10 for u, v in G.edges()]
size = [d[:][v][1]*100 for v in range(len(d))]
nx.draw_circular(G,node_color='g', node_size=size, edge_color='#909090', with_labels=True, width=weights)
plt.axis('equal')
(-1.1057628956577303, 1.1118942023973906, -1.102891954424939, 1.1028920256050923)
from bokeh.io import show, output_notebook
from bokeh.models import Plot, Range1d, MultiLine, Circle, HoverTool, TapTool, BoxSelectTool, ColumnDataSource, LabelSet
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
from bokeh.palettes import Spectral4
plot = Plot(plot_width=1000, plot_height=1000, x_range=Range1d(-1.1,1.1), y_range=Range1d(-1.1,1.1))
plot.title.text = "Graph Interaction Demonstration"
plot.add_tools(HoverTool(tooltips=None), TapTool(), BoxSelectTool())
graph_renderer = from_networkx(G, nx.circular_layout, scale=1, center=(0,0))
graph_renderer.node_renderer.glyph = Circle(size=15, fill_color=Spectral4[0])
graph_renderer.node_renderer.selection_glyph = Circle(size=15, fill_color=Spectral4[2])
graph_renderer.node_renderer.hover_glyph = Circle(size=15, fill_color=Spectral4[1])
graph_renderer.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width=5)
graph_renderer.edge_renderer.selection_glyph = MultiLine(line_color=Spectral4[2], line_width=5)
graph_renderer.edge_renderer.hover_glyph = MultiLine(line_color=Spectral4[1], line_width=5)
graph_renderer.selection_policy = NodesAndLinkedEdges()
graph_renderer.inspection_policy = EdgesAndLinkedNodes()
pos =nx.circular_layout(G)
x, y = zip(*pos.values())
node_labels = list(G.nodes)
source = ColumnDataSource({'x': x, 'y': y,
'label': [node_labels[i] for i in range(len(x))]})
labels = LabelSet(x='x', y='y', text='label', source=source, x_offset=5, y_offset=5)
plot.renderers.append(labels)
plot.renderers.append(graph_renderer)
#output_file("interactive_graphs.html")
output_notebook()
show(plot)
SageMaker NTM takes the high-dimensional word count vectors in documents as inputs, maps them into lower-dimensional hidden representations, and reconstructs the original input back from the hidden representations. The hidden representation learned by the model corresponds to the mixture weights of the topics associated with the document. The semantic meaning of the topics can be determined by the top-ranking words in each topic as learned by the reconstruction layer. The training objective of SageMaker NTM is to minimize the reconstruction error and Kullback–Leibler_divergence, the sum of which corresponds to an upper-bound on the negative log-likelihood of the data.
As an unsupervised generative model, we do not have an accuracy or error metric to compare model training progress to some established prior expectations. The main indicator of model training progress is the training loss, which corresponds to the negative log-likelihood of data as discussed above. To evaluate how well the trained model generalize to unseen data, we recommend that when you train the NTM that you always supply a validation data set so that the model training progress can be properly assessed and early stopping can be in effect to avoid overfitting.
The input documents to the algorithm, both in training and inference, need to be vectors of integers representing word counts. This is so-called bag-of-words (BOW) representation. To convert plain text to BOW, we need to first “tokenize” our documents, that is, identify words and assign an integer ID to each of them. Then, we count the occurrence of each of the tokens in each document and form BOW vectors. We will only keep the most frequent 2,000 tokens (words) because rarely used words have a much smaller impact on the model and thus can be ignored.
!pip install nltk
Requirement already satisfied: nltk in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (3.3) Requirement already satisfied: six in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from nltk) (1.11.0) distributed 1.21.8 requires msgpack, which is not installed. You are using pip version 10.0.1, however version 18.0 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
token_pattern = re.compile(r"(?u)\b\w\w+\b")
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if len(t) >= 2 and re.match("[a-z].*",t)
and re.match(token_pattern, t)]
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip.
With the tokenizer defined we perform token counting next while limiting the vocabulary size to vocab_size:
vocab_size = 2000
print('Tokenizing and counting, this may take a few minutes...')
start_time = time.time()
vectorizer = CountVectorizer(input='content', analyzer='word', stop_words='english',
tokenizer=LemmaTokenizer(), max_features=vocab_size, max_df=0.95, min_df=2)
vectors = vectorizer.fit_transform(df.tweets)
vocab_list = vectorizer.get_feature_names()
print('vocab size:', len(vocab_list))
# random shuffle
idx = np.arange(vectors.shape[0])
np.random.shuffle(idx)
vectors = vectors[idx]
np.save('vocab', vocab_list)
print('Done. Time elapsed: {:.2f}s'.format(time.time() - start_time))
Tokenizing and counting, this may take a few minutes... vocab size: 2000 Done. Time elapsed: 5.05s
Because all the parameters (weights and biases) in the NTM model are np.float32 type we’d need the input data to also be in np.float32. It is better to do this type-casting upfront rather than repeatedly casting during mini-batch training.
As a common practice in modeling training, we should have a training set, a validation set, and a test set. The training set is the set of data the model is actually being trained on. But what we really care about is not the model’s performance on training set but its performance on future, unseen data. Therefore, during training, we periodically calculate scores (or losses) on the validation set to validate the performance of the model on unseen data. By assessing the model’s ability to generalize we can stop the training at the optimal point via early stopping to avoid over-training.
Note that when we only have a training set and no validation set, the NTM model will rely on scores on the training set to perform early stopping, which could result in over-training. Therefore, we recommend always supply a validation set to the model.
Here we use 80% of the data set as the training set and the rest for validation set and test set. We will use the validation set in training and use the test set for demonstrating model inference.
vectors = sparse.csr_matrix(vectors, dtype=np.float32)
print(type(vectors), vectors.dtype)
n_train = int(0.8 * vectors.shape[0])
# split train and test
train_vectors = vectors[:n_train, :]
test_vectors = vectors[n_train:, :]
# further split test set into validation set (val_vectors) and test set (test_vectors)
n_test = test_vectors.shape[0]
val_vectors = test_vectors[:n_test//2, :]
test_vectors = test_vectors[n_test//2:, :]
print(train_vectors.shape, test_vectors.shape, val_vectors.shape)
<class 'scipy.sparse.csr.csr_matrix'> float32 (13928, 2000) (1741, 2000) (1741, 2000)
A SageMaker training job needs access to training data stored in an S3 bucket. The NTM algorithm, as well as other first-party SageMaker algorithms, accepts data in RecordIO Protobuf format. The SageMaker Python API provides helper functions for easily converting your data into this format. Here we define a helper function to convert the data to RecordIO Protobuf format and upload it to Amazon S3. In addition, we will have the option to split the data into several parts specified by n_parts.
The algorithm inherently supports multiple files in the training folder (“channel”), which could be very helpful for large data set. In addition, when we use distributed training with multiple workers (compute instances), having multiple files allows us to distribute different portions of the training data to different workers conveniently.
bucket = sagemaker_session.default_bucket()
prefix = 'twitter_analysis'
train_prefix = os.path.join(prefix, 'train')
val_prefix = os.path.join(prefix, 'val')
output_prefix = os.path.join(prefix, 'output')
s3_train_data = os.path.join('s3://', bucket, train_prefix)
s3_val_data = os.path.join('s3://', bucket, val_prefix)
output_path = os.path.join('s3://', bucket, output_prefix)
print('Training set location', s3_train_data)
print('Validation set location', s3_val_data)
print('Trained model will be saved at', output_path)
INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-951232522638
Training set location s3://sagemaker-us-east-1-951232522638/twitter_analysis/train Validation set location s3://sagemaker-us-east-1-951232522638/twitter_analysis/val Trained model will be saved at s3://sagemaker-us-east-1-951232522638/twitter_analysis/output
def split_convert_upload(sparray, bucket, prefix, fname_template='data_part{}.pbr', n_parts=2):
import io
import sagemaker.amazon.common as smac
chunk_size = sparray.shape[0]// n_parts
for i in range(n_parts):
# Calculate start and end indices
start = i*chunk_size
end = (i+1)*chunk_size
if i+1 == n_parts:
end = sparray.shape[0]
# Convert to record protobuf
buf = io.BytesIO()
smac.write_spmatrix_to_sparse_tensor(array=sparray[start:end], file=buf, labels=None)
buf.seek(0)
# Upload to s3 location specified by bucket and prefix
fname = os.path.join(prefix, fname_template.format(i))
boto3.resource('s3').Bucket(bucket).Object(fname).upload_fileobj(buf)
print('Uploaded data to s3://{}'.format(os.path.join(bucket, fname)))
split_convert_upload(train_vectors, bucket=bucket, prefix=train_prefix, fname_template='train_part{}.pbr', n_parts=8)
split_convert_upload(train_vectors, bucket=bucket, prefix=val_prefix, fname_template='val_part{}.pbr', n_parts=1)
Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part0.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part1.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part2.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part3.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part4.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part5.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part6.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/train/train_part7.pbr Uploaded data to s3://sagemaker-us-east-1-951232522638/twitter_analysis/val/val_part0.pbr
The code in the cell below automatically chooses an algorithm container based on the current Region the SageMaker session is executing in.
from sagemaker.amazon.amazon_estimator import get_image_uri
# select the algorithm container based on this notebook's current location
region_name = boto3.Session().region_name
container = get_image_uri(region_name, 'ntm')
print('Using SageMaker NTM container: {} ({})'.format(container, region_name))
Using SageMaker NTM container: 382416733822.dkr.ecr.us-east-1.amazonaws.com/ntm:1 (us-east-1)
In the API call to sagemaker.estimator.Estimator we also specify the type and count of instances for the training job. Because the 20NewsGroups data set is relatively small, we have chosen a CPU only instance (ml.c4.xlarge), but feel free to change to other instance types: https://aws.amazon.com/sagemaker/pricing/instance-types/. NTM fully takes advantage of GPU hardware and in general trains roughly an order of magnitude faster on a GPU than on a CPU. Multi-GPU or multi-instance training further improves training speed roughly linearly if communication overhead is low compared to compute time.
Next, we specify hyperparameters specific to NTM. Then, we need to specify how the training data and validation data will be distributed to the workers during training. There are two modes for data channels:
We want to have each worker go through a different portion of the full data set to provide different gradients within epochs. We specify distribution to be ShardedByS3Key for the training data channel as follows.
from sagemaker.session import s3_input
ntm = sagemaker.estimator.Estimator(container,
role,
train_instance_count=2,
train_instance_type='ml.c4.xlarge',
output_path=output_path,
sagemaker_session=sagemaker_session)
num_topics = 20
ntm.set_hyperparameters(num_topics=num_topics, feature_dim=vocab_size, mini_batch_size=128,
epochs=100, num_patience_epochs=5, tolerance=0.001)
s3_train = s3_input(s3_train_data, distribution='ShardedByS3Key')
Now we are ready to train. The following cell takes a few minutes to run. The following command will first provision the required hardware. You will see a series of dots indicating the progress of the hardware provisioning process. After the resources are allocated, training logs will be displayed. With multiple workers, the log color and the ID following INFO
identifies logs emitted by different workers.
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return true;
}
ntm.fit({'train': s3_train, 'test': s3_val_data})
INFO:sagemaker:Creating training-job with name: ntm-2018-09-13-14-21-11-945
....................... Docker entrypoint called with argument(s): train [09/13/2018 14:24:53 INFO 140284443182912] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'} [09/13/2018 14:24:53 INFO 140284443182912] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'num_patience_epochs': u'5', u'num_topics': u'20', u'epochs': u'100', u'feature_dim': u'2000', u'mini_batch_size': u'128', u'tolerance': u'0.001'} [09/13/2018 14:24:53 INFO 140284443182912] Final configuration: {u'optimizer': u'adadelta', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'learning_rate': u'0.01', u'clip_gradient': u'Inf', u'feature_dim': u'2000', u'encoder_layers_activation': u'sigmoid', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0', u'num_patience_epochs': u'5', u'epochs': u'100', u'mini_batch_size': u'128', u'num_topics': u'20', u'_num_gpus': u'auto', u'_data_format': u'record', u'_kvstore': u'auto', u'encoder_layers': u'auto', u'tolerance': u'0.001', u'batch_norm': u'false'} [09/13/2018 14:24:57 INFO 140284443182912] Launching parameter server for role server [09/13/2018 14:24:57 INFO 140284443182912] {'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/16f36818-9594-4033-a05f-d65aed113c16', 'PWD': '/'} [09/13/2018 14:24:57 INFO 140284443182912] envs={'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'DMLC_NUM_WORKER': '2', 'DMLC_PS_ROOT_PORT': '9000', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'DMLC_PS_ROOT_URI': '10.32.0.4', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/16f36818-9594-4033-a05f-d65aed113c16', 'DMLC_ROLE': 'server', 'PWD': '/', 'DMLC_NUM_SERVER': '2'} [09/13/2018 14:24:57 INFO 140284443182912] Environment: {'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'DMLC_PS_ROOT_PORT': '9000', 'DMLC_NUM_WORKER': '2', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'DMLC_PS_ROOT_URI': '10.32.0.4', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/16f36818-9594-4033-a05f-d65aed113c16', 'DMLC_ROLE': 'worker', 'PWD': '/', 'DMLC_NUM_SERVER': '2'} [09/13/2018 14:24:57 INFO 140284443182912] Using default worker. [09/13/2018 14:24:57 INFO 140284443182912] Initializing /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/config/config_helper.py:122: DeprecationWarning: deprecated warnings.warn("deprecated", DeprecationWarning) [09/13/2018 14:24:57 INFO 140284443182912] nvidia-smi took: 0.0251688957214 secs to identify 0 gpus [09/13/2018 14:24:57 INFO 140284443182912] Number of GPUs being used: 0 [09/13/2018 14:24:57 INFO 140284443182912] Create Store: dist_async Docker entrypoint called with argument(s): train [09/13/2018 14:24:56 INFO 140484794881856] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'} [09/13/2018 14:24:56 INFO 140484794881856] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'num_patience_epochs': u'5', u'num_topics': u'20', u'epochs': u'100', u'feature_dim': u'2000', u'mini_batch_size': u'128', u'tolerance': u'0.001'} [09/13/2018 14:24:56 INFO 140484794881856] Final configuration: {u'optimizer': u'adadelta', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'learning_rate': u'0.01', u'clip_gradient': u'Inf', u'feature_dim': u'2000', u'encoder_layers_activation': u'sigmoid', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0', u'num_patience_epochs': u'5', u'epochs': u'100', u'mini_batch_size': u'128', u'num_topics': u'20', u'_num_gpus': u'auto', u'_data_format': u'record', u'_kvstore': u'auto', u'encoder_layers': u'auto', u'tolerance': u'0.001', u'batch_norm': u'false'} [09/13/2018 14:24:57 INFO 140484794881856] Launching parameter server for role scheduler [09/13/2018 14:24:57 INFO 140484794881856] {'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/93ca636c-ee9f-420d-9409-cf9f1c93ed54', 'PWD': '/'} [09/13/2018 14:24:57 INFO 140484794881856] envs={'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'DMLC_NUM_WORKER': '2', 'DMLC_PS_ROOT_PORT': '9000', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'DMLC_PS_ROOT_URI': '10.32.0.4', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/93ca636c-ee9f-420d-9409-cf9f1c93ed54', 'DMLC_ROLE': 'scheduler', 'PWD': '/', 'DMLC_NUM_SERVER': '2'} [09/13/2018 14:24:57 INFO 140484794881856] Launching parameter server for role server [09/13/2018 14:24:57 INFO 140484794881856] {'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/93ca636c-ee9f-420d-9409-cf9f1c93ed54', 'PWD': '/'} [09/13/2018 14:24:57 INFO 140484794881856] envs={'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'DMLC_NUM_WORKER': '2', 'DMLC_PS_ROOT_PORT': '9000', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'DMLC_PS_ROOT_URI': '10.32.0.4', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/93ca636c-ee9f-420d-9409-cf9f1c93ed54', 'DMLC_ROLE': 'server', 'PWD': '/', 'DMLC_NUM_SERVER': '2'} [09/13/2018 14:24:57 INFO 140484794881856] Environment: {'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'DMLC_PS_ROOT_PORT': '9000', 'DMLC_NUM_WORKER': '2', 'SAGEMAKER_HTTP_PORT': '8080', 'HOME': '/root', 'PYTHONUNBUFFERED': 'TRUE', 'CANONICAL_ENVROOT': '/opt/amazon', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib64:/opt/amazon/lib', 'LANG': 'en_US.utf8', 'DMLC_INTERFACE': 'ethwe', 'SHLVL': '1', 'DMLC_PS_ROOT_URI': '10.32.0.4', 'AWS_REGION': 'us-east-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'TRAINING_JOB_NAME': 'ntm-2018-09-13-14-21-11-945', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'ENVROOT': '/opt/amazon', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'OMP_NUM_THREADS': '2', 'HOSTNAME': 'aws', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/93ca636c-ee9f-420d-9409-cf9f1c93ed54', 'DMLC_ROLE': 'worker', 'PWD': '/', 'DMLC_NUM_SERVER': '2'} [09/13/2018 14:24:57 INFO 140484794881856] Using default worker. [09/13/2018 14:24:57 INFO 140484794881856] Initializing /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/config/config_helper.py:122: DeprecationWarning: deprecated warnings.warn("deprecated", DeprecationWarning) [09/13/2018 14:24:57 INFO 140484794881856] nvidia-smi took: 0.0252439975739 secs to identify 0 gpus [09/13/2018 14:24:57 INFO 140484794881856] Number of GPUs being used: 0 [09/13/2018 14:24:57 INFO 140484794881856] Create Store: dist_async #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 0, "sum": 0.0, "min": 0}}, "EndTime": 1536848698.010227, "Dimensions": {"Host": "algo-1", "Meta": "init_train_data_iter", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1536848698.010189} [09/13/2018 14:24:58 INFO 140484794881856] [09/13/2018 14:24:58 INFO 140484794881856] # Starting training for epoch 1 [14:24:58] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.200722.0/RHEL5_64/generic-flavor/src/src/ndarray/./../operator/tensor/.././../common/utils.h:416: Storage fallback detected: Copy from csr storage type on cpu to default storage type on cpu. A temporary ndarray with default storage type will be generated in order to perform the copy. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning. #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 0, "sum": 0.0, "min": 0}}, "EndTime": 1536848698.006476, "Dimensions": {"Host": "algo-2", "Meta": "init_train_data_iter", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1536848698.006438} [09/13/2018 14:24:58 INFO 140284443182912] [09/13/2018 14:24:58 INFO 140284443182912] # Starting training for epoch 1 [14:24:58] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.1.x.200722.0/RHEL5_64/generic-flavor/src/src/ndarray/./../operator/tensor/.././../common/utils.h:416: Storage fallback detected: Copy from csr storage type on cpu to default storage type on cpu. A temporary ndarray with default storage type will be generated in order to perform the copy. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning. [09/13/2018 14:24:59 INFO 140484794881856] # Finished training epoch 1 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:24:59 INFO 140484794881856] Metrics for Training: [09/13/2018 14:24:59 INFO 140484794881856] Loss (name: value) total: 6.49372019334 [09/13/2018 14:24:59 INFO 140484794881856] Loss (name: value) kld: 0.0177522907055 [09/13/2018 14:24:59 INFO 140484794881856] Loss (name: value) recons: 6.47596789707 [09/13/2018 14:24:59 INFO 140484794881856] Loss (name: value) logppx: 6.49372019334 [09/13/2018 14:24:59 INFO 140484794881856] #quality_metric: host=algo-1, epoch=1, train total_loss <loss>=6.49372019334 [09/13/2018 14:24:59 INFO 140484794881856] #progress_metric: host=algo-1, completed 1 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Total Records Seen": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 2, "sum": 2.0, "min": 2}}, "EndTime": 1536848699.896089, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 0}, "StartTime": 1536848698.010557} [09/13/2018 14:24:59 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3693.09332563 records/second [09/13/2018 14:24:59 INFO 140484794881856] [09/13/2018 14:24:59 INFO 140484794881856] # Starting training for epoch 2 [09/13/2018 14:25:00 INFO 140284443182912] # Finished training epoch 1 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:00 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:00 INFO 140284443182912] Loss (name: value) total: 6.50361232324 [09/13/2018 14:25:00 INFO 140284443182912] Loss (name: value) kld: 0.0185067007064 [09/13/2018 14:25:00 INFO 140284443182912] Loss (name: value) recons: 6.48510570093 [09/13/2018 14:25:00 INFO 140284443182912] Loss (name: value) logppx: 6.50361232324 [09/13/2018 14:25:00 INFO 140284443182912] #quality_metric: host=algo-2, epoch=1, train total_loss <loss>=6.50361232324 [09/13/2018 14:25:00 INFO 140284443182912] #progress_metric: host=algo-2, completed 1 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Total Records Seen": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 2, "sum": 2.0, "min": 2}}, "EndTime": 1536848700.131166, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 0}, "StartTime": 1536848698.006782} [09/13/2018 14:25:00 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3277.93077492 records/second [09/13/2018 14:25:00 INFO 140284443182912] [09/13/2018 14:25:00 INFO 140284443182912] # Starting training for epoch 2 [09/13/2018 14:25:01 INFO 140484794881856] # Finished training epoch 2 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:01 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:01 INFO 140484794881856] Loss (name: value) total: 6.36649964506 [09/13/2018 14:25:01 INFO 140484794881856] Loss (name: value) kld: 0.00618237608261 [09/13/2018 14:25:01 INFO 140484794881856] Loss (name: value) recons: 6.36031727357 [09/13/2018 14:25:01 INFO 140484794881856] Loss (name: value) logppx: 6.36649964506 [09/13/2018 14:25:01 INFO 140484794881856] #quality_metric: host=algo-1, epoch=2, train total_loss <loss>=6.36649964506 [09/13/2018 14:25:01 INFO 140484794881856] #progress_metric: host=algo-1, completed 2 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 110, "sum": 110.0, "min": 110}, "Total Records Seen": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 4, "sum": 4.0, "min": 4}}, "EndTime": 1536848701.974664, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 1}, "StartTime": 1536848699.896712} [09/13/2018 14:25:01 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3351.12004341 records/second [09/13/2018 14:25:01 INFO 140484794881856] [09/13/2018 14:25:01 INFO 140484794881856] # Starting training for epoch 3 [09/13/2018 14:25:02 INFO 140284443182912] # Finished training epoch 2 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:02 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:02 INFO 140284443182912] Loss (name: value) total: 6.37899412675 [09/13/2018 14:25:02 INFO 140284443182912] Loss (name: value) kld: 0.0082039967846 [09/13/2018 14:25:02 INFO 140284443182912] Loss (name: value) recons: 6.37079011744 [09/13/2018 14:25:02 INFO 140284443182912] Loss (name: value) logppx: 6.37899412675 [09/13/2018 14:25:02 INFO 140284443182912] #quality_metric: host=algo-2, epoch=2, train total_loss <loss>=6.37899412675 [09/13/2018 14:25:02 INFO 140284443182912] #progress_metric: host=algo-2, completed 2 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 110, "sum": 110.0, "min": 110}, "Total Records Seen": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 4, "sum": 4.0, "min": 4}}, "EndTime": 1536848702.445969, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 1}, "StartTime": 1536848700.13199} [09/13/2018 14:25:02 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3009.34873597 records/second [09/13/2018 14:25:02 INFO 140284443182912] [09/13/2018 14:25:02 INFO 140284443182912] # Starting training for epoch 3 [09/13/2018 14:25:03 INFO 140484794881856] # Finished training epoch 3 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:03 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:03 INFO 140484794881856] Loss (name: value) total: 6.35107012229 [09/13/2018 14:25:03 INFO 140484794881856] Loss (name: value) kld: 0.0116681698286 [09/13/2018 14:25:03 INFO 140484794881856] Loss (name: value) recons: 6.3394019387 [09/13/2018 14:25:03 INFO 140484794881856] Loss (name: value) logppx: 6.35107012229 [09/13/2018 14:25:03 INFO 140484794881856] #quality_metric: host=algo-1, epoch=3, train total_loss <loss>=6.35107012229 [09/13/2018 14:25:03 INFO 140484794881856] #progress_metric: host=algo-1, completed 3 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 165, "sum": 165.0, "min": 165}, "Total Records Seen": {"count": 1, "max": 20892, "sum": 20892.0, "min": 20892}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 6, "sum": 6.0, "min": 6}}, "EndTime": 1536848703.940802, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 2}, "StartTime": 1536848701.975355} [09/13/2018 14:25:03 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3542.92777465 records/second [09/13/2018 14:25:03 INFO 140484794881856] [09/13/2018 14:25:03 INFO 140484794881856] # Starting training for epoch 4 [09/13/2018 14:25:04 INFO 140284443182912] # Finished training epoch 3 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:04 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:04 INFO 140284443182912] Loss (name: value) total: 6.37052865028 [09/13/2018 14:25:04 INFO 140284443182912] Loss (name: value) kld: 0.0135567044924 [09/13/2018 14:25:04 INFO 140284443182912] Loss (name: value) recons: 6.35697194013 [09/13/2018 14:25:04 INFO 140284443182912] Loss (name: value) logppx: 6.37052865028 [09/13/2018 14:25:04 INFO 140284443182912] #quality_metric: host=algo-2, epoch=3, train total_loss <loss>=6.37052865028 [09/13/2018 14:25:04 INFO 140284443182912] #progress_metric: host=algo-2, completed 3 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 165, "sum": 165.0, "min": 165}, "Total Records Seen": {"count": 1, "max": 20892, "sum": 20892.0, "min": 20892}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 6, "sum": 6.0, "min": 6}}, "EndTime": 1536848704.634559, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 2}, "StartTime": 1536848702.446516} [09/13/2018 14:25:04 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3182.55323482 records/second [09/13/2018 14:25:04 INFO 140284443182912] [09/13/2018 14:25:04 INFO 140284443182912] # Starting training for epoch 4 [09/13/2018 14:25:05 INFO 140484794881856] # Finished training epoch 4 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:05 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:05 INFO 140484794881856] Loss (name: value) total: 6.34659597223 [09/13/2018 14:25:05 INFO 140484794881856] Loss (name: value) kld: 0.0139713355086 [09/13/2018 14:25:05 INFO 140484794881856] Loss (name: value) recons: 6.33262458281 [09/13/2018 14:25:05 INFO 140484794881856] Loss (name: value) logppx: 6.34659597223 [09/13/2018 14:25:05 INFO 140484794881856] #quality_metric: host=algo-1, epoch=4, train total_loss <loss>=6.34659597223 [09/13/2018 14:25:05 INFO 140484794881856] #progress_metric: host=algo-1, completed 4 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 220, "sum": 220.0, "min": 220}, "Total Records Seen": {"count": 1, "max": 27856, "sum": 27856.0, "min": 27856}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 8, "sum": 8.0, "min": 8}}, "EndTime": 1536848705.899805, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 3}, "StartTime": 1536848703.941826} [09/13/2018 14:25:05 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3556.47233936 records/second [09/13/2018 14:25:05 INFO 140484794881856] [09/13/2018 14:25:05 INFO 140484794881856] # Starting training for epoch 5 [09/13/2018 14:25:06 INFO 140284443182912] # Finished training epoch 4 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:06 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:06 INFO 140284443182912] Loss (name: value) total: 6.36013104699 [09/13/2018 14:25:06 INFO 140284443182912] Loss (name: value) kld: 0.0160940591204 [09/13/2018 14:25:06 INFO 140284443182912] Loss (name: value) recons: 6.34403696927 [09/13/2018 14:25:06 INFO 140284443182912] Loss (name: value) logppx: 6.36013104699 [09/13/2018 14:25:06 INFO 140284443182912] #quality_metric: host=algo-2, epoch=4, train total_loss <loss>=6.36013104699 [09/13/2018 14:25:06 INFO 140284443182912] #progress_metric: host=algo-2, completed 4 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 220, "sum": 220.0, "min": 220}, "Total Records Seen": {"count": 1, "max": 27856, "sum": 27856.0, "min": 27856}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 8, "sum": 8.0, "min": 8}}, "EndTime": 1536848706.678697, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 3}, "StartTime": 1536848704.635074} [09/13/2018 14:25:06 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3407.44817906 records/second [09/13/2018 14:25:06 INFO 140284443182912] [09/13/2018 14:25:06 INFO 140284443182912] # Starting training for epoch 5 [09/13/2018 14:25:07 INFO 140484794881856] # Finished training epoch 5 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:07 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:07 INFO 140484794881856] Loss (name: value) total: 6.34598163258 [09/13/2018 14:25:07 INFO 140484794881856] Loss (name: value) kld: 0.0164745214988 [09/13/2018 14:25:07 INFO 140484794881856] Loss (name: value) recons: 6.32950709516 [09/13/2018 14:25:07 INFO 140484794881856] Loss (name: value) logppx: 6.34598163258 [09/13/2018 14:25:07 INFO 140484794881856] #quality_metric: host=algo-1, epoch=5, train total_loss <loss>=6.34598163258 [09/13/2018 14:25:07 INFO 140484794881856] #progress_metric: host=algo-1, completed 5 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 275, "sum": 275.0, "min": 275}, "Total Records Seen": {"count": 1, "max": 34820, "sum": 34820.0, "min": 34820}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 10, "sum": 10.0, "min": 10}}, "EndTime": 1536848707.875057, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 4}, "StartTime": 1536848705.900181} [09/13/2018 14:25:07 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3525.59220856 records/second [09/13/2018 14:25:07 INFO 140484794881856] [09/13/2018 14:25:07 INFO 140484794881856] # Starting training for epoch 6 [09/13/2018 14:25:08 INFO 140284443182912] # Finished training epoch 5 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:08 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:08 INFO 140284443182912] Loss (name: value) total: 6.3627354015 [09/13/2018 14:25:08 INFO 140284443182912] Loss (name: value) kld: 0.0188437329013 [09/13/2018 14:25:08 INFO 140284443182912] Loss (name: value) recons: 6.34389169866 [09/13/2018 14:25:08 INFO 140284443182912] Loss (name: value) logppx: 6.3627354015 [09/13/2018 14:25:08 INFO 140284443182912] #quality_metric: host=algo-2, epoch=5, train total_loss <loss>=6.3627354015 [09/13/2018 14:25:08 INFO 140284443182912] #progress_metric: host=algo-2, completed 5 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 275, "sum": 275.0, "min": 275}, "Total Records Seen": {"count": 1, "max": 34820, "sum": 34820.0, "min": 34820}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 10, "sum": 10.0, "min": 10}}, "EndTime": 1536848708.761734, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 4}, "StartTime": 1536848706.679118} [09/13/2018 14:25:08 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3343.51026755 records/second [09/13/2018 14:25:08 INFO 140284443182912] [09/13/2018 14:25:08 INFO 140284443182912] # Starting training for epoch 6 [09/13/2018 14:25:09 INFO 140484794881856] # Finished training epoch 6 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:09 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:09 INFO 140484794881856] Loss (name: value) total: 6.34044660221 [09/13/2018 14:25:09 INFO 140484794881856] Loss (name: value) kld: 0.0183727286756 [09/13/2018 14:25:09 INFO 140484794881856] Loss (name: value) recons: 6.32207388878 [09/13/2018 14:25:09 INFO 140484794881856] Loss (name: value) logppx: 6.34044660221 [09/13/2018 14:25:09 INFO 140484794881856] #quality_metric: host=algo-1, epoch=6, train total_loss <loss>=6.34044660221 [09/13/2018 14:25:09 INFO 140484794881856] patience losses:[6.4937201933427291, 6.3664996450597586, 6.3510701222853223, 6.3465959722345522, 6.3459816325794565] min patience loss:6.34598163258 current loss:6.34044660221 absolute loss difference:0.00553503036499 [09/13/2018 14:25:09 INFO 140484794881856] #progress_metric: host=algo-1, completed 6 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 330, "sum": 330.0, "min": 330}, "Total Records Seen": {"count": 1, "max": 41784, "sum": 41784.0, "min": 41784}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 12, "sum": 12.0, "min": 12}}, "EndTime": 1536848709.732436, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 5}, "StartTime": 1536848707.876043} [09/13/2018 14:25:09 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3751.01137203 records/second [09/13/2018 14:25:09 INFO 140484794881856] [09/13/2018 14:25:09 INFO 140484794881856] # Starting training for epoch 7 [09/13/2018 14:25:10 INFO 140284443182912] # Finished training epoch 6 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:10 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:10 INFO 140284443182912] Loss (name: value) total: 6.35420631929 [09/13/2018 14:25:10 INFO 140284443182912] Loss (name: value) kld: 0.0198835672303 [09/13/2018 14:25:10 INFO 140284443182912] Loss (name: value) recons: 6.33432275165 [09/13/2018 14:25:10 INFO 140284443182912] Loss (name: value) logppx: 6.35420631929 [09/13/2018 14:25:10 INFO 140284443182912] #quality_metric: host=algo-2, epoch=6, train total_loss <loss>=6.35420631929 [09/13/2018 14:25:10 INFO 140284443182912] patience losses:[6.5036123232408007, 6.3789941267533736, 6.3705286502838137, 6.3601310469887471, 6.3627354015003554] min patience loss:6.36013104699 current loss:6.35420631929 absolute loss difference:0.00592472769997 [09/13/2018 14:25:10 INFO 140284443182912] #progress_metric: host=algo-2, completed 6 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 330, "sum": 330.0, "min": 330}, "Total Records Seen": {"count": 1, "max": 41784, "sum": 41784.0, "min": 41784}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 12, "sum": 12.0, "min": 12}}, "EndTime": 1536848710.774354, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 5}, "StartTime": 1536848708.762307} [09/13/2018 14:25:10 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3460.83342992 records/second [09/13/2018 14:25:10 INFO 140284443182912] [09/13/2018 14:25:10 INFO 140284443182912] # Starting training for epoch 7 [09/13/2018 14:25:11 INFO 140484794881856] # Finished training epoch 7 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:11 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:11 INFO 140484794881856] Loss (name: value) total: 6.33413887457 [09/13/2018 14:25:11 INFO 140484794881856] Loss (name: value) kld: 0.0193578188392 [09/13/2018 14:25:11 INFO 140484794881856] Loss (name: value) recons: 6.31478105458 [09/13/2018 14:25:11 INFO 140484794881856] Loss (name: value) logppx: 6.33413887457 [09/13/2018 14:25:11 INFO 140484794881856] #quality_metric: host=algo-1, epoch=7, train total_loss <loss>=6.33413887457 [09/13/2018 14:25:11 INFO 140484794881856] patience losses:[6.3664996450597586, 6.3510701222853223, 6.3465959722345522, 6.3459816325794565, 6.3404466022144668] min patience loss:6.34044660221 current loss:6.33413887457 absolute loss difference:0.00630772764033 [09/13/2018 14:25:11 INFO 140484794881856] #progress_metric: host=algo-1, completed 7 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 385, "sum": 385.0, "min": 385}, "Total Records Seen": {"count": 1, "max": 48748, "sum": 48748.0, "min": 48748}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 14, "sum": 14.0, "min": 14}}, "EndTime": 1536848711.700761, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 6}, "StartTime": 1536848709.732858} [09/13/2018 14:25:11 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3538.41652748 records/second [09/13/2018 14:25:11 INFO 140484794881856] [09/13/2018 14:25:11 INFO 140484794881856] # Starting training for epoch 8 [09/13/2018 14:25:12 INFO 140284443182912] # Finished training epoch 7 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:12 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:12 INFO 140284443182912] Loss (name: value) total: 6.3535577774 [09/13/2018 14:25:12 INFO 140284443182912] Loss (name: value) kld: 0.0225443237038 [09/13/2018 14:25:12 INFO 140284443182912] Loss (name: value) recons: 6.33101346276 [09/13/2018 14:25:12 INFO 140284443182912] Loss (name: value) logppx: 6.3535577774 [09/13/2018 14:25:12 INFO 140284443182912] #quality_metric: host=algo-2, epoch=7, train total_loss <loss>=6.3535577774 [09/13/2018 14:25:12 INFO 140284443182912] patience losses:[6.3789941267533736, 6.3705286502838137, 6.3601310469887471, 6.3627354015003554, 6.3542063192887737] min patience loss:6.35420631929 current loss:6.3535577774 absolute loss difference:0.000648541883988 [09/13/2018 14:25:12 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:12 INFO 140284443182912] #progress_metric: host=algo-2, completed 7 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 385, "sum": 385.0, "min": 385}, "Total Records Seen": {"count": 1, "max": 48748, "sum": 48748.0, "min": 48748}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 14, "sum": 14.0, "min": 14}}, "EndTime": 1536848712.869794, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 6}, "StartTime": 1536848710.774931} [09/13/2018 14:25:12 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3324.05768031 records/second [09/13/2018 14:25:12 INFO 140284443182912] [09/13/2018 14:25:12 INFO 140284443182912] # Starting training for epoch 8 [09/13/2018 14:25:13 INFO 140484794881856] # Finished training epoch 8 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:13 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:13 INFO 140484794881856] Loss (name: value) total: 6.32936357151 [09/13/2018 14:25:13 INFO 140484794881856] Loss (name: value) kld: 0.0211185890885 [09/13/2018 14:25:13 INFO 140484794881856] Loss (name: value) recons: 6.30824501731 [09/13/2018 14:25:13 INFO 140484794881856] Loss (name: value) logppx: 6.32936357151 [09/13/2018 14:25:13 INFO 140484794881856] #quality_metric: host=algo-1, epoch=8, train total_loss <loss>=6.32936357151 [09/13/2018 14:25:13 INFO 140484794881856] patience losses:[6.3510701222853223, 6.3465959722345522, 6.3459816325794565, 6.3404466022144668, 6.3341388745741414] min patience loss:6.33413887457 current loss:6.32936357151 absolute loss difference:0.00477530306036 [09/13/2018 14:25:13 INFO 140484794881856] #progress_metric: host=algo-1, completed 8 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 440, "sum": 440.0, "min": 440}, "Total Records Seen": {"count": 1, "max": 55712, "sum": 55712.0, "min": 55712}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 16, "sum": 16.0, "min": 16}}, "EndTime": 1536848713.594897, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 7}, "StartTime": 1536848711.701548} [09/13/2018 14:25:13 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3677.46446294 records/second [09/13/2018 14:25:13 INFO 140484794881856] [09/13/2018 14:25:13 INFO 140484794881856] # Starting training for epoch 9 [09/13/2018 14:25:14 INFO 140284443182912] # Finished training epoch 8 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:14 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:14 INFO 140284443182912] Loss (name: value) total: 6.35338895971 [09/13/2018 14:25:14 INFO 140284443182912] Loss (name: value) kld: 0.0233831705272 [09/13/2018 14:25:14 INFO 140284443182912] Loss (name: value) recons: 6.33000574545 [09/13/2018 14:25:14 INFO 140284443182912] Loss (name: value) logppx: 6.35338895971 [09/13/2018 14:25:14 INFO 140284443182912] #quality_metric: host=algo-2, epoch=8, train total_loss <loss>=6.35338895971 [09/13/2018 14:25:14 INFO 140284443182912] patience losses:[6.3705286502838137, 6.3601310469887471, 6.3627354015003554, 6.3542063192887737, 6.3535577774047853] min patience loss:6.3535577774 current loss:6.35338895971 absolute loss difference:0.000168817693537 [09/13/2018 14:25:14 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:14 INFO 140284443182912] #progress_metric: host=algo-2, completed 8 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 440, "sum": 440.0, "min": 440}, "Total Records Seen": {"count": 1, "max": 55712, "sum": 55712.0, "min": 55712}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 16, "sum": 16.0, "min": 16}}, "EndTime": 1536848714.890248, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 7}, "StartTime": 1536848712.870346} [09/13/2018 14:25:14 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3447.44625359 records/second [09/13/2018 14:25:14 INFO 140284443182912] [09/13/2018 14:25:14 INFO 140284443182912] # Starting training for epoch 9 [09/13/2018 14:25:15 INFO 140484794881856] # Finished training epoch 9 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:15 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:15 INFO 140484794881856] Loss (name: value) total: 6.33026599884 [09/13/2018 14:25:15 INFO 140484794881856] Loss (name: value) kld: 0.0234396471684 [09/13/2018 14:25:15 INFO 140484794881856] Loss (name: value) recons: 6.30682632273 [09/13/2018 14:25:15 INFO 140484794881856] Loss (name: value) logppx: 6.33026599884 [09/13/2018 14:25:15 INFO 140484794881856] #quality_metric: host=algo-1, epoch=9, train total_loss <loss>=6.33026599884 [09/13/2018 14:25:15 INFO 140484794881856] patience losses:[6.3465959722345522, 6.3459816325794565, 6.3404466022144668, 6.3341388745741414, 6.3293635715137828] min patience loss:6.32936357151 current loss:6.33026599884 absolute loss difference:0.000902427326549 [09/13/2018 14:25:15 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:15 INFO 140484794881856] #progress_metric: host=algo-1, completed 9 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 495, "sum": 495.0, "min": 495}, "Total Records Seen": {"count": 1, "max": 62676, "sum": 62676.0, "min": 62676}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 18, "sum": 18.0, "min": 18}}, "EndTime": 1536848715.525398, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 8}, "StartTime": 1536848713.595269} [09/13/2018 14:25:15 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3607.72001582 records/second [09/13/2018 14:25:15 INFO 140484794881856] [09/13/2018 14:25:15 INFO 140484794881856] # Starting training for epoch 10 [09/13/2018 14:25:16 INFO 140284443182912] # Finished training epoch 9 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:16 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:16 INFO 140284443182912] Loss (name: value) total: 6.34476141063 [09/13/2018 14:25:16 INFO 140284443182912] Loss (name: value) kld: 0.0258411797695 [09/13/2018 14:25:16 INFO 140284443182912] Loss (name: value) recons: 6.31892020919 [09/13/2018 14:25:16 INFO 140284443182912] Loss (name: value) logppx: 6.34476141063 [09/13/2018 14:25:16 INFO 140284443182912] #quality_metric: host=algo-2, epoch=9, train total_loss <loss>=6.34476141063 [09/13/2018 14:25:16 INFO 140284443182912] patience losses:[6.3601310469887471, 6.3627354015003554, 6.3542063192887737, 6.3535577774047853, 6.3533889597112481] min patience loss:6.35338895971 current loss:6.34476141063 absolute loss difference:0.00862754908475 [09/13/2018 14:25:16 INFO 140284443182912] #progress_metric: host=algo-2, completed 9 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 495, "sum": 495.0, "min": 495}, "Total Records Seen": {"count": 1, "max": 62676, "sum": 62676.0, "min": 62676}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 18, "sum": 18.0, "min": 18}}, "EndTime": 1536848716.908515, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 8}, "StartTime": 1536848714.890851} [09/13/2018 14:25:16 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3451.06159105 records/second [09/13/2018 14:25:16 INFO 140284443182912] [09/13/2018 14:25:16 INFO 140284443182912] # Starting training for epoch 10 [09/13/2018 14:25:17 INFO 140484794881856] # Finished training epoch 10 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:17 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:17 INFO 140484794881856] Loss (name: value) total: 6.32564719374 [09/13/2018 14:25:17 INFO 140484794881856] Loss (name: value) kld: 0.0253509315958 [09/13/2018 14:25:17 INFO 140484794881856] Loss (name: value) recons: 6.30029626326 [09/13/2018 14:25:17 INFO 140484794881856] Loss (name: value) logppx: 6.32564719374 [09/13/2018 14:25:17 INFO 140484794881856] #quality_metric: host=algo-1, epoch=10, train total_loss <loss>=6.32564719374 [09/13/2018 14:25:17 INFO 140484794881856] patience losses:[6.3459816325794565, 6.3404466022144668, 6.3341388745741414, 6.3293635715137828, 6.330265998840332] min patience loss:6.32936357151 current loss:6.32564719374 absolute loss difference:0.00371637777849 [09/13/2018 14:25:17 INFO 140484794881856] #progress_metric: host=algo-1, completed 10 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 550, "sum": 550.0, "min": 550}, "Total Records Seen": {"count": 1, "max": 69640, "sum": 69640.0, "min": 69640}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 20, "sum": 20.0, "min": 20}}, "EndTime": 1536848717.439293, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 9}, "StartTime": 1536848715.52581} [09/13/2018 14:25:17 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3639.13612193 records/second [09/13/2018 14:25:17 INFO 140484794881856] [09/13/2018 14:25:17 INFO 140484794881856] # Starting training for epoch 11 [09/13/2018 14:25:19 INFO 140484794881856] # Finished training epoch 11 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:19 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:19 INFO 140484794881856] Loss (name: value) total: 6.32058071223 [09/13/2018 14:25:19 INFO 140484794881856] Loss (name: value) kld: 0.0261712005531 [09/13/2018 14:25:19 INFO 140484794881856] Loss (name: value) recons: 6.29440948313 [09/13/2018 14:25:19 INFO 140484794881856] Loss (name: value) logppx: 6.32058071223 [09/13/2018 14:25:19 INFO 140484794881856] #quality_metric: host=algo-1, epoch=11, train total_loss <loss>=6.32058071223 [09/13/2018 14:25:19 INFO 140484794881856] patience losses:[6.3404466022144668, 6.3341388745741414, 6.3293635715137828, 6.330265998840332, 6.3256471937352963] min patience loss:6.32564719374 current loss:6.32058071223 absolute loss difference:0.00506648150357 [09/13/2018 14:25:19 INFO 140484794881856] #progress_metric: host=algo-1, completed 11 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 605, "sum": 605.0, "min": 605}, "Total Records Seen": {"count": 1, "max": 76604, "sum": 76604.0, "min": 76604}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 22, "sum": 22.0, "min": 22}}, "EndTime": 1536848719.313898, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 10}, "StartTime": 1536848717.440275} [09/13/2018 14:25:19 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3716.58937721 records/second [09/13/2018 14:25:19 INFO 140484794881856] [09/13/2018 14:25:19 INFO 140484794881856] # Starting training for epoch 12 [09/13/2018 14:25:18 INFO 140284443182912] # Finished training epoch 10 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:18 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:18 INFO 140284443182912] Loss (name: value) total: 6.34659416892 [09/13/2018 14:25:18 INFO 140284443182912] Loss (name: value) kld: 0.0284997604364 [09/13/2018 14:25:18 INFO 140284443182912] Loss (name: value) recons: 6.31809437058 [09/13/2018 14:25:18 INFO 140284443182912] Loss (name: value) logppx: 6.34659416892 [09/13/2018 14:25:18 INFO 140284443182912] #quality_metric: host=algo-2, epoch=10, train total_loss <loss>=6.34659416892 [09/13/2018 14:25:18 INFO 140284443182912] patience losses:[6.3627354015003554, 6.3542063192887737, 6.3535577774047853, 6.3533889597112481, 6.3447614106264982] min patience loss:6.34476141063 current loss:6.34659416892 absolute loss difference:0.00183275829662 [09/13/2018 14:25:18 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:18 INFO 140284443182912] #progress_metric: host=algo-2, completed 10 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 550, "sum": 550.0, "min": 550}, "Total Records Seen": {"count": 1, "max": 69640, "sum": 69640.0, "min": 69640}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 20, "sum": 20.0, "min": 20}}, "EndTime": 1536848718.965087, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 9}, "StartTime": 1536848716.909384} [09/13/2018 14:25:18 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3387.33591123 records/second [09/13/2018 14:25:18 INFO 140284443182912] [09/13/2018 14:25:18 INFO 140284443182912] # Starting training for epoch 11 [09/13/2018 14:25:21 INFO 140484794881856] # Finished training epoch 12 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:21 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:21 INFO 140484794881856] Loss (name: value) total: 6.32470378009 [09/13/2018 14:25:21 INFO 140484794881856] Loss (name: value) kld: 0.0295307841237 [09/13/2018 14:25:21 INFO 140484794881856] Loss (name: value) recons: 6.29517299479 [09/13/2018 14:25:21 INFO 140484794881856] Loss (name: value) logppx: 6.32470378009 [09/13/2018 14:25:21 INFO 140484794881856] #quality_metric: host=algo-1, epoch=12, train total_loss <loss>=6.32470378009 [09/13/2018 14:25:21 INFO 140484794881856] patience losses:[6.3341388745741414, 6.3293635715137828, 6.330265998840332, 6.3256471937352963, 6.320580712231723] min patience loss:6.32058071223 current loss:6.32470378009 absolute loss difference:0.00412306785583 [09/13/2018 14:25:21 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:21 INFO 140484794881856] #progress_metric: host=algo-1, completed 12 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 660, "sum": 660.0, "min": 660}, "Total Records Seen": {"count": 1, "max": 83568, "sum": 83568.0, "min": 83568}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 24, "sum": 24.0, "min": 24}}, "EndTime": 1536848721.13722, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 11}, "StartTime": 1536848719.314362} [09/13/2018 14:25:21 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3819.89654404 records/second [09/13/2018 14:25:21 INFO 140484794881856] [09/13/2018 14:25:21 INFO 140484794881856] # Starting training for epoch 13 [09/13/2018 14:25:20 INFO 140284443182912] # Finished training epoch 11 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:20 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:20 INFO 140284443182912] Loss (name: value) total: 6.34939420006 [09/13/2018 14:25:20 INFO 140284443182912] Loss (name: value) kld: 0.0303344488483 [09/13/2018 14:25:20 INFO 140284443182912] Loss (name: value) recons: 6.31905979243 [09/13/2018 14:25:20 INFO 140284443182912] Loss (name: value) logppx: 6.34939420006 [09/13/2018 14:25:20 INFO 140284443182912] #quality_metric: host=algo-2, epoch=11, train total_loss <loss>=6.34939420006 [09/13/2018 14:25:20 INFO 140284443182912] patience losses:[6.3542063192887737, 6.3535577774047853, 6.3533889597112481, 6.3447614106264982, 6.3465941689231178] min patience loss:6.34476141063 current loss:6.34939420006 absolute loss difference:0.00463278943842 [09/13/2018 14:25:20 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:20 INFO 140284443182912] #progress_metric: host=algo-2, completed 11 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 605, "sum": 605.0, "min": 605}, "Total Records Seen": {"count": 1, "max": 76604, "sum": 76604.0, "min": 76604}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 22, "sum": 22.0, "min": 22}}, "EndTime": 1536848720.976299, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 10}, "StartTime": 1536848718.965668} [09/13/2018 14:25:20 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3463.27785632 records/second [09/13/2018 14:25:20 INFO 140284443182912] [09/13/2018 14:25:20 INFO 140284443182912] # Starting training for epoch 12 [09/13/2018 14:25:23 INFO 140484794881856] # Finished training epoch 13 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:23 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:23 INFO 140484794881856] Loss (name: value) total: 6.3216312105 [09/13/2018 14:25:23 INFO 140484794881856] Loss (name: value) kld: 0.0299329203977 [09/13/2018 14:25:23 INFO 140484794881856] Loss (name: value) recons: 6.29169826941 [09/13/2018 14:25:23 INFO 140484794881856] Loss (name: value) logppx: 6.3216312105 [09/13/2018 14:25:23 INFO 140484794881856] #quality_metric: host=algo-1, epoch=13, train total_loss <loss>=6.3216312105 [09/13/2018 14:25:23 INFO 140484794881856] patience losses:[6.3293635715137828, 6.330265998840332, 6.3256471937352963, 6.320580712231723, 6.3247037800875576] min patience loss:6.32058071223 current loss:6.3216312105 absolute loss difference:0.00105049826882 [09/13/2018 14:25:23 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:23 INFO 140484794881856] #progress_metric: host=algo-1, completed 13 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 715, "sum": 715.0, "min": 715}, "Total Records Seen": {"count": 1, "max": 90532, "sum": 90532.0, "min": 90532}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 26, "sum": 26.0, "min": 26}}, "EndTime": 1536848723.06794, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 12}, "StartTime": 1536848721.137694} [09/13/2018 14:25:23 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3607.47717916 records/second [09/13/2018 14:25:23 INFO 140484794881856] [09/13/2018 14:25:23 INFO 140484794881856] # Starting training for epoch 14 [09/13/2018 14:25:23 INFO 140284443182912] # Finished training epoch 12 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:23 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:23 INFO 140284443182912] Loss (name: value) total: 6.33812028711 [09/13/2018 14:25:23 INFO 140284443182912] Loss (name: value) kld: 0.0312249628488 [09/13/2018 14:25:23 INFO 140284443182912] Loss (name: value) recons: 6.30689534708 [09/13/2018 14:25:23 INFO 140284443182912] Loss (name: value) logppx: 6.33812028711 [09/13/2018 14:25:23 INFO 140284443182912] #quality_metric: host=algo-2, epoch=12, train total_loss <loss>=6.33812028711 [09/13/2018 14:25:23 INFO 140284443182912] patience losses:[6.3535577774047853, 6.3533889597112481, 6.3447614106264982, 6.3465941689231178, 6.3493942000649195] min patience loss:6.34476141063 current loss:6.33812028711 absolute loss difference:0.00664112351157 [09/13/2018 14:25:23 INFO 140284443182912] #progress_metric: host=algo-2, completed 12 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 660, "sum": 660.0, "min": 660}, "Total Records Seen": {"count": 1, "max": 83568, "sum": 83568.0, "min": 83568}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 24, "sum": 24.0, "min": 24}}, "EndTime": 1536848723.019634, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 11}, "StartTime": 1536848720.976819} [09/13/2018 14:25:23 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3408.67770025 records/second [09/13/2018 14:25:23 INFO 140284443182912] [09/13/2018 14:25:23 INFO 140284443182912] # Starting training for epoch 13 [09/13/2018 14:25:24 INFO 140484794881856] # Finished training epoch 14 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:24 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:24 INFO 140484794881856] Loss (name: value) total: 6.31904564771 [09/13/2018 14:25:24 INFO 140484794881856] Loss (name: value) kld: 0.0317779990675 [09/13/2018 14:25:24 INFO 140484794881856] Loss (name: value) recons: 6.28726766326 [09/13/2018 14:25:24 INFO 140484794881856] Loss (name: value) logppx: 6.31904564771 [09/13/2018 14:25:24 INFO 140484794881856] #quality_metric: host=algo-1, epoch=14, train total_loss <loss>=6.31904564771 [09/13/2018 14:25:24 INFO 140484794881856] patience losses:[6.330265998840332, 6.3256471937352963, 6.320580712231723, 6.3247037800875576, 6.3216312105005441] min patience loss:6.32058071223 current loss:6.31904564771 absolute loss difference:0.00153506452387 [09/13/2018 14:25:24 INFO 140484794881856] #progress_metric: host=algo-1, completed 14 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 770, "sum": 770.0, "min": 770}, "Total Records Seen": {"count": 1, "max": 97496, "sum": 97496.0, "min": 97496}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 28, "sum": 28.0, "min": 28}}, "EndTime": 1536848724.887125, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 13}, "StartTime": 1536848723.068405} [09/13/2018 14:25:24 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3828.70450359 records/second [09/13/2018 14:25:24 INFO 140484794881856] [09/13/2018 14:25:24 INFO 140484794881856] # Starting training for epoch 15 [09/13/2018 14:25:25 INFO 140284443182912] # Finished training epoch 13 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:25 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:25 INFO 140284443182912] Loss (name: value) total: 6.33981859901 [09/13/2018 14:25:25 INFO 140284443182912] Loss (name: value) kld: 0.0323069837791 [09/13/2018 14:25:25 INFO 140284443182912] Loss (name: value) recons: 6.30751159841 [09/13/2018 14:25:25 INFO 140284443182912] Loss (name: value) logppx: 6.33981859901 [09/13/2018 14:25:25 INFO 140284443182912] #quality_metric: host=algo-2, epoch=13, train total_loss <loss>=6.33981859901 [09/13/2018 14:25:25 INFO 140284443182912] patience losses:[6.3533889597112481, 6.3447614106264982, 6.3465941689231178, 6.3493942000649195, 6.3381202871149238] min patience loss:6.33812028711 current loss:6.33981859901 absolute loss difference:0.00169831189242 [09/13/2018 14:25:25 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:25 INFO 140284443182912] #progress_metric: host=algo-2, completed 13 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 715, "sum": 715.0, "min": 715}, "Total Records Seen": {"count": 1, "max": 90532, "sum": 90532.0, "min": 90532}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 26, "sum": 26.0, "min": 26}}, "EndTime": 1536848725.045926, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 12}, "StartTime": 1536848723.020108} [09/13/2018 14:25:25 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3437.37594817 records/second [09/13/2018 14:25:25 INFO 140284443182912] [09/13/2018 14:25:25 INFO 140284443182912] # Starting training for epoch 14 [09/13/2018 14:25:26 INFO 140484794881856] # Finished training epoch 15 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:26 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:26 INFO 140484794881856] Loss (name: value) total: 6.31527429494 [09/13/2018 14:25:26 INFO 140484794881856] Loss (name: value) kld: 0.0327950259163 [09/13/2018 14:25:26 INFO 140484794881856] Loss (name: value) recons: 6.28247925585 [09/13/2018 14:25:26 INFO 140484794881856] Loss (name: value) logppx: 6.31527429494 [09/13/2018 14:25:26 INFO 140484794881856] #quality_metric: host=algo-1, epoch=15, train total_loss <loss>=6.31527429494 [09/13/2018 14:25:26 INFO 140484794881856] patience losses:[6.3256471937352963, 6.320580712231723, 6.3247037800875576, 6.3216312105005441, 6.3190456477078527] min patience loss:6.31904564771 current loss:6.31527429494 absolute loss difference:0.00377135276794 [09/13/2018 14:25:26 INFO 140484794881856] #progress_metric: host=algo-1, completed 15 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 825, "sum": 825.0, "min": 825}, "Total Records Seen": {"count": 1, "max": 104460, "sum": 104460.0, "min": 104460}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 30, "sum": 30.0, "min": 30}}, "EndTime": 1536848726.677647, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 14}, "StartTime": 1536848724.887767} [09/13/2018 14:25:26 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3889.98993127 records/second [09/13/2018 14:25:26 INFO 140484794881856] [09/13/2018 14:25:26 INFO 140484794881856] # Starting training for epoch 16 [09/13/2018 14:25:27 INFO 140284443182912] # Finished training epoch 14 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:27 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:27 INFO 140284443182912] Loss (name: value) total: 6.34039403309 [09/13/2018 14:25:27 INFO 140284443182912] Loss (name: value) kld: 0.0357308519665 [09/13/2018 14:25:27 INFO 140284443182912] Loss (name: value) recons: 6.30466318564 [09/13/2018 14:25:27 INFO 140284443182912] Loss (name: value) logppx: 6.34039403309 [09/13/2018 14:25:27 INFO 140284443182912] #quality_metric: host=algo-2, epoch=14, train total_loss <loss>=6.34039403309 [09/13/2018 14:25:27 INFO 140284443182912] patience losses:[6.3447614106264982, 6.3465941689231178, 6.3493942000649195, 6.3381202871149238, 6.3398185990073461] min patience loss:6.33812028711 current loss:6.34039403309 absolute loss difference:0.00227374597029 [09/13/2018 14:25:27 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:27 INFO 140284443182912] #progress_metric: host=algo-2, completed 14 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 770, "sum": 770.0, "min": 770}, "Total Records Seen": {"count": 1, "max": 97496, "sum": 97496.0, "min": 97496}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 28, "sum": 28.0, "min": 28}}, "EndTime": 1536848727.124343, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 13}, "StartTime": 1536848725.046489} [09/13/2018 14:25:27 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3351.21539443 records/second [09/13/2018 14:25:27 INFO 140284443182912] [09/13/2018 14:25:27 INFO 140284443182912] # Starting training for epoch 15 [09/13/2018 14:25:28 INFO 140484794881856] # Finished training epoch 16 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:28 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:28 INFO 140484794881856] Loss (name: value) total: 6.31582429192 [09/13/2018 14:25:28 INFO 140484794881856] Loss (name: value) kld: 0.036716393356 [09/13/2018 14:25:28 INFO 140484794881856] Loss (name: value) recons: 6.2791079001 [09/13/2018 14:25:28 INFO 140484794881856] Loss (name: value) logppx: 6.31582429192 [09/13/2018 14:25:28 INFO 140484794881856] #quality_metric: host=algo-1, epoch=16, train total_loss <loss>=6.31582429192 [09/13/2018 14:25:28 INFO 140484794881856] patience losses:[6.320580712231723, 6.3247037800875576, 6.3216312105005441, 6.3190456477078527, 6.3152742949399086] min patience loss:6.31527429494 current loss:6.31582429192 absolute loss difference:0.000549996982921 [09/13/2018 14:25:28 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:28 INFO 140484794881856] #progress_metric: host=algo-1, completed 16 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 880, "sum": 880.0, "min": 880}, "Total Records Seen": {"count": 1, "max": 111424, "sum": 111424.0, "min": 111424}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 32, "sum": 32.0, "min": 32}}, "EndTime": 1536848728.647134, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 15}, "StartTime": 1536848726.678343} [09/13/2018 14:25:28 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3536.89077561 records/second [09/13/2018 14:25:28 INFO 140484794881856] [09/13/2018 14:25:28 INFO 140484794881856] # Starting training for epoch 17 [09/13/2018 14:25:29 INFO 140284443182912] # Finished training epoch 15 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:29 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:29 INFO 140284443182912] Loss (name: value) total: 6.34107848081 [09/13/2018 14:25:29 INFO 140284443182912] Loss (name: value) kld: 0.0384909780781 [09/13/2018 14:25:29 INFO 140284443182912] Loss (name: value) recons: 6.30258751783 [09/13/2018 14:25:29 INFO 140284443182912] Loss (name: value) logppx: 6.34107848081 [09/13/2018 14:25:29 INFO 140284443182912] #quality_metric: host=algo-2, epoch=15, train total_loss <loss>=6.34107848081 [09/13/2018 14:25:29 INFO 140284443182912] patience losses:[6.3465941689231178, 6.3493942000649195, 6.3381202871149238, 6.3398185990073461, 6.3403940330852162] min patience loss:6.33812028711 current loss:6.34107848081 absolute loss difference:0.00295819369229 [09/13/2018 14:25:29 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:25:29 INFO 140284443182912] #progress_metric: host=algo-2, completed 15 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 825, "sum": 825.0, "min": 825}, "Total Records Seen": {"count": 1, "max": 104460, "sum": 104460.0, "min": 104460}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 30, "sum": 30.0, "min": 30}}, "EndTime": 1536848729.158683, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 14}, "StartTime": 1536848727.125148} [09/13/2018 14:25:29 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3424.08017532 records/second [09/13/2018 14:25:29 INFO 140284443182912] [09/13/2018 14:25:29 INFO 140284443182912] # Starting training for epoch 16 [09/13/2018 14:25:30 INFO 140484794881856] # Finished training epoch 17 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:30 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:30 INFO 140484794881856] Loss (name: value) total: 6.31574832743 [09/13/2018 14:25:30 INFO 140484794881856] Loss (name: value) kld: 0.0363716680811 [09/13/2018 14:25:30 INFO 140484794881856] Loss (name: value) recons: 6.27937662818 [09/13/2018 14:25:30 INFO 140484794881856] Loss (name: value) logppx: 6.31574832743 [09/13/2018 14:25:30 INFO 140484794881856] #quality_metric: host=algo-1, epoch=17, train total_loss <loss>=6.31574832743 [09/13/2018 14:25:30 INFO 140484794881856] patience losses:[6.3247037800875576, 6.3216312105005441, 6.3190456477078527, 6.3152742949399086, 6.3158242919228291] min patience loss:6.31527429494 current loss:6.31574832743 absolute loss difference:0.000474032488736 [09/13/2018 14:25:30 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:30 INFO 140484794881856] #progress_metric: host=algo-1, completed 17 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 935, "sum": 935.0, "min": 935}, "Total Records Seen": {"count": 1, "max": 118388, "sum": 118388.0, "min": 118388}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 34, "sum": 34.0, "min": 34}}, "EndTime": 1536848730.519676, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 16}, "StartTime": 1536848728.647838} [09/13/2018 14:25:30 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3719.77330945 records/second [09/13/2018 14:25:30 INFO 140484794881856] [09/13/2018 14:25:30 INFO 140484794881856] # Starting training for epoch 18 [09/13/2018 14:25:31 INFO 140284443182912] # Finished training epoch 16 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:31 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:31 INFO 140284443182912] Loss (name: value) total: 6.3326433832 [09/13/2018 14:25:31 INFO 140284443182912] Loss (name: value) kld: 0.0389134113508 [09/13/2018 14:25:31 INFO 140284443182912] Loss (name: value) recons: 6.2937299555 [09/13/2018 14:25:31 INFO 140284443182912] Loss (name: value) logppx: 6.3326433832 [09/13/2018 14:25:31 INFO 140284443182912] #quality_metric: host=algo-2, epoch=16, train total_loss <loss>=6.3326433832 [09/13/2018 14:25:31 INFO 140284443182912] patience losses:[6.3493942000649195, 6.3381202871149238, 6.3398185990073461, 6.3403940330852162, 6.3410784808072176] min patience loss:6.33812028711 current loss:6.3326433832 absolute loss difference:0.00547690391541 [09/13/2018 14:25:31 INFO 140284443182912] #progress_metric: host=algo-2, completed 16 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 880, "sum": 880.0, "min": 880}, "Total Records Seen": {"count": 1, "max": 111424, "sum": 111424.0, "min": 111424}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 32, "sum": 32.0, "min": 32}}, "EndTime": 1536848731.204908, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 15}, "StartTime": 1536848729.166246} [09/13/2018 14:25:31 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3415.69959381 records/second [09/13/2018 14:25:31 INFO 140284443182912] [09/13/2018 14:25:31 INFO 140284443182912] # Starting training for epoch 17 [09/13/2018 14:25:32 INFO 140484794881856] # Finished training epoch 18 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:32 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:32 INFO 140484794881856] Loss (name: value) total: 6.31393944567 [09/13/2018 14:25:32 INFO 140484794881856] Loss (name: value) kld: 0.0380420502106 [09/13/2018 14:25:32 INFO 140484794881856] Loss (name: value) recons: 6.2758974032 [09/13/2018 14:25:32 INFO 140484794881856] Loss (name: value) logppx: 6.31393944567 [09/13/2018 14:25:32 INFO 140484794881856] #quality_metric: host=algo-1, epoch=18, train total_loss <loss>=6.31393944567 [09/13/2018 14:25:32 INFO 140484794881856] patience losses:[6.3216312105005441, 6.3190456477078527, 6.3152742949399086, 6.3158242919228291, 6.3157483274286443] min patience loss:6.31527429494 current loss:6.31393944567 absolute loss difference:0.00133484927091 [09/13/2018 14:25:32 INFO 140484794881856] #progress_metric: host=algo-1, completed 18 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 990, "sum": 990.0, "min": 990}, "Total Records Seen": {"count": 1, "max": 125352, "sum": 125352.0, "min": 125352}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 36, "sum": 36.0, "min": 36}}, "EndTime": 1536848732.408328, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 17}, "StartTime": 1536848730.520749} [09/13/2018 14:25:32 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3688.84858109 records/second [09/13/2018 14:25:32 INFO 140484794881856] [09/13/2018 14:25:32 INFO 140484794881856] # Starting training for epoch 19 [09/13/2018 14:25:33 INFO 140284443182912] # Finished training epoch 17 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:33 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:33 INFO 140284443182912] Loss (name: value) total: 6.33761412881 [09/13/2018 14:25:33 INFO 140284443182912] Loss (name: value) kld: 0.0406258762899 [09/13/2018 14:25:33 INFO 140284443182912] Loss (name: value) recons: 6.29698826183 [09/13/2018 14:25:33 INFO 140284443182912] Loss (name: value) logppx: 6.33761412881 [09/13/2018 14:25:33 INFO 140284443182912] #quality_metric: host=algo-2, epoch=17, train total_loss <loss>=6.33761412881 [09/13/2018 14:25:33 INFO 140284443182912] patience losses:[6.3381202871149238, 6.3398185990073461, 6.3403940330852162, 6.3410784808072176, 6.3326433831995184] min patience loss:6.3326433832 current loss:6.33761412881 absolute loss difference:0.00497074560686 [09/13/2018 14:25:33 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:33 INFO 140284443182912] #progress_metric: host=algo-2, completed 17 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 935, "sum": 935.0, "min": 935}, "Total Records Seen": {"count": 1, "max": 118388, "sum": 118388.0, "min": 118388}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 34, "sum": 34.0, "min": 34}}, "EndTime": 1536848733.276882, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 16}, "StartTime": 1536848731.205725} [09/13/2018 14:25:33 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3362.14973279 records/second [09/13/2018 14:25:33 INFO 140284443182912] [09/13/2018 14:25:33 INFO 140284443182912] # Starting training for epoch 18 [09/13/2018 14:25:34 INFO 140484794881856] # Finished training epoch 19 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:34 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:34 INFO 140484794881856] Loss (name: value) total: 6.31602848226 [09/13/2018 14:25:34 INFO 140484794881856] Loss (name: value) kld: 0.0420495873994 [09/13/2018 14:25:34 INFO 140484794881856] Loss (name: value) recons: 6.27397890524 [09/13/2018 14:25:34 INFO 140484794881856] Loss (name: value) logppx: 6.31602848226 [09/13/2018 14:25:34 INFO 140484794881856] #quality_metric: host=algo-1, epoch=19, train total_loss <loss>=6.31602848226 [09/13/2018 14:25:34 INFO 140484794881856] patience losses:[6.3190456477078527, 6.3152742949399086, 6.3158242919228291, 6.3157483274286443, 6.3139394456690008] min patience loss:6.31393944567 current loss:6.31602848226 absolute loss difference:0.00208903659474 [09/13/2018 14:25:34 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:34 INFO 140484794881856] #progress_metric: host=algo-1, completed 19 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1045, "sum": 1045.0, "min": 1045}, "Total Records Seen": {"count": 1, "max": 132316, "sum": 132316.0, "min": 132316}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 38, "sum": 38.0, "min": 38}}, "EndTime": 1536848734.422815, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 18}, "StartTime": 1536848732.413263} [09/13/2018 14:25:34 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3465.15588694 records/second [09/13/2018 14:25:34 INFO 140484794881856] [09/13/2018 14:25:34 INFO 140484794881856] # Starting training for epoch 20 [09/13/2018 14:25:35 INFO 140284443182912] # Finished training epoch 18 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:35 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:35 INFO 140284443182912] Loss (name: value) total: 6.33284089782 [09/13/2018 14:25:35 INFO 140284443182912] Loss (name: value) kld: 0.0432382199913 [09/13/2018 14:25:35 INFO 140284443182912] Loss (name: value) recons: 6.28960270015 [09/13/2018 14:25:35 INFO 140284443182912] Loss (name: value) logppx: 6.33284089782 [09/13/2018 14:25:35 INFO 140284443182912] #quality_metric: host=algo-2, epoch=18, train total_loss <loss>=6.33284089782 [09/13/2018 14:25:35 INFO 140284443182912] patience losses:[6.3398185990073461, 6.3403940330852162, 6.3410784808072176, 6.3326433831995184, 6.3376141288063739] min patience loss:6.3326433832 current loss:6.33284089782 absolute loss difference:0.000197514620695 [09/13/2018 14:25:35 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:35 INFO 140284443182912] #progress_metric: host=algo-2, completed 18 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 990, "sum": 990.0, "min": 990}, "Total Records Seen": {"count": 1, "max": 125352, "sum": 125352.0, "min": 125352}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 36, "sum": 36.0, "min": 36}}, "EndTime": 1536848735.333317, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 17}, "StartTime": 1536848733.277289} [09/13/2018 14:25:35 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3386.86812262 records/second [09/13/2018 14:25:35 INFO 140284443182912] [09/13/2018 14:25:35 INFO 140284443182912] # Starting training for epoch 19 [09/13/2018 14:25:36 INFO 140484794881856] # Finished training epoch 20 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:36 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:36 INFO 140484794881856] Loss (name: value) total: 6.31198788123 [09/13/2018 14:25:36 INFO 140484794881856] Loss (name: value) kld: 0.0419802464883 [09/13/2018 14:25:36 INFO 140484794881856] Loss (name: value) recons: 6.27000764933 [09/13/2018 14:25:36 INFO 140484794881856] Loss (name: value) logppx: 6.31198788123 [09/13/2018 14:25:36 INFO 140484794881856] #quality_metric: host=algo-1, epoch=20, train total_loss <loss>=6.31198788123 [09/13/2018 14:25:36 INFO 140484794881856] patience losses:[6.3152742949399086, 6.3158242919228291, 6.3157483274286443, 6.3139394456690008, 6.3160284822637385] min patience loss:6.31393944567 current loss:6.31198788123 absolute loss difference:0.00195156444203 [09/13/2018 14:25:36 INFO 140484794881856] #progress_metric: host=algo-1, completed 20 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1100, "sum": 1100.0, "min": 1100}, "Total Records Seen": {"count": 1, "max": 139280, "sum": 139280.0, "min": 139280}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 40, "sum": 40.0, "min": 40}}, "EndTime": 1536848736.306041, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 19}, "StartTime": 1536848734.42325} [09/13/2018 14:25:36 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3698.43396947 records/second [09/13/2018 14:25:36 INFO 140484794881856] [09/13/2018 14:25:36 INFO 140484794881856] # Starting training for epoch 21 [09/13/2018 14:25:37 INFO 140284443182912] # Finished training epoch 19 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:37 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:37 INFO 140284443182912] Loss (name: value) total: 6.33342597268 [09/13/2018 14:25:37 INFO 140284443182912] Loss (name: value) kld: 0.0468513983894 [09/13/2018 14:25:37 INFO 140284443182912] Loss (name: value) recons: 6.28657457612 [09/13/2018 14:25:37 INFO 140284443182912] Loss (name: value) logppx: 6.33342597268 [09/13/2018 14:25:37 INFO 140284443182912] #quality_metric: host=algo-2, epoch=19, train total_loss <loss>=6.33342597268 [09/13/2018 14:25:37 INFO 140284443182912] patience losses:[6.3403940330852162, 6.3410784808072176, 6.3326433831995184, 6.3376141288063739, 6.3328408978202129] min patience loss:6.3326433832 current loss:6.33342597268 absolute loss difference:0.000782589478926 [09/13/2018 14:25:37 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:25:37 INFO 140284443182912] #progress_metric: host=algo-2, completed 19 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1045, "sum": 1045.0, "min": 1045}, "Total Records Seen": {"count": 1, "max": 132316, "sum": 132316.0, "min": 132316}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 38, "sum": 38.0, "min": 38}}, "EndTime": 1536848737.365489, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 18}, "StartTime": 1536848735.333682} [09/13/2018 14:25:37 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3427.22435704 records/second [09/13/2018 14:25:37 INFO 140284443182912] [09/13/2018 14:25:37 INFO 140284443182912] # Starting training for epoch 20 [09/13/2018 14:25:38 INFO 140484794881856] # Finished training epoch 21 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:38 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:38 INFO 140484794881856] Loss (name: value) total: 6.30405794057 [09/13/2018 14:25:38 INFO 140484794881856] Loss (name: value) kld: 0.0465865649283 [09/13/2018 14:25:38 INFO 140484794881856] Loss (name: value) recons: 6.25747137503 [09/13/2018 14:25:38 INFO 140484794881856] Loss (name: value) logppx: 6.30405794057 [09/13/2018 14:25:38 INFO 140484794881856] #quality_metric: host=algo-1, epoch=21, train total_loss <loss>=6.30405794057 [09/13/2018 14:25:38 INFO 140484794881856] patience losses:[6.3158242919228291, 6.3157483274286443, 6.3139394456690008, 6.3160284822637385, 6.3119878812269734] min patience loss:6.31198788123 current loss:6.30405794057 absolute loss difference:0.00792994065718 [09/13/2018 14:25:38 INFO 140484794881856] #progress_metric: host=algo-1, completed 21 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1155, "sum": 1155.0, "min": 1155}, "Total Records Seen": {"count": 1, "max": 146244, "sum": 146244.0, "min": 146244}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 42, "sum": 42.0, "min": 42}}, "EndTime": 1536848738.248177, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 20}, "StartTime": 1536848736.306606} [09/13/2018 14:25:38 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3586.5091506 records/second [09/13/2018 14:25:38 INFO 140484794881856] [09/13/2018 14:25:38 INFO 140484794881856] # Starting training for epoch 22 [09/13/2018 14:25:40 INFO 140484794881856] # Finished training epoch 22 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:40 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:40 INFO 140484794881856] Loss (name: value) total: 6.29917986176 [09/13/2018 14:25:40 INFO 140484794881856] Loss (name: value) kld: 0.049517780119 [09/13/2018 14:25:40 INFO 140484794881856] Loss (name: value) recons: 6.24966206551 [09/13/2018 14:25:40 INFO 140484794881856] Loss (name: value) logppx: 6.29917986176 [09/13/2018 14:25:40 INFO 140484794881856] #quality_metric: host=algo-1, epoch=22, train total_loss <loss>=6.29917986176 [09/13/2018 14:25:40 INFO 140484794881856] patience losses:[6.3157483274286443, 6.3139394456690008, 6.3160284822637385, 6.3119878812269734, 6.3040579405697912] min patience loss:6.30405794057 current loss:6.29917986176 absolute loss difference:0.00487807880748 [09/13/2018 14:25:40 INFO 140484794881856] #progress_metric: host=algo-1, completed 22 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1210, "sum": 1210.0, "min": 1210}, "Total Records Seen": {"count": 1, "max": 153208, "sum": 153208.0, "min": 153208}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 44, "sum": 44.0, "min": 44}}, "EndTime": 1536848740.192538, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 21}, "StartTime": 1536848738.248801} [09/13/2018 14:25:40 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3582.39282686 records/second [09/13/2018 14:25:40 INFO 140484794881856] [09/13/2018 14:25:40 INFO 140484794881856] # Starting training for epoch 23 [09/13/2018 14:25:41 INFO 140484794881856] # Finished training epoch 23 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:41 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:41 INFO 140484794881856] Loss (name: value) total: 6.30020502264 [09/13/2018 14:25:41 INFO 140484794881856] Loss (name: value) kld: 0.052454889498 [09/13/2018 14:25:41 INFO 140484794881856] Loss (name: value) recons: 6.24775018692 [09/13/2018 14:25:41 INFO 140484794881856] Loss (name: value) logppx: 6.30020502264 [09/13/2018 14:25:41 INFO 140484794881856] #quality_metric: host=algo-1, epoch=23, train total_loss <loss>=6.30020502264 [09/13/2018 14:25:41 INFO 140484794881856] patience losses:[6.3139394456690008, 6.3160284822637385, 6.3119878812269734, 6.3040579405697912, 6.2991798617623065] min patience loss:6.29917986176 current loss:6.30020502264 absolute loss difference:0.00102516087619 [09/13/2018 14:25:41 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:41 INFO 140484794881856] #progress_metric: host=algo-1, completed 23 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1265, "sum": 1265.0, "min": 1265}, "Total Records Seen": {"count": 1, "max": 160172, "sum": 160172.0, "min": 160172}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 46, "sum": 46.0, "min": 46}}, "EndTime": 1536848741.999757, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 22}, "StartTime": 1536848740.192972} [09/13/2018 14:25:41 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3853.91030533 records/second [09/13/2018 14:25:42 INFO 140484794881856] [09/13/2018 14:25:42 INFO 140484794881856] # Starting training for epoch 24 [09/13/2018 14:25:39 INFO 140284443182912] # Finished training epoch 20 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:39 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:39 INFO 140284443182912] Loss (name: value) total: 6.32399564223 [09/13/2018 14:25:39 INFO 140284443182912] Loss (name: value) kld: 0.0502341237596 [09/13/2018 14:25:39 INFO 140284443182912] Loss (name: value) recons: 6.27376150218 [09/13/2018 14:25:39 INFO 140284443182912] Loss (name: value) logppx: 6.32399564223 [09/13/2018 14:25:39 INFO 140284443182912] #quality_metric: host=algo-2, epoch=20, train total_loss <loss>=6.32399564223 [09/13/2018 14:25:39 INFO 140284443182912] patience losses:[6.3410784808072176, 6.3326433831995184, 6.3376141288063739, 6.3328408978202129, 6.3334259726784445] min patience loss:6.3326433832 current loss:6.32399564223 absolute loss difference:0.00864774097096 [09/13/2018 14:25:39 INFO 140284443182912] #progress_metric: host=algo-2, completed 20 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1100, "sum": 1100.0, "min": 1100}, "Total Records Seen": {"count": 1, "max": 139280, "sum": 139280.0, "min": 139280}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 40, "sum": 40.0, "min": 40}}, "EndTime": 1536848739.508774, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 19}, "StartTime": 1536848737.366052} [09/13/2018 14:25:39 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3249.87002426 records/second [09/13/2018 14:25:39 INFO 140284443182912] [09/13/2018 14:25:39 INFO 140284443182912] # Starting training for epoch 21 [09/13/2018 14:25:41 INFO 140284443182912] # Finished training epoch 21 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:41 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:41 INFO 140284443182912] Loss (name: value) total: 6.31488205303 [09/13/2018 14:25:41 INFO 140284443182912] Loss (name: value) kld: 0.0512273520401 [09/13/2018 14:25:41 INFO 140284443182912] Loss (name: value) recons: 6.2636547262 [09/13/2018 14:25:41 INFO 140284443182912] Loss (name: value) logppx: 6.31488205303 [09/13/2018 14:25:41 INFO 140284443182912] #quality_metric: host=algo-2, epoch=21, train total_loss <loss>=6.31488205303 [09/13/2018 14:25:41 INFO 140284443182912] patience losses:[6.3326433831995184, 6.3376141288063739, 6.3328408978202129, 6.3334259726784445, 6.3239956422285601] min patience loss:6.32399564223 current loss:6.31488205303 absolute loss difference:0.00911358920011 [09/13/2018 14:25:41 INFO 140284443182912] #progress_metric: host=algo-2, completed 21 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1155, "sum": 1155.0, "min": 1155}, "Total Records Seen": {"count": 1, "max": 146244, "sum": 146244.0, "min": 146244}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 42, "sum": 42.0, "min": 42}}, "EndTime": 1536848741.432617, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 20}, "StartTime": 1536848739.509305} [09/13/2018 14:25:41 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3620.55697694 records/second [09/13/2018 14:25:41 INFO 140284443182912] [09/13/2018 14:25:41 INFO 140284443182912] # Starting training for epoch 22 [09/13/2018 14:25:43 INFO 140284443182912] # Finished training epoch 22 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:43 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:43 INFO 140284443182912] Loss (name: value) total: 6.31041726199 [09/13/2018 14:25:43 INFO 140284443182912] Loss (name: value) kld: 0.0535484499213 [09/13/2018 14:25:43 INFO 140284443182912] Loss (name: value) recons: 6.25686877858 [09/13/2018 14:25:43 INFO 140284443182912] Loss (name: value) logppx: 6.31041726199 [09/13/2018 14:25:43 INFO 140284443182912] #quality_metric: host=algo-2, epoch=22, train total_loss <loss>=6.31041726199 [09/13/2018 14:25:43 INFO 140284443182912] patience losses:[6.3376141288063739, 6.3328408978202129, 6.3334259726784445, 6.3239956422285601, 6.3148820530284535] min patience loss:6.31488205303 current loss:6.31041726199 absolute loss difference:0.00446479103782 [09/13/2018 14:25:43 INFO 140284443182912] #progress_metric: host=algo-2, completed 22 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1210, "sum": 1210.0, "min": 1210}, "Total Records Seen": {"count": 1, "max": 153208, "sum": 153208.0, "min": 153208}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 44, "sum": 44.0, "min": 44}}, "EndTime": 1536848743.424153, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 21}, "StartTime": 1536848741.433207} [09/13/2018 14:25:43 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3497.43588265 records/second [09/13/2018 14:25:43 INFO 140284443182912] [09/13/2018 14:25:43 INFO 140284443182912] # Starting training for epoch 23 [09/13/2018 14:25:43 INFO 140484794881856] # Finished training epoch 24 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:43 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:43 INFO 140484794881856] Loss (name: value) total: 6.29289615804 [09/13/2018 14:25:43 INFO 140484794881856] Loss (name: value) kld: 0.0548129537566 [09/13/2018 14:25:43 INFO 140484794881856] Loss (name: value) recons: 6.23808319352 [09/13/2018 14:25:43 INFO 140484794881856] Loss (name: value) logppx: 6.29289615804 [09/13/2018 14:25:43 INFO 140484794881856] #quality_metric: host=algo-1, epoch=24, train total_loss <loss>=6.29289615804 [09/13/2018 14:25:43 INFO 140484794881856] patience losses:[6.3160284822637385, 6.3119878812269734, 6.3040579405697912, 6.2991798617623065, 6.3002050226384947] min patience loss:6.29917986176 current loss:6.29289615804 absolute loss difference:0.00628370371732 [09/13/2018 14:25:43 INFO 140484794881856] #progress_metric: host=algo-1, completed 24 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1320, "sum": 1320.0, "min": 1320}, "Total Records Seen": {"count": 1, "max": 167136, "sum": 167136.0, "min": 167136}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 48, "sum": 48.0, "min": 48}}, "EndTime": 1536848743.867242, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 23}, "StartTime": 1536848742.000386} [09/13/2018 14:25:43 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3729.88462503 records/second [09/13/2018 14:25:43 INFO 140484794881856] [09/13/2018 14:25:43 INFO 140484794881856] # Starting training for epoch 25 [09/13/2018 14:25:45 INFO 140284443182912] # Finished training epoch 23 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:45 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:45 INFO 140284443182912] Loss (name: value) total: 6.30863891081 [09/13/2018 14:25:45 INFO 140284443182912] Loss (name: value) kld: 0.0566446586089 [09/13/2018 14:25:45 INFO 140284443182912] Loss (name: value) recons: 6.25199427171 [09/13/2018 14:25:45 INFO 140284443182912] Loss (name: value) logppx: 6.30863891081 [09/13/2018 14:25:45 INFO 140284443182912] #quality_metric: host=algo-2, epoch=23, train total_loss <loss>=6.30863891081 [09/13/2018 14:25:45 INFO 140284443182912] patience losses:[6.3328408978202129, 6.3334259726784445, 6.3239956422285601, 6.3148820530284535, 6.3104172619906338] min patience loss:6.31041726199 current loss:6.30863891081 absolute loss difference:0.00177835117687 [09/13/2018 14:25:45 INFO 140284443182912] #progress_metric: host=algo-2, completed 23 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1265, "sum": 1265.0, "min": 1265}, "Total Records Seen": {"count": 1, "max": 160172, "sum": 160172.0, "min": 160172}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 46, "sum": 46.0, "min": 46}}, "EndTime": 1536848745.503594, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 22}, "StartTime": 1536848743.42471} [09/13/2018 14:25:45 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3348.85474665 records/second [09/13/2018 14:25:45 INFO 140284443182912] [09/13/2018 14:25:45 INFO 140284443182912] # Starting training for epoch 24 [09/13/2018 14:25:45 INFO 140484794881856] # Finished training epoch 25 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:45 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:45 INFO 140484794881856] Loss (name: value) total: 6.28871207671 [09/13/2018 14:25:45 INFO 140484794881856] Loss (name: value) kld: 0.0598269987343 [09/13/2018 14:25:45 INFO 140484794881856] Loss (name: value) recons: 6.22888509577 [09/13/2018 14:25:45 INFO 140484794881856] Loss (name: value) logppx: 6.28871207671 [09/13/2018 14:25:45 INFO 140484794881856] #quality_metric: host=algo-1, epoch=25, train total_loss <loss>=6.28871207671 [09/13/2018 14:25:45 INFO 140484794881856] patience losses:[6.3119878812269734, 6.3040579405697912, 6.2991798617623065, 6.3002050226384947, 6.2928961580449885] min patience loss:6.29289615804 current loss:6.28871207671 absolute loss difference:0.00418408133767 [09/13/2018 14:25:45 INFO 140484794881856] #progress_metric: host=algo-1, completed 25 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1375, "sum": 1375.0, "min": 1375}, "Total Records Seen": {"count": 1, "max": 174100, "sum": 174100.0, "min": 174100}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 50, "sum": 50.0, "min": 50}}, "EndTime": 1536848745.810272, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 24}, "StartTime": 1536848743.867859} [09/13/2018 14:25:45 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3584.91393202 records/second [09/13/2018 14:25:45 INFO 140484794881856] [09/13/2018 14:25:45 INFO 140484794881856] # Starting training for epoch 26 [09/13/2018 14:25:47 INFO 140284443182912] # Finished training epoch 24 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:47 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:47 INFO 140284443182912] Loss (name: value) total: 6.31349254955 [09/13/2018 14:25:47 INFO 140284443182912] Loss (name: value) kld: 0.0617684100501 [09/13/2018 14:25:47 INFO 140284443182912] Loss (name: value) recons: 6.2517241478 [09/13/2018 14:25:47 INFO 140284443182912] Loss (name: value) logppx: 6.31349254955 [09/13/2018 14:25:47 INFO 140284443182912] #quality_metric: host=algo-2, epoch=24, train total_loss <loss>=6.31349254955 [09/13/2018 14:25:47 INFO 140284443182912] patience losses:[6.3334259726784445, 6.3239956422285601, 6.3148820530284535, 6.3104172619906338, 6.308638910813765] min patience loss:6.30863891081 current loss:6.31349254955 absolute loss difference:0.00485363873568 [09/13/2018 14:25:47 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:47 INFO 140284443182912] #progress_metric: host=algo-2, completed 24 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1320, "sum": 1320.0, "min": 1320}, "Total Records Seen": {"count": 1, "max": 167136, "sum": 167136.0, "min": 167136}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 48, "sum": 48.0, "min": 48}}, "EndTime": 1536848747.516104, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 23}, "StartTime": 1536848745.504692} [09/13/2018 14:25:47 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3462.00863809 records/second [09/13/2018 14:25:47 INFO 140284443182912] [09/13/2018 14:25:47 INFO 140284443182912] # Starting training for epoch 25 [09/13/2018 14:25:47 INFO 140484794881856] # Finished training epoch 26 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:47 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:47 INFO 140484794881856] Loss (name: value) total: 6.28518250639 [09/13/2018 14:25:47 INFO 140484794881856] Loss (name: value) kld: 0.0618156318637 [09/13/2018 14:25:47 INFO 140484794881856] Loss (name: value) recons: 6.22336683707 [09/13/2018 14:25:47 INFO 140484794881856] Loss (name: value) logppx: 6.28518250639 [09/13/2018 14:25:47 INFO 140484794881856] #quality_metric: host=algo-1, epoch=26, train total_loss <loss>=6.28518250639 [09/13/2018 14:25:47 INFO 140484794881856] patience losses:[6.3040579405697912, 6.2991798617623065, 6.3002050226384947, 6.2928961580449885, 6.2887120767073199] min patience loss:6.28871207671 current loss:6.28518250639 absolute loss difference:0.00352957031944 [09/13/2018 14:25:47 INFO 140484794881856] #progress_metric: host=algo-1, completed 26 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1430, "sum": 1430.0, "min": 1430}, "Total Records Seen": {"count": 1, "max": 181064, "sum": 181064.0, "min": 181064}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 52, "sum": 52.0, "min": 52}}, "EndTime": 1536848747.719626, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 25}, "StartTime": 1536848745.811022} [09/13/2018 14:25:47 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3648.43996221 records/second [09/13/2018 14:25:47 INFO 140484794881856] [09/13/2018 14:25:47 INFO 140484794881856] # Starting training for epoch 27 [09/13/2018 14:25:49 INFO 140484794881856] # Finished training epoch 27 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:49 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:49 INFO 140484794881856] Loss (name: value) total: 6.27273640633 [09/13/2018 14:25:49 INFO 140484794881856] Loss (name: value) kld: 0.0666181775995 [09/13/2018 14:25:49 INFO 140484794881856] Loss (name: value) recons: 6.20611827157 [09/13/2018 14:25:49 INFO 140484794881856] Loss (name: value) logppx: 6.27273640633 [09/13/2018 14:25:49 INFO 140484794881856] #quality_metric: host=algo-1, epoch=27, train total_loss <loss>=6.27273640633 [09/13/2018 14:25:49 INFO 140484794881856] patience losses:[6.2991798617623065, 6.3002050226384947, 6.2928961580449885, 6.2887120767073199, 6.2851825063878843] min patience loss:6.28518250639 current loss:6.27273640633 absolute loss difference:0.0124461000616 [09/13/2018 14:25:49 INFO 140484794881856] #progress_metric: host=algo-1, completed 27 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1485, "sum": 1485.0, "min": 1485}, "Total Records Seen": {"count": 1, "max": 188028, "sum": 188028.0, "min": 188028}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 54, "sum": 54.0, "min": 54}}, "EndTime": 1536848749.679787, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 26}, "StartTime": 1536848747.720283} [09/13/2018 14:25:49 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3553.61229827 records/second [09/13/2018 14:25:49 INFO 140484794881856] [09/13/2018 14:25:49 INFO 140484794881856] # Starting training for epoch 28 [09/13/2018 14:25:49 INFO 140284443182912] # Finished training epoch 25 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:49 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:49 INFO 140284443182912] Loss (name: value) total: 6.30772269856 [09/13/2018 14:25:49 INFO 140284443182912] Loss (name: value) kld: 0.0656871397726 [09/13/2018 14:25:49 INFO 140284443182912] Loss (name: value) recons: 6.24203557101 [09/13/2018 14:25:49 INFO 140284443182912] Loss (name: value) logppx: 6.30772269856 [09/13/2018 14:25:49 INFO 140284443182912] #quality_metric: host=algo-2, epoch=25, train total_loss <loss>=6.30772269856 [09/13/2018 14:25:49 INFO 140284443182912] patience losses:[6.3239956422285601, 6.3148820530284535, 6.3104172619906338, 6.308638910813765, 6.3134925495494496] min patience loss:6.30863891081 current loss:6.30772269856 absolute loss difference:0.000916212255304 [09/13/2018 14:25:49 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:49 INFO 140284443182912] #progress_metric: host=algo-2, completed 25 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1375, "sum": 1375.0, "min": 1375}, "Total Records Seen": {"count": 1, "max": 174100, "sum": 174100.0, "min": 174100}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 50, "sum": 50.0, "min": 50}}, "EndTime": 1536848749.740648, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 24}, "StartTime": 1536848747.516579} [09/13/2018 14:25:49 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3130.86135393 records/second [09/13/2018 14:25:49 INFO 140284443182912] [09/13/2018 14:25:49 INFO 140284443182912] # Starting training for epoch 26 [09/13/2018 14:25:51 INFO 140484794881856] # Finished training epoch 28 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:51 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:51 INFO 140484794881856] Loss (name: value) total: 6.27310105237 [09/13/2018 14:25:51 INFO 140484794881856] Loss (name: value) kld: 0.0695726646618 [09/13/2018 14:25:51 INFO 140484794881856] Loss (name: value) recons: 6.2035283999 [09/13/2018 14:25:51 INFO 140484794881856] Loss (name: value) logppx: 6.27310105237 [09/13/2018 14:25:51 INFO 140484794881856] #quality_metric: host=algo-1, epoch=28, train total_loss <loss>=6.27310105237 [09/13/2018 14:25:51 INFO 140484794881856] patience losses:[6.3002050226384947, 6.2928961580449885, 6.2887120767073199, 6.2851825063878843, 6.2727364063262936] min patience loss:6.27273640633 current loss:6.27310105237 absolute loss difference:0.000364646044645 [09/13/2018 14:25:51 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:51 INFO 140484794881856] #progress_metric: host=algo-1, completed 28 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1540, "sum": 1540.0, "min": 1540}, "Total Records Seen": {"count": 1, "max": 194992, "sum": 194992.0, "min": 194992}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 56, "sum": 56.0, "min": 56}}, "EndTime": 1536848751.675387, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 27}, "StartTime": 1536848749.68035} [09/13/2018 14:25:51 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3490.32991659 records/second [09/13/2018 14:25:51 INFO 140484794881856] [09/13/2018 14:25:51 INFO 140484794881856] # Starting training for epoch 29 [09/13/2018 14:25:51 INFO 140284443182912] # Finished training epoch 26 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:51 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:51 INFO 140284443182912] Loss (name: value) total: 6.29684973197 [09/13/2018 14:25:51 INFO 140284443182912] Loss (name: value) kld: 0.0699698300524 [09/13/2018 14:25:51 INFO 140284443182912] Loss (name: value) recons: 6.2268798568 [09/13/2018 14:25:51 INFO 140284443182912] Loss (name: value) logppx: 6.29684973197 [09/13/2018 14:25:51 INFO 140284443182912] #quality_metric: host=algo-2, epoch=26, train total_loss <loss>=6.29684973197 [09/13/2018 14:25:51 INFO 140284443182912] patience losses:[6.3148820530284535, 6.3104172619906338, 6.308638910813765, 6.3134925495494496, 6.3077226985584609] min patience loss:6.30772269856 current loss:6.29684973197 absolute loss difference:0.010872966593 [09/13/2018 14:25:51 INFO 140284443182912] #progress_metric: host=algo-2, completed 26 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1430, "sum": 1430.0, "min": 1430}, "Total Records Seen": {"count": 1, "max": 181064, "sum": 181064.0, "min": 181064}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 52, "sum": 52.0, "min": 52}}, "EndTime": 1536848751.862191, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 25}, "StartTime": 1536848749.741427} [09/13/2018 14:25:51 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3283.31696752 records/second [09/13/2018 14:25:51 INFO 140284443182912] [09/13/2018 14:25:51 INFO 140284443182912] # Starting training for epoch 27 [09/13/2018 14:25:53 INFO 140484794881856] # Finished training epoch 29 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:53 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:53 INFO 140484794881856] Loss (name: value) total: 6.27360991998 [09/13/2018 14:25:53 INFO 140484794881856] Loss (name: value) kld: 0.0787043081766 [09/13/2018 14:25:53 INFO 140484794881856] Loss (name: value) recons: 6.19490559751 [09/13/2018 14:25:53 INFO 140484794881856] Loss (name: value) logppx: 6.27360991998 [09/13/2018 14:25:53 INFO 140484794881856] #quality_metric: host=algo-1, epoch=29, train total_loss <loss>=6.27360991998 [09/13/2018 14:25:53 INFO 140484794881856] patience losses:[6.2928961580449885, 6.2887120767073199, 6.2851825063878843, 6.2727364063262936, 6.2731010523709383] min patience loss:6.27273640633 current loss:6.27360991998 absolute loss difference:0.00087351365523 [09/13/2018 14:25:53 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:25:53 INFO 140484794881856] #progress_metric: host=algo-1, completed 29 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1595, "sum": 1595.0, "min": 1595}, "Total Records Seen": {"count": 1, "max": 201956, "sum": 201956.0, "min": 201956}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 58, "sum": 58.0, "min": 58}}, "EndTime": 1536848753.645034, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 28}, "StartTime": 1536848751.676165} [09/13/2018 14:25:53 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3536.7434545 records/second [09/13/2018 14:25:53 INFO 140484794881856] [09/13/2018 14:25:53 INFO 140484794881856] # Starting training for epoch 30 [09/13/2018 14:25:53 INFO 140284443182912] # Finished training epoch 27 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:53 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:53 INFO 140284443182912] Loss (name: value) total: 6.28508732969 [09/13/2018 14:25:53 INFO 140284443182912] Loss (name: value) kld: 0.0760913582011 [09/13/2018 14:25:53 INFO 140284443182912] Loss (name: value) recons: 6.2089959318 [09/13/2018 14:25:53 INFO 140284443182912] Loss (name: value) logppx: 6.28508732969 [09/13/2018 14:25:53 INFO 140284443182912] #quality_metric: host=algo-2, epoch=27, train total_loss <loss>=6.28508732969 [09/13/2018 14:25:53 INFO 140284443182912] patience losses:[6.3104172619906338, 6.308638910813765, 6.3134925495494496, 6.3077226985584609, 6.2968497319654988] min patience loss:6.29684973197 current loss:6.28508732969 absolute loss difference:0.0117624022744 [09/13/2018 14:25:53 INFO 140284443182912] #progress_metric: host=algo-2, completed 27 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1485, "sum": 1485.0, "min": 1485}, "Total Records Seen": {"count": 1, "max": 188028, "sum": 188028.0, "min": 188028}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 54, "sum": 54.0, "min": 54}}, "EndTime": 1536848753.978916, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 26}, "StartTime": 1536848751.863428} [09/13/2018 14:25:53 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3291.63798345 records/second [09/13/2018 14:25:53 INFO 140284443182912] [09/13/2018 14:25:53 INFO 140284443182912] # Starting training for epoch 28 [09/13/2018 14:25:55 INFO 140484794881856] # Finished training epoch 30 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:55 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:55 INFO 140484794881856] Loss (name: value) total: 6.2597884785 [09/13/2018 14:25:55 INFO 140484794881856] Loss (name: value) kld: 0.082443759184 [09/13/2018 14:25:55 INFO 140484794881856] Loss (name: value) recons: 6.17734470367 [09/13/2018 14:25:55 INFO 140484794881856] Loss (name: value) logppx: 6.2597884785 [09/13/2018 14:25:55 INFO 140484794881856] #quality_metric: host=algo-1, epoch=30, train total_loss <loss>=6.2597884785 [09/13/2018 14:25:55 INFO 140484794881856] patience losses:[6.2887120767073199, 6.2851825063878843, 6.2727364063262936, 6.2731010523709383, 6.2736099199815234] min patience loss:6.27273640633 current loss:6.2597884785 absolute loss difference:0.0129479278218 [09/13/2018 14:25:55 INFO 140484794881856] #progress_metric: host=algo-1, completed 30 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1650, "sum": 1650.0, "min": 1650}, "Total Records Seen": {"count": 1, "max": 208920, "sum": 208920.0, "min": 208920}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 60, "sum": 60.0, "min": 60}}, "EndTime": 1536848755.58164, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 29}, "StartTime": 1536848753.645445} [09/13/2018 14:25:55 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3596.27506152 records/second [09/13/2018 14:25:55 INFO 140484794881856] [09/13/2018 14:25:55 INFO 140484794881856] # Starting training for epoch 31 [09/13/2018 14:25:56 INFO 140284443182912] # Finished training epoch 28 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:56 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:56 INFO 140284443182912] Loss (name: value) total: 6.28400192261 [09/13/2018 14:25:56 INFO 140284443182912] Loss (name: value) kld: 0.0822785799476 [09/13/2018 14:25:56 INFO 140284443182912] Loss (name: value) recons: 6.20172333717 [09/13/2018 14:25:56 INFO 140284443182912] Loss (name: value) logppx: 6.28400192261 [09/13/2018 14:25:56 INFO 140284443182912] #quality_metric: host=algo-2, epoch=28, train total_loss <loss>=6.28400192261 [09/13/2018 14:25:56 INFO 140284443182912] patience losses:[6.308638910813765, 6.3134925495494496, 6.3077226985584609, 6.2968497319654988, 6.2850873296911063] min patience loss:6.28508732969 current loss:6.28400192261 absolute loss difference:0.00108540708368 [09/13/2018 14:25:56 INFO 140284443182912] #progress_metric: host=algo-2, completed 28 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1540, "sum": 1540.0, "min": 1540}, "Total Records Seen": {"count": 1, "max": 194992, "sum": 194992.0, "min": 194992}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 56, "sum": 56.0, "min": 56}}, "EndTime": 1536848756.058768, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 27}, "StartTime": 1536848753.979552} [09/13/2018 14:25:56 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3346.32489712 records/second [09/13/2018 14:25:56 INFO 140284443182912] [09/13/2018 14:25:56 INFO 140284443182912] # Starting training for epoch 29 [09/13/2018 14:25:57 INFO 140484794881856] # Finished training epoch 31 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:57 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:57 INFO 140484794881856] Loss (name: value) total: 6.25004741495 [09/13/2018 14:25:57 INFO 140484794881856] Loss (name: value) kld: 0.0916364195672 [09/13/2018 14:25:57 INFO 140484794881856] Loss (name: value) recons: 6.15841100866 [09/13/2018 14:25:57 INFO 140484794881856] Loss (name: value) logppx: 6.25004741495 [09/13/2018 14:25:57 INFO 140484794881856] #quality_metric: host=algo-1, epoch=31, train total_loss <loss>=6.25004741495 [09/13/2018 14:25:57 INFO 140484794881856] patience losses:[6.2851825063878843, 6.2727364063262936, 6.2731010523709383, 6.2736099199815234, 6.2597884785045279] min patience loss:6.2597884785 current loss:6.25004741495 absolute loss difference:0.00974106355147 [09/13/2018 14:25:57 INFO 140484794881856] #progress_metric: host=algo-1, completed 31 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1705, "sum": 1705.0, "min": 1705}, "Total Records Seen": {"count": 1, "max": 215884, "sum": 215884.0, "min": 215884}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 62, "sum": 62.0, "min": 62}}, "EndTime": 1536848757.462913, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 30}, "StartTime": 1536848755.582158} [09/13/2018 14:25:57 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3702.44737083 records/second [09/13/2018 14:25:57 INFO 140484794881856] [09/13/2018 14:25:57 INFO 140484794881856] # Starting training for epoch 32 [09/13/2018 14:25:58 INFO 140284443182912] # Finished training epoch 29 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:58 INFO 140284443182912] Metrics for Training: [09/13/2018 14:25:58 INFO 140284443182912] Loss (name: value) total: 6.28363947868 [09/13/2018 14:25:58 INFO 140284443182912] Loss (name: value) kld: 0.0904350788756 [09/13/2018 14:25:58 INFO 140284443182912] Loss (name: value) recons: 6.1932043379 [09/13/2018 14:25:58 INFO 140284443182912] Loss (name: value) logppx: 6.28363947868 [09/13/2018 14:25:58 INFO 140284443182912] #quality_metric: host=algo-2, epoch=29, train total_loss <loss>=6.28363947868 [09/13/2018 14:25:58 INFO 140284443182912] patience losses:[6.3134925495494496, 6.3077226985584609, 6.2968497319654988, 6.2850873296911063, 6.2840019226074215] min patience loss:6.28400192261 current loss:6.28363947868 absolute loss difference:0.00036244392395 [09/13/2018 14:25:58 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:25:58 INFO 140284443182912] #progress_metric: host=algo-2, completed 29 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1595, "sum": 1595.0, "min": 1595}, "Total Records Seen": {"count": 1, "max": 201956, "sum": 201956.0, "min": 201956}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 58, "sum": 58.0, "min": 58}}, "EndTime": 1536848758.158069, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 28}, "StartTime": 1536848756.061358} [09/13/2018 14:25:58 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3321.16896519 records/second [09/13/2018 14:25:58 INFO 140284443182912] [09/13/2018 14:25:58 INFO 140284443182912] # Starting training for epoch 30 [09/13/2018 14:25:59 INFO 140484794881856] # Finished training epoch 32 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:25:59 INFO 140484794881856] Metrics for Training: [09/13/2018 14:25:59 INFO 140484794881856] Loss (name: value) total: 6.23616972403 [09/13/2018 14:25:59 INFO 140484794881856] Loss (name: value) kld: 0.0958615606481 [09/13/2018 14:25:59 INFO 140484794881856] Loss (name: value) recons: 6.14030815905 [09/13/2018 14:25:59 INFO 140484794881856] Loss (name: value) logppx: 6.23616972403 [09/13/2018 14:25:59 INFO 140484794881856] #quality_metric: host=algo-1, epoch=32, train total_loss <loss>=6.23616972403 [09/13/2018 14:25:59 INFO 140484794881856] patience losses:[6.2727364063262936, 6.2731010523709383, 6.2736099199815234, 6.2597884785045279, 6.250047414953058] min patience loss:6.25004741495 current loss:6.23616972403 absolute loss difference:0.0138776909221 [09/13/2018 14:25:59 INFO 140484794881856] #progress_metric: host=algo-1, completed 32 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1760, "sum": 1760.0, "min": 1760}, "Total Records Seen": {"count": 1, "max": 222848, "sum": 222848.0, "min": 222848}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 64, "sum": 64.0, "min": 64}}, "EndTime": 1536848759.408223, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 31}, "StartTime": 1536848757.463389} [09/13/2018 14:25:59 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3580.46284783 records/second [09/13/2018 14:25:59 INFO 140484794881856] [09/13/2018 14:25:59 INFO 140484794881856] # Starting training for epoch 33 [09/13/2018 14:26:00 INFO 140284443182912] # Finished training epoch 30 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:00 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:00 INFO 140284443182912] Loss (name: value) total: 6.26987674453 [09/13/2018 14:26:00 INFO 140284443182912] Loss (name: value) kld: 0.0964024115692 [09/13/2018 14:26:00 INFO 140284443182912] Loss (name: value) recons: 6.1734743205 [09/13/2018 14:26:00 INFO 140284443182912] Loss (name: value) logppx: 6.26987674453 [09/13/2018 14:26:00 INFO 140284443182912] #quality_metric: host=algo-2, epoch=30, train total_loss <loss>=6.26987674453 [09/13/2018 14:26:00 INFO 140284443182912] patience losses:[6.3077226985584609, 6.2968497319654988, 6.2850873296911063, 6.2840019226074215, 6.2836394786834715] min patience loss:6.28363947868 current loss:6.26987674453 absolute loss difference:0.0137627341531 [09/13/2018 14:26:00 INFO 140284443182912] #progress_metric: host=algo-2, completed 30 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1650, "sum": 1650.0, "min": 1650}, "Total Records Seen": {"count": 1, "max": 208920, "sum": 208920.0, "min": 208920}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 60, "sum": 60.0, "min": 60}}, "EndTime": 1536848760.268186, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 29}, "StartTime": 1536848758.158546} [09/13/2018 14:26:00 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3300.54421115 records/second [09/13/2018 14:26:00 INFO 140284443182912] [09/13/2018 14:26:00 INFO 140284443182912] # Starting training for epoch 31 [09/13/2018 14:26:01 INFO 140484794881856] # Finished training epoch 33 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:01 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:01 INFO 140484794881856] Loss (name: value) total: 6.23841841004 [09/13/2018 14:26:01 INFO 140484794881856] Loss (name: value) kld: 0.10127014064 [09/13/2018 14:26:01 INFO 140484794881856] Loss (name: value) recons: 6.13714828925 [09/13/2018 14:26:01 INFO 140484794881856] Loss (name: value) logppx: 6.23841841004 [09/13/2018 14:26:01 INFO 140484794881856] #quality_metric: host=algo-1, epoch=33, train total_loss <loss>=6.23841841004 [09/13/2018 14:26:01 INFO 140484794881856] patience losses:[6.2731010523709383, 6.2736099199815234, 6.2597884785045279, 6.250047414953058, 6.236169724030928] min patience loss:6.23616972403 current loss:6.23841841004 absolute loss difference:0.00224868601019 [09/13/2018 14:26:01 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:01 INFO 140484794881856] #progress_metric: host=algo-1, completed 33 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1815, "sum": 1815.0, "min": 1815}, "Total Records Seen": {"count": 1, "max": 229812, "sum": 229812.0, "min": 229812}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 66, "sum": 66.0, "min": 66}}, "EndTime": 1536848761.339573, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 32}, "StartTime": 1536848759.408674} [09/13/2018 14:26:01 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3606.22163377 records/second [09/13/2018 14:26:01 INFO 140484794881856] [09/13/2018 14:26:01 INFO 140484794881856] # Starting training for epoch 34 [09/13/2018 14:26:02 INFO 140284443182912] # Finished training epoch 31 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:02 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:02 INFO 140284443182912] Loss (name: value) total: 6.25825066133 [09/13/2018 14:26:02 INFO 140284443182912] Loss (name: value) kld: 0.0997548668222 [09/13/2018 14:26:02 INFO 140284443182912] Loss (name: value) recons: 6.15849578597 [09/13/2018 14:26:02 INFO 140284443182912] Loss (name: value) logppx: 6.25825066133 [09/13/2018 14:26:02 INFO 140284443182912] #quality_metric: host=algo-2, epoch=31, train total_loss <loss>=6.25825066133 [09/13/2018 14:26:02 INFO 140284443182912] patience losses:[6.2968497319654988, 6.2850873296911063, 6.2840019226074215, 6.2836394786834715, 6.2698767445304178] min patience loss:6.26987674453 current loss:6.25825066133 absolute loss difference:0.0116260832006 [09/13/2018 14:26:02 INFO 140284443182912] #progress_metric: host=algo-2, completed 31 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1705, "sum": 1705.0, "min": 1705}, "Total Records Seen": {"count": 1, "max": 215884, "sum": 215884.0, "min": 215884}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 62, "sum": 62.0, "min": 62}}, "EndTime": 1536848762.314146, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 30}, "StartTime": 1536848760.269029} [09/13/2018 14:26:02 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3404.94098949 records/second [09/13/2018 14:26:02 INFO 140284443182912] [09/13/2018 14:26:02 INFO 140284443182912] # Starting training for epoch 32 [09/13/2018 14:26:03 INFO 140484794881856] # Finished training epoch 34 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:03 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:03 INFO 140484794881856] Loss (name: value) total: 6.22654731057 [09/13/2018 14:26:03 INFO 140484794881856] Loss (name: value) kld: 0.105929463154 [09/13/2018 14:26:03 INFO 140484794881856] Loss (name: value) recons: 6.12061779282 [09/13/2018 14:26:03 INFO 140484794881856] Loss (name: value) logppx: 6.22654731057 [09/13/2018 14:26:03 INFO 140484794881856] #quality_metric: host=algo-1, epoch=34, train total_loss <loss>=6.22654731057 [09/13/2018 14:26:03 INFO 140484794881856] patience losses:[6.2736099199815234, 6.2597884785045279, 6.250047414953058, 6.236169724030928, 6.2384184100411151] min patience loss:6.23616972403 current loss:6.22654731057 absolute loss difference:0.00962241346186 [09/13/2018 14:26:03 INFO 140484794881856] #progress_metric: host=algo-1, completed 34 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1870, "sum": 1870.0, "min": 1870}, "Total Records Seen": {"count": 1, "max": 236776, "sum": 236776.0, "min": 236776}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 68, "sum": 68.0, "min": 68}}, "EndTime": 1536848763.211398, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 33}, "StartTime": 1536848761.340139} [09/13/2018 14:26:03 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3721.2480853 records/second [09/13/2018 14:26:03 INFO 140484794881856] [09/13/2018 14:26:03 INFO 140484794881856] # Starting training for epoch 35 [09/13/2018 14:26:04 INFO 140284443182912] # Finished training epoch 32 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:04 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:04 INFO 140284443182912] Loss (name: value) total: 6.25114587871 [09/13/2018 14:26:04 INFO 140284443182912] Loss (name: value) kld: 0.106092444706 [09/13/2018 14:26:04 INFO 140284443182912] Loss (name: value) recons: 6.14505339536 [09/13/2018 14:26:04 INFO 140284443182912] Loss (name: value) logppx: 6.25114587871 [09/13/2018 14:26:04 INFO 140284443182912] #quality_metric: host=algo-2, epoch=32, train total_loss <loss>=6.25114587871 [09/13/2018 14:26:04 INFO 140284443182912] patience losses:[6.2850873296911063, 6.2840019226074215, 6.2836394786834715, 6.2698767445304178, 6.2582506613297895] min patience loss:6.25825066133 current loss:6.25114587871 absolute loss difference:0.00710478262468 [09/13/2018 14:26:04 INFO 140284443182912] #progress_metric: host=algo-2, completed 32 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1760, "sum": 1760.0, "min": 1760}, "Total Records Seen": {"count": 1, "max": 222848, "sum": 222848.0, "min": 222848}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 64, "sum": 64.0, "min": 64}}, "EndTime": 1536848764.41274, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 31}, "StartTime": 1536848762.314743} [09/13/2018 14:26:04 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3319.06691294 records/second [09/13/2018 14:26:04 INFO 140284443182912] [09/13/2018 14:26:04 INFO 140284443182912] # Starting training for epoch 33 [09/13/2018 14:26:05 INFO 140484794881856] # Finished training epoch 35 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:05 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:05 INFO 140484794881856] Loss (name: value) total: 6.22624774413 [09/13/2018 14:26:05 INFO 140484794881856] Loss (name: value) kld: 0.109471994435 [09/13/2018 14:26:05 INFO 140484794881856] Loss (name: value) recons: 6.11677569476 [09/13/2018 14:26:05 INFO 140484794881856] Loss (name: value) logppx: 6.22624774413 [09/13/2018 14:26:05 INFO 140484794881856] #quality_metric: host=algo-1, epoch=35, train total_loss <loss>=6.22624774413 [09/13/2018 14:26:05 INFO 140484794881856] patience losses:[6.2597884785045279, 6.250047414953058, 6.236169724030928, 6.2384184100411151, 6.2265473105690692] min patience loss:6.22654731057 current loss:6.22624774413 absolute loss difference:0.000299566442316 [09/13/2018 14:26:05 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:05 INFO 140484794881856] #progress_metric: host=algo-1, completed 35 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1925, "sum": 1925.0, "min": 1925}, "Total Records Seen": {"count": 1, "max": 243740, "sum": 243740.0, "min": 243740}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 70, "sum": 70.0, "min": 70}}, "EndTime": 1536848765.246263, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 34}, "StartTime": 1536848763.211925} [09/13/2018 14:26:05 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3422.97750412 records/second [09/13/2018 14:26:05 INFO 140484794881856] [09/13/2018 14:26:05 INFO 140484794881856] # Starting training for epoch 36 [09/13/2018 14:26:06 INFO 140284443182912] # Finished training epoch 33 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:06 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:06 INFO 140284443182912] Loss (name: value) total: 6.25267661702 [09/13/2018 14:26:06 INFO 140284443182912] Loss (name: value) kld: 0.110924626142 [09/13/2018 14:26:06 INFO 140284443182912] Loss (name: value) recons: 6.14175197428 [09/13/2018 14:26:06 INFO 140284443182912] Loss (name: value) logppx: 6.25267661702 [09/13/2018 14:26:06 INFO 140284443182912] #quality_metric: host=algo-2, epoch=33, train total_loss <loss>=6.25267661702 [09/13/2018 14:26:06 INFO 140284443182912] patience losses:[6.2840019226074215, 6.2836394786834715, 6.2698767445304178, 6.2582506613297895, 6.2511458787051115] min patience loss:6.25114587871 current loss:6.25267661702 absolute loss difference:0.00153073831038 [09/13/2018 14:26:06 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:06 INFO 140284443182912] #progress_metric: host=algo-2, completed 33 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1815, "sum": 1815.0, "min": 1815}, "Total Records Seen": {"count": 1, "max": 229812, "sum": 229812.0, "min": 229812}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 66, "sum": 66.0, "min": 66}}, "EndTime": 1536848766.491426, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 32}, "StartTime": 1536848764.413435} [09/13/2018 14:26:06 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3351.09928222 records/second [09/13/2018 14:26:06 INFO 140284443182912] [09/13/2018 14:26:06 INFO 140284443182912] # Starting training for epoch 34 [09/13/2018 14:26:07 INFO 140484794881856] # Finished training epoch 36 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:07 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:07 INFO 140484794881856] Loss (name: value) total: 6.22016116056 [09/13/2018 14:26:07 INFO 140484794881856] Loss (name: value) kld: 0.115588554469 [09/13/2018 14:26:07 INFO 140484794881856] Loss (name: value) recons: 6.10457259958 [09/13/2018 14:26:07 INFO 140484794881856] Loss (name: value) logppx: 6.22016116056 [09/13/2018 14:26:07 INFO 140484794881856] #quality_metric: host=algo-1, epoch=36, train total_loss <loss>=6.22016116056 [09/13/2018 14:26:07 INFO 140484794881856] patience losses:[6.250047414953058, 6.236169724030928, 6.2384184100411151, 6.2265473105690692, 6.226247744126753] min patience loss:6.22624774413 current loss:6.22016116056 absolute loss difference:0.006086583571 [09/13/2018 14:26:07 INFO 140484794881856] #progress_metric: host=algo-1, completed 36 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1980, "sum": 1980.0, "min": 1980}, "Total Records Seen": {"count": 1, "max": 250704, "sum": 250704.0, "min": 250704}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 72, "sum": 72.0, "min": 72}}, "EndTime": 1536848767.137237, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 35}, "StartTime": 1536848765.246824} [09/13/2018 14:26:07 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3683.43987454 records/second [09/13/2018 14:26:07 INFO 140484794881856] [09/13/2018 14:26:07 INFO 140484794881856] # Starting training for epoch 37 [09/13/2018 14:26:08 INFO 140284443182912] # Finished training epoch 34 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:08 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:08 INFO 140284443182912] Loss (name: value) total: 6.24266233878 [09/13/2018 14:26:08 INFO 140284443182912] Loss (name: value) kld: 0.115367102758 [09/13/2018 14:26:08 INFO 140284443182912] Loss (name: value) recons: 6.12729523832 [09/13/2018 14:26:08 INFO 140284443182912] Loss (name: value) logppx: 6.24266233878 [09/13/2018 14:26:08 INFO 140284443182912] #quality_metric: host=algo-2, epoch=34, train total_loss <loss>=6.24266233878 [09/13/2018 14:26:08 INFO 140284443182912] patience losses:[6.2836394786834715, 6.2698767445304178, 6.2582506613297895, 6.2511458787051115, 6.2526766170154922] min patience loss:6.25114587871 current loss:6.24266233878 absolute loss difference:0.00848353992809 [09/13/2018 14:26:08 INFO 140284443182912] #progress_metric: host=algo-2, completed 34 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1870, "sum": 1870.0, "min": 1870}, "Total Records Seen": {"count": 1, "max": 236776, "sum": 236776.0, "min": 236776}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 68, "sum": 68.0, "min": 68}}, "EndTime": 1536848768.500915, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 33}, "StartTime": 1536848766.491959} [09/13/2018 14:26:08 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3466.24600421 records/second [09/13/2018 14:26:08 INFO 140284443182912] [09/13/2018 14:26:08 INFO 140284443182912] # Starting training for epoch 35 [09/13/2018 14:26:09 INFO 140484794881856] # Finished training epoch 37 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:09 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:09 INFO 140484794881856] Loss (name: value) total: 6.22309759747 [09/13/2018 14:26:09 INFO 140484794881856] Loss (name: value) kld: 0.122481119362 [09/13/2018 14:26:09 INFO 140484794881856] Loss (name: value) recons: 6.10061645941 [09/13/2018 14:26:09 INFO 140484794881856] Loss (name: value) logppx: 6.22309759747 [09/13/2018 14:26:09 INFO 140484794881856] #quality_metric: host=algo-1, epoch=37, train total_loss <loss>=6.22309759747 [09/13/2018 14:26:09 INFO 140484794881856] patience losses:[6.236169724030928, 6.2384184100411151, 6.2265473105690692, 6.226247744126753, 6.2201611605557527] min patience loss:6.22016116056 current loss:6.22309759747 absolute loss difference:0.00293643691323 [09/13/2018 14:26:09 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:09 INFO 140484794881856] #progress_metric: host=algo-1, completed 37 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2035, "sum": 2035.0, "min": 2035}, "Total Records Seen": {"count": 1, "max": 257668, "sum": 257668.0, "min": 257668}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 74, "sum": 74.0, "min": 74}}, "EndTime": 1536848769.002932, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 36}, "StartTime": 1536848767.137646} [09/13/2018 14:26:09 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3733.07134889 records/second [09/13/2018 14:26:09 INFO 140484794881856] [09/13/2018 14:26:09 INFO 140484794881856] # Starting training for epoch 38 [09/13/2018 14:26:10 INFO 140284443182912] # Finished training epoch 35 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:10 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:10 INFO 140284443182912] Loss (name: value) total: 6.23727212819 [09/13/2018 14:26:10 INFO 140284443182912] Loss (name: value) kld: 0.121236642789 [09/13/2018 14:26:10 INFO 140284443182912] Loss (name: value) recons: 6.11603546143 [09/13/2018 14:26:10 INFO 140284443182912] Loss (name: value) logppx: 6.23727212819 [09/13/2018 14:26:10 INFO 140284443182912] #quality_metric: host=algo-2, epoch=35, train total_loss <loss>=6.23727212819 [09/13/2018 14:26:10 INFO 140284443182912] patience losses:[6.2698767445304178, 6.2582506613297895, 6.2511458787051115, 6.2526766170154922, 6.2426623387770217] min patience loss:6.24266233878 current loss:6.23727212819 absolute loss difference:0.00539021058516 [09/13/2018 14:26:10 INFO 140284443182912] #progress_metric: host=algo-2, completed 35 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1925, "sum": 1925.0, "min": 1925}, "Total Records Seen": {"count": 1, "max": 243740, "sum": 243740.0, "min": 243740}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 70, "sum": 70.0, "min": 70}}, "EndTime": 1536848770.565236, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 34}, "StartTime": 1536848768.501977} [09/13/2018 14:26:10 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3375.00075752 records/second [09/13/2018 14:26:10 INFO 140284443182912] [09/13/2018 14:26:10 INFO 140284443182912] # Starting training for epoch 36 [09/13/2018 14:26:10 INFO 140484794881856] # Finished training epoch 38 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:10 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:10 INFO 140484794881856] Loss (name: value) total: 6.21771030426 [09/13/2018 14:26:10 INFO 140484794881856] Loss (name: value) kld: 0.125368305093 [09/13/2018 14:26:10 INFO 140484794881856] Loss (name: value) recons: 6.09234202558 [09/13/2018 14:26:10 INFO 140484794881856] Loss (name: value) logppx: 6.21771030426 [09/13/2018 14:26:10 INFO 140484794881856] #quality_metric: host=algo-1, epoch=38, train total_loss <loss>=6.21771030426 [09/13/2018 14:26:10 INFO 140484794881856] patience losses:[6.2384184100411151, 6.2265473105690692, 6.226247744126753, 6.2201611605557527, 6.2230975974689828] min patience loss:6.22016116056 current loss:6.21771030426 absolute loss difference:0.0024508562955 [09/13/2018 14:26:10 INFO 140484794881856] #progress_metric: host=algo-1, completed 38 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2090, "sum": 2090.0, "min": 2090}, "Total Records Seen": {"count": 1, "max": 264632, "sum": 264632.0, "min": 264632}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 76, "sum": 76.0, "min": 76}}, "EndTime": 1536848770.834485, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 37}, "StartTime": 1536848769.003472} [09/13/2018 14:26:10 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3803.0016685 records/second [09/13/2018 14:26:10 INFO 140484794881856] [09/13/2018 14:26:10 INFO 140484794881856] # Starting training for epoch 39 [09/13/2018 14:26:12 INFO 140484794881856] # Finished training epoch 39 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:12 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:12 INFO 140484794881856] Loss (name: value) total: 6.21432402351 [09/13/2018 14:26:12 INFO 140484794881856] Loss (name: value) kld: 0.13150717379 [09/13/2018 14:26:12 INFO 140484794881856] Loss (name: value) recons: 6.08281682188 [09/13/2018 14:26:12 INFO 140484794881856] Loss (name: value) logppx: 6.21432402351 [09/13/2018 14:26:12 INFO 140484794881856] #quality_metric: host=algo-1, epoch=39, train total_loss <loss>=6.21432402351 [09/13/2018 14:26:12 INFO 140484794881856] patience losses:[6.2265473105690692, 6.226247744126753, 6.2201611605557527, 6.2230975974689828, 6.2177103042602537] min patience loss:6.21771030426 current loss:6.21432402351 absolute loss difference:0.0033862807534 [09/13/2018 14:26:12 INFO 140484794881856] #progress_metric: host=algo-1, completed 39 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2145, "sum": 2145.0, "min": 2145}, "Total Records Seen": {"count": 1, "max": 271596, "sum": 271596.0, "min": 271596}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 78, "sum": 78.0, "min": 78}}, "EndTime": 1536848772.713175, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 38}, "StartTime": 1536848770.835773} [09/13/2018 14:26:12 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3709.06517558 records/second [09/13/2018 14:26:12 INFO 140484794881856] [09/13/2018 14:26:12 INFO 140484794881856] # Starting training for epoch 40 [09/13/2018 14:26:12 INFO 140284443182912] # Finished training epoch 36 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:12 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:12 INFO 140284443182912] Loss (name: value) total: 6.23685007095 [09/13/2018 14:26:12 INFO 140284443182912] Loss (name: value) kld: 0.125911162523 [09/13/2018 14:26:12 INFO 140284443182912] Loss (name: value) recons: 6.11093893485 [09/13/2018 14:26:12 INFO 140284443182912] Loss (name: value) logppx: 6.23685007095 [09/13/2018 14:26:12 INFO 140284443182912] #quality_metric: host=algo-2, epoch=36, train total_loss <loss>=6.23685007095 [09/13/2018 14:26:12 INFO 140284443182912] patience losses:[6.2582506613297895, 6.2511458787051115, 6.2526766170154922, 6.2426623387770217, 6.2372721281918615] min patience loss:6.23727212819 current loss:6.23685007095 absolute loss difference:0.000422057238493 [09/13/2018 14:26:12 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:12 INFO 140284443182912] #progress_metric: host=algo-2, completed 36 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 1980, "sum": 1980.0, "min": 1980}, "Total Records Seen": {"count": 1, "max": 250704, "sum": 250704.0, "min": 250704}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 72, "sum": 72.0, "min": 72}}, "EndTime": 1536848772.67453, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 35}, "StartTime": 1536848770.566458} [09/13/2018 14:26:12 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3303.27124665 records/second [09/13/2018 14:26:12 INFO 140284443182912] [09/13/2018 14:26:12 INFO 140284443182912] # Starting training for epoch 37 [09/13/2018 14:26:14 INFO 140484794881856] # Finished training epoch 40 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:14 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:14 INFO 140484794881856] Loss (name: value) total: 6.21074885022 [09/13/2018 14:26:14 INFO 140484794881856] Loss (name: value) kld: 0.134565500509 [09/13/2018 14:26:14 INFO 140484794881856] Loss (name: value) recons: 6.07618337111 [09/13/2018 14:26:14 INFO 140484794881856] Loss (name: value) logppx: 6.21074885022 [09/13/2018 14:26:14 INFO 140484794881856] #quality_metric: host=algo-1, epoch=40, train total_loss <loss>=6.21074885022 [09/13/2018 14:26:14 INFO 140484794881856] patience losses:[6.226247744126753, 6.2201611605557527, 6.2230975974689828, 6.2177103042602537, 6.2143240235068582] min patience loss:6.21432402351 current loss:6.21074885022 absolute loss difference:0.00357517329129 [09/13/2018 14:26:14 INFO 140484794881856] #progress_metric: host=algo-1, completed 40 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2200, "sum": 2200.0, "min": 2200}, "Total Records Seen": {"count": 1, "max": 278560, "sum": 278560.0, "min": 278560}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 80, "sum": 80.0, "min": 80}}, "EndTime": 1536848774.493508, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 39}, "StartTime": 1536848772.713719} [09/13/2018 14:26:14 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3912.3417964 records/second [09/13/2018 14:26:14 INFO 140484794881856] [09/13/2018 14:26:14 INFO 140484794881856] # Starting training for epoch 41 [09/13/2018 14:26:14 INFO 140284443182912] # Finished training epoch 37 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:14 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:14 INFO 140284443182912] Loss (name: value) total: 6.2327940334 [09/13/2018 14:26:14 INFO 140284443182912] Loss (name: value) kld: 0.129778385162 [09/13/2018 14:26:14 INFO 140284443182912] Loss (name: value) recons: 6.10301561356 [09/13/2018 14:26:14 INFO 140284443182912] Loss (name: value) logppx: 6.2327940334 [09/13/2018 14:26:14 INFO 140284443182912] #quality_metric: host=algo-2, epoch=37, train total_loss <loss>=6.2327940334 [09/13/2018 14:26:14 INFO 140284443182912] patience losses:[6.2511458787051115, 6.2526766170154922, 6.2426623387770217, 6.2372721281918615, 6.236850070953369] min patience loss:6.23685007095 current loss:6.2327940334 absolute loss difference:0.00405603755604 [09/13/2018 14:26:14 INFO 140284443182912] #progress_metric: host=algo-2, completed 37 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2035, "sum": 2035.0, "min": 2035}, "Total Records Seen": {"count": 1, "max": 257668, "sum": 257668.0, "min": 257668}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 74, "sum": 74.0, "min": 74}}, "EndTime": 1536848774.796373, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 36}, "StartTime": 1536848772.67505} [09/13/2018 14:26:14 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3282.3525059 records/second [09/13/2018 14:26:14 INFO 140284443182912] [09/13/2018 14:26:14 INFO 140284443182912] # Starting training for epoch 38 [09/13/2018 14:26:16 INFO 140484794881856] # Finished training epoch 41 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:16 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:16 INFO 140484794881856] Loss (name: value) total: 6.20119318529 [09/13/2018 14:26:16 INFO 140484794881856] Loss (name: value) kld: 0.137519264831 [09/13/2018 14:26:16 INFO 140484794881856] Loss (name: value) recons: 6.06367390806 [09/13/2018 14:26:16 INFO 140484794881856] Loss (name: value) logppx: 6.20119318529 [09/13/2018 14:26:16 INFO 140484794881856] #quality_metric: host=algo-1, epoch=41, train total_loss <loss>=6.20119318529 [09/13/2018 14:26:16 INFO 140484794881856] patience losses:[6.2201611605557527, 6.2230975974689828, 6.2177103042602537, 6.2143240235068582, 6.2107488502155652] min patience loss:6.21074885022 current loss:6.20119318529 absolute loss difference:0.00955566492948 [09/13/2018 14:26:16 INFO 140484794881856] #progress_metric: host=algo-1, completed 41 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2255, "sum": 2255.0, "min": 2255}, "Total Records Seen": {"count": 1, "max": 285524, "sum": 285524.0, "min": 285524}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 82, "sum": 82.0, "min": 82}}, "EndTime": 1536848776.270421, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 40}, "StartTime": 1536848774.49402} [09/13/2018 14:26:16 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3919.65463288 records/second [09/13/2018 14:26:16 INFO 140484794881856] [09/13/2018 14:26:16 INFO 140484794881856] # Starting training for epoch 42 [09/13/2018 14:26:16 INFO 140284443182912] # Finished training epoch 38 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:16 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:16 INFO 140284443182912] Loss (name: value) total: 6.22960880453 [09/13/2018 14:26:16 INFO 140284443182912] Loss (name: value) kld: 0.135703341121 [09/13/2018 14:26:16 INFO 140284443182912] Loss (name: value) recons: 6.09390548793 [09/13/2018 14:26:16 INFO 140284443182912] Loss (name: value) logppx: 6.22960880453 [09/13/2018 14:26:16 INFO 140284443182912] #quality_metric: host=algo-2, epoch=38, train total_loss <loss>=6.22960880453 [09/13/2018 14:26:16 INFO 140284443182912] patience losses:[6.2526766170154922, 6.2426623387770217, 6.2372721281918615, 6.236850070953369, 6.2327940333973277] min patience loss:6.2327940334 current loss:6.22960880453 absolute loss difference:0.00318522886796 [09/13/2018 14:26:16 INFO 140284443182912] #progress_metric: host=algo-2, completed 38 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2090, "sum": 2090.0, "min": 2090}, "Total Records Seen": {"count": 1, "max": 264632, "sum": 264632.0, "min": 264632}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 76, "sum": 76.0, "min": 76}}, "EndTime": 1536848776.844463, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 37}, "StartTime": 1536848774.797309} [09/13/2018 14:26:16 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3401.4968493 records/second [09/13/2018 14:26:16 INFO 140284443182912] [09/13/2018 14:26:16 INFO 140284443182912] # Starting training for epoch 39 [09/13/2018 14:26:18 INFO 140484794881856] # Finished training epoch 42 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:18 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:18 INFO 140484794881856] Loss (name: value) total: 6.1955048301 [09/13/2018 14:26:18 INFO 140484794881856] Loss (name: value) kld: 0.141903922775 [09/13/2018 14:26:18 INFO 140484794881856] Loss (name: value) recons: 6.05360095284 [09/13/2018 14:26:18 INFO 140484794881856] Loss (name: value) logppx: 6.1955048301 [09/13/2018 14:26:18 INFO 140484794881856] #quality_metric: host=algo-1, epoch=42, train total_loss <loss>=6.1955048301 [09/13/2018 14:26:18 INFO 140484794881856] patience losses:[6.2230975974689828, 6.2177103042602537, 6.2143240235068582, 6.2107488502155652, 6.2011931852860886] min patience loss:6.20119318529 current loss:6.1955048301 absolute loss difference:0.00568835518577 [09/13/2018 14:26:18 INFO 140484794881856] #progress_metric: host=algo-1, completed 42 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2310, "sum": 2310.0, "min": 2310}, "Total Records Seen": {"count": 1, "max": 292488, "sum": 292488.0, "min": 292488}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 84, "sum": 84.0, "min": 84}}, "EndTime": 1536848778.094685, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 41}, "StartTime": 1536848776.271135} [09/13/2018 14:26:18 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3818.44288873 records/second [09/13/2018 14:26:18 INFO 140484794881856] [09/13/2018 14:26:18 INFO 140484794881856] # Starting training for epoch 43 [09/13/2018 14:26:18 INFO 140284443182912] # Finished training epoch 39 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:18 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:18 INFO 140284443182912] Loss (name: value) total: 6.23259976127 [09/13/2018 14:26:18 INFO 140284443182912] Loss (name: value) kld: 0.139871500026 [09/13/2018 14:26:18 INFO 140284443182912] Loss (name: value) recons: 6.09272828969 [09/13/2018 14:26:18 INFO 140284443182912] Loss (name: value) logppx: 6.23259976127 [09/13/2018 14:26:18 INFO 140284443182912] #quality_metric: host=algo-2, epoch=39, train total_loss <loss>=6.23259976127 [09/13/2018 14:26:18 INFO 140284443182912] patience losses:[6.2426623387770217, 6.2372721281918615, 6.236850070953369, 6.2327940333973277, 6.2296088045293638] min patience loss:6.22960880453 current loss:6.23259976127 absolute loss difference:0.00299095673995 [09/13/2018 14:26:18 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:18 INFO 140284443182912] #progress_metric: host=algo-2, completed 39 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2145, "sum": 2145.0, "min": 2145}, "Total Records Seen": {"count": 1, "max": 271596, "sum": 271596.0, "min": 271596}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 78, "sum": 78.0, "min": 78}}, "EndTime": 1536848778.946588, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 38}, "StartTime": 1536848776.845367} [09/13/2018 14:26:18 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3313.97492355 records/second [09/13/2018 14:26:18 INFO 140284443182912] [09/13/2018 14:26:18 INFO 140284443182912] # Starting training for epoch 40 [09/13/2018 14:26:19 INFO 140484794881856] # Finished training epoch 43 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:19 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:19 INFO 140484794881856] Loss (name: value) total: 6.20355132276 [09/13/2018 14:26:19 INFO 140484794881856] Loss (name: value) kld: 0.147166593644 [09/13/2018 14:26:19 INFO 140484794881856] Loss (name: value) recons: 6.05638472384 [09/13/2018 14:26:19 INFO 140484794881856] Loss (name: value) logppx: 6.20355132276 [09/13/2018 14:26:19 INFO 140484794881856] #quality_metric: host=algo-1, epoch=43, train total_loss <loss>=6.20355132276 [09/13/2018 14:26:19 INFO 140484794881856] patience losses:[6.2177103042602537, 6.2143240235068582, 6.2107488502155652, 6.2011931852860886, 6.1955048301003197] min patience loss:6.1955048301 current loss:6.20355132276 absolute loss difference:0.0080464926633 [09/13/2018 14:26:19 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:19 INFO 140484794881856] #progress_metric: host=algo-1, completed 43 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2365, "sum": 2365.0, "min": 2365}, "Total Records Seen": {"count": 1, "max": 299452, "sum": 299452.0, "min": 299452}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 86, "sum": 86.0, "min": 86}}, "EndTime": 1536848779.90614, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 42}, "StartTime": 1536848778.09532} [09/13/2018 14:26:19 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3845.42297564 records/second [09/13/2018 14:26:19 INFO 140484794881856] [09/13/2018 14:26:19 INFO 140484794881856] # Starting training for epoch 44 [09/13/2018 14:26:21 INFO 140284443182912] # Finished training epoch 40 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:21 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:21 INFO 140284443182912] Loss (name: value) total: 6.2185958082 [09/13/2018 14:26:21 INFO 140284443182912] Loss (name: value) kld: 0.143275380541 [09/13/2018 14:26:21 INFO 140284443182912] Loss (name: value) recons: 6.07532040856 [09/13/2018 14:26:21 INFO 140284443182912] Loss (name: value) logppx: 6.2185958082 [09/13/2018 14:26:21 INFO 140284443182912] #quality_metric: host=algo-2, epoch=40, train total_loss <loss>=6.2185958082 [09/13/2018 14:26:21 INFO 140284443182912] patience losses:[6.2372721281918615, 6.236850070953369, 6.2327940333973277, 6.2296088045293638, 6.2325997612693094] min patience loss:6.22960880453 current loss:6.2185958082 absolute loss difference:0.0110129963268 [09/13/2018 14:26:21 INFO 140284443182912] #progress_metric: host=algo-2, completed 40 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2200, "sum": 2200.0, "min": 2200}, "Total Records Seen": {"count": 1, "max": 278560, "sum": 278560.0, "min": 278560}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 80, "sum": 80.0, "min": 80}}, "EndTime": 1536848781.038633, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 39}, "StartTime": 1536848778.947312} [09/13/2018 14:26:21 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3329.42586912 records/second [09/13/2018 14:26:21 INFO 140284443182912] [09/13/2018 14:26:21 INFO 140284443182912] # Starting training for epoch 41 [09/13/2018 14:26:21 INFO 140484794881856] # Finished training epoch 44 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:21 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:21 INFO 140484794881856] Loss (name: value) total: 6.18978937322 [09/13/2018 14:26:21 INFO 140484794881856] Loss (name: value) kld: 0.14858874726 [09/13/2018 14:26:21 INFO 140484794881856] Loss (name: value) recons: 6.04120062048 [09/13/2018 14:26:21 INFO 140484794881856] Loss (name: value) logppx: 6.18978937322 [09/13/2018 14:26:21 INFO 140484794881856] #quality_metric: host=algo-1, epoch=44, train total_loss <loss>=6.18978937322 [09/13/2018 14:26:21 INFO 140484794881856] patience losses:[6.2143240235068582, 6.2107488502155652, 6.2011931852860886, 6.1955048301003197, 6.2035513227636168] min patience loss:6.1955048301 current loss:6.18978937322 absolute loss difference:0.00571545687589 [09/13/2018 14:26:21 INFO 140484794881856] #progress_metric: host=algo-1, completed 44 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2420, "sum": 2420.0, "min": 2420}, "Total Records Seen": {"count": 1, "max": 306416, "sum": 306416.0, "min": 306416}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 88, "sum": 88.0, "min": 88}}, "EndTime": 1536848781.677234, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 43}, "StartTime": 1536848779.906648} [09/13/2018 14:26:21 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3932.86345678 records/second [09/13/2018 14:26:21 INFO 140484794881856] [09/13/2018 14:26:21 INFO 140484794881856] # Starting training for epoch 45 [09/13/2018 14:26:23 INFO 140284443182912] # Finished training epoch 41 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:23 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:23 INFO 140284443182912] Loss (name: value) total: 6.22518672943 [09/13/2018 14:26:23 INFO 140284443182912] Loss (name: value) kld: 0.149158060009 [09/13/2018 14:26:23 INFO 140284443182912] Loss (name: value) recons: 6.07602870248 [09/13/2018 14:26:23 INFO 140284443182912] Loss (name: value) logppx: 6.22518672943 [09/13/2018 14:26:23 INFO 140284443182912] #quality_metric: host=algo-2, epoch=41, train total_loss <loss>=6.22518672943 [09/13/2018 14:26:23 INFO 140284443182912] patience losses:[6.236850070953369, 6.2327940333973277, 6.2296088045293638, 6.2325997612693094, 6.2185958082025703] min patience loss:6.2185958082 current loss:6.22518672943 absolute loss difference:0.00659092122858 [09/13/2018 14:26:23 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:23 INFO 140284443182912] #progress_metric: host=algo-2, completed 41 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2255, "sum": 2255.0, "min": 2255}, "Total Records Seen": {"count": 1, "max": 285524, "sum": 285524.0, "min": 285524}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 82, "sum": 82.0, "min": 82}}, "EndTime": 1536848783.137292, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 40}, "StartTime": 1536848781.039128} [09/13/2018 14:26:23 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3318.81876723 records/second [09/13/2018 14:26:23 INFO 140284443182912] [09/13/2018 14:26:23 INFO 140284443182912] # Starting training for epoch 42 [09/13/2018 14:26:23 INFO 140484794881856] # Finished training epoch 45 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:23 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:23 INFO 140484794881856] Loss (name: value) total: 6.18777626644 [09/13/2018 14:26:23 INFO 140484794881856] Loss (name: value) kld: 0.154182113775 [09/13/2018 14:26:23 INFO 140484794881856] Loss (name: value) recons: 6.03359415314 [09/13/2018 14:26:23 INFO 140484794881856] Loss (name: value) logppx: 6.18777626644 [09/13/2018 14:26:23 INFO 140484794881856] #quality_metric: host=algo-1, epoch=45, train total_loss <loss>=6.18777626644 [09/13/2018 14:26:23 INFO 140484794881856] patience losses:[6.2107488502155652, 6.2011931852860886, 6.1955048301003197, 6.2035513227636168, 6.189789373224432] min patience loss:6.18978937322 current loss:6.18777626644 absolute loss difference:0.00201310677962 [09/13/2018 14:26:23 INFO 140484794881856] #progress_metric: host=algo-1, completed 45 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2475, "sum": 2475.0, "min": 2475}, "Total Records Seen": {"count": 1, "max": 313380, "sum": 313380.0, "min": 313380}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 90, "sum": 90.0, "min": 90}}, "EndTime": 1536848783.439516, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 44}, "StartTime": 1536848781.677786} [09/13/2018 14:26:23 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3952.5106909 records/second [09/13/2018 14:26:23 INFO 140484794881856] [09/13/2018 14:26:23 INFO 140484794881856] # Starting training for epoch 46 [09/13/2018 14:26:25 INFO 140484794881856] # Finished training epoch 46 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:25 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:25 INFO 140484794881856] Loss (name: value) total: 6.17922571356 [09/13/2018 14:26:25 INFO 140484794881856] Loss (name: value) kld: 0.156774423204 [09/13/2018 14:26:25 INFO 140484794881856] Loss (name: value) recons: 6.02245129672 [09/13/2018 14:26:25 INFO 140484794881856] Loss (name: value) logppx: 6.17922571356 [09/13/2018 14:26:25 INFO 140484794881856] #quality_metric: host=algo-1, epoch=46, train total_loss <loss>=6.17922571356 [09/13/2018 14:26:25 INFO 140484794881856] patience losses:[6.2011931852860886, 6.1955048301003197, 6.2035513227636168, 6.189789373224432, 6.1877762664448133] min patience loss:6.18777626644 current loss:6.17922571356 absolute loss difference:0.00855055288835 [09/13/2018 14:26:25 INFO 140484794881856] #progress_metric: host=algo-1, completed 46 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2530, "sum": 2530.0, "min": 2530}, "Total Records Seen": {"count": 1, "max": 320344, "sum": 320344.0, "min": 320344}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 92, "sum": 92.0, "min": 92}}, "EndTime": 1536848785.270033, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 45}, "StartTime": 1536848783.439926} [09/13/2018 14:26:25 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3804.85541264 records/second [09/13/2018 14:26:25 INFO 140484794881856] [09/13/2018 14:26:25 INFO 140484794881856] # Starting training for epoch 47 [09/13/2018 14:26:25 INFO 140284443182912] # Finished training epoch 42 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:25 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:25 INFO 140284443182912] Loss (name: value) total: 6.21422452927 [09/13/2018 14:26:25 INFO 140284443182912] Loss (name: value) kld: 0.149376740645 [09/13/2018 14:26:25 INFO 140284443182912] Loss (name: value) recons: 6.06484781612 [09/13/2018 14:26:25 INFO 140284443182912] Loss (name: value) logppx: 6.21422452927 [09/13/2018 14:26:25 INFO 140284443182912] #quality_metric: host=algo-2, epoch=42, train total_loss <loss>=6.21422452927 [09/13/2018 14:26:25 INFO 140284443182912] patience losses:[6.2327940333973277, 6.2296088045293638, 6.2325997612693094, 6.2185958082025703, 6.2251867294311527] min patience loss:6.2185958082 current loss:6.21422452927 absolute loss difference:0.00437127893621 [09/13/2018 14:26:25 INFO 140284443182912] #progress_metric: host=algo-2, completed 42 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2310, "sum": 2310.0, "min": 2310}, "Total Records Seen": {"count": 1, "max": 292488, "sum": 292488.0, "min": 292488}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 84, "sum": 84.0, "min": 84}}, "EndTime": 1536848785.164767, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 41}, "StartTime": 1536848783.137941} [09/13/2018 14:26:25 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3435.66327336 records/second [09/13/2018 14:26:25 INFO 140284443182912] [09/13/2018 14:26:25 INFO 140284443182912] # Starting training for epoch 43 [09/13/2018 14:26:27 INFO 140484794881856] # Finished training epoch 47 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:27 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:27 INFO 140484794881856] Loss (name: value) total: 6.18096131845 [09/13/2018 14:26:27 INFO 140484794881856] Loss (name: value) kld: 0.162204787135 [09/13/2018 14:26:27 INFO 140484794881856] Loss (name: value) recons: 6.01875658035 [09/13/2018 14:26:27 INFO 140484794881856] Loss (name: value) logppx: 6.18096131845 [09/13/2018 14:26:27 INFO 140484794881856] #quality_metric: host=algo-1, epoch=47, train total_loss <loss>=6.18096131845 [09/13/2018 14:26:27 INFO 140484794881856] patience losses:[6.1955048301003197, 6.2035513227636168, 6.189789373224432, 6.1877762664448133, 6.1792257135564634] min patience loss:6.17922571356 current loss:6.18096131845 absolute loss difference:0.00173560489308 [09/13/2018 14:26:27 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:27 INFO 140484794881856] #progress_metric: host=algo-1, completed 47 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2585, "sum": 2585.0, "min": 2585}, "Total Records Seen": {"count": 1, "max": 327308, "sum": 327308.0, "min": 327308}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 94, "sum": 94.0, "min": 94}}, "EndTime": 1536848787.043887, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 46}, "StartTime": 1536848785.270447} [09/13/2018 14:26:27 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3926.52555705 records/second [09/13/2018 14:26:27 INFO 140484794881856] [09/13/2018 14:26:27 INFO 140484794881856] # Starting training for epoch 48 [09/13/2018 14:26:27 INFO 140284443182912] # Finished training epoch 43 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:27 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:27 INFO 140284443182912] Loss (name: value) total: 6.21568147052 [09/13/2018 14:26:27 INFO 140284443182912] Loss (name: value) kld: 0.157788405026 [09/13/2018 14:26:27 INFO 140284443182912] Loss (name: value) recons: 6.05789307681 [09/13/2018 14:26:27 INFO 140284443182912] Loss (name: value) logppx: 6.21568147052 [09/13/2018 14:26:27 INFO 140284443182912] #quality_metric: host=algo-2, epoch=43, train total_loss <loss>=6.21568147052 [09/13/2018 14:26:27 INFO 140284443182912] patience losses:[6.2296088045293638, 6.2325997612693094, 6.2185958082025703, 6.2251867294311527, 6.2142245292663576] min patience loss:6.21422452927 current loss:6.21568147052 absolute loss difference:0.00145694125782 [09/13/2018 14:26:27 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:27 INFO 140284443182912] #progress_metric: host=algo-2, completed 43 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2365, "sum": 2365.0, "min": 2365}, "Total Records Seen": {"count": 1, "max": 299452, "sum": 299452.0, "min": 299452}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 86, "sum": 86.0, "min": 86}}, "EndTime": 1536848787.249067, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 42}, "StartTime": 1536848785.166852} [09/13/2018 14:26:27 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3344.27933862 records/second [09/13/2018 14:26:27 INFO 140284443182912] [09/13/2018 14:26:27 INFO 140284443182912] # Starting training for epoch 44 [09/13/2018 14:26:28 INFO 140484794881856] # Finished training epoch 48 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:28 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:28 INFO 140484794881856] Loss (name: value) total: 6.19188095873 [09/13/2018 14:26:28 INFO 140484794881856] Loss (name: value) kld: 0.165908737007 [09/13/2018 14:26:28 INFO 140484794881856] Loss (name: value) recons: 6.02597225363 [09/13/2018 14:26:28 INFO 140484794881856] Loss (name: value) logppx: 6.19188095873 [09/13/2018 14:26:28 INFO 140484794881856] #quality_metric: host=algo-1, epoch=48, train total_loss <loss>=6.19188095873 [09/13/2018 14:26:28 INFO 140484794881856] patience losses:[6.2035513227636168, 6.189789373224432, 6.1877762664448133, 6.1792257135564634, 6.1809613184495404] min patience loss:6.17922571356 current loss:6.19188095873 absolute loss difference:0.0126552451741 [09/13/2018 14:26:28 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:26:28 INFO 140484794881856] #progress_metric: host=algo-1, completed 48 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2640, "sum": 2640.0, "min": 2640}, "Total Records Seen": {"count": 1, "max": 334272, "sum": 334272.0, "min": 334272}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 96, "sum": 96.0, "min": 96}}, "EndTime": 1536848788.812559, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 47}, "StartTime": 1536848787.044313} [09/13/2018 14:26:28 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3938.03857154 records/second [09/13/2018 14:26:28 INFO 140484794881856] [09/13/2018 14:26:28 INFO 140484794881856] # Starting training for epoch 49 [09/13/2018 14:26:29 INFO 140284443182912] # Finished training epoch 44 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:29 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:29 INFO 140284443182912] Loss (name: value) total: 6.21891488162 [09/13/2018 14:26:29 INFO 140284443182912] Loss (name: value) kld: 0.162513727085 [09/13/2018 14:26:29 INFO 140284443182912] Loss (name: value) recons: 6.05640112703 [09/13/2018 14:26:29 INFO 140284443182912] Loss (name: value) logppx: 6.21891488162 [09/13/2018 14:26:29 INFO 140284443182912] #quality_metric: host=algo-2, epoch=44, train total_loss <loss>=6.21891488162 [09/13/2018 14:26:29 INFO 140284443182912] patience losses:[6.2325997612693094, 6.2185958082025703, 6.2251867294311527, 6.2142245292663576, 6.2156814705241814] min patience loss:6.21422452927 current loss:6.21891488162 absolute loss difference:0.00469035235318 [09/13/2018 14:26:29 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:26:29 INFO 140284443182912] #progress_metric: host=algo-2, completed 44 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2420, "sum": 2420.0, "min": 2420}, "Total Records Seen": {"count": 1, "max": 306416, "sum": 306416.0, "min": 306416}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 88, "sum": 88.0, "min": 88}}, "EndTime": 1536848789.280308, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 43}, "StartTime": 1536848787.249657} [09/13/2018 14:26:29 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3429.19230392 records/second [09/13/2018 14:26:29 INFO 140284443182912] [09/13/2018 14:26:29 INFO 140284443182912] # Starting training for epoch 45 [09/13/2018 14:26:30 INFO 140484794881856] # Finished training epoch 49 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:30 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:30 INFO 140484794881856] Loss (name: value) total: 6.17260643786 [09/13/2018 14:26:30 INFO 140484794881856] Loss (name: value) kld: 0.16862619316 [09/13/2018 14:26:30 INFO 140484794881856] Loss (name: value) recons: 6.00398024212 [09/13/2018 14:26:30 INFO 140484794881856] Loss (name: value) logppx: 6.17260643786 [09/13/2018 14:26:30 INFO 140484794881856] #quality_metric: host=algo-1, epoch=49, train total_loss <loss>=6.17260643786 [09/13/2018 14:26:30 INFO 140484794881856] patience losses:[6.189789373224432, 6.1877762664448133, 6.1792257135564634, 6.1809613184495404, 6.1918809587305246] min patience loss:6.17922571356 current loss:6.17260643786 absolute loss difference:0.00661927569996 [09/13/2018 14:26:30 INFO 140484794881856] #progress_metric: host=algo-1, completed 49 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2695, "sum": 2695.0, "min": 2695}, "Total Records Seen": {"count": 1, "max": 341236, "sum": 341236.0, "min": 341236}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 98, "sum": 98.0, "min": 98}}, "EndTime": 1536848790.693732, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 48}, "StartTime": 1536848788.813024} [09/13/2018 14:26:30 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3702.33755572 records/second [09/13/2018 14:26:30 INFO 140484794881856] [09/13/2018 14:26:30 INFO 140484794881856] # Starting training for epoch 50 [09/13/2018 14:26:32 INFO 140484794881856] # Finished training epoch 50 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:32 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:32 INFO 140484794881856] Loss (name: value) total: 6.17317474972 [09/13/2018 14:26:32 INFO 140484794881856] Loss (name: value) kld: 0.170401432433 [09/13/2018 14:26:32 INFO 140484794881856] Loss (name: value) recons: 6.0027733326 [09/13/2018 14:26:32 INFO 140484794881856] Loss (name: value) logppx: 6.17317474972 [09/13/2018 14:26:32 INFO 140484794881856] #quality_metric: host=algo-1, epoch=50, train total_loss <loss>=6.17317474972 [09/13/2018 14:26:32 INFO 140484794881856] patience losses:[6.1877762664448133, 6.1792257135564634, 6.1809613184495404, 6.1918809587305246, 6.1726064378565004] min patience loss:6.17260643786 current loss:6.17317474972 absolute loss difference:0.00056831186468 [09/13/2018 14:26:32 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:32 INFO 140484794881856] #progress_metric: host=algo-1, completed 50 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2750, "sum": 2750.0, "min": 2750}, "Total Records Seen": {"count": 1, "max": 348200, "sum": 348200.0, "min": 348200}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 100, "sum": 100.0, "min": 100}}, "EndTime": 1536848792.465126, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 49}, "StartTime": 1536848790.694431} [09/13/2018 14:26:32 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3932.55052333 records/second [09/13/2018 14:26:32 INFO 140484794881856] [09/13/2018 14:26:32 INFO 140484794881856] # Starting training for epoch 51 [09/13/2018 14:26:31 INFO 140284443182912] # Finished training epoch 45 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:31 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:31 INFO 140284443182912] Loss (name: value) total: 6.21077047261 [09/13/2018 14:26:31 INFO 140284443182912] Loss (name: value) kld: 0.164125593955 [09/13/2018 14:26:31 INFO 140284443182912] Loss (name: value) recons: 6.04664488706 [09/13/2018 14:26:31 INFO 140284443182912] Loss (name: value) logppx: 6.21077047261 [09/13/2018 14:26:31 INFO 140284443182912] #quality_metric: host=algo-2, epoch=45, train total_loss <loss>=6.21077047261 [09/13/2018 14:26:31 INFO 140284443182912] patience losses:[6.2185958082025703, 6.2251867294311527, 6.2142245292663576, 6.2156814705241814, 6.2189148816195399] min patience loss:6.21422452927 current loss:6.21077047261 absolute loss difference:0.00345405665311 [09/13/2018 14:26:31 INFO 140284443182912] #progress_metric: host=algo-2, completed 45 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2475, "sum": 2475.0, "min": 2475}, "Total Records Seen": {"count": 1, "max": 313380, "sum": 313380.0, "min": 313380}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 90, "sum": 90.0, "min": 90}}, "EndTime": 1536848791.373797, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 44}, "StartTime": 1536848789.281206} [09/13/2018 14:26:31 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3327.69659438 records/second [09/13/2018 14:26:31 INFO 140284443182912] [09/13/2018 14:26:31 INFO 140284443182912] # Starting training for epoch 46 [09/13/2018 14:26:33 INFO 140284443182912] # Finished training epoch 46 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:33 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:33 INFO 140284443182912] Loss (name: value) total: 6.19413049871 [09/13/2018 14:26:33 INFO 140284443182912] Loss (name: value) kld: 0.169031369551 [09/13/2018 14:26:33 INFO 140284443182912] Loss (name: value) recons: 6.02509914745 [09/13/2018 14:26:33 INFO 140284443182912] Loss (name: value) logppx: 6.19413049871 [09/13/2018 14:26:33 INFO 140284443182912] #quality_metric: host=algo-2, epoch=46, train total_loss <loss>=6.19413049871 [09/13/2018 14:26:33 INFO 140284443182912] patience losses:[6.2251867294311527, 6.2142245292663576, 6.2156814705241814, 6.2189148816195399, 6.2107704726132482] min patience loss:6.21077047261 current loss:6.19413049871 absolute loss difference:0.0166399739005 [09/13/2018 14:26:33 INFO 140284443182912] #progress_metric: host=algo-2, completed 46 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2530, "sum": 2530.0, "min": 2530}, "Total Records Seen": {"count": 1, "max": 320344, "sum": 320344.0, "min": 320344}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 92, "sum": 92.0, "min": 92}}, "EndTime": 1536848793.417161, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 45}, "StartTime": 1536848791.37433} [09/13/2018 14:26:33 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3408.68088257 records/second [09/13/2018 14:26:33 INFO 140284443182912] [09/13/2018 14:26:33 INFO 140284443182912] # Starting training for epoch 47 [09/13/2018 14:26:34 INFO 140484794881856] # Finished training epoch 51 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:34 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:34 INFO 140484794881856] Loss (name: value) total: 6.16961873228 [09/13/2018 14:26:34 INFO 140484794881856] Loss (name: value) kld: 0.178340316632 [09/13/2018 14:26:34 INFO 140484794881856] Loss (name: value) recons: 5.9912784273 [09/13/2018 14:26:34 INFO 140484794881856] Loss (name: value) logppx: 6.16961873228 [09/13/2018 14:26:34 INFO 140484794881856] #quality_metric: host=algo-1, epoch=51, train total_loss <loss>=6.16961873228 [09/13/2018 14:26:34 INFO 140484794881856] patience losses:[6.1792257135564634, 6.1809613184495404, 6.1918809587305246, 6.1726064378565004, 6.1731747497211806] min patience loss:6.17260643786 current loss:6.16961873228 absolute loss difference:0.0029877055775 [09/13/2018 14:26:34 INFO 140484794881856] #progress_metric: host=algo-1, completed 51 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2805, "sum": 2805.0, "min": 2805}, "Total Records Seen": {"count": 1, "max": 355164, "sum": 355164.0, "min": 355164}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 102, "sum": 102.0, "min": 102}}, "EndTime": 1536848794.296936, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 50}, "StartTime": 1536848792.465732} [09/13/2018 14:26:34 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3802.60311674 records/second [09/13/2018 14:26:34 INFO 140484794881856] [09/13/2018 14:26:34 INFO 140484794881856] # Starting training for epoch 52 [09/13/2018 14:26:35 INFO 140284443182912] # Finished training epoch 47 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:35 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:35 INFO 140284443182912] Loss (name: value) total: 6.20469079885 [09/13/2018 14:26:35 INFO 140284443182912] Loss (name: value) kld: 0.176543838057 [09/13/2018 14:26:35 INFO 140284443182912] Loss (name: value) recons: 6.02814697352 [09/13/2018 14:26:35 INFO 140284443182912] Loss (name: value) logppx: 6.20469079885 [09/13/2018 14:26:35 INFO 140284443182912] #quality_metric: host=algo-2, epoch=47, train total_loss <loss>=6.20469079885 [09/13/2018 14:26:35 INFO 140284443182912] patience losses:[6.2142245292663576, 6.2156814705241814, 6.2189148816195399, 6.2107704726132482, 6.1941304987127133] min patience loss:6.19413049871 current loss:6.20469079885 absolute loss difference:0.0105603001334 [09/13/2018 14:26:35 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:35 INFO 140284443182912] #progress_metric: host=algo-2, completed 47 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2585, "sum": 2585.0, "min": 2585}, "Total Records Seen": {"count": 1, "max": 327308, "sum": 327308.0, "min": 327308}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 94, "sum": 94.0, "min": 94}}, "EndTime": 1536848795.491023, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 46}, "StartTime": 1536848793.420319} [09/13/2018 14:26:35 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3358.88195428 records/second [09/13/2018 14:26:35 INFO 140284443182912] [09/13/2018 14:26:35 INFO 140284443182912] # Starting training for epoch 48 [09/13/2018 14:26:36 INFO 140484794881856] # Finished training epoch 52 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:36 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:36 INFO 140484794881856] Loss (name: value) total: 6.1663816452 [09/13/2018 14:26:36 INFO 140484794881856] Loss (name: value) kld: 0.181519161707 [09/13/2018 14:26:36 INFO 140484794881856] Loss (name: value) recons: 5.98486248797 [09/13/2018 14:26:36 INFO 140484794881856] Loss (name: value) logppx: 6.1663816452 [09/13/2018 14:26:36 INFO 140484794881856] #quality_metric: host=algo-1, epoch=52, train total_loss <loss>=6.1663816452 [09/13/2018 14:26:36 INFO 140484794881856] patience losses:[6.1809613184495404, 6.1918809587305246, 6.1726064378565004, 6.1731747497211806, 6.1696187322789973] min patience loss:6.16961873228 current loss:6.1663816452 absolute loss difference:0.00323708707636 [09/13/2018 14:26:36 INFO 140484794881856] #progress_metric: host=algo-1, completed 52 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2860, "sum": 2860.0, "min": 2860}, "Total Records Seen": {"count": 1, "max": 362128, "sum": 362128.0, "min": 362128}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 104, "sum": 104.0, "min": 104}}, "EndTime": 1536848796.163721, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 51}, "StartTime": 1536848794.297555} [09/13/2018 14:26:36 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3731.38935796 records/second [09/13/2018 14:26:36 INFO 140484794881856] [09/13/2018 14:26:36 INFO 140484794881856] # Starting training for epoch 53 [09/13/2018 14:26:37 INFO 140284443182912] # Finished training epoch 48 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:37 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:37 INFO 140284443182912] Loss (name: value) total: 6.19973328764 [09/13/2018 14:26:37 INFO 140284443182912] Loss (name: value) kld: 0.179210425101 [09/13/2018 14:26:37 INFO 140284443182912] Loss (name: value) recons: 6.0205228242 [09/13/2018 14:26:37 INFO 140284443182912] Loss (name: value) logppx: 6.19973328764 [09/13/2018 14:26:37 INFO 140284443182912] #quality_metric: host=algo-2, epoch=48, train total_loss <loss>=6.19973328764 [09/13/2018 14:26:37 INFO 140284443182912] patience losses:[6.2156814705241814, 6.2189148816195399, 6.2107704726132482, 6.1941304987127133, 6.2046907988461584] min patience loss:6.19413049871 current loss:6.19973328764 absolute loss difference:0.00560278892517 [09/13/2018 14:26:37 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:26:37 INFO 140284443182912] #progress_metric: host=algo-2, completed 48 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2640, "sum": 2640.0, "min": 2640}, "Total Records Seen": {"count": 1, "max": 334272, "sum": 334272.0, "min": 334272}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 96, "sum": 96.0, "min": 96}}, "EndTime": 1536848797.572228, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 47}, "StartTime": 1536848795.491678} [09/13/2018 14:26:37 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3346.98940837 records/second [09/13/2018 14:26:37 INFO 140284443182912] [09/13/2018 14:26:37 INFO 140284443182912] # Starting training for epoch 49 [09/13/2018 14:26:38 INFO 140484794881856] # Finished training epoch 53 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:38 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:38 INFO 140484794881856] Loss (name: value) total: 6.15416082035 [09/13/2018 14:26:38 INFO 140484794881856] Loss (name: value) kld: 0.184989656034 [09/13/2018 14:26:38 INFO 140484794881856] Loss (name: value) recons: 5.96917109489 [09/13/2018 14:26:38 INFO 140484794881856] Loss (name: value) logppx: 6.15416082035 [09/13/2018 14:26:38 INFO 140484794881856] #quality_metric: host=algo-1, epoch=53, train total_loss <loss>=6.15416082035 [09/13/2018 14:26:38 INFO 140484794881856] patience losses:[6.1918809587305246, 6.1726064378565004, 6.1731747497211806, 6.1696187322789973, 6.1663816452026365] min patience loss:6.1663816452 current loss:6.15416082035 absolute loss difference:0.0122208248485 [09/13/2018 14:26:38 INFO 140484794881856] #progress_metric: host=algo-1, completed 53 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2915, "sum": 2915.0, "min": 2915}, "Total Records Seen": {"count": 1, "max": 369092, "sum": 369092.0, "min": 369092}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 106, "sum": 106.0, "min": 106}}, "EndTime": 1536848798.060092, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 52}, "StartTime": 1536848796.164138} [09/13/2018 14:26:38 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3672.61033479 records/second [09/13/2018 14:26:38 INFO 140484794881856] [09/13/2018 14:26:38 INFO 140484794881856] # Starting training for epoch 54 [09/13/2018 14:26:39 INFO 140284443182912] # Finished training epoch 49 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:39 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:39 INFO 140284443182912] Loss (name: value) total: 6.19428766857 [09/13/2018 14:26:39 INFO 140284443182912] Loss (name: value) kld: 0.185465232892 [09/13/2018 14:26:39 INFO 140284443182912] Loss (name: value) recons: 6.00882240642 [09/13/2018 14:26:39 INFO 140284443182912] Loss (name: value) logppx: 6.19428766857 [09/13/2018 14:26:39 INFO 140284443182912] #quality_metric: host=algo-2, epoch=49, train total_loss <loss>=6.19428766857 [09/13/2018 14:26:39 INFO 140284443182912] patience losses:[6.2189148816195399, 6.2107704726132482, 6.1941304987127133, 6.2046907988461584, 6.1997332876378843] min patience loss:6.19413049871 current loss:6.19428766857 absolute loss difference:0.000157169862227 [09/13/2018 14:26:39 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:26:39 INFO 140284443182912] #progress_metric: host=algo-2, completed 49 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2695, "sum": 2695.0, "min": 2695}, "Total Records Seen": {"count": 1, "max": 341236, "sum": 341236.0, "min": 341236}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 98, "sum": 98.0, "min": 98}}, "EndTime": 1536848799.621119, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 48}, "StartTime": 1536848797.572725} [09/13/2018 14:26:39 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3399.35323738 records/second [09/13/2018 14:26:39 INFO 140284443182912] [09/13/2018 14:26:39 INFO 140284443182912] # Starting training for epoch 50 [09/13/2018 14:26:39 INFO 140484794881856] # Finished training epoch 54 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:39 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:39 INFO 140484794881856] Loss (name: value) total: 6.16028831655 [09/13/2018 14:26:39 INFO 140484794881856] Loss (name: value) kld: 0.18847534643 [09/13/2018 14:26:39 INFO 140484794881856] Loss (name: value) recons: 5.97181296349 [09/13/2018 14:26:39 INFO 140484794881856] Loss (name: value) logppx: 6.16028831655 [09/13/2018 14:26:39 INFO 140484794881856] #quality_metric: host=algo-1, epoch=54, train total_loss <loss>=6.16028831655 [09/13/2018 14:26:39 INFO 140484794881856] patience losses:[6.1726064378565004, 6.1731747497211806, 6.1696187322789973, 6.1663816452026365, 6.1541608203541145] min patience loss:6.15416082035 current loss:6.16028831655 absolute loss difference:0.00612749619917 [09/13/2018 14:26:39 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:39 INFO 140484794881856] #progress_metric: host=algo-1, completed 54 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2970, "sum": 2970.0, "min": 2970}, "Total Records Seen": {"count": 1, "max": 376056, "sum": 376056.0, "min": 376056}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 108, "sum": 108.0, "min": 108}}, "EndTime": 1536848799.948878, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 53}, "StartTime": 1536848798.06096} [09/13/2018 14:26:39 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3688.3799773 records/second [09/13/2018 14:26:39 INFO 140484794881856] [09/13/2018 14:26:39 INFO 140484794881856] # Starting training for epoch 55 [09/13/2018 14:26:41 INFO 140484794881856] # Finished training epoch 55 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:41 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:41 INFO 140484794881856] Loss (name: value) total: 6.15363258015 [09/13/2018 14:26:41 INFO 140484794881856] Loss (name: value) kld: 0.192609959841 [09/13/2018 14:26:41 INFO 140484794881856] Loss (name: value) recons: 5.96102260243 [09/13/2018 14:26:41 INFO 140484794881856] Loss (name: value) logppx: 6.15363258015 [09/13/2018 14:26:41 INFO 140484794881856] #quality_metric: host=algo-1, epoch=55, train total_loss <loss>=6.15363258015 [09/13/2018 14:26:41 INFO 140484794881856] patience losses:[6.1731747497211806, 6.1696187322789973, 6.1663816452026365, 6.1541608203541145, 6.1602883165532889] min patience loss:6.15416082035 current loss:6.15363258015 absolute loss difference:0.000528240203857 [09/13/2018 14:26:41 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:26:41 INFO 140484794881856] #progress_metric: host=algo-1, completed 55 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3025, "sum": 3025.0, "min": 3025}, "Total Records Seen": {"count": 1, "max": 383020, "sum": 383020.0, "min": 383020}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 110, "sum": 110.0, "min": 110}}, "EndTime": 1536848801.801505, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 54}, "StartTime": 1536848799.949469} [09/13/2018 14:26:41 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3759.837113 records/second [09/13/2018 14:26:41 INFO 140484794881856] [09/13/2018 14:26:41 INFO 140484794881856] # Starting training for epoch 56 [09/13/2018 14:26:41 INFO 140284443182912] # Finished training epoch 50 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:41 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:41 INFO 140284443182912] Loss (name: value) total: 6.18576969667 [09/13/2018 14:26:41 INFO 140284443182912] Loss (name: value) kld: 0.187014301799 [09/13/2018 14:26:41 INFO 140284443182912] Loss (name: value) recons: 5.99875538566 [09/13/2018 14:26:41 INFO 140284443182912] Loss (name: value) logppx: 6.18576969667 [09/13/2018 14:26:41 INFO 140284443182912] #quality_metric: host=algo-2, epoch=50, train total_loss <loss>=6.18576969667 [09/13/2018 14:26:41 INFO 140284443182912] patience losses:[6.2107704726132482, 6.1941304987127133, 6.2046907988461584, 6.1997332876378843, 6.1942876685749404] min patience loss:6.19413049871 current loss:6.18576969667 absolute loss difference:0.00836080204357 [09/13/2018 14:26:41 INFO 140284443182912] #progress_metric: host=algo-2, completed 50 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2750, "sum": 2750.0, "min": 2750}, "Total Records Seen": {"count": 1, "max": 348200, "sum": 348200.0, "min": 348200}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 100, "sum": 100.0, "min": 100}}, "EndTime": 1536848801.686775, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 49}, "StartTime": 1536848799.62222} [09/13/2018 14:26:41 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3372.83509534 records/second [09/13/2018 14:26:41 INFO 140284443182912] [09/13/2018 14:26:41 INFO 140284443182912] # Starting training for epoch 51 [09/13/2018 14:26:43 INFO 140484794881856] # Finished training epoch 56 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:43 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:43 INFO 140484794881856] Loss (name: value) total: 6.15707529675 [09/13/2018 14:26:43 INFO 140484794881856] Loss (name: value) kld: 0.198419330066 [09/13/2018 14:26:43 INFO 140484794881856] Loss (name: value) recons: 5.95865600326 [09/13/2018 14:26:43 INFO 140484794881856] Loss (name: value) logppx: 6.15707529675 [09/13/2018 14:26:43 INFO 140484794881856] #quality_metric: host=algo-1, epoch=56, train total_loss <loss>=6.15707529675 [09/13/2018 14:26:43 INFO 140484794881856] patience losses:[6.1696187322789973, 6.1663816452026365, 6.1541608203541145, 6.1602883165532889, 6.1536325801502576] min patience loss:6.15363258015 current loss:6.15707529675 absolute loss difference:0.00344271659851 [09/13/2018 14:26:43 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:26:43 INFO 140484794881856] #progress_metric: host=algo-1, completed 56 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3080, "sum": 3080.0, "min": 3080}, "Total Records Seen": {"count": 1, "max": 389984, "sum": 389984.0, "min": 389984}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 112, "sum": 112.0, "min": 112}}, "EndTime": 1536848803.574446, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 55}, "StartTime": 1536848801.802619} [09/13/2018 14:26:43 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3930.0361566 records/second [09/13/2018 14:26:43 INFO 140484794881856] [09/13/2018 14:26:43 INFO 140484794881856] # Starting training for epoch 57 [09/13/2018 14:26:43 INFO 140284443182912] # Finished training epoch 51 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:43 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:43 INFO 140284443182912] Loss (name: value) total: 6.17739919316 [09/13/2018 14:26:43 INFO 140284443182912] Loss (name: value) kld: 0.194248574566 [09/13/2018 14:26:43 INFO 140284443182912] Loss (name: value) recons: 5.98315064257 [09/13/2018 14:26:43 INFO 140284443182912] Loss (name: value) logppx: 6.17739919316 [09/13/2018 14:26:43 INFO 140284443182912] #quality_metric: host=algo-2, epoch=51, train total_loss <loss>=6.17739919316 [09/13/2018 14:26:43 INFO 140284443182912] patience losses:[6.1941304987127133, 6.2046907988461584, 6.1997332876378843, 6.1942876685749404, 6.1857696966691451] min patience loss:6.18576969667 current loss:6.17739919316 absolute loss difference:0.0083705035123 [09/13/2018 14:26:43 INFO 140284443182912] #progress_metric: host=algo-2, completed 51 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2805, "sum": 2805.0, "min": 2805}, "Total Records Seen": {"count": 1, "max": 355164, "sum": 355164.0, "min": 355164}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 102, "sum": 102.0, "min": 102}}, "EndTime": 1536848803.808635, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 50}, "StartTime": 1536848801.687342} [09/13/2018 14:26:43 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3282.69741847 records/second [09/13/2018 14:26:43 INFO 140284443182912] [09/13/2018 14:26:43 INFO 140284443182912] # Starting training for epoch 52 [09/13/2018 14:26:45 INFO 140484794881856] # Finished training epoch 57 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:45 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:45 INFO 140484794881856] Loss (name: value) total: 6.14588646889 [09/13/2018 14:26:45 INFO 140484794881856] Loss (name: value) kld: 0.202257077667 [09/13/2018 14:26:45 INFO 140484794881856] Loss (name: value) recons: 5.94362935153 [09/13/2018 14:26:45 INFO 140484794881856] Loss (name: value) logppx: 6.14588646889 [09/13/2018 14:26:45 INFO 140484794881856] #quality_metric: host=algo-1, epoch=57, train total_loss <loss>=6.14588646889 [09/13/2018 14:26:45 INFO 140484794881856] patience losses:[6.1663816452026365, 6.1541608203541145, 6.1602883165532889, 6.1536325801502576, 6.1570752967487685] min patience loss:6.15363258015 current loss:6.14588646889 absolute loss difference:0.00774611126293 [09/13/2018 14:26:45 INFO 140484794881856] #progress_metric: host=algo-1, completed 57 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3135, "sum": 3135.0, "min": 3135}, "Total Records Seen": {"count": 1, "max": 396948, "sum": 396948.0, "min": 396948}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 114, "sum": 114.0, "min": 114}}, "EndTime": 1536848805.40188, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 56}, "StartTime": 1536848803.575054} [09/13/2018 14:26:45 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3811.71538364 records/second [09/13/2018 14:26:45 INFO 140484794881856] [09/13/2018 14:26:45 INFO 140484794881856] # Starting training for epoch 58 [09/13/2018 14:26:45 INFO 140284443182912] # Finished training epoch 52 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:45 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:45 INFO 140284443182912] Loss (name: value) total: 6.18242381269 [09/13/2018 14:26:45 INFO 140284443182912] Loss (name: value) kld: 0.201099661805 [09/13/2018 14:26:45 INFO 140284443182912] Loss (name: value) recons: 5.98132410483 [09/13/2018 14:26:45 INFO 140284443182912] Loss (name: value) logppx: 6.18242381269 [09/13/2018 14:26:45 INFO 140284443182912] #quality_metric: host=algo-2, epoch=52, train total_loss <loss>=6.18242381269 [09/13/2018 14:26:45 INFO 140284443182912] patience losses:[6.2046907988461584, 6.1997332876378843, 6.1942876685749404, 6.1857696966691451, 6.177399193156849] min patience loss:6.17739919316 current loss:6.18242381269 absolute loss difference:0.00502461953597 [09/13/2018 14:26:45 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:45 INFO 140284443182912] #progress_metric: host=algo-2, completed 52 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2860, "sum": 2860.0, "min": 2860}, "Total Records Seen": {"count": 1, "max": 362128, "sum": 362128.0, "min": 362128}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 104, "sum": 104.0, "min": 104}}, "EndTime": 1536848805.963278, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 51}, "StartTime": 1536848803.808996} [09/13/2018 14:26:45 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3232.22363181 records/second [09/13/2018 14:26:45 INFO 140284443182912] [09/13/2018 14:26:45 INFO 140284443182912] # Starting training for epoch 53 [09/13/2018 14:26:47 INFO 140484794881856] # Finished training epoch 58 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:47 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:47 INFO 140484794881856] Loss (name: value) total: 6.14047724984 [09/13/2018 14:26:47 INFO 140484794881856] Loss (name: value) kld: 0.204427990317 [09/13/2018 14:26:47 INFO 140484794881856] Loss (name: value) recons: 5.93604926629 [09/13/2018 14:26:47 INFO 140484794881856] Loss (name: value) logppx: 6.14047724984 [09/13/2018 14:26:47 INFO 140484794881856] #quality_metric: host=algo-1, epoch=58, train total_loss <loss>=6.14047724984 [09/13/2018 14:26:47 INFO 140484794881856] patience losses:[6.1541608203541145, 6.1602883165532889, 6.1536325801502576, 6.1570752967487685, 6.1458864688873289] min patience loss:6.14588646889 current loss:6.14047724984 absolute loss difference:0.00540921904824 [09/13/2018 14:26:47 INFO 140484794881856] #progress_metric: host=algo-1, completed 58 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3190, "sum": 3190.0, "min": 3190}, "Total Records Seen": {"count": 1, "max": 403912, "sum": 403912.0, "min": 403912}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 116, "sum": 116.0, "min": 116}}, "EndTime": 1536848807.163549, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 57}, "StartTime": 1536848805.402452} [09/13/2018 14:26:47 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3953.93817108 records/second [09/13/2018 14:26:47 INFO 140484794881856] [09/13/2018 14:26:47 INFO 140484794881856] # Starting training for epoch 59 [09/13/2018 14:26:48 INFO 140284443182912] # Finished training epoch 53 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:48 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:48 INFO 140284443182912] Loss (name: value) total: 6.17095405839 [09/13/2018 14:26:48 INFO 140284443182912] Loss (name: value) kld: 0.203793969615 [09/13/2018 14:26:48 INFO 140284443182912] Loss (name: value) recons: 5.96716010787 [09/13/2018 14:26:48 INFO 140284443182912] Loss (name: value) logppx: 6.17095405839 [09/13/2018 14:26:48 INFO 140284443182912] #quality_metric: host=algo-2, epoch=53, train total_loss <loss>=6.17095405839 [09/13/2018 14:26:48 INFO 140284443182912] patience losses:[6.1997332876378843, 6.1942876685749404, 6.1857696966691451, 6.177399193156849, 6.1824238126928153] min patience loss:6.17739919316 current loss:6.17095405839 absolute loss difference:0.00644513476979 [09/13/2018 14:26:48 INFO 140284443182912] #progress_metric: host=algo-2, completed 53 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2915, "sum": 2915.0, "min": 2915}, "Total Records Seen": {"count": 1, "max": 369092, "sum": 369092.0, "min": 369092}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 106, "sum": 106.0, "min": 106}}, "EndTime": 1536848808.068531, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 52}, "StartTime": 1536848805.964773} [09/13/2018 14:26:48 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3310.04669854 records/second [09/13/2018 14:26:48 INFO 140284443182912] [09/13/2018 14:26:48 INFO 140284443182912] # Starting training for epoch 54 [09/13/2018 14:26:48 INFO 140484794881856] # Finished training epoch 59 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:48 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:48 INFO 140484794881856] Loss (name: value) total: 6.14513987194 [09/13/2018 14:26:48 INFO 140484794881856] Loss (name: value) kld: 0.212347452613 [09/13/2018 14:26:48 INFO 140484794881856] Loss (name: value) recons: 5.93279242082 [09/13/2018 14:26:48 INFO 140484794881856] Loss (name: value) logppx: 6.14513987194 [09/13/2018 14:26:48 INFO 140484794881856] #quality_metric: host=algo-1, epoch=59, train total_loss <loss>=6.14513987194 [09/13/2018 14:26:48 INFO 140484794881856] patience losses:[6.1602883165532889, 6.1536325801502576, 6.1570752967487685, 6.1458864688873289, 6.1404772498390887] min patience loss:6.14047724984 current loss:6.14513987194 absolute loss difference:0.00466262210499 [09/13/2018 14:26:48 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:48 INFO 140484794881856] #progress_metric: host=algo-1, completed 59 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3245, "sum": 3245.0, "min": 3245}, "Total Records Seen": {"count": 1, "max": 410876, "sum": 410876.0, "min": 410876}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 118, "sum": 118.0, "min": 118}}, "EndTime": 1536848808.927857, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 58}, "StartTime": 1536848807.164118} [09/13/2018 14:26:48 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3948.15651114 records/second [09/13/2018 14:26:48 INFO 140484794881856] [09/13/2018 14:26:48 INFO 140484794881856] # Starting training for epoch 60 [09/13/2018 14:26:50 INFO 140284443182912] # Finished training epoch 54 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:50 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:50 INFO 140284443182912] Loss (name: value) total: 6.17578908313 [09/13/2018 14:26:50 INFO 140284443182912] Loss (name: value) kld: 0.207935638048 [09/13/2018 14:26:50 INFO 140284443182912] Loss (name: value) recons: 5.96785345511 [09/13/2018 14:26:50 INFO 140284443182912] Loss (name: value) logppx: 6.17578908313 [09/13/2018 14:26:50 INFO 140284443182912] #quality_metric: host=algo-2, epoch=54, train total_loss <loss>=6.17578908313 [09/13/2018 14:26:50 INFO 140284443182912] patience losses:[6.1942876685749404, 6.1857696966691451, 6.177399193156849, 6.1824238126928153, 6.1709540583870623] min patience loss:6.17095405839 current loss:6.17578908313 absolute loss difference:0.00483502474698 [09/13/2018 14:26:50 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:50 INFO 140284443182912] #progress_metric: host=algo-2, completed 54 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 2970, "sum": 2970.0, "min": 2970}, "Total Records Seen": {"count": 1, "max": 376056, "sum": 376056.0, "min": 376056}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 108, "sum": 108.0, "min": 108}}, "EndTime": 1536848810.046637, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 53}, "StartTime": 1536848808.068955} [09/13/2018 14:26:50 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3521.05026801 records/second [09/13/2018 14:26:50 INFO 140284443182912] [09/13/2018 14:26:50 INFO 140284443182912] # Starting training for epoch 55 [09/13/2018 14:26:52 INFO 140284443182912] # Finished training epoch 55 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:52 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:52 INFO 140284443182912] Loss (name: value) total: 6.16596099247 [09/13/2018 14:26:52 INFO 140284443182912] Loss (name: value) kld: 0.214880705286 [09/13/2018 14:26:52 INFO 140284443182912] Loss (name: value) recons: 5.95108028325 [09/13/2018 14:26:52 INFO 140284443182912] Loss (name: value) logppx: 6.16596099247 [09/13/2018 14:26:52 INFO 140284443182912] #quality_metric: host=algo-2, epoch=55, train total_loss <loss>=6.16596099247 [09/13/2018 14:26:52 INFO 140284443182912] patience losses:[6.1857696966691451, 6.177399193156849, 6.1824238126928153, 6.1709540583870623, 6.1757890831340445] min patience loss:6.17095405839 current loss:6.16596099247 absolute loss difference:0.00499306592074 [09/13/2018 14:26:52 INFO 140284443182912] #progress_metric: host=algo-2, completed 55 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3025, "sum": 3025.0, "min": 3025}, "Total Records Seen": {"count": 1, "max": 383020, "sum": 383020.0, "min": 383020}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 110, "sum": 110.0, "min": 110}}, "EndTime": 1536848812.146874, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 54}, "StartTime": 1536848810.047195} [09/13/2018 14:26:52 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3316.48093058 records/second [09/13/2018 14:26:52 INFO 140284443182912] [09/13/2018 14:26:52 INFO 140284443182912] # Starting training for epoch 56 [09/13/2018 14:26:50 INFO 140484794881856] # Finished training epoch 60 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:50 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:50 INFO 140484794881856] Loss (name: value) total: 6.13347271139 [09/13/2018 14:26:50 INFO 140484794881856] Loss (name: value) kld: 0.2137191404 [09/13/2018 14:26:50 INFO 140484794881856] Loss (name: value) recons: 5.91975358183 [09/13/2018 14:26:50 INFO 140484794881856] Loss (name: value) logppx: 6.13347271139 [09/13/2018 14:26:50 INFO 140484794881856] #quality_metric: host=algo-1, epoch=60, train total_loss <loss>=6.13347271139 [09/13/2018 14:26:50 INFO 140484794881856] patience losses:[6.1536325801502576, 6.1570752967487685, 6.1458864688873289, 6.1404772498390887, 6.1451398719440808] min patience loss:6.14047724984 current loss:6.13347271139 absolute loss difference:0.00700453844937 [09/13/2018 14:26:50 INFO 140484794881856] #progress_metric: host=algo-1, completed 60 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3300, "sum": 3300.0, "min": 3300}, "Total Records Seen": {"count": 1, "max": 417840, "sum": 417840.0, "min": 417840}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 120, "sum": 120.0, "min": 120}}, "EndTime": 1536848810.703187, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 59}, "StartTime": 1536848808.92824} [09/13/2018 14:26:50 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3923.16401703 records/second [09/13/2018 14:26:50 INFO 140484794881856] [09/13/2018 14:26:50 INFO 140484794881856] # Starting training for epoch 61 [09/13/2018 14:26:52 INFO 140484794881856] # Finished training epoch 61 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:52 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:52 INFO 140484794881856] Loss (name: value) total: 6.14103993936 [09/13/2018 14:26:52 INFO 140484794881856] Loss (name: value) kld: 0.220058780502 [09/13/2018 14:26:52 INFO 140484794881856] Loss (name: value) recons: 5.92098117742 [09/13/2018 14:26:52 INFO 140484794881856] Loss (name: value) logppx: 6.14103993936 [09/13/2018 14:26:52 INFO 140484794881856] #quality_metric: host=algo-1, epoch=61, train total_loss <loss>=6.14103993936 [09/13/2018 14:26:52 INFO 140484794881856] patience losses:[6.1570752967487685, 6.1458864688873289, 6.1404772498390887, 6.1451398719440808, 6.1334727113897154] min patience loss:6.13347271139 current loss:6.14103993936 absolute loss difference:0.00756722797047 [09/13/2018 14:26:52 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:52 INFO 140484794881856] #progress_metric: host=algo-1, completed 61 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3355, "sum": 3355.0, "min": 3355}, "Total Records Seen": {"count": 1, "max": 424804, "sum": 424804.0, "min": 424804}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 122, "sum": 122.0, "min": 122}}, "EndTime": 1536848812.440535, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 60}, "StartTime": 1536848810.703744} [09/13/2018 14:26:52 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4009.01486289 records/second [09/13/2018 14:26:52 INFO 140484794881856] [09/13/2018 14:26:52 INFO 140484794881856] # Starting training for epoch 62 [09/13/2018 14:26:54 INFO 140484794881856] # Finished training epoch 62 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:54 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:54 INFO 140484794881856] Loss (name: value) total: 6.12806441567 [09/13/2018 14:26:54 INFO 140484794881856] Loss (name: value) kld: 0.222316573289 [09/13/2018 14:26:54 INFO 140484794881856] Loss (name: value) recons: 5.90574786446 [09/13/2018 14:26:54 INFO 140484794881856] Loss (name: value) logppx: 6.12806441567 [09/13/2018 14:26:54 INFO 140484794881856] #quality_metric: host=algo-1, epoch=62, train total_loss <loss>=6.12806441567 [09/13/2018 14:26:54 INFO 140484794881856] patience losses:[6.1458864688873289, 6.1404772498390887, 6.1451398719440808, 6.1334727113897154, 6.1410399393601853] min patience loss:6.13347271139 current loss:6.12806441567 absolute loss difference:0.00540829571811 [09/13/2018 14:26:54 INFO 140484794881856] #progress_metric: host=algo-1, completed 62 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3410, "sum": 3410.0, "min": 3410}, "Total Records Seen": {"count": 1, "max": 431768, "sum": 431768.0, "min": 431768}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 124, "sum": 124.0, "min": 124}}, "EndTime": 1536848814.164102, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 61}, "StartTime": 1536848812.441077} [09/13/2018 14:26:54 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4041.3535098 records/second [09/13/2018 14:26:54 INFO 140484794881856] [09/13/2018 14:26:54 INFO 140484794881856] # Starting training for epoch 63 [09/13/2018 14:26:54 INFO 140284443182912] # Finished training epoch 56 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:54 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:54 INFO 140284443182912] Loss (name: value) total: 6.16024996584 [09/13/2018 14:26:54 INFO 140284443182912] Loss (name: value) kld: 0.219153580747 [09/13/2018 14:26:54 INFO 140284443182912] Loss (name: value) recons: 5.9410963622 [09/13/2018 14:26:54 INFO 140284443182912] Loss (name: value) logppx: 6.16024996584 [09/13/2018 14:26:54 INFO 140284443182912] #quality_metric: host=algo-2, epoch=56, train total_loss <loss>=6.16024996584 [09/13/2018 14:26:54 INFO 140284443182912] patience losses:[6.177399193156849, 6.1824238126928153, 6.1709540583870623, 6.1757890831340445, 6.1659609924663199] min patience loss:6.16596099247 current loss:6.16024996584 absolute loss difference:0.0057110266252 [09/13/2018 14:26:54 INFO 140284443182912] #progress_metric: host=algo-2, completed 56 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3080, "sum": 3080.0, "min": 3080}, "Total Records Seen": {"count": 1, "max": 389984, "sum": 389984.0, "min": 389984}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 112, "sum": 112.0, "min": 112}}, "EndTime": 1536848814.249376, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 55}, "StartTime": 1536848812.147481} [09/13/2018 14:26:54 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3312.992745 records/second [09/13/2018 14:26:54 INFO 140284443182912] [09/13/2018 14:26:54 INFO 140284443182912] # Starting training for epoch 57 [09/13/2018 14:26:55 INFO 140484794881856] # Finished training epoch 63 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:55 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:55 INFO 140484794881856] Loss (name: value) total: 6.12342215885 [09/13/2018 14:26:55 INFO 140484794881856] Loss (name: value) kld: 0.223743906075 [09/13/2018 14:26:55 INFO 140484794881856] Loss (name: value) recons: 5.89967825629 [09/13/2018 14:26:55 INFO 140484794881856] Loss (name: value) logppx: 6.12342215885 [09/13/2018 14:26:55 INFO 140484794881856] #quality_metric: host=algo-1, epoch=63, train total_loss <loss>=6.12342215885 [09/13/2018 14:26:55 INFO 140484794881856] patience losses:[6.1404772498390887, 6.1451398719440808, 6.1334727113897154, 6.1410399393601853, 6.1280644156716084] min patience loss:6.12806441567 current loss:6.12342215885 absolute loss difference:0.00464225682345 [09/13/2018 14:26:55 INFO 140484794881856] #progress_metric: host=algo-1, completed 63 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3465, "sum": 3465.0, "min": 3465}, "Total Records Seen": {"count": 1, "max": 438732, "sum": 438732.0, "min": 438732}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 126, "sum": 126.0, "min": 126}}, "EndTime": 1536848815.915292, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 62}, "StartTime": 1536848814.164545} [09/13/2018 14:26:55 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3977.39003245 records/second [09/13/2018 14:26:55 INFO 140484794881856] [09/13/2018 14:26:55 INFO 140484794881856] # Starting training for epoch 64 [09/13/2018 14:26:56 INFO 140284443182912] # Finished training epoch 57 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:56 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:56 INFO 140284443182912] Loss (name: value) total: 6.15800114545 [09/13/2018 14:26:56 INFO 140284443182912] Loss (name: value) kld: 0.223075049845 [09/13/2018 14:26:56 INFO 140284443182912] Loss (name: value) recons: 5.93492609804 [09/13/2018 14:26:56 INFO 140284443182912] Loss (name: value) logppx: 6.15800114545 [09/13/2018 14:26:56 INFO 140284443182912] #quality_metric: host=algo-2, epoch=57, train total_loss <loss>=6.15800114545 [09/13/2018 14:26:56 INFO 140284443182912] patience losses:[6.1824238126928153, 6.1709540583870623, 6.1757890831340445, 6.1659609924663199, 6.1602499658411203] min patience loss:6.16024996584 current loss:6.15800114545 absolute loss difference:0.00224882039157 [09/13/2018 14:26:56 INFO 140284443182912] #progress_metric: host=algo-2, completed 57 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3135, "sum": 3135.0, "min": 3135}, "Total Records Seen": {"count": 1, "max": 396948, "sum": 396948.0, "min": 396948}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 114, "sum": 114.0, "min": 114}}, "EndTime": 1536848816.366955, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 56}, "StartTime": 1536848814.249886} [09/13/2018 14:26:56 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3289.24529636 records/second [09/13/2018 14:26:56 INFO 140284443182912] [09/13/2018 14:26:56 INFO 140284443182912] # Starting training for epoch 58 [09/13/2018 14:26:57 INFO 140484794881856] # Finished training epoch 64 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:57 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:57 INFO 140484794881856] Loss (name: value) total: 6.12063746019 [09/13/2018 14:26:57 INFO 140484794881856] Loss (name: value) kld: 0.229554589093 [09/13/2018 14:26:57 INFO 140484794881856] Loss (name: value) recons: 5.89108288938 [09/13/2018 14:26:57 INFO 140484794881856] Loss (name: value) logppx: 6.12063746019 [09/13/2018 14:26:57 INFO 140484794881856] #quality_metric: host=algo-1, epoch=64, train total_loss <loss>=6.12063746019 [09/13/2018 14:26:57 INFO 140484794881856] patience losses:[6.1451398719440808, 6.1334727113897154, 6.1410399393601853, 6.1280644156716084, 6.1234221588481557] min patience loss:6.12342215885 current loss:6.12063746019 absolute loss difference:0.00278469865972 [09/13/2018 14:26:57 INFO 140484794881856] #progress_metric: host=algo-1, completed 64 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3520, "sum": 3520.0, "min": 3520}, "Total Records Seen": {"count": 1, "max": 445696, "sum": 445696.0, "min": 445696}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 128, "sum": 128.0, "min": 128}}, "EndTime": 1536848817.67763, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 63}, "StartTime": 1536848815.915779} [09/13/2018 14:26:57 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3952.27751251 records/second [09/13/2018 14:26:57 INFO 140484794881856] [09/13/2018 14:26:57 INFO 140484794881856] # Starting training for epoch 65 [09/13/2018 14:26:58 INFO 140284443182912] # Finished training epoch 58 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:58 INFO 140284443182912] Metrics for Training: [09/13/2018 14:26:58 INFO 140284443182912] Loss (name: value) total: 6.1557453459 [09/13/2018 14:26:58 INFO 140284443182912] Loss (name: value) kld: 0.22885164632 [09/13/2018 14:26:58 INFO 140284443182912] Loss (name: value) recons: 5.92689366341 [09/13/2018 14:26:58 INFO 140284443182912] Loss (name: value) logppx: 6.1557453459 [09/13/2018 14:26:58 INFO 140284443182912] #quality_metric: host=algo-2, epoch=58, train total_loss <loss>=6.1557453459 [09/13/2018 14:26:58 INFO 140284443182912] patience losses:[6.1709540583870623, 6.1757890831340445, 6.1659609924663199, 6.1602499658411203, 6.1580011454495516] min patience loss:6.15800114545 current loss:6.1557453459 absolute loss difference:0.00225579955361 [09/13/2018 14:26:58 INFO 140284443182912] #progress_metric: host=algo-2, completed 58 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3190, "sum": 3190.0, "min": 3190}, "Total Records Seen": {"count": 1, "max": 403912, "sum": 403912.0, "min": 403912}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 116, "sum": 116.0, "min": 116}}, "EndTime": 1536848818.47925, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 57}, "StartTime": 1536848816.37037} [09/13/2018 14:26:58 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3302.02026739 records/second [09/13/2018 14:26:58 INFO 140284443182912] [09/13/2018 14:26:58 INFO 140284443182912] # Starting training for epoch 59 [09/13/2018 14:26:59 INFO 140484794881856] # Finished training epoch 65 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:26:59 INFO 140484794881856] Metrics for Training: [09/13/2018 14:26:59 INFO 140484794881856] Loss (name: value) total: 6.13111139211 [09/13/2018 14:26:59 INFO 140484794881856] Loss (name: value) kld: 0.237464057315 [09/13/2018 14:26:59 INFO 140484794881856] Loss (name: value) recons: 5.89364728061 [09/13/2018 14:26:59 INFO 140484794881856] Loss (name: value) logppx: 6.13111139211 [09/13/2018 14:26:59 INFO 140484794881856] #quality_metric: host=algo-1, epoch=65, train total_loss <loss>=6.13111139211 [09/13/2018 14:26:59 INFO 140484794881856] patience losses:[6.1334727113897154, 6.1410399393601853, 6.1280644156716084, 6.1234221588481557, 6.1206374601884326] min patience loss:6.12063746019 current loss:6.13111139211 absolute loss difference:0.0104739319194 [09/13/2018 14:26:59 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:26:59 INFO 140484794881856] #progress_metric: host=algo-1, completed 65 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3575, "sum": 3575.0, "min": 3575}, "Total Records Seen": {"count": 1, "max": 452660, "sum": 452660.0, "min": 452660}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 130, "sum": 130.0, "min": 130}}, "EndTime": 1536848819.426222, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 64}, "StartTime": 1536848817.678125} [09/13/2018 14:26:59 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3983.41196603 records/second [09/13/2018 14:26:59 INFO 140484794881856] [09/13/2018 14:26:59 INFO 140484794881856] # Starting training for epoch 66 [09/13/2018 14:27:00 INFO 140284443182912] # Finished training epoch 59 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:00 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:00 INFO 140284443182912] Loss (name: value) total: 6.15446723158 [09/13/2018 14:27:00 INFO 140284443182912] Loss (name: value) kld: 0.24119146981 [09/13/2018 14:27:00 INFO 140284443182912] Loss (name: value) recons: 5.9132757447 [09/13/2018 14:27:00 INFO 140284443182912] Loss (name: value) logppx: 6.15446723158 [09/13/2018 14:27:00 INFO 140284443182912] #quality_metric: host=algo-2, epoch=59, train total_loss <loss>=6.15446723158 [09/13/2018 14:27:00 INFO 140284443182912] patience losses:[6.1757890831340445, 6.1659609924663199, 6.1602499658411203, 6.1580011454495516, 6.1557453458959408] min patience loss:6.1557453459 current loss:6.15446723158 absolute loss difference:0.00127811431885 [09/13/2018 14:27:00 INFO 140284443182912] #progress_metric: host=algo-2, completed 59 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3245, "sum": 3245.0, "min": 3245}, "Total Records Seen": {"count": 1, "max": 410876, "sum": 410876.0, "min": 410876}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 118, "sum": 118.0, "min": 118}}, "EndTime": 1536848820.640022, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 58}, "StartTime": 1536848818.479788} [09/13/2018 14:27:00 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3223.28339682 records/second [09/13/2018 14:27:00 INFO 140284443182912] [09/13/2018 14:27:00 INFO 140284443182912] # Starting training for epoch 60 [09/13/2018 14:27:01 INFO 140484794881856] # Finished training epoch 66 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:01 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:01 INFO 140484794881856] Loss (name: value) total: 6.12069216642 [09/13/2018 14:27:01 INFO 140484794881856] Loss (name: value) kld: 0.239575999704 [09/13/2018 14:27:01 INFO 140484794881856] Loss (name: value) recons: 5.88111616481 [09/13/2018 14:27:01 INFO 140484794881856] Loss (name: value) logppx: 6.12069216642 [09/13/2018 14:27:01 INFO 140484794881856] #quality_metric: host=algo-1, epoch=66, train total_loss <loss>=6.12069216642 [09/13/2018 14:27:01 INFO 140484794881856] patience losses:[6.1410399393601853, 6.1280644156716084, 6.1234221588481557, 6.1206374601884326, 6.1311113921078766] min patience loss:6.12063746019 current loss:6.12069216642 absolute loss difference:5.47062266953e-05 [09/13/2018 14:27:01 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:01 INFO 140484794881856] #progress_metric: host=algo-1, completed 66 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3630, "sum": 3630.0, "min": 3630}, "Total Records Seen": {"count": 1, "max": 459624, "sum": 459624.0, "min": 459624}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 132, "sum": 132.0, "min": 132}}, "EndTime": 1536848821.229947, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 65}, "StartTime": 1536848819.426807} [09/13/2018 14:27:01 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3861.60131729 records/second [09/13/2018 14:27:01 INFO 140484794881856] [09/13/2018 14:27:01 INFO 140484794881856] # Starting training for epoch 67 [09/13/2018 14:27:02 INFO 140284443182912] # Finished training epoch 60 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:02 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:02 INFO 140284443182912] Loss (name: value) total: 6.14915895462 [09/13/2018 14:27:02 INFO 140284443182912] Loss (name: value) kld: 0.242164883018 [09/13/2018 14:27:02 INFO 140284443182912] Loss (name: value) recons: 5.90699403503 [09/13/2018 14:27:02 INFO 140284443182912] Loss (name: value) logppx: 6.14915895462 [09/13/2018 14:27:02 INFO 140284443182912] #quality_metric: host=algo-2, epoch=60, train total_loss <loss>=6.14915895462 [09/13/2018 14:27:02 INFO 140284443182912] patience losses:[6.1659609924663199, 6.1602499658411203, 6.1580011454495516, 6.1557453458959408, 6.154467231577093] min patience loss:6.15446723158 current loss:6.14915895462 absolute loss difference:0.00530827695673 [09/13/2018 14:27:02 INFO 140284443182912] #progress_metric: host=algo-2, completed 60 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3300, "sum": 3300.0, "min": 3300}, "Total Records Seen": {"count": 1, "max": 417840, "sum": 417840.0, "min": 417840}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 120, "sum": 120.0, "min": 120}}, "EndTime": 1536848822.703501, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 59}, "StartTime": 1536848820.640727} [09/13/2018 14:27:02 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3375.7941391 records/second [09/13/2018 14:27:02 INFO 140284443182912] [09/13/2018 14:27:02 INFO 140284443182912] # Starting training for epoch 61 [09/13/2018 14:27:03 INFO 140484794881856] # Finished training epoch 67 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:03 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:03 INFO 140484794881856] Loss (name: value) total: 6.11771998405 [09/13/2018 14:27:03 INFO 140484794881856] Loss (name: value) kld: 0.243499486013 [09/13/2018 14:27:03 INFO 140484794881856] Loss (name: value) recons: 5.87422050563 [09/13/2018 14:27:03 INFO 140484794881856] Loss (name: value) logppx: 6.11771998405 [09/13/2018 14:27:03 INFO 140484794881856] #quality_metric: host=algo-1, epoch=67, train total_loss <loss>=6.11771998405 [09/13/2018 14:27:03 INFO 140484794881856] patience losses:[6.1280644156716084, 6.1234221588481557, 6.1206374601884326, 6.1311113921078766, 6.1206921664151279] min patience loss:6.12063746019 current loss:6.11771998405 absolute loss difference:0.00291747613387 [09/13/2018 14:27:03 INFO 140484794881856] #progress_metric: host=algo-1, completed 67 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3685, "sum": 3685.0, "min": 3685}, "Total Records Seen": {"count": 1, "max": 466588, "sum": 466588.0, "min": 466588}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 134, "sum": 134.0, "min": 134}}, "EndTime": 1536848823.049493, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 66}, "StartTime": 1536848821.230593} [09/13/2018 14:27:03 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3828.19768923 records/second [09/13/2018 14:27:03 INFO 140484794881856] [09/13/2018 14:27:03 INFO 140484794881856] # Starting training for epoch 68 [09/13/2018 14:27:04 INFO 140484794881856] # Finished training epoch 68 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:04 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:04 INFO 140484794881856] Loss (name: value) total: 6.10745325522 [09/13/2018 14:27:04 INFO 140484794881856] Loss (name: value) kld: 0.24998109151 [09/13/2018 14:27:04 INFO 140484794881856] Loss (name: value) recons: 5.8574721163 [09/13/2018 14:27:04 INFO 140484794881856] Loss (name: value) logppx: 6.10745325522 [09/13/2018 14:27:04 INFO 140484794881856] #quality_metric: host=algo-1, epoch=68, train total_loss <loss>=6.10745325522 [09/13/2018 14:27:04 INFO 140484794881856] patience losses:[6.1234221588481557, 6.1206374601884326, 6.1311113921078766, 6.1206921664151279, 6.1177199840545651] min patience loss:6.11771998405 current loss:6.10745325522 absolute loss difference:0.0102667288347 [09/13/2018 14:27:04 INFO 140484794881856] #progress_metric: host=algo-1, completed 68 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3740, "sum": 3740.0, "min": 3740}, "Total Records Seen": {"count": 1, "max": 473552, "sum": 473552.0, "min": 473552}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 136, "sum": 136.0, "min": 136}}, "EndTime": 1536848824.851081, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 67}, "StartTime": 1536848823.050075} [09/13/2018 14:27:04 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3866.3094876 records/second [09/13/2018 14:27:04 INFO 140484794881856] [09/13/2018 14:27:04 INFO 140484794881856] # Starting training for epoch 69 [09/13/2018 14:27:04 INFO 140284443182912] # Finished training epoch 61 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:04 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:04 INFO 140284443182912] Loss (name: value) total: 6.15990553336 [09/13/2018 14:27:04 INFO 140284443182912] Loss (name: value) kld: 0.248003390702 [09/13/2018 14:27:04 INFO 140284443182912] Loss (name: value) recons: 5.91190213724 [09/13/2018 14:27:04 INFO 140284443182912] Loss (name: value) logppx: 6.15990553336 [09/13/2018 14:27:04 INFO 140284443182912] #quality_metric: host=algo-2, epoch=61, train total_loss <loss>=6.15990553336 [09/13/2018 14:27:04 INFO 140284443182912] patience losses:[6.1602499658411203, 6.1580011454495516, 6.1557453458959408, 6.154467231577093, 6.1491589546203613] min patience loss:6.14915895462 current loss:6.15990553336 absolute loss difference:0.0107465787367 [09/13/2018 14:27:04 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:04 INFO 140284443182912] #progress_metric: host=algo-2, completed 61 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3355, "sum": 3355.0, "min": 3355}, "Total Records Seen": {"count": 1, "max": 424804, "sum": 424804.0, "min": 424804}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 122, "sum": 122.0, "min": 122}}, "EndTime": 1536848824.783729, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 60}, "StartTime": 1536848822.70416} [09/13/2018 14:27:04 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3348.37641407 records/second [09/13/2018 14:27:04 INFO 140284443182912] [09/13/2018 14:27:04 INFO 140284443182912] # Starting training for epoch 62 [09/13/2018 14:27:06 INFO 140484794881856] # Finished training epoch 69 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:06 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:06 INFO 140484794881856] Loss (name: value) total: 6.11712035699 [09/13/2018 14:27:06 INFO 140484794881856] Loss (name: value) kld: 0.251945034076 [09/13/2018 14:27:06 INFO 140484794881856] Loss (name: value) recons: 5.86517531655 [09/13/2018 14:27:06 INFO 140484794881856] Loss (name: value) logppx: 6.11712035699 [09/13/2018 14:27:06 INFO 140484794881856] #quality_metric: host=algo-1, epoch=69, train total_loss <loss>=6.11712035699 [09/13/2018 14:27:06 INFO 140484794881856] patience losses:[6.1206374601884326, 6.1311113921078766, 6.1206921664151279, 6.1177199840545651, 6.1074532552198928] min patience loss:6.10745325522 current loss:6.11712035699 absolute loss difference:0.00966710177335 [09/13/2018 14:27:06 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:06 INFO 140484794881856] #progress_metric: host=algo-1, completed 69 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3795, "sum": 3795.0, "min": 3795}, "Total Records Seen": {"count": 1, "max": 480516, "sum": 480516.0, "min": 480516}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 138, "sum": 138.0, "min": 138}}, "EndTime": 1536848826.685396, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 68}, "StartTime": 1536848824.851713} [09/13/2018 14:27:06 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3797.52926479 records/second [09/13/2018 14:27:06 INFO 140484794881856] [09/13/2018 14:27:06 INFO 140484794881856] # Starting training for epoch 70 [09/13/2018 14:27:06 INFO 140284443182912] # Finished training epoch 62 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:06 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:06 INFO 140284443182912] Loss (name: value) total: 6.14189973744 [09/13/2018 14:27:06 INFO 140284443182912] Loss (name: value) kld: 0.249203966693 [09/13/2018 14:27:06 INFO 140284443182912] Loss (name: value) recons: 5.89269571738 [09/13/2018 14:27:06 INFO 140284443182912] Loss (name: value) logppx: 6.14189973744 [09/13/2018 14:27:06 INFO 140284443182912] #quality_metric: host=algo-2, epoch=62, train total_loss <loss>=6.14189973744 [09/13/2018 14:27:06 INFO 140284443182912] patience losses:[6.1580011454495516, 6.1557453458959408, 6.154467231577093, 6.1491589546203613, 6.1599055333571] min patience loss:6.14915895462 current loss:6.14189973744 absolute loss difference:0.00725921717557 [09/13/2018 14:27:06 INFO 140284443182912] #progress_metric: host=algo-2, completed 62 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3410, "sum": 3410.0, "min": 3410}, "Total Records Seen": {"count": 1, "max": 431768, "sum": 431768.0, "min": 431768}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 124, "sum": 124.0, "min": 124}}, "EndTime": 1536848826.933849, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 61}, "StartTime": 1536848824.784692} [09/13/2018 14:27:06 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3240.12023033 records/second [09/13/2018 14:27:06 INFO 140284443182912] [09/13/2018 14:27:06 INFO 140284443182912] # Starting training for epoch 63 [09/13/2018 14:27:08 INFO 140484794881856] # Finished training epoch 70 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:08 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:08 INFO 140484794881856] Loss (name: value) total: 6.11624874635 [09/13/2018 14:27:08 INFO 140484794881856] Loss (name: value) kld: 0.253929722445 [09/13/2018 14:27:08 INFO 140484794881856] Loss (name: value) recons: 5.86231900562 [09/13/2018 14:27:08 INFO 140484794881856] Loss (name: value) logppx: 6.11624874635 [09/13/2018 14:27:08 INFO 140484794881856] #quality_metric: host=algo-1, epoch=70, train total_loss <loss>=6.11624874635 [09/13/2018 14:27:08 INFO 140484794881856] patience losses:[6.1311113921078766, 6.1206921664151279, 6.1177199840545651, 6.1074532552198928, 6.117120356993242] min patience loss:6.10745325522 current loss:6.11624874635 absolute loss difference:0.00879549113187 [09/13/2018 14:27:08 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:08 INFO 140484794881856] #progress_metric: host=algo-1, completed 70 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3850, "sum": 3850.0, "min": 3850}, "Total Records Seen": {"count": 1, "max": 487480, "sum": 487480.0, "min": 487480}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 140, "sum": 140.0, "min": 140}}, "EndTime": 1536848828.452813, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 69}, "StartTime": 1536848826.685783} [09/13/2018 14:27:08 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3940.44730274 records/second [09/13/2018 14:27:08 INFO 140484794881856] [09/13/2018 14:27:08 INFO 140484794881856] # Starting training for epoch 71 [09/13/2018 14:27:09 INFO 140284443182912] # Finished training epoch 63 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:09 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:09 INFO 140284443182912] Loss (name: value) total: 6.14295933897 [09/13/2018 14:27:09 INFO 140284443182912] Loss (name: value) kld: 0.252442525192 [09/13/2018 14:27:09 INFO 140284443182912] Loss (name: value) recons: 5.89051681865 [09/13/2018 14:27:09 INFO 140284443182912] Loss (name: value) logppx: 6.14295933897 [09/13/2018 14:27:09 INFO 140284443182912] #quality_metric: host=algo-2, epoch=63, train total_loss <loss>=6.14295933897 [09/13/2018 14:27:09 INFO 140284443182912] patience losses:[6.1557453458959408, 6.154467231577093, 6.1491589546203613, 6.1599055333571, 6.141899737444791] min patience loss:6.14189973744 current loss:6.14295933897 absolute loss difference:0.00105960152366 [09/13/2018 14:27:09 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:09 INFO 140284443182912] #progress_metric: host=algo-2, completed 63 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3465, "sum": 3465.0, "min": 3465}, "Total Records Seen": {"count": 1, "max": 438732, "sum": 438732.0, "min": 438732}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 126, "sum": 126.0, "min": 126}}, "EndTime": 1536848829.036202, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 62}, "StartTime": 1536848826.934416} [09/13/2018 14:27:09 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3313.14268408 records/second [09/13/2018 14:27:09 INFO 140284443182912] [09/13/2018 14:27:09 INFO 140284443182912] # Starting training for epoch 64 [09/13/2018 14:27:10 INFO 140484794881856] # Finished training epoch 71 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:10 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:10 INFO 140484794881856] Loss (name: value) total: 6.11059952216 [09/13/2018 14:27:10 INFO 140484794881856] Loss (name: value) kld: 0.259341262958 [09/13/2018 14:27:10 INFO 140484794881856] Loss (name: value) recons: 5.85125825188 [09/13/2018 14:27:10 INFO 140484794881856] Loss (name: value) logppx: 6.11059952216 [09/13/2018 14:27:10 INFO 140484794881856] #quality_metric: host=algo-1, epoch=71, train total_loss <loss>=6.11059952216 [09/13/2018 14:27:10 INFO 140484794881856] patience losses:[6.1206921664151279, 6.1177199840545651, 6.1074532552198928, 6.117120356993242, 6.1162487463517623] min patience loss:6.10745325522 current loss:6.11059952216 absolute loss difference:0.00314626693726 [09/13/2018 14:27:10 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:27:10 INFO 140484794881856] #progress_metric: host=algo-1, completed 71 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3905, "sum": 3905.0, "min": 3905}, "Total Records Seen": {"count": 1, "max": 494444, "sum": 494444.0, "min": 494444}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 142, "sum": 142.0, "min": 142}}, "EndTime": 1536848830.19276, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 70}, "StartTime": 1536848828.453728} [09/13/2018 14:27:10 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4004.13792245 records/second [09/13/2018 14:27:10 INFO 140484794881856] [09/13/2018 14:27:10 INFO 140484794881856] # Starting training for epoch 72 [09/13/2018 14:27:11 INFO 140284443182912] # Finished training epoch 64 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:11 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:11 INFO 140284443182912] Loss (name: value) total: 6.14038117582 [09/13/2018 14:27:11 INFO 140284443182912] Loss (name: value) kld: 0.257646493749 [09/13/2018 14:27:11 INFO 140284443182912] Loss (name: value) recons: 5.88273467151 [09/13/2018 14:27:11 INFO 140284443182912] Loss (name: value) logppx: 6.14038117582 [09/13/2018 14:27:11 INFO 140284443182912] #quality_metric: host=algo-2, epoch=64, train total_loss <loss>=6.14038117582 [09/13/2018 14:27:11 INFO 140284443182912] patience losses:[6.154467231577093, 6.1491589546203613, 6.1599055333571, 6.141899737444791, 6.14295933896845] min patience loss:6.14189973744 current loss:6.14038117582 absolute loss difference:0.00151856162331 [09/13/2018 14:27:11 INFO 140284443182912] #progress_metric: host=algo-2, completed 64 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3520, "sum": 3520.0, "min": 3520}, "Total Records Seen": {"count": 1, "max": 445696, "sum": 445696.0, "min": 445696}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 128, "sum": 128.0, "min": 128}}, "EndTime": 1536848831.127496, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 63}, "StartTime": 1536848829.039683} [09/13/2018 14:27:11 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3335.26341952 records/second [09/13/2018 14:27:11 INFO 140284443182912] [09/13/2018 14:27:11 INFO 140284443182912] # Starting training for epoch 65 [09/13/2018 14:27:11 INFO 140484794881856] # Finished training epoch 72 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:11 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:11 INFO 140484794881856] Loss (name: value) total: 6.10242544954 [09/13/2018 14:27:11 INFO 140484794881856] Loss (name: value) kld: 0.257770162685 [09/13/2018 14:27:11 INFO 140484794881856] Loss (name: value) recons: 5.84465531436 [09/13/2018 14:27:11 INFO 140484794881856] Loss (name: value) logppx: 6.10242544954 [09/13/2018 14:27:11 INFO 140484794881856] #quality_metric: host=algo-1, epoch=72, train total_loss <loss>=6.10242544954 [09/13/2018 14:27:12 INFO 140484794881856] patience losses:[6.1177199840545651, 6.1074532552198928, 6.117120356993242, 6.1162487463517623, 6.1105995221571492] min patience loss:6.10745325522 current loss:6.10242544954 absolute loss difference:0.00502780567516 [09/13/2018 14:27:12 INFO 140484794881856] #progress_metric: host=algo-1, completed 72 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3960, "sum": 3960.0, "min": 3960}, "Total Records Seen": {"count": 1, "max": 501408, "sum": 501408.0, "min": 501408}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 144, "sum": 144.0, "min": 144}}, "EndTime": 1536848832.002184, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 71}, "StartTime": 1536848830.193346} [09/13/2018 14:27:12 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3849.56810096 records/second [09/13/2018 14:27:12 INFO 140484794881856] [09/13/2018 14:27:12 INFO 140484794881856] # Starting training for epoch 73 [09/13/2018 14:27:13 INFO 140284443182912] # Finished training epoch 65 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:13 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:13 INFO 140284443182912] Loss (name: value) total: 6.14488974485 [09/13/2018 14:27:13 INFO 140284443182912] Loss (name: value) kld: 0.260550571843 [09/13/2018 14:27:13 INFO 140284443182912] Loss (name: value) recons: 5.88433918519 [09/13/2018 14:27:13 INFO 140284443182912] Loss (name: value) logppx: 6.14488974485 [09/13/2018 14:27:13 INFO 140284443182912] #quality_metric: host=algo-2, epoch=65, train total_loss <loss>=6.14488974485 [09/13/2018 14:27:13 INFO 140284443182912] patience losses:[6.1491589546203613, 6.1599055333571, 6.141899737444791, 6.14295933896845, 6.1403811758214779] min patience loss:6.14038117582 current loss:6.14488974485 absolute loss difference:0.00450856902383 [09/13/2018 14:27:13 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:13 INFO 140284443182912] #progress_metric: host=algo-2, completed 65 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3575, "sum": 3575.0, "min": 3575}, "Total Records Seen": {"count": 1, "max": 452660, "sum": 452660.0, "min": 452660}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 130, "sum": 130.0, "min": 130}}, "EndTime": 1536848833.192372, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 64}, "StartTime": 1536848831.128083} [09/13/2018 14:27:13 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3373.3231691 records/second [09/13/2018 14:27:13 INFO 140284443182912] [09/13/2018 14:27:13 INFO 140284443182912] # Starting training for epoch 66 [09/13/2018 14:27:13 INFO 140484794881856] # Finished training epoch 73 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:13 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:13 INFO 140484794881856] Loss (name: value) total: 6.10915940458 [09/13/2018 14:27:13 INFO 140484794881856] Loss (name: value) kld: 0.265941686522 [09/13/2018 14:27:13 INFO 140484794881856] Loss (name: value) recons: 5.84321770235 [09/13/2018 14:27:13 INFO 140484794881856] Loss (name: value) logppx: 6.10915940458 [09/13/2018 14:27:13 INFO 140484794881856] #quality_metric: host=algo-1, epoch=73, train total_loss <loss>=6.10915940458 [09/13/2018 14:27:13 INFO 140484794881856] patience losses:[6.1074532552198928, 6.117120356993242, 6.1162487463517623, 6.1105995221571492, 6.1024254495447332] min patience loss:6.10242544954 current loss:6.10915940458 absolute loss difference:0.00673395503651 [09/13/2018 14:27:13 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:13 INFO 140484794881856] #progress_metric: host=algo-1, completed 73 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4015, "sum": 4015.0, "min": 4015}, "Total Records Seen": {"count": 1, "max": 508372, "sum": 508372.0, "min": 508372}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 146, "sum": 146.0, "min": 146}}, "EndTime": 1536848833.800196, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 72}, "StartTime": 1536848832.002822} [09/13/2018 14:27:13 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3874.20735304 records/second [09/13/2018 14:27:13 INFO 140484794881856] [09/13/2018 14:27:13 INFO 140484794881856] # Starting training for epoch 74 [09/13/2018 14:27:15 INFO 140284443182912] # Finished training epoch 66 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:15 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:15 INFO 140284443182912] Loss (name: value) total: 6.13949280652 [09/13/2018 14:27:15 INFO 140284443182912] Loss (name: value) kld: 0.266487113725 [09/13/2018 14:27:15 INFO 140284443182912] Loss (name: value) recons: 5.87300569794 [09/13/2018 14:27:15 INFO 140284443182912] Loss (name: value) logppx: 6.13949280652 [09/13/2018 14:27:15 INFO 140284443182912] #quality_metric: host=algo-2, epoch=66, train total_loss <loss>=6.13949280652 [09/13/2018 14:27:15 INFO 140284443182912] patience losses:[6.1599055333571, 6.141899737444791, 6.14295933896845, 6.1403811758214779, 6.1448897448453037] min patience loss:6.14038117582 current loss:6.13949280652 absolute loss difference:0.000888369300148 [09/13/2018 14:27:15 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:15 INFO 140284443182912] #progress_metric: host=algo-2, completed 66 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3630, "sum": 3630.0, "min": 3630}, "Total Records Seen": {"count": 1, "max": 459624, "sum": 459624.0, "min": 459624}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 132, "sum": 132.0, "min": 132}}, "EndTime": 1536848835.215217, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 65}, "StartTime": 1536848833.192778} [09/13/2018 14:27:15 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3443.11710435 records/second [09/13/2018 14:27:15 INFO 140284443182912] [09/13/2018 14:27:15 INFO 140284443182912] # Starting training for epoch 67 [09/13/2018 14:27:15 INFO 140484794881856] # Finished training epoch 74 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:15 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:15 INFO 140484794881856] Loss (name: value) total: 6.10733297521 [09/13/2018 14:27:15 INFO 140484794881856] Loss (name: value) kld: 0.264388282191 [09/13/2018 14:27:15 INFO 140484794881856] Loss (name: value) recons: 5.84294468706 [09/13/2018 14:27:15 INFO 140484794881856] Loss (name: value) logppx: 6.10733297521 [09/13/2018 14:27:15 INFO 140484794881856] #quality_metric: host=algo-1, epoch=74, train total_loss <loss>=6.10733297521 [09/13/2018 14:27:15 INFO 140484794881856] patience losses:[6.117120356993242, 6.1162487463517623, 6.1105995221571492, 6.1024254495447332, 6.1091594045812432] min patience loss:6.10242544954 current loss:6.10733297521 absolute loss difference:0.00490752566945 [09/13/2018 14:27:15 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:15 INFO 140484794881856] #progress_metric: host=algo-1, completed 74 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4070, "sum": 4070.0, "min": 4070}, "Total Records Seen": {"count": 1, "max": 515336, "sum": 515336.0, "min": 515336}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 148, "sum": 148.0, "min": 148}}, "EndTime": 1536848835.65218, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 73}, "StartTime": 1536848833.800586} [09/13/2018 14:27:15 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3760.79755686 records/second [09/13/2018 14:27:15 INFO 140484794881856] [09/13/2018 14:27:15 INFO 140484794881856] # Starting training for epoch 75 [09/13/2018 14:27:17 INFO 140484794881856] # Finished training epoch 75 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:17 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:17 INFO 140484794881856] Loss (name: value) total: 6.09767256217 [09/13/2018 14:27:17 INFO 140484794881856] Loss (name: value) kld: 0.268738653308 [09/13/2018 14:27:17 INFO 140484794881856] Loss (name: value) recons: 5.82893391522 [09/13/2018 14:27:17 INFO 140484794881856] Loss (name: value) logppx: 6.09767256217 [09/13/2018 14:27:17 INFO 140484794881856] #quality_metric: host=algo-1, epoch=75, train total_loss <loss>=6.09767256217 [09/13/2018 14:27:17 INFO 140484794881856] patience losses:[6.1162487463517623, 6.1105995221571492, 6.1024254495447332, 6.1091594045812432, 6.1073329752141783] min patience loss:6.10242544954 current loss:6.09767256217 absolute loss difference:0.00475288737904 [09/13/2018 14:27:17 INFO 140484794881856] #progress_metric: host=algo-1, completed 75 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4125, "sum": 4125.0, "min": 4125}, "Total Records Seen": {"count": 1, "max": 522300, "sum": 522300.0, "min": 522300}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 150, "sum": 150.0, "min": 150}}, "EndTime": 1536848837.479777, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 74}, "StartTime": 1536848835.65255} [09/13/2018 14:27:17 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3810.93509319 records/second [09/13/2018 14:27:17 INFO 140484794881856] [09/13/2018 14:27:17 INFO 140484794881856] # Starting training for epoch 76 [09/13/2018 14:27:17 INFO 140284443182912] # Finished training epoch 67 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:17 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:17 INFO 140284443182912] Loss (name: value) total: 6.12530361956 [09/13/2018 14:27:17 INFO 140284443182912] Loss (name: value) kld: 0.266382693838 [09/13/2018 14:27:17 INFO 140284443182912] Loss (name: value) recons: 5.85892092098 [09/13/2018 14:27:17 INFO 140284443182912] Loss (name: value) logppx: 6.12530361956 [09/13/2018 14:27:17 INFO 140284443182912] #quality_metric: host=algo-2, epoch=67, train total_loss <loss>=6.12530361956 [09/13/2018 14:27:17 INFO 140284443182912] patience losses:[6.141899737444791, 6.14295933896845, 6.1403811758214779, 6.1448897448453037, 6.1394928065213294] min patience loss:6.13949280652 current loss:6.12530361956 absolute loss difference:0.0141891869632 [09/13/2018 14:27:17 INFO 140284443182912] #progress_metric: host=algo-2, completed 67 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3685, "sum": 3685.0, "min": 3685}, "Total Records Seen": {"count": 1, "max": 466588, "sum": 466588.0, "min": 466588}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 134, "sum": 134.0, "min": 134}}, "EndTime": 1536848837.327793, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 66}, "StartTime": 1536848835.215677} [09/13/2018 14:27:17 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3296.93947301 records/second [09/13/2018 14:27:17 INFO 140284443182912] [09/13/2018 14:27:17 INFO 140284443182912] # Starting training for epoch 68 [09/13/2018 14:27:19 INFO 140484794881856] # Finished training epoch 76 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:19 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:19 INFO 140484794881856] Loss (name: value) total: 6.09904967221 [09/13/2018 14:27:19 INFO 140484794881856] Loss (name: value) kld: 0.271680373902 [09/13/2018 14:27:19 INFO 140484794881856] Loss (name: value) recons: 5.82736927813 [09/13/2018 14:27:19 INFO 140484794881856] Loss (name: value) logppx: 6.09904967221 [09/13/2018 14:27:19 INFO 140484794881856] #quality_metric: host=algo-1, epoch=76, train total_loss <loss>=6.09904967221 [09/13/2018 14:27:19 INFO 140484794881856] patience losses:[6.1105995221571492, 6.1024254495447332, 6.1091594045812432, 6.1073329752141783, 6.0976725621656938] min patience loss:6.09767256217 current loss:6.09904967221 absolute loss difference:0.00137711004777 [09/13/2018 14:27:19 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:19 INFO 140484794881856] #progress_metric: host=algo-1, completed 76 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4180, "sum": 4180.0, "min": 4180}, "Total Records Seen": {"count": 1, "max": 529264, "sum": 529264.0, "min": 529264}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 152, "sum": 152.0, "min": 152}}, "EndTime": 1536848839.245839, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 75}, "StartTime": 1536848837.480169} [09/13/2018 14:27:19 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3943.75706816 records/second [09/13/2018 14:27:19 INFO 140484794881856] [09/13/2018 14:27:19 INFO 140484794881856] # Starting training for epoch 77 [09/13/2018 14:27:19 INFO 140284443182912] # Finished training epoch 68 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:19 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:19 INFO 140284443182912] Loss (name: value) total: 6.13371154612 [09/13/2018 14:27:19 INFO 140284443182912] Loss (name: value) kld: 0.270067712529 [09/13/2018 14:27:19 INFO 140284443182912] Loss (name: value) recons: 5.86364387599 [09/13/2018 14:27:19 INFO 140284443182912] Loss (name: value) logppx: 6.13371154612 [09/13/2018 14:27:19 INFO 140284443182912] #quality_metric: host=algo-2, epoch=68, train total_loss <loss>=6.13371154612 [09/13/2018 14:27:19 INFO 140284443182912] patience losses:[6.14295933896845, 6.1403811758214779, 6.1448897448453037, 6.1394928065213294, 6.1253036195581609] min patience loss:6.12530361956 current loss:6.13371154612 absolute loss difference:0.00840792655945 [09/13/2018 14:27:19 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:19 INFO 140284443182912] #progress_metric: host=algo-2, completed 68 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3740, "sum": 3740.0, "min": 3740}, "Total Records Seen": {"count": 1, "max": 473552, "sum": 473552.0, "min": 473552}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 136, "sum": 136.0, "min": 136}}, "EndTime": 1536848839.395124, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 67}, "StartTime": 1536848837.328356} [09/13/2018 14:27:19 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3369.26784202 records/second [09/13/2018 14:27:19 INFO 140284443182912] [09/13/2018 14:27:19 INFO 140284443182912] # Starting training for epoch 69 [09/13/2018 14:27:21 INFO 140484794881856] # Finished training epoch 77 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:21 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:21 INFO 140484794881856] Loss (name: value) total: 6.09613110802 [09/13/2018 14:27:21 INFO 140484794881856] Loss (name: value) kld: 0.273244471171 [09/13/2018 14:27:21 INFO 140484794881856] Loss (name: value) recons: 5.82288664471 [09/13/2018 14:27:21 INFO 140484794881856] Loss (name: value) logppx: 6.09613110802 [09/13/2018 14:27:21 INFO 140484794881856] #quality_metric: host=algo-1, epoch=77, train total_loss <loss>=6.09613110802 [09/13/2018 14:27:21 INFO 140484794881856] patience losses:[6.1024254495447332, 6.1091594045812432, 6.1073329752141783, 6.0976725621656938, 6.0990496722134679] min patience loss:6.09767256217 current loss:6.09613110802 absolute loss difference:0.00154145414179 [09/13/2018 14:27:21 INFO 140484794881856] #progress_metric: host=algo-1, completed 77 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4235, "sum": 4235.0, "min": 4235}, "Total Records Seen": {"count": 1, "max": 536228, "sum": 536228.0, "min": 536228}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 154, "sum": 154.0, "min": 154}}, "EndTime": 1536848841.009725, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 76}, "StartTime": 1536848839.246679} [09/13/2018 14:27:21 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3949.58565386 records/second [09/13/2018 14:27:21 INFO 140484794881856] [09/13/2018 14:27:21 INFO 140484794881856] # Starting training for epoch 78 [09/13/2018 14:27:21 INFO 140284443182912] # Finished training epoch 69 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:21 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:21 INFO 140284443182912] Loss (name: value) total: 6.12980774099 [09/13/2018 14:27:21 INFO 140284443182912] Loss (name: value) kld: 0.272395735031 [09/13/2018 14:27:21 INFO 140284443182912] Loss (name: value) recons: 5.85741200881 [09/13/2018 14:27:21 INFO 140284443182912] Loss (name: value) logppx: 6.12980774099 [09/13/2018 14:27:21 INFO 140284443182912] #quality_metric: host=algo-2, epoch=69, train total_loss <loss>=6.12980774099 [09/13/2018 14:27:21 INFO 140284443182912] patience losses:[6.1403811758214779, 6.1448897448453037, 6.1394928065213294, 6.1253036195581609, 6.1337115461176088] min patience loss:6.12530361956 current loss:6.12980774099 absolute loss difference:0.00450412143361 [09/13/2018 14:27:21 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:21 INFO 140284443182912] #progress_metric: host=algo-2, completed 69 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3795, "sum": 3795.0, "min": 3795}, "Total Records Seen": {"count": 1, "max": 480516, "sum": 480516.0, "min": 480516}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 138, "sum": 138.0, "min": 138}}, "EndTime": 1536848841.441254, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 68}, "StartTime": 1536848839.39567} [09/13/2018 14:27:21 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3404.03229044 records/second [09/13/2018 14:27:21 INFO 140284443182912] [09/13/2018 14:27:21 INFO 140284443182912] # Starting training for epoch 70 [09/13/2018 14:27:22 INFO 140484794881856] # Finished training epoch 78 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:22 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:22 INFO 140484794881856] Loss (name: value) total: 6.08247853192 [09/13/2018 14:27:22 INFO 140484794881856] Loss (name: value) kld: 0.277145837518 [09/13/2018 14:27:22 INFO 140484794881856] Loss (name: value) recons: 5.80533272136 [09/13/2018 14:27:22 INFO 140484794881856] Loss (name: value) logppx: 6.08247853192 [09/13/2018 14:27:22 INFO 140484794881856] #quality_metric: host=algo-1, epoch=78, train total_loss <loss>=6.08247853192 [09/13/2018 14:27:22 INFO 140484794881856] patience losses:[6.1091594045812432, 6.1073329752141783, 6.0976725621656938, 6.0990496722134679, 6.0961311080239033] min patience loss:6.09613110802 current loss:6.08247853192 absolute loss difference:0.0136525760997 [09/13/2018 14:27:22 INFO 140484794881856] #progress_metric: host=algo-1, completed 78 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4290, "sum": 4290.0, "min": 4290}, "Total Records Seen": {"count": 1, "max": 543192, "sum": 543192.0, "min": 543192}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 156, "sum": 156.0, "min": 156}}, "EndTime": 1536848842.865287, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 77}, "StartTime": 1536848841.010306} [09/13/2018 14:27:22 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3753.78320301 records/second [09/13/2018 14:27:22 INFO 140484794881856] [09/13/2018 14:27:22 INFO 140484794881856] # Starting training for epoch 79 [09/13/2018 14:27:23 INFO 140284443182912] # Finished training epoch 70 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:23 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:23 INFO 140284443182912] Loss (name: value) total: 6.13636786287 [09/13/2018 14:27:23 INFO 140284443182912] Loss (name: value) kld: 0.276001332158 [09/13/2018 14:27:23 INFO 140284443182912] Loss (name: value) recons: 5.86036650051 [09/13/2018 14:27:23 INFO 140284443182912] Loss (name: value) logppx: 6.13636786287 [09/13/2018 14:27:23 INFO 140284443182912] #quality_metric: host=algo-2, epoch=70, train total_loss <loss>=6.13636786287 [09/13/2018 14:27:23 INFO 140284443182912] patience losses:[6.1448897448453037, 6.1394928065213294, 6.1253036195581609, 6.1337115461176088, 6.1298077409917662] min patience loss:6.12530361956 current loss:6.13636786287 absolute loss difference:0.0110642433167 [09/13/2018 14:27:23 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:27:23 INFO 140284443182912] #progress_metric: host=algo-2, completed 70 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3850, "sum": 3850.0, "min": 3850}, "Total Records Seen": {"count": 1, "max": 487480, "sum": 487480.0, "min": 487480}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 140, "sum": 140.0, "min": 140}}, "EndTime": 1536848843.501907, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 69}, "StartTime": 1536848841.441917} [09/13/2018 14:27:23 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3380.37408994 records/second [09/13/2018 14:27:23 INFO 140284443182912] [09/13/2018 14:27:23 INFO 140284443182912] # Starting training for epoch 71 [09/13/2018 14:27:24 INFO 140484794881856] # Finished training epoch 79 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:24 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:24 INFO 140484794881856] Loss (name: value) total: 6.092379557 [09/13/2018 14:27:24 INFO 140484794881856] Loss (name: value) kld: 0.280457947606 [09/13/2018 14:27:24 INFO 140484794881856] Loss (name: value) recons: 5.81192162254 [09/13/2018 14:27:24 INFO 140484794881856] Loss (name: value) logppx: 6.092379557 [09/13/2018 14:27:24 INFO 140484794881856] #quality_metric: host=algo-1, epoch=79, train total_loss <loss>=6.092379557 [09/13/2018 14:27:24 INFO 140484794881856] patience losses:[6.1073329752141783, 6.0976725621656938, 6.0990496722134679, 6.0961311080239033, 6.0824785319241608] min patience loss:6.08247853192 current loss:6.092379557 absolute loss difference:0.00990102507851 [09/13/2018 14:27:24 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:24 INFO 140484794881856] #progress_metric: host=algo-1, completed 79 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4345, "sum": 4345.0, "min": 4345}, "Total Records Seen": {"count": 1, "max": 550156, "sum": 550156.0, "min": 550156}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 158, "sum": 158.0, "min": 158}}, "EndTime": 1536848844.720696, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 78}, "StartTime": 1536848842.865856} [09/13/2018 14:27:24 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3754.00995118 records/second [09/13/2018 14:27:24 INFO 140484794881856] [09/13/2018 14:27:24 INFO 140484794881856] # Starting training for epoch 80 [09/13/2018 14:27:25 INFO 140284443182912] # Finished training epoch 71 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:25 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:25 INFO 140284443182912] Loss (name: value) total: 6.12060492689 [09/13/2018 14:27:25 INFO 140284443182912] Loss (name: value) kld: 0.275651028617 [09/13/2018 14:27:25 INFO 140284443182912] Loss (name: value) recons: 5.84495388378 [09/13/2018 14:27:25 INFO 140284443182912] Loss (name: value) logppx: 6.12060492689 [09/13/2018 14:27:25 INFO 140284443182912] #quality_metric: host=algo-2, epoch=71, train total_loss <loss>=6.12060492689 [09/13/2018 14:27:25 INFO 140284443182912] patience losses:[6.1394928065213294, 6.1253036195581609, 6.1337115461176088, 6.1298077409917662, 6.1363678628748115] min patience loss:6.12530361956 current loss:6.12060492689 absolute loss difference:0.00469869266857 [09/13/2018 14:27:25 INFO 140284443182912] #progress_metric: host=algo-2, completed 71 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3905, "sum": 3905.0, "min": 3905}, "Total Records Seen": {"count": 1, "max": 494444, "sum": 494444.0, "min": 494444}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 142, "sum": 142.0, "min": 142}}, "EndTime": 1536848845.559031, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 70}, "StartTime": 1536848843.502382} [09/13/2018 14:27:25 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3385.87523659 records/second [09/13/2018 14:27:25 INFO 140284443182912] [09/13/2018 14:27:25 INFO 140284443182912] # Starting training for epoch 72 [09/13/2018 14:27:26 INFO 140484794881856] # Finished training epoch 80 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:26 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:26 INFO 140484794881856] Loss (name: value) total: 6.07929350246 [09/13/2018 14:27:26 INFO 140484794881856] Loss (name: value) kld: 0.278411791812 [09/13/2018 14:27:26 INFO 140484794881856] Loss (name: value) recons: 5.80088169358 [09/13/2018 14:27:26 INFO 140484794881856] Loss (name: value) logppx: 6.07929350246 [09/13/2018 14:27:26 INFO 140484794881856] #quality_metric: host=algo-1, epoch=80, train total_loss <loss>=6.07929350246 [09/13/2018 14:27:26 INFO 140484794881856] patience losses:[6.0976725621656938, 6.0990496722134679, 6.0961311080239033, 6.0824785319241608, 6.0923795570026744] min patience loss:6.08247853192 current loss:6.07929350246 absolute loss difference:0.00318502946333 [09/13/2018 14:27:26 INFO 140484794881856] #progress_metric: host=algo-1, completed 80 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4400, "sum": 4400.0, "min": 4400}, "Total Records Seen": {"count": 1, "max": 557120, "sum": 557120.0, "min": 557120}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 160, "sum": 160.0, "min": 160}}, "EndTime": 1536848846.484822, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 79}, "StartTime": 1536848844.721313} [09/13/2018 14:27:26 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3948.61338264 records/second [09/13/2018 14:27:26 INFO 140484794881856] [09/13/2018 14:27:26 INFO 140484794881856] # Starting training for epoch 81 [09/13/2018 14:27:27 INFO 140284443182912] # Finished training epoch 72 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:27 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:27 INFO 140284443182912] Loss (name: value) total: 6.13040258234 [09/13/2018 14:27:27 INFO 140284443182912] Loss (name: value) kld: 0.284377839213 [09/13/2018 14:27:27 INFO 140284443182912] Loss (name: value) recons: 5.84602474733 [09/13/2018 14:27:27 INFO 140284443182912] Loss (name: value) logppx: 6.13040258234 [09/13/2018 14:27:27 INFO 140284443182912] #quality_metric: host=algo-2, epoch=72, train total_loss <loss>=6.13040258234 [09/13/2018 14:27:27 INFO 140284443182912] patience losses:[6.1253036195581609, 6.1337115461176088, 6.1298077409917662, 6.1363678628748115, 6.1206049268895928] min patience loss:6.12060492689 current loss:6.13040258234 absolute loss difference:0.00979765545238 [09/13/2018 14:27:27 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:27 INFO 140284443182912] #progress_metric: host=algo-2, completed 72 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 3960, "sum": 3960.0, "min": 3960}, "Total Records Seen": {"count": 1, "max": 501408, "sum": 501408.0, "min": 501408}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 144, "sum": 144.0, "min": 144}}, "EndTime": 1536848847.643222, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 71}, "StartTime": 1536848845.559527} [09/13/2018 14:27:27 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3341.85121353 records/second [09/13/2018 14:27:27 INFO 140284443182912] [09/13/2018 14:27:27 INFO 140284443182912] # Starting training for epoch 73 [09/13/2018 14:27:28 INFO 140484794881856] # Finished training epoch 81 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:28 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:28 INFO 140484794881856] Loss (name: value) total: 6.09848137335 [09/13/2018 14:27:28 INFO 140484794881856] Loss (name: value) kld: 0.286761361902 [09/13/2018 14:27:28 INFO 140484794881856] Loss (name: value) recons: 5.81172006347 [09/13/2018 14:27:28 INFO 140484794881856] Loss (name: value) logppx: 6.09848137335 [09/13/2018 14:27:28 INFO 140484794881856] #quality_metric: host=algo-1, epoch=81, train total_loss <loss>=6.09848137335 [09/13/2018 14:27:28 INFO 140484794881856] patience losses:[6.0990496722134679, 6.0961311080239033, 6.0824785319241608, 6.0923795570026744, 6.0792935024608266] min patience loss:6.07929350246 current loss:6.09848137335 absolute loss difference:0.0191878708926 [09/13/2018 14:27:28 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:28 INFO 140484794881856] #progress_metric: host=algo-1, completed 81 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4455, "sum": 4455.0, "min": 4455}, "Total Records Seen": {"count": 1, "max": 564084, "sum": 564084.0, "min": 564084}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 162, "sum": 162.0, "min": 162}}, "EndTime": 1536848848.319571, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 80}, "StartTime": 1536848846.487528} [09/13/2018 14:27:28 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3800.8326759 records/second [09/13/2018 14:27:28 INFO 140484794881856] [09/13/2018 14:27:28 INFO 140484794881856] # Starting training for epoch 82 [09/13/2018 14:27:29 INFO 140284443182912] # Finished training epoch 73 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:29 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:29 INFO 140284443182912] Loss (name: value) total: 6.11216392084 [09/13/2018 14:27:29 INFO 140284443182912] Loss (name: value) kld: 0.28067381233 [09/13/2018 14:27:29 INFO 140284443182912] Loss (name: value) recons: 5.83149013519 [09/13/2018 14:27:29 INFO 140284443182912] Loss (name: value) logppx: 6.11216392084 [09/13/2018 14:27:29 INFO 140284443182912] #quality_metric: host=algo-2, epoch=73, train total_loss <loss>=6.11216392084 [09/13/2018 14:27:29 INFO 140284443182912] patience losses:[6.1337115461176088, 6.1298077409917662, 6.1363678628748115, 6.1206049268895928, 6.1304025823419748] min patience loss:6.12060492689 current loss:6.11216392084 absolute loss difference:0.00844100605358 [09/13/2018 14:27:29 INFO 140284443182912] #progress_metric: host=algo-2, completed 73 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4015, "sum": 4015.0, "min": 4015}, "Total Records Seen": {"count": 1, "max": 508372, "sum": 508372.0, "min": 508372}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 146, "sum": 146.0, "min": 146}}, "EndTime": 1536848849.690909, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 72}, "StartTime": 1536848847.643696} [09/13/2018 14:27:29 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3401.47030979 records/second [09/13/2018 14:27:29 INFO 140284443182912] [09/13/2018 14:27:29 INFO 140284443182912] # Starting training for epoch 74 [09/13/2018 14:27:30 INFO 140484794881856] # Finished training epoch 82 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:30 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:30 INFO 140484794881856] Loss (name: value) total: 6.08955574036 [09/13/2018 14:27:30 INFO 140484794881856] Loss (name: value) kld: 0.284348356995 [09/13/2018 14:27:30 INFO 140484794881856] Loss (name: value) recons: 5.80520738688 [09/13/2018 14:27:30 INFO 140484794881856] Loss (name: value) logppx: 6.08955574036 [09/13/2018 14:27:30 INFO 140484794881856] #quality_metric: host=algo-1, epoch=82, train total_loss <loss>=6.08955574036 [09/13/2018 14:27:30 INFO 140484794881856] patience losses:[6.0961311080239033, 6.0824785319241608, 6.0923795570026744, 6.0792935024608266, 6.0984813733534375] min patience loss:6.07929350246 current loss:6.08955574036 absolute loss difference:0.0102622378956 [09/13/2018 14:27:30 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:30 INFO 140484794881856] #progress_metric: host=algo-1, completed 82 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4510, "sum": 4510.0, "min": 4510}, "Total Records Seen": {"count": 1, "max": 571048, "sum": 571048.0, "min": 571048}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 164, "sum": 164.0, "min": 164}}, "EndTime": 1536848850.143767, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 81}, "StartTime": 1536848848.32002} [09/13/2018 14:27:30 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3817.73369099 records/second [09/13/2018 14:27:30 INFO 140484794881856] [09/13/2018 14:27:30 INFO 140484794881856] # Starting training for epoch 83 [09/13/2018 14:27:32 INFO 140484794881856] # Finished training epoch 83 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:32 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:32 INFO 140484794881856] Loss (name: value) total: 6.09455089136 [09/13/2018 14:27:32 INFO 140484794881856] Loss (name: value) kld: 0.289116526734 [09/13/2018 14:27:32 INFO 140484794881856] Loss (name: value) recons: 5.80543436137 [09/13/2018 14:27:32 INFO 140484794881856] Loss (name: value) logppx: 6.09455089136 [09/13/2018 14:27:32 INFO 140484794881856] #quality_metric: host=algo-1, epoch=83, train total_loss <loss>=6.09455089136 [09/13/2018 14:27:32 INFO 140484794881856] patience losses:[6.0824785319241608, 6.0923795570026744, 6.0792935024608266, 6.0984813733534375, 6.0895557403564453] min patience loss:6.07929350246 current loss:6.09455089136 absolute loss difference:0.0152573888952 [09/13/2018 14:27:32 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:27:32 INFO 140484794881856] #progress_metric: host=algo-1, completed 83 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4565, "sum": 4565.0, "min": 4565}, "Total Records Seen": {"count": 1, "max": 578012, "sum": 578012.0, "min": 578012}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 166, "sum": 166.0, "min": 166}}, "EndTime": 1536848852.0466, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 82}, "StartTime": 1536848850.144818} [09/13/2018 14:27:32 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3661.53030887 records/second [09/13/2018 14:27:32 INFO 140484794881856] [09/13/2018 14:27:32 INFO 140484794881856] # Starting training for epoch 84 [09/13/2018 14:27:31 INFO 140284443182912] # Finished training epoch 74 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:31 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:31 INFO 140284443182912] Loss (name: value) total: 6.12429766221 [09/13/2018 14:27:31 INFO 140284443182912] Loss (name: value) kld: 0.286451257088 [09/13/2018 14:27:31 INFO 140284443182912] Loss (name: value) recons: 5.83784638318 [09/13/2018 14:27:31 INFO 140284443182912] Loss (name: value) logppx: 6.12429766221 [09/13/2018 14:27:31 INFO 140284443182912] #quality_metric: host=algo-2, epoch=74, train total_loss <loss>=6.12429766221 [09/13/2018 14:27:31 INFO 140284443182912] patience losses:[6.1298077409917662, 6.1363678628748115, 6.1206049268895928, 6.1304025823419748, 6.1121639208360152] min patience loss:6.11216392084 current loss:6.12429766221 absolute loss difference:0.0121337413788 [09/13/2018 14:27:31 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:31 INFO 140284443182912] #progress_metric: host=algo-2, completed 74 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4070, "sum": 4070.0, "min": 4070}, "Total Records Seen": {"count": 1, "max": 515336, "sum": 515336.0, "min": 515336}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 148, "sum": 148.0, "min": 148}}, "EndTime": 1536848851.796891, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 73}, "StartTime": 1536848849.691926} [09/13/2018 14:27:31 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3308.12803843 records/second [09/13/2018 14:27:31 INFO 140284443182912] [09/13/2018 14:27:31 INFO 140284443182912] # Starting training for epoch 75 [09/13/2018 14:27:33 INFO 140484794881856] # Finished training epoch 84 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:33 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:33 INFO 140484794881856] Loss (name: value) total: 6.09020013376 [09/13/2018 14:27:33 INFO 140484794881856] Loss (name: value) kld: 0.295774581487 [09/13/2018 14:27:33 INFO 140484794881856] Loss (name: value) recons: 5.7944255352 [09/13/2018 14:27:33 INFO 140484794881856] Loss (name: value) logppx: 6.09020013376 [09/13/2018 14:27:33 INFO 140484794881856] #quality_metric: host=algo-1, epoch=84, train total_loss <loss>=6.09020013376 [09/13/2018 14:27:33 INFO 140484794881856] patience losses:[6.0923795570026744, 6.0792935024608266, 6.0984813733534375, 6.0895557403564453, 6.0945508913560351] min patience loss:6.07929350246 current loss:6.09020013376 absolute loss difference:0.0109066312963 [09/13/2018 14:27:33 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:4 [09/13/2018 14:27:33 INFO 140484794881856] #progress_metric: host=algo-1, completed 84 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4620, "sum": 4620.0, "min": 4620}, "Total Records Seen": {"count": 1, "max": 584976, "sum": 584976.0, "min": 584976}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 168, "sum": 168.0, "min": 168}}, "EndTime": 1536848853.866705, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 83}, "StartTime": 1536848852.047004} [09/13/2018 14:27:33 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3826.64295798 records/second [09/13/2018 14:27:33 INFO 140484794881856] [09/13/2018 14:27:33 INFO 140484794881856] # Starting training for epoch 85 [09/13/2018 14:27:33 INFO 140284443182912] # Finished training epoch 75 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:33 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:33 INFO 140284443182912] Loss (name: value) total: 6.11826122891 [09/13/2018 14:27:33 INFO 140284443182912] Loss (name: value) kld: 0.292337852446 [09/13/2018 14:27:33 INFO 140284443182912] Loss (name: value) recons: 5.82592337348 [09/13/2018 14:27:33 INFO 140284443182912] Loss (name: value) logppx: 6.11826122891 [09/13/2018 14:27:33 INFO 140284443182912] #quality_metric: host=algo-2, epoch=75, train total_loss <loss>=6.11826122891 [09/13/2018 14:27:33 INFO 140284443182912] patience losses:[6.1363678628748115, 6.1206049268895928, 6.1304025823419748, 6.1121639208360152, 6.1242976622147998] min patience loss:6.11216392084 current loss:6.11826122891 absolute loss difference:0.00609730807218 [09/13/2018 14:27:33 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:33 INFO 140284443182912] #progress_metric: host=algo-2, completed 75 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4125, "sum": 4125.0, "min": 4125}, "Total Records Seen": {"count": 1, "max": 522300, "sum": 522300.0, "min": 522300}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 150, "sum": 150.0, "min": 150}}, "EndTime": 1536848853.881354, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 74}, "StartTime": 1536848851.797403} [09/13/2018 14:27:33 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3341.51440143 records/second [09/13/2018 14:27:33 INFO 140284443182912] [09/13/2018 14:27:33 INFO 140284443182912] # Starting training for epoch 76 [09/13/2018 14:27:35 INFO 140484794881856] # Finished training epoch 85 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:35 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:35 INFO 140484794881856] Loss (name: value) total: 6.09118990898 [09/13/2018 14:27:35 INFO 140484794881856] Loss (name: value) kld: 0.296739666299 [09/13/2018 14:27:35 INFO 140484794881856] Loss (name: value) recons: 5.79445024404 [09/13/2018 14:27:35 INFO 140484794881856] Loss (name: value) logppx: 6.09118990898 [09/13/2018 14:27:35 INFO 140484794881856] #quality_metric: host=algo-1, epoch=85, train total_loss <loss>=6.09118990898 [09/13/2018 14:27:35 INFO 140484794881856] patience losses:[6.0792935024608266, 6.0984813733534375, 6.0895557403564453, 6.0945508913560351, 6.0902001337571576] min patience loss:6.07929350246 current loss:6.09118990898 absolute loss difference:0.0118964065205 [09/13/2018 14:27:35 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:5 [09/13/2018 14:27:35 INFO 140484794881856] #progress_metric: host=algo-1, completed 85 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4675, "sum": 4675.0, "min": 4675}, "Total Records Seen": {"count": 1, "max": 591940, "sum": 591940.0, "min": 591940}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 170, "sum": 170.0, "min": 170}}, "EndTime": 1536848855.7687, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 84}, "StartTime": 1536848853.867361} [09/13/2018 14:27:35 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3662.41638083 records/second [09/13/2018 14:27:35 INFO 140484794881856] [09/13/2018 14:27:35 INFO 140484794881856] # Starting training for epoch 86 [09/13/2018 14:27:35 INFO 140284443182912] # Finished training epoch 76 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:35 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:35 INFO 140284443182912] Loss (name: value) total: 6.11286429925 [09/13/2018 14:27:35 INFO 140284443182912] Loss (name: value) kld: 0.294509976831 [09/13/2018 14:27:35 INFO 140284443182912] Loss (name: value) recons: 5.81835435521 [09/13/2018 14:27:35 INFO 140284443182912] Loss (name: value) logppx: 6.11286429925 [09/13/2018 14:27:35 INFO 140284443182912] #quality_metric: host=algo-2, epoch=76, train total_loss <loss>=6.11286429925 [09/13/2018 14:27:35 INFO 140284443182912] patience losses:[6.1206049268895928, 6.1304025823419748, 6.1121639208360152, 6.1242976622147998, 6.1182612289081924] min patience loss:6.11216392084 current loss:6.11286429925 absolute loss difference:0.000700378417969 [09/13/2018 14:27:35 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:27:35 INFO 140284443182912] #progress_metric: host=algo-2, completed 76 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4180, "sum": 4180.0, "min": 4180}, "Total Records Seen": {"count": 1, "max": 529264, "sum": 529264.0, "min": 529264}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 152, "sum": 152.0, "min": 152}}, "EndTime": 1536848855.940531, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 75}, "StartTime": 1536848853.881695} [09/13/2018 14:27:35 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3382.26077854 records/second [09/13/2018 14:27:35 INFO 140284443182912] [09/13/2018 14:27:35 INFO 140284443182912] # Starting training for epoch 77 [09/13/2018 14:27:37 INFO 140484794881856] # Finished training epoch 86 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:37 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:37 INFO 140484794881856] Loss (name: value) total: 6.08478146033 [09/13/2018 14:27:37 INFO 140484794881856] Loss (name: value) kld: 0.299184493856 [09/13/2018 14:27:37 INFO 140484794881856] Loss (name: value) recons: 5.78559695157 [09/13/2018 14:27:37 INFO 140484794881856] Loss (name: value) logppx: 6.08478146033 [09/13/2018 14:27:37 INFO 140484794881856] #quality_metric: host=algo-1, epoch=86, train total_loss <loss>=6.08478146033 [09/13/2018 14:27:37 INFO 140484794881856] patience losses:[6.0984813733534375, 6.0895557403564453, 6.0945508913560351, 6.0902001337571576, 6.0911899089813231] min patience loss:6.08955574036 current loss:6.08478146033 absolute loss difference:0.00477428002791 [09/13/2018 14:27:37 INFO 140484794881856] #progress_metric: host=algo-1, completed 86 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4730, "sum": 4730.0, "min": 4730}, "Total Records Seen": {"count": 1, "max": 598904, "sum": 598904.0, "min": 598904}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 172, "sum": 172.0, "min": 172}}, "EndTime": 1536848857.623666, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 85}, "StartTime": 1536848855.769097} [09/13/2018 14:27:37 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3754.71642321 records/second [09/13/2018 14:27:37 INFO 140484794881856] [09/13/2018 14:27:37 INFO 140484794881856] # Starting training for epoch 87 [09/13/2018 14:27:38 INFO 140284443182912] # Finished training epoch 77 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:38 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:38 INFO 140284443182912] Loss (name: value) total: 6.11236232411 [09/13/2018 14:27:38 INFO 140284443182912] Loss (name: value) kld: 0.294220016761 [09/13/2018 14:27:38 INFO 140284443182912] Loss (name: value) recons: 5.8181423274 [09/13/2018 14:27:38 INFO 140284443182912] Loss (name: value) logppx: 6.11236232411 [09/13/2018 14:27:38 INFO 140284443182912] #quality_metric: host=algo-2, epoch=77, train total_loss <loss>=6.11236232411 [09/13/2018 14:27:38 INFO 140284443182912] patience losses:[6.1304025823419748, 6.1121639208360152, 6.1242976622147998, 6.1182612289081924, 6.1128642992539843] min patience loss:6.11216392084 current loss:6.11236232411 absolute loss difference:0.000198403271762 [09/13/2018 14:27:38 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:4 [09/13/2018 14:27:38 INFO 140284443182912] #progress_metric: host=algo-2, completed 77 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4235, "sum": 4235.0, "min": 4235}, "Total Records Seen": {"count": 1, "max": 536228, "sum": 536228.0, "min": 536228}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 154, "sum": 154.0, "min": 154}}, "EndTime": 1536848858.052847, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 76}, "StartTime": 1536848855.940882} [09/13/2018 14:27:38 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3297.1955234 records/second [09/13/2018 14:27:38 INFO 140284443182912] [09/13/2018 14:27:38 INFO 140284443182912] # Starting training for epoch 78 [09/13/2018 14:27:39 INFO 140484794881856] # Finished training epoch 87 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:39 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:39 INFO 140484794881856] Loss (name: value) total: 6.08735360232 [09/13/2018 14:27:39 INFO 140484794881856] Loss (name: value) kld: 0.302185451307 [09/13/2018 14:27:39 INFO 140484794881856] Loss (name: value) recons: 5.78516814492 [09/13/2018 14:27:39 INFO 140484794881856] Loss (name: value) logppx: 6.08735360232 [09/13/2018 14:27:39 INFO 140484794881856] #quality_metric: host=algo-1, epoch=87, train total_loss <loss>=6.08735360232 [09/13/2018 14:27:39 INFO 140484794881856] patience losses:[6.0895557403564453, 6.0945508913560351, 6.0902001337571576, 6.0911899089813231, 6.0847814603285357] min patience loss:6.08478146033 current loss:6.08735360232 absolute loss difference:0.00257214199413 [09/13/2018 14:27:39 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:39 INFO 140484794881856] #progress_metric: host=algo-1, completed 87 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4785, "sum": 4785.0, "min": 4785}, "Total Records Seen": {"count": 1, "max": 605868, "sum": 605868.0, "min": 605868}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 174, "sum": 174.0, "min": 174}}, "EndTime": 1536848859.521944, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 86}, "StartTime": 1536848857.624363} [09/13/2018 14:27:39 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3669.6647234 records/second [09/13/2018 14:27:39 INFO 140484794881856] [09/13/2018 14:27:39 INFO 140484794881856] # Starting training for epoch 88 [09/13/2018 14:27:40 INFO 140284443182912] # Finished training epoch 78 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:40 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:40 INFO 140284443182912] Loss (name: value) total: 6.10629929196 [09/13/2018 14:27:40 INFO 140284443182912] Loss (name: value) kld: 0.29813029739 [09/13/2018 14:27:40 INFO 140284443182912] Loss (name: value) recons: 5.80816901814 [09/13/2018 14:27:40 INFO 140284443182912] Loss (name: value) logppx: 6.10629929196 [09/13/2018 14:27:40 INFO 140284443182912] #quality_metric: host=algo-2, epoch=78, train total_loss <loss>=6.10629929196 [09/13/2018 14:27:40 INFO 140284443182912] patience losses:[6.1121639208360152, 6.1242976622147998, 6.1182612289081924, 6.1128642992539843, 6.1123623241077771] min patience loss:6.11216392084 current loss:6.10629929196 absolute loss difference:0.00586462887851 [09/13/2018 14:27:40 INFO 140284443182912] #progress_metric: host=algo-2, completed 78 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4290, "sum": 4290.0, "min": 4290}, "Total Records Seen": {"count": 1, "max": 543192, "sum": 543192.0, "min": 543192}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 156, "sum": 156.0, "min": 156}}, "EndTime": 1536848860.15452, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 77}, "StartTime": 1536848858.053341} [09/13/2018 14:27:40 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3314.07757282 records/second [09/13/2018 14:27:40 INFO 140284443182912] [09/13/2018 14:27:40 INFO 140284443182912] # Starting training for epoch 79 [09/13/2018 14:27:41 INFO 140484794881856] # Finished training epoch 88 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:41 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:41 INFO 140484794881856] Loss (name: value) total: 6.07706903978 [09/13/2018 14:27:41 INFO 140484794881856] Loss (name: value) kld: 0.301812213117 [09/13/2018 14:27:41 INFO 140484794881856] Loss (name: value) recons: 5.77525678981 [09/13/2018 14:27:41 INFO 140484794881856] Loss (name: value) logppx: 6.07706903978 [09/13/2018 14:27:41 INFO 140484794881856] #quality_metric: host=algo-1, epoch=88, train total_loss <loss>=6.07706903978 [09/13/2018 14:27:41 INFO 140484794881856] patience losses:[6.0945508913560351, 6.0902001337571576, 6.0911899089813231, 6.0847814603285357, 6.0873536023226649] min patience loss:6.08478146033 current loss:6.07706903978 absolute loss difference:0.00771242055026 [09/13/2018 14:27:41 INFO 140484794881856] #progress_metric: host=algo-1, completed 88 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4840, "sum": 4840.0, "min": 4840}, "Total Records Seen": {"count": 1, "max": 612832, "sum": 612832.0, "min": 612832}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 176, "sum": 176.0, "min": 176}}, "EndTime": 1536848861.282667, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 87}, "StartTime": 1536848859.522581} [09/13/2018 14:27:41 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3956.27637056 records/second [09/13/2018 14:27:41 INFO 140484794881856] [09/13/2018 14:27:41 INFO 140484794881856] # Starting training for epoch 89 [09/13/2018 14:27:42 INFO 140284443182912] # Finished training epoch 79 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:42 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:42 INFO 140284443182912] Loss (name: value) total: 6.11030225754 [09/13/2018 14:27:42 INFO 140284443182912] Loss (name: value) kld: 0.299339713969 [09/13/2018 14:27:42 INFO 140284443182912] Loss (name: value) recons: 5.81096253395 [09/13/2018 14:27:42 INFO 140284443182912] Loss (name: value) logppx: 6.11030225754 [09/13/2018 14:27:42 INFO 140284443182912] #quality_metric: host=algo-2, epoch=79, train total_loss <loss>=6.11030225754 [09/13/2018 14:27:42 INFO 140284443182912] patience losses:[6.1242976622147998, 6.1182612289081924, 6.1128642992539843, 6.1123623241077771, 6.1062992919575088] min patience loss:6.10629929196 current loss:6.11030225754 absolute loss difference:0.00400296558033 [09/13/2018 14:27:42 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:42 INFO 140284443182912] #progress_metric: host=algo-2, completed 79 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4345, "sum": 4345.0, "min": 4345}, "Total Records Seen": {"count": 1, "max": 550156, "sum": 550156.0, "min": 550156}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 158, "sum": 158.0, "min": 158}}, "EndTime": 1536848862.152211, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 78}, "StartTime": 1536848860.155089} [09/13/2018 14:27:42 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3486.6772455 records/second [09/13/2018 14:27:42 INFO 140284443182912] [09/13/2018 14:27:42 INFO 140284443182912] # Starting training for epoch 80 [09/13/2018 14:27:43 INFO 140484794881856] # Finished training epoch 89 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:43 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:43 INFO 140484794881856] Loss (name: value) total: 6.09004412998 [09/13/2018 14:27:43 INFO 140484794881856] Loss (name: value) kld: 0.306663094055 [09/13/2018 14:27:43 INFO 140484794881856] Loss (name: value) recons: 5.7833810156 [09/13/2018 14:27:43 INFO 140484794881856] Loss (name: value) logppx: 6.09004412998 [09/13/2018 14:27:43 INFO 140484794881856] #quality_metric: host=algo-1, epoch=89, train total_loss <loss>=6.09004412998 [09/13/2018 14:27:43 INFO 140484794881856] patience losses:[6.0902001337571576, 6.0911899089813231, 6.0847814603285357, 6.0873536023226649, 6.0770690397782756] min patience loss:6.07706903978 current loss:6.09004412998 absolute loss difference:0.0129750902003 [09/13/2018 14:27:43 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:43 INFO 140484794881856] #progress_metric: host=algo-1, completed 89 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4895, "sum": 4895.0, "min": 4895}, "Total Records Seen": {"count": 1, "max": 619796, "sum": 619796.0, "min": 619796}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 178, "sum": 178.0, "min": 178}}, "EndTime": 1536848863.109629, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 88}, "StartTime": 1536848861.283091} [09/13/2018 14:27:43 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3812.32879982 records/second [09/13/2018 14:27:43 INFO 140484794881856] [09/13/2018 14:27:43 INFO 140484794881856] # Starting training for epoch 90 [09/13/2018 14:27:44 INFO 140284443182912] # Finished training epoch 80 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:44 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:44 INFO 140284443182912] Loss (name: value) total: 6.10864432942 [09/13/2018 14:27:44 INFO 140284443182912] Loss (name: value) kld: 0.303706619279 [09/13/2018 14:27:44 INFO 140284443182912] Loss (name: value) recons: 5.8049377268 [09/13/2018 14:27:44 INFO 140284443182912] Loss (name: value) logppx: 6.10864432942 [09/13/2018 14:27:44 INFO 140284443182912] #quality_metric: host=algo-2, epoch=80, train total_loss <loss>=6.10864432942 [09/13/2018 14:27:44 INFO 140284443182912] patience losses:[6.1182612289081924, 6.1128642992539843, 6.1123623241077771, 6.1062992919575088, 6.1103022575378416] min patience loss:6.10629929196 current loss:6.10864432942 absolute loss difference:0.00234503746033 [09/13/2018 14:27:44 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:44 INFO 140284443182912] #progress_metric: host=algo-2, completed 80 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4400, "sum": 4400.0, "min": 4400}, "Total Records Seen": {"count": 1, "max": 557120, "sum": 557120.0, "min": 557120}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 160, "sum": 160.0, "min": 160}}, "EndTime": 1536848864.262113, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 79}, "StartTime": 1536848862.152828} [09/13/2018 14:27:44 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3301.35334142 records/second [09/13/2018 14:27:44 INFO 140284443182912] [09/13/2018 14:27:44 INFO 140284443182912] # Starting training for epoch 81 [09/13/2018 14:27:44 INFO 140484794881856] # Finished training epoch 90 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:44 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:44 INFO 140484794881856] Loss (name: value) total: 6.06961394223 [09/13/2018 14:27:44 INFO 140484794881856] Loss (name: value) kld: 0.306806197762 [09/13/2018 14:27:44 INFO 140484794881856] Loss (name: value) recons: 5.76280772903 [09/13/2018 14:27:44 INFO 140484794881856] Loss (name: value) logppx: 6.06961394223 [09/13/2018 14:27:44 INFO 140484794881856] #quality_metric: host=algo-1, epoch=90, train total_loss <loss>=6.06961394223 [09/13/2018 14:27:44 INFO 140484794881856] patience losses:[6.0911899089813231, 6.0847814603285357, 6.0873536023226649, 6.0770690397782756, 6.0900441299785264] min patience loss:6.07706903978 current loss:6.06961394223 absolute loss difference:0.00745509754528 [09/13/2018 14:27:44 INFO 140484794881856] #progress_metric: host=algo-1, completed 90 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4950, "sum": 4950.0, "min": 4950}, "Total Records Seen": {"count": 1, "max": 626760, "sum": 626760.0, "min": 626760}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 180, "sum": 180.0, "min": 180}}, "EndTime": 1536848864.974061, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 89}, "StartTime": 1536848863.110079} [09/13/2018 14:27:44 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3735.75126141 records/second [09/13/2018 14:27:44 INFO 140484794881856] [09/13/2018 14:27:44 INFO 140484794881856] # Starting training for epoch 91 [09/13/2018 14:27:46 INFO 140284443182912] # Finished training epoch 81 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:46 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:46 INFO 140284443182912] Loss (name: value) total: 6.11202324954 [09/13/2018 14:27:46 INFO 140284443182912] Loss (name: value) kld: 0.309053206173 [09/13/2018 14:27:46 INFO 140284443182912] Loss (name: value) recons: 5.80297006694 [09/13/2018 14:27:46 INFO 140284443182912] Loss (name: value) logppx: 6.11202324954 [09/13/2018 14:27:46 INFO 140284443182912] #quality_metric: host=algo-2, epoch=81, train total_loss <loss>=6.11202324954 [09/13/2018 14:27:46 INFO 140284443182912] patience losses:[6.1128642992539843, 6.1123623241077771, 6.1062992919575088, 6.1103022575378416, 6.1086443294178352] min patience loss:6.10629929196 current loss:6.11202324954 absolute loss difference:0.00572395758195 [09/13/2018 14:27:46 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:27:46 INFO 140284443182912] #progress_metric: host=algo-2, completed 81 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4455, "sum": 4455.0, "min": 4455}, "Total Records Seen": {"count": 1, "max": 564084, "sum": 564084.0, "min": 564084}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 162, "sum": 162.0, "min": 162}}, "EndTime": 1536848866.399116, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 80}, "StartTime": 1536848864.262644} [09/13/2018 14:27:46 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3259.37313018 records/second [09/13/2018 14:27:46 INFO 140284443182912] [09/13/2018 14:27:46 INFO 140284443182912] # Starting training for epoch 82 [09/13/2018 14:27:46 INFO 140484794881856] # Finished training epoch 91 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:46 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:46 INFO 140484794881856] Loss (name: value) total: 6.08625555038 [09/13/2018 14:27:46 INFO 140484794881856] Loss (name: value) kld: 0.314204676043 [09/13/2018 14:27:46 INFO 140484794881856] Loss (name: value) recons: 5.77205084454 [09/13/2018 14:27:46 INFO 140484794881856] Loss (name: value) logppx: 6.08625555038 [09/13/2018 14:27:46 INFO 140484794881856] #quality_metric: host=algo-1, epoch=91, train total_loss <loss>=6.08625555038 [09/13/2018 14:27:46 INFO 140484794881856] patience losses:[6.0847814603285357, 6.0873536023226649, 6.0770690397782756, 6.0900441299785264, 6.0696139422329987] min patience loss:6.06961394223 current loss:6.08625555038 absolute loss difference:0.0166416081515 [09/13/2018 14:27:46 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:46 INFO 140484794881856] #progress_metric: host=algo-1, completed 91 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5005, "sum": 5005.0, "min": 5005}, "Total Records Seen": {"count": 1, "max": 633724, "sum": 633724.0, "min": 633724}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 182, "sum": 182.0, "min": 182}}, "EndTime": 1536848866.694304, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 90}, "StartTime": 1536848864.974754} [09/13/2018 14:27:46 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4049.51297733 records/second [09/13/2018 14:27:46 INFO 140484794881856] [09/13/2018 14:27:46 INFO 140484794881856] # Starting training for epoch 92 [09/13/2018 14:27:48 INFO 140284443182912] # Finished training epoch 82 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:48 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:48 INFO 140284443182912] Loss (name: value) total: 6.11201663884 [09/13/2018 14:27:48 INFO 140284443182912] Loss (name: value) kld: 0.30979788764 [09/13/2018 14:27:48 INFO 140284443182912] Loss (name: value) recons: 5.80221875798 [09/13/2018 14:27:48 INFO 140284443182912] Loss (name: value) logppx: 6.11201663884 [09/13/2018 14:27:48 INFO 140284443182912] #quality_metric: host=algo-2, epoch=82, train total_loss <loss>=6.11201663884 [09/13/2018 14:27:48 INFO 140284443182912] patience losses:[6.1123623241077771, 6.1062992919575088, 6.1103022575378416, 6.1086443294178352, 6.1120232495394617] min patience loss:6.10629929196 current loss:6.11201663884 absolute loss difference:0.00571734688499 [09/13/2018 14:27:48 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:4 [09/13/2018 14:27:48 INFO 140284443182912] #progress_metric: host=algo-2, completed 82 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4510, "sum": 4510.0, "min": 4510}, "Total Records Seen": {"count": 1, "max": 571048, "sum": 571048.0, "min": 571048}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 164, "sum": 164.0, "min": 164}}, "EndTime": 1536848868.575278, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 81}, "StartTime": 1536848866.399708} [09/13/2018 14:27:48 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3200.79848715 records/second [09/13/2018 14:27:48 INFO 140284443182912] [09/13/2018 14:27:48 INFO 140284443182912] # Starting training for epoch 83 [09/13/2018 14:27:48 INFO 140484794881856] # Finished training epoch 92 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:48 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:48 INFO 140484794881856] Loss (name: value) total: 6.07124563997 [09/13/2018 14:27:48 INFO 140484794881856] Loss (name: value) kld: 0.313103116913 [09/13/2018 14:27:48 INFO 140484794881856] Loss (name: value) recons: 5.75814252767 [09/13/2018 14:27:48 INFO 140484794881856] Loss (name: value) logppx: 6.07124563997 [09/13/2018 14:27:48 INFO 140484794881856] #quality_metric: host=algo-1, epoch=92, train total_loss <loss>=6.07124563997 [09/13/2018 14:27:48 INFO 140484794881856] patience losses:[6.0873536023226649, 6.0770690397782756, 6.0900441299785264, 6.0696139422329987, 6.0862555503845215] min patience loss:6.06961394223 current loss:6.07124563997 absolute loss difference:0.00163169774142 [09/13/2018 14:27:48 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:2 [09/13/2018 14:27:48 INFO 140484794881856] #progress_metric: host=algo-1, completed 92 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5060, "sum": 5060.0, "min": 5060}, "Total Records Seen": {"count": 1, "max": 640688, "sum": 640688.0, "min": 640688}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 184, "sum": 184.0, "min": 184}}, "EndTime": 1536848868.615013, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 91}, "StartTime": 1536848866.694742} [09/13/2018 14:27:48 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3626.29021189 records/second [09/13/2018 14:27:48 INFO 140484794881856] [09/13/2018 14:27:48 INFO 140484794881856] # Starting training for epoch 93 [09/13/2018 14:27:50 INFO 140484794881856] # Finished training epoch 93 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:50 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:50 INFO 140484794881856] Loss (name: value) total: 6.08128702424 [09/13/2018 14:27:50 INFO 140484794881856] Loss (name: value) kld: 0.317914219607 [09/13/2018 14:27:50 INFO 140484794881856] Loss (name: value) recons: 5.76337281574 [09/13/2018 14:27:50 INFO 140484794881856] Loss (name: value) logppx: 6.08128702424 [09/13/2018 14:27:50 INFO 140484794881856] #quality_metric: host=algo-1, epoch=93, train total_loss <loss>=6.08128702424 [09/13/2018 14:27:50 INFO 140484794881856] patience losses:[6.0770690397782756, 6.0900441299785264, 6.0696139422329987, 6.0862555503845215, 6.0712456399744203] min patience loss:6.06961394223 current loss:6.08128702424 absolute loss difference:0.0116730820049 [09/13/2018 14:27:50 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:3 [09/13/2018 14:27:50 INFO 140484794881856] #progress_metric: host=algo-1, completed 93 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5115, "sum": 5115.0, "min": 5115}, "Total Records Seen": {"count": 1, "max": 647652, "sum": 647652.0, "min": 647652}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 186, "sum": 186.0, "min": 186}}, "EndTime": 1536848870.386344, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 92}, "StartTime": 1536848868.615572} [09/13/2018 14:27:50 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3932.38375173 records/second [09/13/2018 14:27:50 INFO 140484794881856] [09/13/2018 14:27:50 INFO 140484794881856] # Starting training for epoch 94 [09/13/2018 14:27:50 INFO 140284443182912] # Finished training epoch 83 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:50 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:50 INFO 140284443182912] Loss (name: value) total: 6.11448569298 [09/13/2018 14:27:50 INFO 140284443182912] Loss (name: value) kld: 0.31310008981 [09/13/2018 14:27:50 INFO 140284443182912] Loss (name: value) recons: 5.80138560642 [09/13/2018 14:27:50 INFO 140284443182912] Loss (name: value) logppx: 6.11448569298 [09/13/2018 14:27:50 INFO 140284443182912] #quality_metric: host=algo-2, epoch=83, train total_loss <loss>=6.11448569298 [09/13/2018 14:27:50 INFO 140284443182912] patience losses:[6.1062992919575088, 6.1103022575378416, 6.1086443294178352, 6.1120232495394617, 6.1120166388424959] min patience loss:6.10629929196 current loss:6.11448569298 absolute loss difference:0.0081864010204 [09/13/2018 14:27:50 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:5 [09/13/2018 14:27:50 INFO 140284443182912] #progress_metric: host=algo-2, completed 83 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4565, "sum": 4565.0, "min": 4565}, "Total Records Seen": {"count": 1, "max": 578012, "sum": 578012.0, "min": 578012}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 166, "sum": 166.0, "min": 166}}, "EndTime": 1536848870.680207, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 82}, "StartTime": 1536848868.575775} [09/13/2018 14:27:50 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3308.87491923 records/second [09/13/2018 14:27:50 INFO 140284443182912] [09/13/2018 14:27:50 INFO 140284443182912] # Starting training for epoch 84 [09/13/2018 14:27:52 INFO 140484794881856] # Finished training epoch 94 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:52 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:52 INFO 140484794881856] Loss (name: value) total: 6.0660450762 [09/13/2018 14:27:52 INFO 140484794881856] Loss (name: value) kld: 0.315351229093 [09/13/2018 14:27:52 INFO 140484794881856] Loss (name: value) recons: 5.75069387609 [09/13/2018 14:27:52 INFO 140484794881856] Loss (name: value) logppx: 6.0660450762 [09/13/2018 14:27:52 INFO 140484794881856] #quality_metric: host=algo-1, epoch=94, train total_loss <loss>=6.0660450762 [09/13/2018 14:27:52 INFO 140484794881856] patience losses:[6.0900441299785264, 6.0696139422329987, 6.0862555503845215, 6.0712456399744203, 6.0812870242378931] min patience loss:6.06961394223 current loss:6.0660450762 absolute loss difference:0.00356886603615 [09/13/2018 14:27:52 INFO 140484794881856] #progress_metric: host=algo-1, completed 94 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5170, "sum": 5170.0, "min": 5170}, "Total Records Seen": {"count": 1, "max": 654616, "sum": 654616.0, "min": 654616}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 188, "sum": 188.0, "min": 188}}, "EndTime": 1536848872.189737, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 93}, "StartTime": 1536848870.386742} [09/13/2018 14:27:52 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=3862.1241647 records/second [09/13/2018 14:27:52 INFO 140484794881856] [09/13/2018 14:27:52 INFO 140484794881856] # Starting training for epoch 95 [09/13/2018 14:27:53 INFO 140484794881856] # Finished training epoch 95 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:53 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:53 INFO 140484794881856] Loss (name: value) total: 6.06089396043 [09/13/2018 14:27:53 INFO 140484794881856] Loss (name: value) kld: 0.32077611739 [09/13/2018 14:27:53 INFO 140484794881856] Loss (name: value) recons: 5.74011781432 [09/13/2018 14:27:53 INFO 140484794881856] Loss (name: value) logppx: 6.06089396043 [09/13/2018 14:27:53 INFO 140484794881856] #quality_metric: host=algo-1, epoch=95, train total_loss <loss>=6.06089396043 [09/13/2018 14:27:53 INFO 140484794881856] patience losses:[6.0696139422329987, 6.0862555503845215, 6.0712456399744203, 6.0812870242378931, 6.0660450761968443] min patience loss:6.0660450762 current loss:6.06089396043 absolute loss difference:0.00515111576427 [09/13/2018 14:27:53 INFO 140484794881856] #progress_metric: host=algo-1, completed 95 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5225, "sum": 5225.0, "min": 5225}, "Total Records Seen": {"count": 1, "max": 661580, "sum": 661580.0, "min": 661580}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 190, "sum": 190.0, "min": 190}}, "EndTime": 1536848873.717452, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 94}, "StartTime": 1536848872.191327} [09/13/2018 14:27:53 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4562.73359999 records/second [09/13/2018 14:27:53 INFO 140484794881856] [09/13/2018 14:27:53 INFO 140484794881856] # Starting training for epoch 96 [09/13/2018 14:27:52 INFO 140284443182912] # Finished training epoch 84 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:52 INFO 140284443182912] Metrics for Training: [09/13/2018 14:27:52 INFO 140284443182912] Loss (name: value) total: 6.11374891455 [09/13/2018 14:27:52 INFO 140284443182912] Loss (name: value) kld: 0.315318275311 [09/13/2018 14:27:52 INFO 140284443182912] Loss (name: value) recons: 5.79843062921 [09/13/2018 14:27:52 INFO 140284443182912] Loss (name: value) logppx: 6.11374891455 [09/13/2018 14:27:52 INFO 140284443182912] #quality_metric: host=algo-2, epoch=84, train total_loss <loss>=6.11374891455 [09/13/2018 14:27:52 INFO 140284443182912] patience losses:[6.1103022575378416, 6.1086443294178352, 6.1120232495394617, 6.1120166388424959, 6.1144856929779055] min patience loss:6.10864432942 current loss:6.11374891455 absolute loss difference:0.0051045851274 [09/13/2018 14:27:52 INFO 140284443182912] Bad epoch: loss has not improved (enough). Bad count:6 [09/13/2018 14:27:52 INFO 140284443182912] Bad epochs exceeded patience. Stopping training early! [09/13/2018 14:27:52 INFO 140284443182912] Early stop condition met. Stopping training. [09/13/2018 14:27:52 INFO 140284443182912] #progress_metric: host=algo-2, completed 100 % epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 4620, "sum": 4620.0, "min": 4620}, "Total Records Seen": {"count": 1, "max": 584976, "sum": 584976.0, "min": 584976}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 168, "sum": 168.0, "min": 168}}, "EndTime": 1536848872.746938, "Dimensions": {"Host": "algo-2", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 83}, "StartTime": 1536848870.680867} [09/13/2018 14:27:52 INFO 140284443182912] #throughput_metric: host=algo-2, train throughput=3370.41551106 records/second [09/13/2018 14:27:55 INFO 140484794881856] # Finished training epoch 96 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:55 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:55 INFO 140484794881856] Loss (name: value) total: 6.04658197056 [09/13/2018 14:27:55 INFO 140484794881856] Loss (name: value) kld: 0.319515146992 [09/13/2018 14:27:55 INFO 140484794881856] Loss (name: value) recons: 5.72706687234 [09/13/2018 14:27:55 INFO 140484794881856] Loss (name: value) logppx: 6.04658197056 [09/13/2018 14:27:55 INFO 140484794881856] #quality_metric: host=algo-1, epoch=96, train total_loss <loss>=6.04658197056 [09/13/2018 14:27:55 INFO 140484794881856] patience losses:[6.0862555503845215, 6.0712456399744203, 6.0812870242378931, 6.0660450761968443, 6.0608939604325727] min patience loss:6.06089396043 current loss:6.04658197056 absolute loss difference:0.0143119898709 [09/13/2018 14:27:55 INFO 140484794881856] #progress_metric: host=algo-1, completed 96 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5280, "sum": 5280.0, "min": 5280}, "Total Records Seen": {"count": 1, "max": 668544, "sum": 668544.0, "min": 668544}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 192, "sum": 192.0, "min": 192}}, "EndTime": 1536848875.109723, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 95}, "StartTime": 1536848873.718022} [09/13/2018 14:27:55 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=5003.4170286 records/second [09/13/2018 14:27:55 INFO 140484794881856] [09/13/2018 14:27:55 INFO 140484794881856] # Starting training for epoch 97 [09/13/2018 14:27:56 INFO 140484794881856] # Finished training epoch 97 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:56 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:56 INFO 140484794881856] Loss (name: value) total: 6.04798079837 [09/13/2018 14:27:56 INFO 140484794881856] Loss (name: value) kld: 0.319736656005 [09/13/2018 14:27:56 INFO 140484794881856] Loss (name: value) recons: 5.72824415727 [09/13/2018 14:27:56 INFO 140484794881856] Loss (name: value) logppx: 6.04798079837 [09/13/2018 14:27:56 INFO 140484794881856] #quality_metric: host=algo-1, epoch=97, train total_loss <loss>=6.04798079837 [09/13/2018 14:27:56 INFO 140484794881856] patience losses:[6.0712456399744203, 6.0812870242378931, 6.0660450761968443, 6.0608939604325727, 6.0465819705616344] min patience loss:6.04658197056 current loss:6.04798079837 absolute loss difference:0.00139882781289 [09/13/2018 14:27:56 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:56 INFO 140484794881856] #progress_metric: host=algo-1, completed 97 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5335, "sum": 5335.0, "min": 5335}, "Total Records Seen": {"count": 1, "max": 675508, "sum": 675508.0, "min": 675508}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 194, "sum": 194.0, "min": 194}}, "EndTime": 1536848876.499904, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 96}, "StartTime": 1536848875.110245} [09/13/2018 14:27:56 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=5010.59152398 records/second [09/13/2018 14:27:56 INFO 140484794881856] [09/13/2018 14:27:56 INFO 140484794881856] # Starting training for epoch 98 [09/13/2018 14:27:57 INFO 140484794881856] # Finished training epoch 98 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:57 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:57 INFO 140484794881856] Loss (name: value) total: 6.04151021351 [09/13/2018 14:27:57 INFO 140484794881856] Loss (name: value) kld: 0.320069293813 [09/13/2018 14:27:57 INFO 140484794881856] Loss (name: value) recons: 5.7214409438 [09/13/2018 14:27:57 INFO 140484794881856] Loss (name: value) logppx: 6.04151021351 [09/13/2018 14:27:57 INFO 140484794881856] #quality_metric: host=algo-1, epoch=98, train total_loss <loss>=6.04151021351 [09/13/2018 14:27:57 INFO 140484794881856] patience losses:[6.0812870242378931, 6.0660450761968443, 6.0608939604325727, 6.0465819705616344, 6.0479807983745228] min patience loss:6.04658197056 current loss:6.04151021351 absolute loss difference:0.0050717570565 [09/13/2018 14:27:57 INFO 140484794881856] #progress_metric: host=algo-1, completed 98 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5390, "sum": 5390.0, "min": 5390}, "Total Records Seen": {"count": 1, "max": 682472, "sum": 682472.0, "min": 682472}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 196, "sum": 196.0, "min": 196}}, "EndTime": 1536848877.914243, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 97}, "StartTime": 1536848876.500541} [09/13/2018 14:27:57 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4925.2445169 records/second [09/13/2018 14:27:57 INFO 140484794881856] [09/13/2018 14:27:57 INFO 140484794881856] # Starting training for epoch 99 [09/13/2018 14:27:59 INFO 140484794881856] # Finished training epoch 99 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:27:59 INFO 140484794881856] Metrics for Training: [09/13/2018 14:27:59 INFO 140484794881856] Loss (name: value) total: 6.04771772298 [09/13/2018 14:27:59 INFO 140484794881856] Loss (name: value) kld: 0.332006813179 [09/13/2018 14:27:59 INFO 140484794881856] Loss (name: value) recons: 5.71571088704 [09/13/2018 14:27:59 INFO 140484794881856] Loss (name: value) logppx: 6.04771772298 [09/13/2018 14:27:59 INFO 140484794881856] #quality_metric: host=algo-1, epoch=99, train total_loss <loss>=6.04771772298 [09/13/2018 14:27:59 INFO 140484794881856] patience losses:[6.0660450761968443, 6.0608939604325727, 6.0465819705616344, 6.0479807983745228, 6.0415102135051377] min patience loss:6.04151021351 current loss:6.04771772298 absolute loss difference:0.00620750947432 [09/13/2018 14:27:59 INFO 140484794881856] Bad epoch: loss has not improved (enough). Bad count:1 [09/13/2018 14:27:59 INFO 140484794881856] #progress_metric: host=algo-1, completed 99 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5445, "sum": 5445.0, "min": 5445}, "Total Records Seen": {"count": 1, "max": 689436, "sum": 689436.0, "min": 689436}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 198, "sum": 198.0, "min": 198}}, "EndTime": 1536848879.37469, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 98}, "StartTime": 1536848877.914884} [09/13/2018 14:27:59 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=4769.93143506 records/second [09/13/2018 14:27:59 INFO 140484794881856] [09/13/2018 14:27:59 INFO 140484794881856] # Starting training for epoch 100 [09/13/2018 14:28:00 INFO 140284443182912] Best model based on early stopping at epoch 78. Best loss: 6.10629929196 [09/13/2018 14:28:00 INFO 140484794881856] # Finished training epoch 100 on 6964 examples from 55 batches, each of size 128. [09/13/2018 14:28:00 INFO 140484794881856] Metrics for Training: [09/13/2018 14:28:00 INFO 140484794881856] Loss (name: value) total: 6.03936740268 [09/13/2018 14:28:00 INFO 140484794881856] Loss (name: value) kld: 0.333956166018 [09/13/2018 14:28:00 INFO 140484794881856] Loss (name: value) recons: 5.70541118275 [09/13/2018 14:28:00 INFO 140484794881856] Loss (name: value) logppx: 6.03936740268 [09/13/2018 14:28:00 INFO 140484794881856] #quality_metric: host=algo-1, epoch=100, train total_loss <loss>=6.03936740268 [09/13/2018 14:28:00 INFO 140484794881856] patience losses:[6.0608939604325727, 6.0465819705616344, 6.0479807983745228, 6.0415102135051377, 6.047717722979459] min patience loss:6.04151021351 current loss:6.03936740268 absolute loss difference:0.00214281082153 [09/13/2018 14:28:00 INFO 140484794881856] #progress_metric: host=algo-1, completed 100 % of epochs #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Batches Since Last Reset": {"count": 1, "max": 55, "sum": 55.0, "min": 55}, "Number of Records Since Last Reset": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Total Batches Seen": {"count": 1, "max": 5500, "sum": 5500.0, "min": 5500}, "Total Records Seen": {"count": 1, "max": 696400, "sum": 696400.0, "min": 696400}, "Max Records Seen Between Resets": {"count": 1, "max": 6964, "sum": 6964.0, "min": 6964}, "Reset Count": {"count": 1, "max": 200, "sum": 200.0, "min": 200}}, "EndTime": 1536848880.755487, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 99}, "StartTime": 1536848879.375259} [09/13/2018 14:28:00 INFO 140484794881856] #throughput_metric: host=algo-1, train throughput=5044.68818912 records/second [09/13/2018 14:28:00 INFO 140484794881856] Best model based on early stopping at epoch 100. Best loss: 6.03936740268 [09/13/2018 14:28:00 INFO 140484794881856] Topics from epoch:final (num_topics:20): 600 1861 934 1777 1510 799 881 543 1822 175 1653 905 14 110 740 123 1860 214 103 432 1807 935 921 799 1573 1980 817 294 113 1330 1755 1053 1832 1445 557 1378 140 1465 543 914 735 1861 1752 1860 56 957 914 1691 249 905 50 284 865 1138 494 1179 1388 1715 294 1445 1065 1438 31 83 1591 1063 947 110 1020 1245 111 334 1582 1751 60 197 1648 1082 649 1437 1829 947 31 1033 334 1591 1349 1133 173 1819 1437 764 1592 955 101 1542 59 1208 281 127 1401 1637 1818 637 201 1067 1867 352 1796 1692 843 1078 587 1157 919 1007 1886 1885 1847 1276 1139 140 1861 294 1068 1950 934 1807 817 1852 1329 1715 1390 611 1179 865 1378 1860 1832 1465 175 110 114 905 50 914 1448 558 214 404 117 189 31 1265 91 1438 1810 1592 1239 1861 1696 639 910 1388 462 712 1437 998 1060 558 181 735 47 840 1512 545 1549 1285 404 876 1532 1546 944 572 908 1656 960 918 1728 889 1667 1317 189 1526 72 297 175 1710 1305 1135 1860 1715 865 1832 1950 957 136 1752 611 294 1777 494 1861 934 1179 1370 1015 645 817 1852 1438 49 978 1524 1437 789 60 214 1177 300 176 1620 1065 437 1457 916 1082 66 1305 728 1438 642 114 117 283 31 558 957 692 110 84 41 107 54 998 112 346 214 1591 489 1860 1030 14 1465 319 817 1330 865 714 113 1832 410 921 1980 1440 294 1852 1861 1800 1329 914 558 1696 103 712 84 1437 998 107 615 1262 1438 214 692 91 701 651 1819 929 110 114 838 1960 1438 1437 110 1512 947 910 1751 50 692 103 111 1545 374 462 117 1161 174 944 1532 1546 572 1656 908 1728 601 808 1744 1065 297 1526 524 887 1587 1594 1305 1296 252 1695 111 1438 1645 374 83 1083 438 1102 462 127 947 545 1960 1512 1897 799 910 840 790 117 110 1068 1695 957 385 914 284 557 1681 53 934 1512 1861 123 59 921 1777 376 91 1163 229 1147 1150 254 1345 166 825 1503 1103 1646 120 253 706 94 1795 1211 1312 755 1739 [09/13/2018 14:28:00 INFO 140484794881856] Serializing model to /opt/ml/model/model_algo-1 [09/13/2018 14:28:00 INFO 140484794881856] Saved checkpoint to "/tmp/tmp7UZYti/state-0001.params" [09/13/2018 14:28:00 INFO 140284443182912] Topics from epoch:final (num_topics:20): 1861 294 814 1465 1777 735 957 1832 1860 289 184 799 1440 1100 1094 543 1875 110 1752 1715 1755 735 1465 1861 1015 921 1832 91 113 1860 1068 1291 896 1139 1800 1859 799 249 740 294 1068 1861 735 1860 1681 934 376 294 1960 91 673 53 600 159 816 45 175 1478 1752 1675 1526 283 31 462 1523 1808 1620 1591 1420 1819 954 1592 110 1437 1718 1593 933 1065 214 107 1825 1438 17 947 376 1437 1083 1 60 214 1245 489 111 263 1549 1263 1645 615 437 1507 1401 352 1818 1637 1078 919 637 843 1007 587 1067 1867 201 1276 1885 1796 1148 1272 1692 1847 1390 1139 1852 865 1861 294 934 195 1752 1510 1715 140 1465 327 1860 1755 1597 1247 799 1179 110 50 175 1437 712 214 462 60 960 905 653 84 914 537 117 1133 692 181 111 1549 331 111 60 93 214 110 1437 712 374 1161 117 914 489 910 998 330 462 880 615 54 1546 1532 944 572 908 1656 1728 918 297 960 72 565 1317 1627 1744 1482 1526 643 601 742 1832 294 1139 1068 611 1980 865 1030 14 1950 600 113 1861 799 1752 1465 1179 759 965 249 1438 492 1226 1457 1177 71 110 622 1296 1303 701 1808 922 1065 1815 1124 1046 1073 1099 107 110 1438 642 905 214 840 50 1861 103 1437 114 283 929 1245 59 84 615 427 650 956 600 294 1378 1465 957 1861 1860 1445 1800 799 1852 171 1053 840 1681 1329 1752 410 1510 1857 103 374 257 114 107 50 284 110 84 1437 998 1699 31 1294 462 957 541 1265 346 324 532 1438 1437 214 117 374 1512 462 947 110 1875 1208 1262 54 1226 1696 330 181 1695 1699 572 1532 944 1546 908 1656 1728 960 1065 1744 601 101 536 1594 480 1082 1526 987 252 297 114 110 214 947 1695 1437 1591 692 905 374 855 1819 330 555 1808 1897 939 735 1825 717 110 1861 1258 967 438 957 799 294 427 735 1370 1752 744 1584 600 50 558 1381 1860 286 1163 229 1150 166 1147 1345 1103 254 825 1646 1503 120 706 253 755 1312 1409 1211 841 310 [09/13/2018 14:28:00 INFO 140284443182912] Serializing model to /opt/ml/model/model_algo-2 [09/13/2018 14:28:00 INFO 140284443182912] Saved checkpoint to "/tmp/tmp1KfOvc/state-0001.params" [09/13/2018 14:28:01 INFO 140284443182912] Finished scoring on 13824 examples from 108 batches, each of size 128. [09/13/2018 14:28:01 INFO 140284443182912] Metrics for Inference: [09/13/2018 14:28:01 INFO 140284443182912] Loss (name: value) total: 6.13722160127 [09/13/2018 14:28:01 INFO 140284443182912] Loss (name: value) kld: 0.27949356674 [09/13/2018 14:28:01 INFO 140284443182912] Loss (name: value) recons: 5.85772804419 [09/13/2018 14:28:01 INFO 140284443182912] Loss (name: value) logppx: 6.13722160127 #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 109, "sum": 109.0, "min": 109}, "Number of Batches Since Last Reset": {"count": 1, "max": 109, "sum": 109.0, "min": 109}, "Number of Records Since Last Reset": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Total Batches Seen": {"count": 1, "max": 109, "sum": 109.0, "min": 109}, "Total Records Seen": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Max Records Seen Between Resets": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Reset Count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1536848881.785815, "Dimensions": {"Host": "algo-2", "Meta": "test_data_iter", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1536848880.842541} [09/13/2018 14:28:01 INFO 140284443182912] #test_score (algo-2) : ('log_perplexity', 6.1372216012742786) #metrics {"Metrics": {"totaltime": {"count": 1, "max": 188216.02511405945, "sum": 188216.02511405945, "min": 188216.02511405945}, "finalize.time": {"count": 1, "max": 84.39016342163086, "sum": 84.39016342163086, "min": 84.39016342163086}, "initialize.time": {"count": 1, "max": 854.2530536651611, "sum": 854.2530536651611, "min": 854.2530536651611}, "model.serialize.time": {"count": 1, "max": 3.7038326263427734, "sum": 3.7038326263427734, "min": 3.7038326263427734}, "setuptime": {"count": 1, "max": 3525.520086288452, "sum": 3525.520086288452, "min": 3525.520086288452}, "early_stop.time": {"count": 84, "max": 18.229961395263672, "sum": 223.94299507141113, "min": 0.022172927856445312}, "update.time": {"count": 84, "max": 2313.803195953369, "sum": 174640.79403877258, "min": 1923.1300354003906}, "epochs": {"count": 1, "max": 100, "sum": 100.0, "min": 100}, "model.score.time": {"count": 1, "max": 943.1991577148438, "sum": 943.1991577148438, "min": 943.1991577148438}}, "EndTime": 1536848881.789224, "Dimensions": {"Host": "algo-2", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1536848697.128842} [09/13/2018 14:28:01 INFO 140484794881856] Finished scoring on 13824 examples from 108 batches, each of size 128. [09/13/2018 14:28:01 INFO 140484794881856] Metrics for Inference: [09/13/2018 14:28:01 INFO 140484794881856] Loss (name: value) total: 6.15040260774 [09/13/2018 14:28:01 INFO 140484794881856] Loss (name: value) kld: 0.33247458714 [09/13/2018 14:28:01 INFO 140484794881856] Loss (name: value) recons: 5.81792800515 [09/13/2018 14:28:01 INFO 140484794881856] Loss (name: value) logppx: 6.15040260774 #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 109, "sum": 109.0, "min": 109}, "Number of Batches Since Last Reset": {"count": 1, "max": 109, "sum": 109.0, "min": 109}, "Number of Records Since Last Reset": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Total Batches Seen": {"count": 1, "max": 109, "sum": 109.0, "min": 109}, "Total Records Seen": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Max Records Seen Between Resets": {"count": 1, "max": 13928, "sum": 13928.0, "min": 13928}, "Reset Count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1536848881.777372, "Dimensions": {"Host": "algo-1", "Meta": "test_data_iter", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1536848880.857424} [09/13/2018 14:28:01 INFO 140484794881856] #test_score (algo-1) : ('log_perplexity', 6.1504026077411789) #metrics {"Metrics": {"totaltime": {"count": 1, "max": 185105.66210746765, "sum": 185105.66210746765, "min": 185105.66210746765}, "finalize.time": {"count": 1, "max": 94.8948860168457, "sum": 94.8948860168457, "min": 94.8948860168457}, "initialize.time": {"count": 1, "max": 740.257978439331, "sum": 740.257978439331, "min": 740.257978439331}, "model.serialize.time": {"count": 1, "max": 3.8280487060546875, "sum": 3.8280487060546875, "min": 3.8280487060546875}, "setuptime": {"count": 1, "max": 524.9249935150146, "sum": 524.9249935150146, "min": 524.9249935150146}, "early_stop.time": {"count": 100, "max": 12.373924255371094, "sum": 242.70987510681152, "min": 0.1499652862548828}, "update.time": {"count": 100, "max": 2077.699899673462, "sum": 182659.7716808319, "min": 1380.0442218780518}, "epochs": {"count": 1, "max": 100, "sum": 100.0, "min": 100}, "model.score.time": {"count": 1, "max": 919.8729991912842, "sum": 919.8729991912842, "min": 919.8729991912842}}, "EndTime": 1536848881.781868, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1536848697.23929} Billable seconds: 584
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
A trained NTM model does nothing on its own. We now want to use the model we computed to perform inference on data. For this example, that means predicting the topic mixture representing a given document. We create an inference endpoint using the SageMaker Python SDK deploy() function from the job we defined above. We specify the instance type where inference is computed as well as an initial number of instances to spin up.
ntm_predictor = ntm.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
INFO:sagemaker:Creating model with name: ntm-2018-09-13-14-31-22-875 INFO:sagemaker:Creating endpoint with name ntm-2018-09-13-14-21-11-945
----------------------------------------------------------------------------!
After the deployment is completed, run the following code to prepare the test data and invoke the endpoint for inference. We can pass data in a variety of formats to our inference endpoint. Here, we will demonstrate passing CSV-formatted data. We make use of the SageMaker Python SDK utilities csv_serialize and json_deserializer when configuring the inference endpoint and pass 5 documents from the test dataset for inference.
from sagemaker.predictor import csv_serializer, json_deserializer
ntm_predictor.content_type = 'text/csv'
ntm_predictor.serializer = csv_serializer
ntm_predictor.deserializer = json_deserializer
test_data = np.array(test_vectors.todense())
results = ntm_predictor.predict(test_data[:5])
predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
A more intuitive way to see the prediction results is to visualize topic assignments for the 5 sample testing data. Run the following code in a new cell to plot a bar chart for the topic assignment for the 20 topics.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
fs = 12
topics=pd.DataFrame(predictions.T)
topics.plot(kind='bar', figsize=(16,4), fontsize=fs)
plt.ylabel('Topic assignment', fontsize=fs+2)
plt.xlabel('Topic ID', fontsize=fs+2)
Text(0.5,0,'Topic ID')
Note: The following section is meant as a deeper dive into exploring the trained models. The demonstrated functionalities may not be fully supported or guaranteed. For example, the parameter names may change without notice.
The trained model artifact is a compressed package of MXNet models from the two workers. To explore the model, we first need to install mxnet, then download and unpack the artifact.
Then, we can load the model parameters and extract the weight matrix $W$ in the decoder as follows
!pip install mxnet
import mxnet as mx
model_path= os.path.join(output_prefix, ntm._current_job_name, 'output/model.tar.gz')
boto3.resource('s3').Bucket(bucket).download_file(model_path, 'downloaded_model.tar.gz')
!tar -xzvf 'downloaded_model.tar.gz'
!unzip -o model_algo-1
model = mx.ndarray.load('params')
W = model['arg:projection_weight']
Collecting mxnet Downloading https://files.pythonhosted.org/packages/bb/53/5d33f71c5224a676112679458714eb728f6db8cae15f39fcdf27226f6e41/mxnet-1.2.1.post1-py2.py3-none-manylinux1_x86_64.whl (24.2MB) 100% |████████████████████████████████| 24.2MB 2.2MB/s eta 0:00:01 Collecting graphviz<0.9.0,>=0.8.1 (from mxnet) Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl Requirement already satisfied: requests<2.19.0,>=2.18.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from mxnet) (2.18.4) Requirement already satisfied: numpy<1.15.0,>=1.8.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from mxnet) (1.14.3) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet) (3.0.4) Requirement already satisfied: idna<2.7,>=2.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet) (2.6) Requirement already satisfied: urllib3<1.23,>=1.21.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet) (1.22) Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from requests<2.19.0,>=2.18.4->mxnet) (2018.8.24) distributed 1.21.8 requires msgpack, which is not installed. Installing collected packages: graphviz, mxnet Successfully installed graphviz-0.8.4 mxnet-1.2.1.post1 You are using pip version 10.0.1, however version 18.0 is available. You should consider upgrading via the 'pip install --upgrade pip' command. model_algo-2 model_algo-1 Archive: model_algo-1 extracting: meta.json extracting: symbol.json extracting: params
Matrix $W$ corresponds to the $W$ in the NTM digram at the beginning of this notebook. Each column of $W$ corresponds to a learned topic. The elements in the columns of $W$ corresponds to the pseudo-probability of a word within a topic. We can visualize each topic as a word cloud with the size of each word be proportional to the pseudo-probability of the words appearing under each topic.
word_to_id = dict()
for i, v in enumerate(vocab_list):
word_to_id[v] = i
limit = 24
n_col = 4
counter = 0
plt.figure(figsize=(20,16))
for ind in range(num_topics):
if counter >= limit:
break
title_str = 'Topic{}'.format(ind)
#pvals = mx.nd.softmax(W[:, ind]).asnumpy()
pvals = mx.nd.softmax(mx.nd.array(W[:, ind])).asnumpy()
word_freq = dict()
for k in word_to_id.keys():
i = word_to_id[k]
word_freq[k] =pvals[i]
wordcloud = WordCloud(background_color='white').fit_words(word_freq)
plt.subplot(limit // n_col, n_col, counter+1)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(title_str)
#plt.close()
counter +=1
Looking at the word clouds above - the topics don't seem very distinct or obvious. We most likely have a sub-optimal number of topics for out model. Let's do a HPO job and see if we can improve our model's performance.
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter, ContinuousParameter
ntm = sagemaker.estimator.Estimator(container,
role,
train_instance_count=2,
train_instance_type='ml.c4.xlarge',
output_path=output_path,
sagemaker_session=sagemaker_session)
num_topics = 6
ntm.set_hyperparameters(num_topics=num_topics, feature_dim=vocab_size, mini_batch_size=128,
epochs=100, num_patience_epochs=5, tolerance=0.001)
hyperparameter_ranges = {'optimizer': CategoricalParameter(['sgd', 'adam', 'adadelta']),
'learning_rate': ContinuousParameter(0.001, 0.02)}
tuner = HyperparameterTuner(ntm,
'validation:total_loss',
hyperparameter_ranges,
objective_type='Minimize',
max_jobs=9,
max_parallel_jobs=3)
tuner.fit({'train': s3_train, 'test': s3_val_data})
INFO:sagemaker:Creating hyperparameter tuning job with name: ntm-180913-2057