Out of sample. How do you do it?
-
He asked me about the projection outside the sample, but no answers. Maybe a simple example would make it clearer. Example. Got the data. [1,2,3,4,5,6,7,8,9] Then. Data normalization. Incoming training. The work of the NS (buh-bah, magic) Data inversion. Out. The next element should be. [9.98743] That's a rough example. As many examples as I've seen, there's only a sample forecast everywhere. I mean, test data. And I don't understand the projection outside the sample. Please make references. We could use an example. How do you do that?
Thanks for the answer. But I wanted to know how to put an prediction on elements outside the sample. Like this code. How to know the next day's pollution or the next week or month. How this function is performed. I cannot find examples of such implementation. ''
from math import sqrt from numpy import concatenate from matplotlib import pyplot from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM
convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return aggload dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.valuesinteger encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])ensure all data is float
values = values.astype('float32')
normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)
''
Here's an example from the guitaba.
But here and other examples I've seen. Only the sample prediction is real.
Not a sample.
-
Any model of machine training/non-tural networks always expects data in a given format or, in technical language, a tensor of the same size, not taking into account the first or last axle (number of samples). For example, if you have a data set of 2D matrix or table, the model will always expect the same number of columns. The number of lines may vary.
Data outside the sample is the data you'll get at the entrance. future years♪ In order to launch and test such a model, the following are done:
- A fairly large sample of data is collected
- split into two parts - training and test sample
- In the case of a time series, the last 20 to 30 per cent of the data are usually taken as a test sample. If this is not a time series, then the data are intermittently mixed and divided into two parts - 70 to 80 per cent for training, the rest for testing.
- We further learn the model on the training set, and the accuracy of the model is checked on the test set.
- When we're satisfied with the results of the tests, the model rolls out into the market and starts working on completely new data. Of course, all the pre-processing (pre-processing) steps you've done for the study and test sample should be repeated for new input data.
- after this, its model is usually supplemented by new data.
That's it. ♪ ♪