Out of sample. How do you do it?

  • He asked me about the projection outside the sample, but no answers. Maybe a simple example would make it clearer. Example. Got the data. [1,2,3,4,5,6,7,8,9] Then. Data normalization. Incoming training. The work of the NS (buh-bah, magic) Data inversion. Out. The next element should be. [9.98743] That's a rough example. As many examples as I've seen, there's only a sample forecast everywhere. I mean, test data. And I don't understand the projection outside the sample. Please make references. We could use an example. How do you do that?

    Thanks for the answer. But I wanted to know how to put an prediction on elements outside the sample. Like this code. How to know the next day's pollution or the next week or month. How this function is performed. I cannot find examples of such implementation. ''

    from math import sqrt
    from numpy import concatenate
    from matplotlib import pyplot
    from pandas import read_csv
    from pandas import DataFrame
    from pandas import concat
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import LabelEncoder
    from sklearn.metrics import mean_squared_error
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM

    convert series to supervised learning

    def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
    names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
    if i == 0:
    names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
    names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
    return agg

    load dataset

    dataset = read_csv('pollution.csv', header=0, index_col=0)
    values = dataset.values

    integer encode direction

    encoder = LabelEncoder()
    values[:,4] = encoder.fit_transform(values[:,4])

    ensure all data is float

    values = values.astype('float32')

    normalize features

    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled = scaler.fit_transform(values)

    frame as supervised learning

    reframed = series_to_supervised(scaled, 1, 1)

    drop columns we don't want to predict

    reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

    split into train and test sets

    values = reframed.values
    n_train_hours = 365 * 24
    train = values[:n_train_hours, :]
    test = values[n_train_hours:, :]

    split into input and outputs

    train_X, train_y = train[:, :-1], train[:, -1]
    test_X, test_y = test[:, :-1], test[:, -1]

    reshape input to be 3D [samples, timesteps, features]

    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    design network

    model = Sequential()
    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.compile(loss='mae', optimizer='adam')

    fit network

    history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)

    plot history

    pyplot.plot(history.history['loss'], label='train')
    pyplot.plot(history.history['val_loss'], label='test')

    make a prediction

    yhat = model.predict(test_X)
    test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

    invert scaling for forecast

    inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
    inv_yhat = scaler.inverse_transform(inv_yhat)
    inv_yhat = inv_yhat[:,0]

    invert scaling for actual

    test_y = test_y.reshape((len(test_y), 1))
    inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
    inv_y = scaler.inverse_transform(inv_y)
    inv_y = inv_y[:,0]

    calculate RMSE

    rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
    print('Test RMSE: %.3f' % rmse)

    Here's an example from the guitaba.
    But here and other examples I've seen. Only the sample prediction is real.
    Not a sample.

  • Any model of machine training/non-tural networks always expects data in a given format or, in technical language, a tensor of the same size, not taking into account the first or last axle (number of samples). For example, if you have a data set of 2D matrix or table, the model will always expect the same number of columns. The number of lines may vary.

    Data outside the sample is the data you'll get at the entrance. future years♪ In order to launch and test such a model, the following are done:

    1. A fairly large sample of data is collected
    2. split into two parts - training and test sample
    3. In the case of a time series, the last 20 to 30 per cent of the data are usually taken as a test sample. If this is not a time series, then the data are intermittently mixed and divided into two parts - 70 to 80 per cent for training, the rest for testing.
    4. We further learn the model on the training set, and the accuracy of the model is checked on the test set.
    5. When we're satisfied with the results of the tests, the model rolls out into the market and starts working on completely new data. Of course, all the pre-processing (pre-processing) steps you've done for the study and test sample should be repeated for new input data.
    6. after this, its model is usually supplemented by new data.

    That's it. ♪ ♪

Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2