Datasets

Open dataset for benchmarking Machine Learning algorithms


See the Kaggle project’s page

Data description

This dataset is dedicated to benchmarking Machine Learning solutions to the problem of estimation of the components of the state vector in nonlinear dynamical systems.

The dataset is built using two dynamical systems, namely:

  • The Electronic Throttle Controlled (ETC) system representing a technological device that controls the air flow rate in automotive motors. This is a three-states system in which only the first state and the control input are measured while the other two states are to be estimated using the previous available measurements. The system is controlled via an input signal (which is also measured) representing the electric current that acts on an electric torque generation sub-system. This torque changes acts on the angle of a device hence changing the flow-rate entering the combustion chamber.

  • The Lorentz attractor representing a famous nonlinear chaotic system with no inputs (autonomous system). Here again, this is a three-states system in which only the first state is measured while the two remaining states are to be estimated using the available measurements over a past window.

Some definitions and notation to understand the context

The state vector and the control input (if any) are denoted by x and u respectively. Both systems are defined up to the knowledge of an associated vector of parameters p involved in the model’s definition.

The very possibility of estimating the non measured components xi of the states, such as x2 and x3 in the data set of both systems relies on the existence of an associated maps of the form:

xi(k) = Fi(y_past(k), p)

where y_past encompasses the measurement acquired on some past moving window spanning the past time interval defined by:

(k-window, ..., k-1, k).

More precisely, the vector of features (used in the X features matrix) is built from the values of the measurements over the previously defined time interval with some under-sampling consisting in taking one value over nJump values. Namely when nJump=1 all the measurements are used while when nJump=5 only the fifth of the instantes are considered.

Based on the precious definitions, the features vector and the label to be identified are schematically shown in the figure below.

Description

This is the main file containing the dictionary of the dataset the can be used as a benchmark for nonlinear state estimators design via Machine Learning.

The file contains a dictionary that can be acceded by using the pickle.load command:

import pickle 
data = pickle.load(open('data.pkl', 'rb')

The list of keys of the data dictionary is the following:

[('etc', 0.0, 'x2'),
 ('etc', 0.0, 'x3'),
 ('etc', 0.05, 'x2'),
 ('etc', 0.05, 'x3'),
 ('etc', 0.1, 'x2'),
 ('etc', 0.1, 'x3'),
 ('lorentz', 0.0, 'x2'),
 ('lorentz', 0.0, 'x3'),
 ('lorentz', 0.05, 'x2'),
 ('lorentz', 0.05, 'x3'),
 ('lorentz', 0.1, 'x2'),
 ('lorentz', 0.1, 'x3')]

Where each key is a triplet of values representing

  • The system being considered: Possible values art etc or lorentz
  • The relative standard deviation of the system’s parameters: Possible values are 0, 0.05 or 0.1
  • The state component to be estimated: Possible values are x2 or x3.

Notice that the noise level can be chosen and the corresponding noise added to the features matrices.

Once a key k is chosen among the above mentioned list, the corresponding value data[k] is again a dictionary enabling to access the (X,y) paires for training and test, namely:

data[k].Xtrain, data[k].Xtest, data[k].ytrain, data[k].ytest

Finally, in order to grasp an idea regarding the size of the datasets, the following script is used:

print(data[('etc', 0.0, 'x2')]['Xtrain'].shape)
print(data[('etc', 0.0, 'x2')]['Xtest'].shape)
print(data[('lorentz', 0.0, 'x2')]['Xtrain'].shape)
print(data[('lorentz', 0.0, 'x2')]['Xtest'].shape)

which produces the following results:

(136000, 30) (136000, 30) (44000, 5) (44000, 5)