g2sys

API Documenation


This documentation concerns the scripting version of g2sys. For the use of the streamlit version, refer to the movies provided in the g2sys-dedicated section and those which follow it.

1 The g2sys Fit arguments dictionary

Following the same scheme as the one used for the plars module, the use of g2sys needs a list of arguments that are gathered in a dictionary. For instance, the following script shows how the fit method can be called:

from mizopol.g2sys_api import fit

# Download the working dataframe
df = pd.read_csv("datasets/zema.csv", index_col=0)

# set the arguments dictionary 
args = dict(deg=3,
            d=20,
            nd=2,
            list_of_c=['PS4', 'PS6'],
            nModes=20,
            nModels=10,
            nfeat_max=25,
            index_range=(0.0, 0.2),
            include_plots=True,
            recursive=True,
)

# Fit a g2sys model 
dic_solutions, dic_figs, cpu = fit(df, args)

where one first download the working dataframe, set the arguments dictionary described below and then run the fit method of the g2sys_api module. Notice that as it is described in the plars documentation, the list of (keyword,value) pairs gathers only those parameters the user intended to set, other pairs are used behnd the scene with their default values.

As the g2sys module is based on the plars module to parsimoniously identify polynomial relationships, One might observe that the presence of the already seen parameters: nModels, nModes, deg which belongs to the arguments dictionart of plars (the window is not set here as the user is ok with the default value).

Notice also the presence of the list of labels ['PS4', 'PS6'] for which one is seeking models. This is because g2sys is designed to work in a Batch mode on a set of targets.

The following table provides a list of available parameters for the g2sys module.

Table 1: Possible entries in the args dictionary for g2sys.
Parameter Type Used for Default
deg int The degree of the polynomial to be identified 1
window int number of samples per window (window width) 200
nModels int Number of sampled window for alignement evaluation 10
nModes int Number of selected monomials per window 10
eps float precision for the final least squares solution 5e-2
nBatch int Number of window used to determine monomials contributions 25
eta float The quantile used to compute the error dataframe 50
d int The amount of delay used (multiple of sampling time) 0
nd int The number of delayed instances per sensors to be used 0
list_of_c list[str] The list of sensors’ labels to model user-defined
recursive boolean If True, dynamic relationships are looked for otherwise static False
nSlices int Number of random slices to use in the search for relevant columns to include 20
nSelect int Number of columns to be selected at each slice 10
nfeat_max int Maximum number of columns incuded before polynomial expansion 20
include_plots boolean If true, fitting plots are provided False
index_range tuple(float, float) the interval for train as fraction of the dataset’s length (0.0, 0.25)
th_monomials float Threshold for the inclusion of monomials in the solution 1e-4
Warning

Regarding the list_of_c argument, notice that even when a single label, say c is targeted, the syntax is to call the fit method with the list [c].

2 The fit method

2.1 Inputs arguments

Table 2: Input arguments for the fit method of g2ys.
Parameter Type Description Default
df pandas dataframe The working dataframe for training user-defined
args dict The dictionary of arguments (see section Section 1) user-defined

2.2 Returned arguments

Table 3: Argument returned by the fit method of g2ys.
Parameter Type Description
dic_solutions dict Dictionary where the keys are the elements of list_of_c and the values are plars solutions as described in Section dedicated to the plars documentation
dic_figs dict Dictionary where the keys are the elements of list_of_c and the values are plotly figures representing the fitting result. The figures are present only if the field include_plots is set to True in the arguments of the fit otherwise, None is returned
cpu tuple The local and distance computation times as described in the plars documentation

The execution of the script shown in section Section 1, produce the following intermediate log results during the fitting:

python -m test_g2sys
g2sys ---- test of fit
--> PS4              error = 0.12 |                      align : 0.967 |  nfeat = 6 / 2024
--> PS6              error = 0.07 |                      align : 0.995 |  nfeat = 3 / 2024

which shows the fitting performance in terms of errors, alignement between the label and the features-based predicted on as well as the number of used monomials reported to the total number of eligible ones.

3 The predict method

Once a dictionary of solutions, say dic_solutions, indexed by the items in the list_of_c list, the solutions can be used to predict the associated label for a new dataframe df using the following script:

from mizopol.g2sys_api import predict

y, ypred, (cpu1, cpu2) = predict(df, dic_solutions['PS4'])

in which y is simply df['PS4'].values while ypred is its prediced values based on the solution dic_solutions['PS4']. As for the computation times cpu1 and cpu2, they represent the user viewed computation time and the host side computation time (generally lower than the first one which incorporates the communication and warming server delays).

4 The monomials_contrib method

As in the case of plars module, the g2sys module provides a method that compute the contributions of the different monomials inside a solution. The following script shows how the solution associated to the sensor PS6 is used to compute the contributions of the different monomials over a working dataframe df:

from mizopol.g2sys_api import monomials_contrib

df = pd.read_csv("datasets/zema.csv", index_col=0)
    
args = dict(deg=1,
            d=20,
            nd=3,
            list_of_c=['PS4', 'PS6'],
            nModes=20,
            nModels=10,
            nfeat_max=25,
            index_range=(0.0, 0.2),
            include_plots=False,
            recursive=True,
            only_train=False,
            th_monomial=1e-2,
            )

dic_solutions, dic_figs, cpu = fit(df, args)

sol = dic_solutions['PS6']
df_contrib, (cpu1, cpu2) = monomials_contrib(df, sol, win=200, nBatch=25)

print(df_contrib)
print(f'fitting error: {sol["error"]} | card = {sol["card"]}')
print(f'cpu all = {cpu1:1.3} | cpu distant = {cpu2:1.3}')

Notice that the meaning of the input arguments win and nBatch is exactly the same as the one provided in the plars documentation. The results of the previous script are shown below:

--> PS4              error = 0.17 |     align : 0.990 |  nfeat = 6 / 29
--> PS6              error = 0.04 |     align : 0.999 |  nfeat = 4 / 29

      Monomial  Contribution       std
0  (PS6(k-20))     -0.404991  0.001812
1  (PS6(k-40))      0.403470  0.001734
2     (PS5(k))      0.096090  0.000403
3  (PS6(k-60))     -0.095449  0.000386

fitting error: 0.042079054813895324 | card = 4
cpu all = 1.11 | cpu distant = 0.552

From these results it comes out that because of the presence of delays, it is not obvious how to evaluate the importance of a particular sensor (and not monomial) in the solution. This is because a sensor might participate through different delayed terms. That is the reason why the g2sys module proposes also the sensors_contrib method that is descibed in the next section.

5 The sensors_contrib method

The following script fits a g2sys models for PS4 and PS6 sensors, then compute, for each of the two resulting models, the contributions of sensors:

from mizopol.g2sys_api import fit, sensors_contrib

df = pd.read_csv("datasets/zema.csv", index_col=0)

args = dict(deg=1,
            d=20,
            nd=3,
            list_of_c=['PS4', 'PS6'],
            nModes=20,
            nModels=10,
            nfeat_max=25,
            index_range=(0.0, 0.2),
            include_plots=False,
            recursive=True,
            only_train=False,
            th_monomial=1e-2,
            )

dic_solutions, dic_figs, cpu = fit(df, args)

for c in args['list_of_c']:
    sol = dic_solutions[c]
    #---------------------------------------------------
    df_contrib, (cpu1, cpu2) = sensors_contrib(df, sol)
    #---------------------------------------------------
    print('sensors contribution in ', c)
    print(df_contrib)
    print(f'fitting error: {sol["error"]} | card = {sol["card"]}')
    print(f'cpu all = {cpu1:1.3} | cpu distant = {cpu2:1.3}')
    print('----')

output:

--> PS4              error = 0.07 |     align : 0.990 |  nfeat = 7 / 29
--> PS6              error = 0.06 |     align : 0.999 |  nfeat = 4 / 29

sensors contribution in  PS4
      contrib
PS4  0.739079
PS5  0.133793
PS6  0.127128

fitting error: 0.06515230111229599 | card = 7
cpu all = 1.25 | cpu distant = 0.671
----

sensors contribution in  PS6
      contrib
PS5  0.095567
PS6  0.904433

fitting error: 0.05536495997206507 | card = 4
cpu all = 1.22 | cpu distant = 0.525
----