plars

API Documenation


Recall that the objective of plars is to fit a polynomial \(P\) such that

\[y\approx P(x)\]

where \(y\) is a label while \(x\) is a vector of features.

1 (Instance | Fit) dictionaries

Before we detail the parameters of the plars call. Let us examine the simplest way to call for a fit of a plars model using the defatult parameters.

from mizopol.plars_api import fit 

# Set the default parameters for the plars instance
dic_plars = dict()

# Set the default parameters for the fit method 
dic_plars_fit = dict()

# run the fit method
sol, cpu = fit(X, y, dic_plars=dic_plars, dic_plars_fit=dic_plars_fit)

As a matter of fact by providing empty dictionaries, the default values are used. Each time a eligible (key, value) pair is defined inside the dict() instructions, the provided values are used to replace the defaults ones.

Table 1 defines the possible (key, value) pairs that can be used in the setting of the dic_plars dictionary:

1.1 dic_plars dictionary’s arguments:

Table 1: Possible entries in the dic_args dictionary.
Parameter Type Used for Default
deg int The degree of the polynomial to be identified 1
window int number of samples per window (window width) 200
nModels int Number of sampled window for alignement evaluation 10
nModes int Number of selected monomials per window 10
eps float precision for the final least squares solution 5e-2
nBatch int Number of window used to determine monomials contributions 25
eta float The quantile used to compute the error dataframe 50

With regards to Table 1, the following comments are worth giving:

  1. The maximum number of monomials is limited to nModels * nModes by construction of the alorithm.

  2. Note that eta is expected to represent quantile value so it is generally taken among the following set of values eta \(\in \{50, 80, 90, 95, 98, 99, 100\}\).

  3. While the deg parameters might be any integer, it is important to keep in mind that when the number of sensors is high, taking high values of deg might leads to an important computation time secause of the resulting unreasonably high of candidate monomials.

  4. while in the interface modules that are defined in the publicly available GitHub repository and which serves as an intermediary with the core modules developed locally (see the deployment figure in the previous section), the dic_plars and the dic_plars_fit are dictionary, the type of the entries is enforced in the distant fast-api through pydantic model’s declaration. This help returning meaninfull error message in case the type of the used parameter is not legal. For some variables, there are also bounds on the values of some parameters that, when violated, triggers a comprehensive error message.

  5. Increasing nModels increase the chance to capture all the important modes (monomials) that contribute here and there in the dataset to contruct the correct vector of label. Therefore, by taking too small a value, there is a risk of skipping important correlations while taking it high, uselessly increases the computation time.

  6. As for nModes, it is somehow linked to the presumed complexity of the model. It might be thought of once the two parameters nModels and window are already chosen.

  7. The last comments regarding nModels and nModes might suggest that the choice is quite difficult and need high level of expertise. As a matter of fact, default values perform quite good results in the majority of case and unless you take extreme values, the results are not so sensitive to this choice. A typical user experience consists in trying first the default value and then increase or decrease those values and see if it does make any significant change in the quality of the fit.

1.2 dic_plars_fit dictionary’s arguments:

Table 2: Possible entries in the dic_args_fit dictionary.
Parameter Type Used for Default
th_monomial float Threshold to keep a candidate monomial 1e-4
colNames list[str] the name attributed to the \(x\)-components for the creation of some resulting dataframe None
decouple boolean Whether to avoid updating the coefficients of the previously selected modes when new ones are selected False
compute_contributions boolean Whether to compute the contributions of monomial for later displaying False
nfeats int Maximum number of sensors to to involve in the solution None

With regards to Table 2, the following comments help better understanding the impact of the choices of the dictionary entries:

  1. Notice that if compute_contribution is set to False the parameter th_monomial has no effect. Indeed, in order to use this threshold, the contribution of the monomial have to be computed.

  2. The colNames is used to refer to the columns of the features matrix X. If the default value None is used, the standard names x1, x2, … xn are used. Therefore, giving meaningful name that talk to the end users which are familiar with the meaning of the sensors might be important when presenting the results.

  3. The nfeats parameters might be helpful when the number of sensors involved in the problem is really very high making the number of monomials in case of relatively high deg impratically high. In such case, setting nfeats to reasonable values forces the solver to first select the most important sensors before applying the polynomial transformation. Notice however that this selection process comes with a cost. So using nfeats different from the default None should be used only when necessary.

1.3 Using non default dictionaries

Based on the previous section, it is now possible to rewrite the script of Section 1 while using non default values for some of the entry parameters of the two dictionary:

from mizopol.plars_api import fit 

# Set the default parameters for the plars instance
dic_plars = dict(deg=3, window=1000)

# Set the default parameters for the fit method 
dic_plars_fit = dict(th_monomial=1e-3, compute_contributions=True)

# run the fit method
sol, cpu = fit(X, y, dic_plars=dic_plars, dic_plars_fit=dic_plars_fit)

By so doing the corresponding default values are replaced by the ones provided by the user.

2 Fiting a plars model

2.1 Importing the fit method

The fit method can be imported via:

from mizopol.plars_api import fit 

2.2 Input arguments

Table 3: Input arguments of the fit method of the plars_api module.
Parameter Type Used for Default
X list[list[float]] The degree of the polynomial to be identified user-defined
y list[float] number of samples per window (window width) user-defined
dic_plars dict Number of sampled window for alignement evaluation user-defined
dic_plars_fit dic Number of selected monomials per window defined
  1. X and y are nd.array python variables (matrix and vector respectively).

  2. for dic_plars and dic_plars_fit see Table 1 and Table 2 of Section 1.

  3. All the input arguments of fit are mandatory although the user might give the default dictionary as arguments. This choice has been made intentionally in order to remind the user of the existance of these dictionaries and that their default values are not necessarily the one to be used.

2.3 Example of use

import numpy as np
from mizopol.plars_api import predict, fit, monomials_contrib

nt = 20000
nx = 7
X = 1.0 + np.random.randn(nt, nx)
y = 12 * X[:, 0] + 10.3 * X[:, 1] * X[:, 2] - 2 * X[:, 1] ** 4 - 12.0

dic_plars = dict(deg=4, nModes=10, nModels=6, window=100, eps=5e-2)

# Comment the colNames argument | try nfeats = 5
dic_plars_fit = dict(
    compute_contributions=False,
    colNames=[f'S{i + 1}' for i in range(X.shape[1])],
    nfeats=None,
)

sol, cpu = fit(X, y, dic_plars=dic_plars, dic_plars_fit=dic_plars_fit)

In the following section, we dive more deeply in the returned arguments sol and cpu that are returned by the fit method.

2.4 Returned arguments

The fit method returns two arguments:

  • sol: A dictionary containing the solution and some corresponding fitting results that are detailed below.

  • cpu: a tuple (cpu[0], cpu[1]) such that

    • cpu[0]: is the computation time from the user’s perspective. Namley, this includes the communication with the endpoint, potentially the time needed by the server to warm up the docker image and the time needed to serialize and send back the results.

    • cpu[1]: is the computation time needed at the server side which gives a faithful information regarding the efficiency of the algorithm putting aside all the delays that are induced by the cloud and the distant deployment. This information might be interesting for the evaluation of local use of the MizoPol package which might be possible under certain circumstances.

As for the first argument sol, it is a dictionary whose content is detailed in the following table:

2.4.1 The sol dictionary

Table 4: The content of the sol dictionary returned by the fit method of the mizopol.plars_api module.
Parameter Type Description
nfeat int The number of candidate monomials before selection
indices list[int] The indices of the selected monomial among the polynomial-features generated ones
powers list[list[int]] The matrix of powers that defines the selected monomials
coefs list[float] The associated vector of coefficients
card int The cardinality of the solution (number of retained monomials)
error float The value of the eta quantile of the error
cpu float The computation time (in sec)
cols list[str] columns names in accordance with the maatrix of powers
df_contrib pandas dataframe Dataframe showing the normalized contributions of the selected monomial in reconstructing the label (available only if compute_contributions is set ton True in the dic_plars_fit dictionary)
df_sol pandas dataframe Summary datafreame showing a more detailed statistics of the monomial contribution together with their associated coefficients in the solutions (available only if compute_contributions is set ton True in the dic_plars_fit dictionary)
dfe_train pandas dataframe Dataframe showing the percentiles of error
eta int Recalling the eta used to produce the normalized percentile of errors

2.5 Example of returned results

So let us examine the returned sol and cpu resulted from our last script used in Section 2.3 by executing the following script:

for k, value in sol.items():
    print(k)
    print(value)
    print('---')

print(f'cpu all = {cpu[0]:1.3} | cpu distant = {cpu[1]:1.3}')

Below are the printed messages:

nfeat
330
---
indices
[0, 1, 204, 16]
---
powers
[[0, 4, 0, 0, 0, 0, 0], [0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]]
---
coefs
[-1.9999998958050205, 10.299963148684332, 11.999881943977648, -11.999786769721386]
---
card
4
---
error
3.671939849565719e-06
---
cpu
0.11779189109802246
---
cols
['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7']
---
bias_to_add
0.0
---
df_contrib
   Monomial  Contribution       std
0    (S2)^4     -0.333409  0.088181
1  (S2)(S3)      0.236548  0.029214
2      (S1)      0.232246  0.016285
3         1     -0.197798  0.000000
---
df_sol
   S1  S2  S3  S4  S5  S6  S7  Contribution       std      coefs
0   0   4   0   0   0   0   0     -0.333409  0.088181  -2.000000
1   0   1   1   0   0   0   0      0.236548  0.029214  10.299963
2   1   0   0   0   0   0   0      0.232246  0.016285  11.999882
3   0   0   0   0   0   0   0     -0.197798  0.000000 -11.999787
---
dfe_train
         Error
50%   0.000001
80%   0.000002
90%   0.000003
95%   0.000004
98%   0.000004
99%   0.000005
100%  0.000008
---
eta
95

3 Computing monomial contributions

Previously, it has been shown that when fitting a plars model on some training dataset with the field compute_contributions set to True in the dic_plars_fit dictionart, the contribution of the monomial retained in the solution is automatically computed.

Now given a fitted solution sol that is returned by the fit method, it might be useful to compute the contribution of the monomial contained in the solution in a new dataframe. Indeed, this might contain several kind of information:

  1. If the contributions of monomials in the new data is far from their contribution in the training data, this might indicate a change in the context between the train and the new data.

  2. Sometimes, when the residual of the relationship is higher over a period of time inside the new data, the computation of the change of the contribution of the monomials within the incriminated period might inform a lot about the kind of default that lies behind the increase in the residual.

3.1 Importing the monomials_contrib method

This is done using the monomials_contrib method that can be imported as using

from mizopol.plars_api import monomials_contrib

3.2 Input arguments

Table 5: Input arguments of the monomials_contrib method of the plars_api module.
Parameter Type Used for Default
df pandas dataframe The working dataframe user-defined
sol dict solution returned by the fit method user-defined
win int The window used in evaluating the contribution by random sampling 200
nBatch int Number of sampled window used in the evaluation 25
df_contrib, (cpu1, cpu2) = monomials_contrib(df, sol, win=200, nBatch=25)

3.3 Returned arguments

The monomials_contrib method returns:

  • df_contrib: a pandas dataframe taking the same form as the one contained in the fitted sol (provided that the compute_contributions field is set to True in the dic_plars_fit dictionary)

  • The tuple cpu containing the computation time (user and distant cpu) as preciously explained in Section 2.4.

4 The predict method

Once a solution dictionary, say sol, is returned by the fit method, it can be used to predict the label for a given features matrix X. This is done by the predict method as it is shown in the following script:

from mizopol.plars_api import predict

ypred, (cpu1, cpu2) = predict(X, sol)

where the returned arguments are quite self-explanatory.