PwPol

API documentation

Recall that the objective of the pwpol module is the following:

Given a set of sensors \(s_j\), \(j=1,\dots,n_s\) and a target sensor \(s_i\), find a set of \(n_r\) polynomials \(P_\sigma\), \(\sigma=1,\dots,n_r\), such that the following residual: \[ e_i = \min_{\sigma=1}^{n_r}\Bigl\vert s_i-P_\sigma(\{s_j\}_{j\neq i})\Bigr\vert \tag{1}\] is small.

The principle is similar to the previously presented modules plars and g2sys, the main call needs a dictionary of arguments to be procided which is described in the following section.

1 The `pwpol` arguments dictionary

Here again, as the plars module lies in the heart of all the methods proposed in the Mizopol package, the dictionary unavoidably contains the entries of the plars dictionary and the plars fit dictionary. In particular, the following keys are present that have been already presented in the plars documentation:

deg, nModels, nModes, window, eps, eta coming from the plars’s instance dictionary.
decoule, nfeats, compute_contributions, th_monomial, coming from the plars fit dictionary.

Additional arguments are needed that are linked to the search for the piece-wise polynomial representations.

For the sake of completeness, all the parameters are included in the table below, the old as well as the new ones.

Table 1: Possible entries in the pwpol module args dictionary.

Parameter	Type	Used for	Default
`deg`	`int`	The degree of the polynomial to be identified	`1`
`window`	`int`	number of samples per window (window width)	`200`
`nModels`	`int`	Number of sampled window for alignement evaluation	`10`
`nModes`	`int`	Number of selected monomials per window	`10`
`eps`	`float`	precision for the final least squares solution	`5e-2`
`nBatch`	`int`	Number of window used to determine monomials contributions	`25`
`eta`	`float`	The quantile used to compute the error dataframe	`50`
`th_monomial`	`float`	Threshold to keep a candidate monomial	`1e-4`
`decouple`	`boolean`	Whether to avoid updating the coefficients of the previously selected modes when new ones are selected	`False`
`compute_contributions`	`boolean`	Whether to compute the contributions of monomial for later displaying	`False`
`nfeats`	`int`	Maximum number of sensors to to involve in the solution	`None`
`th`	`float`	Initial value of the precision threshold used to admit a candidate polynomial during the search	`0.1`
`Nguess`	`int`	The number of regions centers that are randomly sampled when searching for regions in the features space	`5`
`Niter`	`int`	The number of solutions for the same centers to account for the randomness of the `plars` solutions	`5`
`ncoef_max`	`int`	The maximum cumulative number of coefficients included in all the retained polynomials	`5000`
`expansion_rate`	`float`	Expansion rate used to increase the precision threshold, initially at `th` after unsuccessful rounds	`1.2`
`ratio_untreated`	`float`	Ratio of the dataset that can be left without being fitted by any polynomial	`0.01`
`dx`	`float`	Initial size of the regions around the randomy sampled center in the space of normalized features.	`0.02`

2 The `fit` method

2.1 Input arguments

Table 2: Input arguments for the fit method of the pwpol module.

Parameter	Type	Used for	Default
`df`	`pandas dataframe`	The dataframe used in the fit	user-defined
`colX`	`list[str]`	The list of columns in `df` to be used as features	user-defined
`coly`	`str`	The column in `df` to be used as label	user-defined
`args`	`dict`	The arguments dictionary discussed in Section 1	user-defined
`plot_conv`	`bool`	If set to `True` a plotly figure is provided showing the convergence log of the solution	`False`

2.2 Returned arguments

Table 3: Input arguments for the fit method of the pwpol module.

Parameter	Type	Description
`model`	`dict`	Dictionary containing all the information defining the model (see below for a complete list)
`cpu`	`tuple`	The tuple of local and distant computation times

The following table enumerates all the fields contained in the returned model.

Table 4: Description of the model dictionary returned by the fit method of the pwpol module.

Parameter	Type	Description
`solutions`	`list[dict]`	The list of `plars` solution representing the polynomials retained in the fitted piece-wise polynomial model
`centers`	`list[list[float]]`	The matrix of centers, each row is a center of a region that was fitted by the polynomial having the same index in `solutions`
`rayons`	`list[float]`	The list of radius of the region around the centers where the polynomial has been identified
`populations`	`list[int]`	The list of integers representing the number of samples contained in the region of the same index
`dens`	`list[float]`	The vector of normalizing coefficients applied to the columns of the training dataframe `df` used for the fit
`log`	`dict`	The log dictionary containing some information regarding the fitting process
`colX`	`list[str]`	The argument `colX` used in the fit (see Table 2)
`coly`	`str`	The argument `coly` used in the fit (see Table 2)
`th`	`float`	The argument `th` used in the fit (see Table 2)
`args`	`dict`	The argument `args` used in the fit (see Table 2)
`deg`	`int`	The argument `deg` used in the arguments dictionary (see Table 1)
`fig_conv`	`plotly \| None`	The convergence figure in case the parameter `plot_conv` of the call is set to `True`
`df_sensitivity`	`pandas dataframe \| None`	The sensitivity dataframe in case `compute_contributions` is set to `True`

2.3 Fit example

from mizopol.pwpol_api import fit

df = pd.read_csv('datasets/Zema.csv', index_col=0).iloc[::5]
coly = 'PS3'
colX = [c for c in df.columns if c != coly]

args = dict(
    th=0.05,
    deg=3,
    window=200,
    compute_contributions=True,
    th_monomial=1e-3,
)

model, (cpu1, cpu2) = fit(df=df[::10], colX=colX, coly=coly,
                            args=args, plot_conv=True)
# let us recover the solutions field fo the model
if 'fig_conv' in model:
    model['fig_conv'].show()

print('\n')
print('arguments of the call:')
print(model['args'], '\n')
print('---------')
print('Sensitivity dataframe:')
print(model['df_sensitivity'], '\n')
print(f'cpu total = {cpu1:1.3} | cpu distance {cpu2:1.3} \n')

print('The list of keys:')
print(model.keys())

Results:

Treated   0% | #rows=   5109 | #models =   0 | #coeffs =   0, | th=0.050
Treated  85% | #rows=    807 | #models =   1 | #coeffs =   4, | th=0.050
Treated  92% | #rows=    427 | #models =   2 | #coeffs =   7, | th=0.050
Treated  95% | #rows=    256 | #models =   3 | #coeffs =  14, | th=0.050
Treated  97% | #rows=    201 | #models =   4 | #coeffs =  33, | th=0.050
Treated  97% | #rows=    201 | #models =   4 | #coeffs =  33, | th=0.050
Treated  97% | #rows=    201 | #models =   4 | #coeffs =  33, | th=0.050
Treated  97% | #rows=    201 | #models =   4 | #coeffs =  33, | th=0.050
Treated  97% | #rows=    201 | #models =   4 | #coeffs =  33, | th=0.050
Treated  98% | #rows=    127 | #models =   5 | #coeffs =  43, | th=0.104


arguments of the call:
{'deg': 3, 'window': 200, 'nModes': 10, 'nModels': 10, 'eps': 0.05, 'nBatch': 25, 'eta': 95, 'dx': 0.02, 'th': 0.05, 'Niter': 5, 'Nguess': 5, 'ncoeff_max': 5000, 'ratio_untreated': 0.01, 'from_dataframe': True, 'th_monomial': 0.001, 'compute_contributions': True, 'decouple': False, 'nfeats': None, 'expansion_rate': 1.2, 'th_percentile': 99.0, 'colX': ['PS1', 'PS2', 'PS4', 'PS5', 'PS6', 'EPS1'], 'coly': 'PS3'} 

---------
Sensitivity dataframe:
      Contribution
PS2       0.347512
PS1       0.293670
EPS1      0.173660
PS5       0.094850
PS4       0.059131
PS6       0.027503
1         0.003675 

cpu total = 32.8 | cpu distance 32.3 

The list of keys:
dict_keys(['solutions', 'centers', 'rayons', 'populations', 'dens', 'log', 'colX', 'coly', 'th', 'args', 'deg', 'inds', 'cpu', 'fig_conv', 'df_sensitivity'])

Notice also since we asked for the convergence plot, the execution of the previous script shows the following plotly figure

The convergence plot produced by the `pwpol` fitting call. Notice that the print `th` is the last one obtained during the fit and resulting from the expansion when necessary. The value of the initial `th` can be read inside the `model['args']` field.

3 The `predict` method

Once a model is obtained through the fit method, it can be used on a new dataframe to produce prediction. However:

Prediction of a pwpol model

As the model is implicit, the residual defined by Equation 1 can be computed only if the true label y is available. In the absence of the label, only the values of all the polynomials involved in the model can be computed and this does not deliver any residual as each of these value is legitimate candidate as value of the label.

3.1 The `options` dictionary

The availability of the label y is provided to the predict method through the options dictionary (that will be used as one of the input arguments of the predict method) that might be empty or showing one or both of the following fields:

options = None
options = {'y': y, 'n':5}
options = {'n': 5}
options = {'y': y}

in which \(y\) is the true label contained in the dataframe. the n fields might be used to define the number of polynomials to be used. Indeed, as it has been shown in the results of the fit process above, it might happen (and it generally happens) that the last polynomials are used in order to handle a small amount of remaining samples making them not necessarily mandatory for a good fit. that is why the n field in the option dictionary has been introduced.

If the value is None (default value), no residual is computed and the prediction of all the polynomials involved in the model are computed in a matrix Ypred having as many columns as there are polynomials in the model.

3.2 The input arguments

Now that we introduced the option dictionary we can move on to present the input argument of the predict method.

Table 5: Input arguments for the predict method of the pwpol module.

Parameter	Type	Description	Default
`df`	`pandas dataframe`	The dataframe to be used in the prediction	user-defined
`model`	`dict`	The `pwpol` model that is returned by the `fit` method	user-defined
`options`	`dict`	The options dictionary	None

3.3 The returned arguments

Table 6: The returned items by the predict method of the pwpol module.

Parameter	Type	Description
`Ypred`	`list[list[float]]`	The matrix of prediction by all the polynomials in the `model`
`ypred`	`list[float] \| None`	The predicted label in case the `options['y']` is provided, else, None is returned
`dfe`	`pandas dataframe \| None`	The normalized percentile of error in case the `options['y']` is provided, else, None is returned
`ind_polynomial`	`list[int] \| None`	The vector of index of the closest polynomial in case the `options['y']` is provided, else, None is returned
`cpu`	`tuple`	The tuple providing the local and the distant computation times.

3.4 `predict` example

df = pd.read_csv('datasets/Zema.csv', index_col=0).iloc[::5]
coly = 'PS3'
colX = [c for c in df.columns if c != coly]

args = dict(th=0.1,deg=2,window=2000,compute_contributions=True,th_monomial=1e-3)

model, (_,_) = fit(df=df, colX=colX, coly=coly, args=args, plot_conv=False)

options = dict(y = df[coly].values)

#-----------------------------------------------------------------------------
Ypred, ypred, dfe, ind_polynomial, (cpu1, cpu2) = predict(df, model, options)
#-----------------------------------------------------------------------------

plot = True
if plot:
    fig = go.Figure()
    t = np.linspace(0,1, len(ypred))
    fig.add_trace(go.Scatter(x=t, y=ypred, name='predicted'))
    fig.add_trace(go.Scatter(x=t, y=options['y'], name='true'))
    fig.show()

print('percentiles of errors \n', dfe)
print(f"cpu all {cpu1:1.3} | cpu distant {cpu2:1.3}")
print('log')
print('------')
for key, value in model['log'].items():
    print(key, value)
print('------')
print('cardinalities: ', [sol['card'] for sol in model['solutions']])
print('------')
print('Thresholds \n', 'initial = ', model['args']['th'], '| final = ', model['th'])
print('colX = ', model['colX'])

Results:

Treated   0% | #rows=  51084 | #models =   0 | #coeffs =   0, | th=0.100
Treated  96% | #rows=   2077 | #models =   1 | #coeffs =  10, | th=0.100
Treated  96% | #rows=   2077 | #models =   1 | #coeffs =  10, | th=0.100
Treated  97% | #rows=   1535 | #models =   2 | #coeffs =  17, | th=0.120
percentiles of errors 

          Error
50%   0.018472
80%   0.043386
90%   0.059735
95%   0.076548
98%   0.124325
99%   0.207965
100%  1.419767

cpu all 2.44 | cpu distant 0.14

log
------
remain [100, 4, 4, 3]
nb_models [0, 1, 1, 2]
nb_coeff [0, 10, 10, 17]
cpu 87.91623163223267
------
cardinalities: [10, 7]
------
Thresholds 
 initial =  0.1 | final =  0.12
colX =  ['PS1', 'PS2', 'PS4', 'PS5', 'PS6', 'EPS1']

1 The pwpol arguments dictionary

2 The fit method