| BACKGROUND |
[Alignment independent descriptors] [Shape descriptors] [Principal Component Analysis] [Partial Least Squares] [Variable Selection]
1. Alignment independent descriptors
1.1. Introduction.
In many fields one of the crucial steps is to describe the structure of chemical compounds. The biological activity of a compound is often a linear function directly dependent of its free energy of binding with a biological receptor, and therefore it depends upon the presence in the ligand of some groups capable to make efficient non-covalent bonds with the receptor. Since these interactions depend critically on the distances between the interacting groups, any description of the structure of the ligands should consider not only the presence or absence of the appropriate chemical groups, but also consider their relative spatial position.
The importance of the position of the substituents in the space justified the development of the so-called 3D QSAR, in which the structural description of the compounds describes also the positions most favorable for the interaction in terms of a "molecular interaction field" (MIF). In the original approaches (1), the molecular interaction field was simulated using a chemical probe that samples the possible chemical interactions at regular, definite positions, using a 3D grid around the compounds.
Unfortunately, in order to build descriptors of use in QSAR, these descriptors must be consistent through a series of compounds and this implies to align the compounds, using exactly the same 3D grid for all of them. When the compounds in the series are rather similar it is easy to find a set of hypothetical anchor points, and attempt a superimposition. When the compounds are not so similar the superimposition is not obvious. To worsen things, experience gained trough the analysis of series of ligand-receptor complexes solved by X-ray crystallography shows that, often the superimposition is not simple, and sometimes it is even paradoxical (2,3). So far, a wide variety of methods of superimposing compounds has been proposed, none of them being completely satisfactory (4).
Today it is commonly accepted that the superimposition is the most difficult and subjective step in any 3D QSAR analysis. Usually, it is the most time-consuming step but, since the results depends completely on the accuracy of the superimposition, it cannot be overlooked. Indeed, the difficulties associated to this step are the reason behind the decision of many pharmaceutical companies to dismiss the otherwise powerful 3D QSAR methods. The aim of this work is to develop a method of describing compounds in the 3D space without previously superimpose them. In addition, such a description should also be
| Relevant | It should represent properties of the compounds that are important for discriminating between active and inactive compounds. |
| Interpretable | The descriptors should be formulated in diagrams or plots that can be understood and interpreted chemically. |
This section describes the GRid INdependent Descriptors (GRIND)(5), as implemented in program ALMOND 3.0. These are descriptors aimed to solve the above problems integrated in a friendly software environment.
1.2. Initial assumptions
1.3. GRid INdependent Descriptors (GRIND)
Based on the above initial assumptions we have developed a procedure which transforms the information included in the MIF and generates from it a handful of informative variables, independent of the location of the molecules within the grid. These descriptors have probed useful in 3D QSAR analysis, but are applicable in different fields, like:
etc..
The procedure of transformation involves two main steps: Field filtering and MACC2 encoding.
Field filtering
Before applying any mathematical transformation on the MIF variables, the MIF itself should be simplified. It should be borne in mind that MIFs often contain hundreds of thousand of variables and the information they bear is largely redundant.
|
Aspect of a typical molecular interaction field. Favorable interactions (negative) are represented in blue. Unfavorable interactions (positive) are represented in yellow. The information is largely redundant |
![]() |
| Positions of the MIF showing more favorable interactions with an OH probe | ![]() |
The objective of filtering is extracting from the grid field, for each object, a reduced subset of informative positions or nodes. This is made by extracting the set of nodes meeting at best two requirements:
The rationale behind this algorithm is to try to represent the independent pharmacophoric groups by which the ligand can interact. Both criteria are important: considering only the values of the field might overlook independent separate regions of interaction which are important for the activity. The relative importance of both criteria can be adjusted in the algorithm using a tunable parameter.
The algorithm used for the selection is an adaptation of the Fedorov algorithm, classically used in D-optimal design, in which the target scoring function has been replaced by a simple function representing both the values of the field and the Euclidean distance between the points.
![]() |
Nodes of the field represented above selected by the filtering. Three regions of points were selected. In this case, these regions map neatly the positions with higher field values. |
Maximum Auto and Cross Covariance (MACC) transform
The use of autocorrelation (ACC) transform in QSAR is not new (8). However, the application of the classical ACC in 3D QSAR produces extremely complex correlograms, very difficult to interpret. Moreover, since variables are generated as the sum of many variables there is no way to trace back the nodes that participate in each variable. Therefore our approach is slightly different. The MACC transform represents in the autocorrelograms only the maximum value of the products of the two i and j field values, found at each different rij distance.
The transform starts partitioning all the possible distances rij inside the box in a definite set of discrete distances. In order to do this, the program defines a "smoothing window" (SW) width in terms of grid units. When comparing two interactions between two couples of nodes, one separated by rij and other separated by r'ij, they are compared only if the difference between rij and r'ij is smaller than the SW. Then, for each possible couple of field values i and j, the product is computed and compared with any other interaction in the same range of distances (as defined by the SW value). For each distance considered only one value is stored, the maximum one, and the spatial position of the nodes generating this interaction is stored as well.
Field values were normalized by dividing the value of the energy of interaction by a constant representing a theoretical maximum of energy for this probe. Therefore, the values of the fields are normalized approximately between 0 and 1 and, as a consequence, their products must vary approximately between 0 and 1.
For each set of MIFs considered, the MACC transform computes the auto-correlograms and the corresponding cross-correlograms, i.e. in the case of two MIFs A and B, the MACC transform will compute the autocorrelograms AA and BB and the crosscorrelogram AB. The procedure generates a different set of correlograms for each object in the series. These can be represented individually for every object in the series or all together. The series correlograms can look messy in some series but are useful to visualize the variation of different peaks of the correlograms in the series. The interpretation of the profiles is much better if the points are colored according to the chemical families of the compounds or according to their biological activity.
![]() |
Example of correlogram representing a series
of compounds. The correlograms for every compounds in the series is
represented simultaneously. Here the colors represent the activity of the compounds (blue inactive, red, active) |
GRIND variables are not completely independent of the alignment of the molecules during the GRID analysis. It should be borne in mind that the GRID analysis extracts a discrete set of samples from a field that is continuous. Slight displacements of the molecules into the cage will produce changes in the sampling set. Even if the transform can make the information sampled independent of the alignment, it cannot correct the problems of the sampling itself!. However, these differences are minimized by an adequate sampling, i.e. by using smaller grid spacing. In our test, a grid spacing of 0.5Å or smaller gave the best results (directive NPLA=2).
![]() |
Autocorrelogram corresponding to the above
filtered field. The peaks here represent distances between the regions.
In ALMOND these plots are interactive, clicking on the points with the mouse shows the corresponding node-node interactions in the grid plots (next plot) |
In the autocorrelograms, each peak represents a couple of intense interactions separated by a certain distance. The position of the peak in the horizontal axis expresses the distance at which this interaction was found. In the crosscorrelograms, the meaning is the same but the interactions represented are between nodes of different MIFs.
Compounds interacting with the same receptor can have some peaks in common, expressing the internal geometrical relationships of receptor regions with which the ligands establish non-covalent bonds. One of the unique features of the MACC transform is that it is possible to trace back the variables that generated this "most intense" interaction. Indeed, the program ALMOND is able to represent in the space these interactions as lines linking the couples of points involved.
| After the interaction of the User on the autocorrelogram, the grid plot shows the nodes which have produced the selected peaks, as red lines connecting couples of nodes. | ![]() |
The correlograms obtained can be used to generate an X matrix of descriptor variables. Each variable is associated to a certain interaction distance, between couples of points belonging the same MIF or different MIFs. Each correlogram is, per se, a block of X variables.
In ALMOND, the standard procedure involves three GRID probes: the hydrophobic probe (DRY), the carbonyl oxygen (O) and the amide nitrogen (N1). These probes represent strong non-covalent interactions commonly found in the receptors.
When three probes are used, the metric describing the series includes up to six blocks of X variables:
| auto-correlograms | DRY | DRY |
| O | O | |
| N1 | N1 | |
| cross-correlograms | DRY | O |
| DRY | N1 | |
| O | N1 |
The assembly of all these variables, usually less that one hundred per compound, constitutes the standard molecular description that can be used directly for the chemometric analysis. The X matrix looks like that.
| compound 1 | DRY-DRY | O-O | N1-N1 | DRY-O | DRY-N1 | O-N1 |
| compound 2 | DRY-DRY | O-O | N1-N1 | DRY-O | DRY-N1 | O-N1 |
| compound 3 | DRY-DRY | O-O | N1-N1 | DRY-O | DRY-N1 | O-N1 |
| ... | ||||||
| compound n | DRY-DRY | O-O | N1-N1 | DRY-O | DRY-N1 | O-N1 |
The X matrix can be scaled using one of the following methods:
| Raw | No transformation is performed |
| Remove baseline | Often, interactions at certain distance ranges are found only for some compounds in the series. The associated variables take a certain value for these compounds and zero for all the others. Therefore, these variables show a large variance that can be detrimental for the modeling. The Remove baseline scaling removes from the X matrix all the variables that take zero value for one or more compounds. |
| Normalize block-wise | Values within each block is normalized between 0.5 and 2.0, using the maximum value of the product of interaction found into this block. |
It is difficult to fully explain the activity of chemical compounds without considering their shape. In the context of ligand-receptor interaction, a good shape complementarity between the ligand and the receptor is generally required for binding. The molecular shape is not considered explicitely in the filtered fields since only grid nodes with negative energy are conserved. For this the reason, a special shape probe, i.e. TIP, has been designed.
2.1. Molecular shape
The positive part of a MIF is a good starting point for the analysis of the shape of a molecule, either for visualisation or for QSAR modelling. The grid nodes situated at the limit of the positive field define an iso-surface which follows the outline of the molecule.
| Example of a molecular interaction field at an energy value of 1 kcal/mol. The shape of the molecule is followed by a yellow iso-surface. | ![]() |
![]() |
Unfortunatelly, incorporating all the grid nodes of this surface is not appropriate for MACC2 computation. As for any GRID fields, a filtering procedure is required in order to select the most relevant grid nodes. In this case, the energy of the field cannot be used as a filtering criterion since all the nodes on the surface have the same energy value. Instead, we measure surface curvature which is an execelent property for shape description.
An interesting property of the molecular surface, which could be exploited to extract relevant regions, is the local curvature. It can be defined as the rate of change of direction of the surface. Local curvature is very interesting as a molecular shape descriptor, because many of the substituent which protrude from the molecular scaffold are characterised by having highly convex regions. Therefore, the characterization of such convexity can indicate the position and spatial extent of chemical groups and of the molecule as a whole. The rationale of this characterization can be found here [9]
ALMOND 3.3 implements local surface characterization. All the procedure was embedded in a "GRID probe-like" interface, in order to make easier its integration within the GRIND formalism.
2.2. curvature and GRIND
The shape field is computed in few steps:
| The iso-surface has been colored according to its curvature, convex regions can be easily distinguished from concave region. the TIP patches (blue crosses) highlight regions of the molecular structure very likely to play a major role (either positive or negative) in the fitting of the structure into the binding pocket. | ![]() |
![]() |
In the same way as the original GRIND auto-correlograms, the TIP-TIP correlogram indicate the distance between two separated TIP patches. Each bin value is a result of the product of two scaled curvature value. Cross-correlograms indicate the distance between specific area of interaction (e.g. an hydrogen-bond acceptor site) and the protruding parts of the molecule. Cross-distances between TIP patches and other sites of interaction give some indications about the local shape differences between the compounds. From a QSAR point of view, when a GRIND variable involving the TIP patch has a negative weight on the PLS model, one can suspect of a steric hindrance near to the TIP patch. Conversely, a positive weight on the PLS model is an indication of a stabilizing shape feature.
3. Principal Components Analysis (PCA).
3.1. Theory.
Principal Components Analysis (PCA) is a technique extremely useful to "summarize" all the information contained in the X-matrix and put it in a form understandable by human beings.
The PCA works by decomposing the X-matrix as the product of two smaller matrices:
X = TP' + E
The information not contained in these matrices remains as "unexplained X-variance" in a residual matrix (E) which has exactly the same dimensionality as the original X-matrix.
The PCs, among many others, have two interesting properties:
In PCA, the User can decide how many PCs should be extracted (the number of significant components, i.e. the dimensionality of the model). Each new PC extracted increases further the amount of information (variance) explained by the model. However, usually the first four of five PCs explain more than 90% of the X-variance. There is not a simple nor unique criterion to decide how many PC to extract and two kinds of considerations should be taken into account. From a theoretical point of view, it is possible to use cross-validation techniques to decide the number of PCs to include. From a practical point of view it does not matter to extract a large number of PCs if the User has no way to interpret the results.
3.2. Uses of PCA
PCA is extremely useful for understanding the distribution of the objects, and knowing how different are one from another and why they are different. Moreover the PCA can be used to highlight the variables which contain similar information, or the variables which contains completely independent information.
The best way to extract information from the PCA is to plot the matrices obtained:
2D and 3D scores plot
These plots represent the relative position of the objects in the space (two-dimensional or three-dimensional) of the principal components. They are useful to identify clusters of objects and single objects that behave in a peculiar way (outliers).
Moreover, the position of the objects in the plots may serve to interpret the PCs. The first PCs try to explain the maximum amount of variation and therefore, when there are clusters of objects, to distinguish among them. In this context, the PC can be interpreted as a compendium of the distinctive features of the objects in these clusters.
2D and 3D loading plots
These plots represents the original variables in the space (two-dimensional or three-dimensional) of the principal components.
Remember that the PC are obtained as linear combinations of the original X-variables. The loading of a single variable indicates how much this variable participates in defining the PC (the squares of the loadings indicate their percentage in the PC). Variables contributing very little to the PCs have small loading values and are plotted around the center of the plot. On the other hand, the variables which contribute most are plotted around the borders of the plot.
Loading plots allow to test the "homogeneity" of the contributions of the X-variables to the model. When there are more than one single field group of variables, it is useful to plot the loadings highlighting the variables belonging to one of them, in order to understand how each field contributes to the whole description of the objects.
4. Partial Least Squares (PLS)
4.1. Regression models.
In the previous section we have described PCA, a technique of multivariate analysis that deals only with the X-variables. The Partial Least Squares (PLS) analysis is a regression technique and its goal is to explain one or more dependent variables (Y's) in terms of a number of explanatory variables (predictors, X's).
Y = f(X) + E
It is possible to build many different models that fulfill the equation. Different methods produce models that "fit" the Y's more or less accurately. Among them, the best one will be able to calculate Y values that correspond to the experimental ones, even for molecules not included in building the model. These models are "predictive" and can be used to calculate reliable estimations of Y values for new molecules, prior to their availability.
It is important to notice that the Y's variables, like any other experimental variable, contain error. The models will try to fit the Y as much as possible and, if we try to improve too much the fitting, the model will explain also the noise!. This phenomenon, called overfitting, is very dangerous, because overfitted models seem to be very good, but they often prove to be useless to predict the Y's of objects not included in the available data set (training set).
The predictive ability of a model is attributed to the existence of a "true" relationship between the measured X properties (usually field interaction energies) and the Y property measurements (usually the activity of a drug). Even if the 3D-QSAR models are rough and are not able to explain to a full extent the activity values, they are extremely valuable to identify the structural features that contribute most to the activity. These can be regarded as important in the context of the ligand-receptor interaction, and therefore they can give hints about how they interact and suggest new compounds.
The most sophisticated PLS 3D-QSAR models are subject to the same limitations of any regression model, and the aphorism "No regression model is better than the series it was obtained from" is always applicable. It is not possible that a model can provide information about the influence on the activity of areas which were not different enough in the series. Unfortunately no design method for 3D-QSAR have been reported so far and a good series design in this area continue to be a challenge. Nevertheless the User should be aware of the lack of series design and understand the limitations of the resulting models.
4.2. PLS modeling.
In 3D-QSAR often the X-matrix contains much less objects (molecules) than variables. In these situations, the classical regression technique, Multiple Linear Regression (MLR) is completely useless. There are many reasons, but, among others:
In fact, the only regression method than can deal with the kind of X-matrices used in 3D-QSAR is Partial Least Squares (PLS). PLS works decomposing the X-matrix as the product of two smaller matrices, much like PCA does:
The main difference is that PCA obtains the PCs that represent at best the structure of the X-matrix and PLS obtains the LVs under two constrains:
The PLS algorithm originally proposed by Wold works as follows. For each component it is calculated:
w' = u'X / (u'u)
w = w / ||w|| (w is normalized to length 1)
t = Xw / (w'w)
c' = t'Y / (t't)
u = Yc / (c'c)
And these steps are repeated until t convergence. Then it is calculated:
p' = t'X / (t't) (calculates the X-loadings)
E = X - tp' (updates the X-matrix)
F = Y - tc' (updates the Y-matrix)
Where (the lowercase letters represent the corresponding vectors from this matrices):
X X-matrix.
Y Y-matrix (can contain more than one single variables).
W Weight matrix which expresses the relationship between the X's and the Y's.
T X-score matrix, in which the objects are described in terms of the LVs.
C Y-weights, which describe the relationship between the X-scores and the Y-variables.
E The X-residuals. Variance that remains unexplained in the X-matrix.
F The Y-residuals. Variance that remains unexplained in the Y-matrix.
P X-loading matrix, the LVs described in terms of the original X-variables.
At the end, the PLS model satisfies:
X = TW' + E X's are decomposed in X-scores (T) and X-weights (W)
Y = UC' + F Y's are decomposed in Y-scores (U) and Y-weights (C)
U = T + H X-scores correlates with Y-scores (the inner relation)
Notice that the X-loadings (P) are calculated at the end of each iteration, to best approximate the X-matrix, and don't participate directly in the model building but to update the X-matrix.
The LVs share some important properties with the PCs:
As in the PCA, the User can select the number of LVs to maintain in the model, but in PLS, selecting the correct dimensionality is of critical importance. When too many LVs are included a serious overfitt will result and the model will have little or no validity. To check how many LVs to include it is strictly necessary to test the predictive ability of the model taking into account different number of components.
4.3. Tests of predictive ability. Cross-validation.
The evaluation of the predictive ability of the PLS models is important to :
The predictive ability of a model is usually evaluated using cross-validation (CV). It works building reduced models (models for which some of the objects were removed) and using them to predict the Y-variables of the objects held out. Then the Y predicted is compared with the Y experimental, and so, for each model dimensionality the following indexes are computed:
SDEP Standard Deviation of Errors of Prediction.
![]()
Q^2 Predictive correlation coefficient
![]()
Y : Experimental value
Y' : Predicted value
: Average value
N : Number of objects
The CV technique is very valuable because it performs an "internal validation" of the model and obtains an estimation of the predictive ability without the help of external data sets. This is particularly important in QSAR studies, where the number of objects available is usually small, and it is not affordable to remove objects from the learning data set.
One of the main inconveniences of CV is that there is not a general agreement on how to build the reduced groups and on the criterion to decide how many objects to keep. It is clear that the objects should be deleted once and only once over the model ensemble, but apart from this there are different approaches:
Leave One Out Models are built keeping one object at a time out of the analysis and repeating the procedure until all the objects are kept out once.
Leave Two Out Models are built keeping two objects at a time out of the analysis and repeating the procedure until all the objects are kept out once in all the combinations of two.
Fixed Groups The objects are assigned in a fixed way to N groups, each one containing an equal (or nearly equal) number of objects. Then models are built keeping one of this groups out of the analysis until all the objects were kept one once.
Random Groups The objects are assigned in a random way to N groups, each one containing an equal (or nearly equal) number of objects. Then models are built keeping one of this groups out of the analysis until all the objects were kept one once. The formation of the groups and the validation is repeated M times.
In order to choose the most appropriate CV method we have to consider the peculiarities of the QSAR data sets, in which the objects are always clustered. In this circumstance, the LOO or LTO methods have no chance to remove the structure of the data; most of the information contained in the object or couple of objects removed is anyway inside the model, kept in others objects of the same cluster, thus leading to overoptimistic results if this is also true that the SDEP is obtained in a reproducible way. On the other hand, the random group approach can produce a much better, but more conservative, estimation of the real predictive ability. In other words: the uncertainly of future predictions is numerically worse but much more reliable. The lower is the number of groups, the harder is the validation criterion.
In this latter approach, the number of groups should be fixed in such a way that there are real chance that complete clusters are removed from the analysis in the reduced models. Also, the procedure should be repeated many times, in order to obtain stable results. For this procedure it can be computed the Standard Deviation of SDEP, which gives an estimate of the dispersion of the SDEP values obtained from different runs.
There are also some more details of the computation that may affect the results of CV. When some objects are removed from the data set to build the reduced models, there are two possibilities: to recalculate the average and weights of each variable for the new, reduced data set or to use the original averages and weights. The first approach is more time consuming, but the calculations are more accurate. This method may introduce bias and the reduced model may require one more LV, but only if the groups are formed only once. When the groups are formed in a random way the problem is removed.
Another way to evaluate the predictive ability of the model is to use an external prediction set. In this approach the objects in the original data set are split up into two groups from the very beginning of the analysis. The first one, the learning set, will be used to build the PLS model. The other, the prediction set, will be used to compare their experimental Y-values with the predictions made by the PLS model. There is no doubt that this technique is more realistic to test the predictive ability. However it can be argued that the results depend critically upon how many and which objects are assigned to each group. Also, the data sets in QSAR, often contains too few objects and it is not possible to remove objects from the analysis without a loss of information.
4.4. Uses of PLS.
As for PCA, the best way to examine the information from PLS is to plot the matrices obtained. To fully understand and diagnose a PLS model one should look carefully all the available plots.
2D T-U scores plot
This plot represent objects in the space of X-scores (T) against the Y-scores (U). From this plot the User can have a clear idea of the correlation between the X's and the Y's obtained in the model for each one of the LV. The plot of the first LV is by far the most informative and contains the main relationship between activities and structural descriptors. This information is completely lost is one looks only to the "predicted vs experimental" plot.
The plot can be useful to identify influential objects or clusters of objects (outliers). Usually these objects don't correlate in the first component. Then, the second or third LV is devoted to fit them, and they appear in the T-U scores plot for this LV as some (few) objects completely distinct from the rest of the objects. Accordingly, the T-U score plots for non-significant components show this behavior.
2D and 3D loading plots
These plots represents original variables in the space (two-dimensional or three-dimensional) of the latent variables (P).
Remember that the LVs are obtained as linear combinations of the original X-variables. The loading of a single variable indicates how much of this variable is included in the LV.
Variables contributing very little to the LVs have small loading values and are plotted around the center of the plot. On the other hand, the variables which contribute most are plotted at the borders of the plot.
Loading plots allow to test the "homogeneity" of the contributions of the X-variables to the model. When there are more than one single field group of variables, it is useful to plot the loadings highlighting the variables in one of them, to understand how each field contributes to the whole PLS model.
2D and 3D weight plots
These plots represent original variables in the space (two-dimensional or three-dimensional) of the weights (W).
The weights (W) represent the coefficients that multiply the X's to best fit the Y's. Therefore, we can say that the loadings represent better the first constraint used to build the PLS model (the representation of the X-matrix) while the weights represent better the second constraint used to build the PLS model (the fitting of the Y's). Variables with high weights are important for the fitting of the Y's while variables with low weights (those in the center of the plot) are not so important. When there are more than one single field group of variables, it is useful to plot the weights highlighting the variables in one of them, to understand how each field contributes to the whole PLS model, from the point of view of the fitting.
5.1. Concepts.
The conceptual model underlying 3D-QSAR is that the large number of variables included in the X-matrix somewhat captures the dominating effects due to changes in structure over the actual molecules. However, we should be aware that a large number of variables in the X-matrix have no relationship with the activity and introduce only noise in the description of the molecules.
It should be considered that any X-variable, even if it does not contribute to explain the Y-variables, certainly contribute to the structure of the X-matrix. As the solution provided by PLS has the constrain of explaining the structure of the X-matrix, this structure only makes more difficult to find a solution satisfying both constraints. It has been reported that PLS is unable to obtain a good model when a single explicative variable is hidden in the middle of many others, even if this variable is highly correlated to the Y. One of the possible solutions to this problem is to remove from the data set all the noisy variables, but it is not so simple to define a criterion nor a methodology to distinguish noise from information. The FFD procedure was developed just to handle this problem.
5.2. The FFD procedure
In few words, the FFD procedure is a method for detecting variables increasing the predictive ability of PLS models. The models obtained by using only variables selected by FFD are more predictive than the PLS on all variables.
The FFD procedure involve the following steps:
First step. Initial PLS model.
The first step is to build a PLS model using all the variables and to select the dimensionality that produces the most predictive model.
Second step, evaluation of the variables.
In the second step FFD builds a large number of "reduced models" similar to the complete model but removing some variables. The predictive ability of each model is evaluated using CV and, from these values, FFD relates the predictive ability of the model with the presence or absence of each X-variable.
In this step, one of the main problem, is to find the most efficient way to evaluate the individual effect of each variable in the predictive ability of the models. The strategy used by FFD is to make a "design matrix" following a Fractional Factorial Design scheme (from here, the name of the method). In this matrix, each experience is a "variable combination" from which a reduced PLS model is built, and the measured response is the calculated predictive ability for this model. The matrix contains as many columns as many variables there are in the X-matrix (p), which can take a value of (+), expressing that in this experience the corresponding X-variable is used or a value of (-), expressing that in this variable combination (experience) the corresponding X-variable is not used.
| x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | ... | xp | |
| model 1 | - | - | - | - | + | - | - | - | - | + | |
| model 2 | + | - | - | - | - | + | + | + | - | - | |
| model 3 | + | + | - | - | + | - | - | + | + | + | |
| ... | |||||||||||
| model N-1 | + | + | + | - | - | + | + | + | + | + | |
| Model N | + | + | + | + | + | + | + | + | + | + |
Once the N models were obtained and the SDEP for each one is calculated, it is possible to calculate how the presence or absence of a given variable affects the SDEP value (the effect of this variable), and therefore their predictive ability
where:
E : The effect of a variable .
SDEP+ : The average SDEP for all the models that include the variable.
SDEP- : The average SDEP for all the models that don't include the variable.
N : The total number of models.
Unfortunately, in FFD, the main effects of the main variables are confounded with some of their interaction terms, and we have the risk of selecting as relevant a variable that is not. In order to minimize this possibility, FFD introduces in the design matrix some "dummy" or "ghost" variables: variables that appear in the design matrix but that have no equivalent in the X-matrix. The effect of these variables, is then compared with the effect of "real variables", on the basis of a Student-t-tailoring at the 95% confidence level. The comparison is performed as follows:
The average effect of the dummies (V) is computed:

where:
ED : Effect of the dummy variable D.
ND : Number of dummy variables included in the design matrix.
tcritical : T critical for the 95% of confidence.
The effect of each variables (E) is compared with the average effect of the dummies (V). From this comparison the variables are labeled as fixed, excluded or uncertain:
E < V : This variable is "uncertain".
E > V and E < 0 : This variable decreases SDEP. Fixed.
E < V and E > 0 : This variable increases SDEP. Excluded.
The fixed variables have a positive effect on the predictive ability of the model, while the excluded variables have a negative effect. The effect of the uncertain, either positive or negative, can't be distinguished from the spurious effects which appear for the dummies. The number of dummies included in the matrix is also important. In general, the larger is the number of dummies, the better the decision about the relevance of the variables, but in our experience, a ratio 4:1 between the real and the dummies variables is a good compromise.
Please note that the dummies are necessary just because of the confoundings present in the FFD. In full factorial designs, the specific effect of a single variable can be evaluated unambiguously, however, at the expense of an enormous increase in the computation time. A cheaper alternative consists in "folding-over" the design, that means repeating all the combinations in the design matrix after switching the signs. This new design has the main effects unconfounded with two-variables interaction effects and, therefore, the individual effects of the variables can be evaluated in a more reliable fashion.
| x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | ... | xp | |
| model 1 | - | - | - | - | + | - | - | - | - | + | |
| model 2 | + | - | - | - | - | + | + | + | - | - | |
| model 3 | + | + | - | - | + | - | - | + | + | + | |
| ... | |||||||||||
| model N-1 | + | + | + | - | - | + | + | + | + | + | |
| Model N | + | + | + | + | + | + | + | + | + | + | |
| model N+1 | + | + | + | + | - | + | + | + | + | - | |
| model N+2 | - | + | + | + | + | - | - | - | + | + | |
| model N+3 | - | - | + | + | - | + | + | - | - | - | |
| ... | |||||||||||
| model 2N-1 | - | - | - | + | + | - | - | - | - | - |
Third step . Remove undesirable variables.
The last step consist in removing from the data set the variables that don't improve the predictive ability of the model. It is clear that all the variables labeled as "excluded variables" should be removed from the analysis, while those labeled as "fixed" will be kept. In the method it is possible to choose between removing or maintaining the uncertain variables, but we suggest to remove them.
However, in our experience, when the elimination of variables is forced too much it might appear overfitting in the model. Also, from a theoretical point of view, force the variable selection too much can break the PLS formalism, by destroying completely the structure of the X-variables. The User should compare the risks of keeping in the model:
Our advice to be conservative and to apply always a "mild" variable selection.
The whole procedure can be repeated several times. In each step, a first PLS model is obtained and variable selection is applied to the data. In the next step, the reduced data file is used to obtain a new PLS model and to select variables again. Usually, after two selection of variables, the number of variables remains more or less unchanged. Also, the considerations stated above can be applied here: never force the variable selection too much.
Additional remarks:
FFD is a procedure which works evaluating the predictive ability of the models using CV and, therefore, it is extremely sensitive to a misuse of the CV. In particular, one should make sure that the data set don't contains outliers nor dangerous clustering. If not, there is the risk to select the variables that better predict the outliers, obtaining a model of no general validity.
The variable selection can make models more predictive, but when the initial model is very little or no predictive at all, the FFD method cannot improve their quality, but it would derive chance correlations. Our advice is to never apply variable selection to models with low or negative Q2. In any case, the User should be aware that under these conditions, there is a high probability of obtaining spurious results, and he should try at his own risk.