|
Performs partial least squares regression for one or more response variables and one or more predictor variables.
Y — Array of size ny by h containing the values of the responses, where ny ≥ NOBS is the number of rows of Y and h is the number of response variables. (Input)
X — Array of size nx by p containing the values of the predictor variables, where nx ≥ NOBS is the number of rows of X and p is the number of predictor variables. (Input)
COEF — Array of size SIZE(IXIND) by SIZE(IYIND) containing the final PLS regression coefficient estimates. (Output)
NOBS — Positive
integer specifying the number of observations to be used in the analysis.
(Input)
Default :
NOBS =
min(size (Y,1), size (X,1)).
IYIND — Array
containing column indices of Y specifying which response variables to use
in the analysis. MAXVAL(IYIND) ≤
h. (Input)
Default: IYIND = 1, 2, …,
h.
IXIND — Array
containing column indices of X specifying which predictor variables to
use in the analysis. MAXVAL(IXIND) ≤
p. (Input)
Default: IXIND = 1, 2,…,
p.
NCOMPS — The
number of PLS components to fit. NCOMPS ≤ SIZE(IXIND).
(Input)
Default: NCOMPS = size (IXIND).
Note: If CV = .TRUE., models with 1 up to NCOMPS components are tested using cross-validation. The model with the lowest predicted residual sum of squares is reported.
CV — Logical.
If .TRUE.,
the routine performs K-fold cross validation to select the number of
components. If .FALSE., the routine
fits only the model specified by NCOMPS.
(Input)
Default: CV = .TRUE.
K — Integer
specifying the number of folds to use in K-fold cross validation.
K must be
between 1 and NOBS,
inclusive. K is ignored if CV = .FALSE.
(Input)
Default: K = 5.
Note: If NOBS/K <= 3, the routine performs leave-one-out cross validation as opposed to K-fold cross validation.
SCALE —
Logical. If .TRUE., Y and
X are centered and scaled to have mean 0 and standard deviation of
1. If .FALSE. Y and
X are centered only. (Input)
Default: SCALE = .FALSE.
IPRINT — Printing
option. (Input)
Default: IPRINT = 0.
IPRINT |
Action |
0 |
No printing is performed. |
1 |
Prints final results only. |
2 |
Prints final results and intermediate results. |
YHAT — Array of size NOBS by h containing the predicted values for the response variables using the final values of the coefficients. (Output)
RESIDS — Array of size NOBS by h containing residuals of the final fit for each response variable. (Output)
SE — Array of size p by h containing the standard errors of the PLS coefficients. (Output)
PRESS — Array of size NCOMPS by h providing the predicted residual error sum of squares obtained by cross-validation for each model of size j = 1, … , NCOMPS components. The argument PRESS is ignored if CV = .FALSE. (Output)
XSCRS — Array of size NOBS by NCOMPS containing X-scores. (Output)
YSCRS — Array of size NOBS by NCOMPS containing Y-scores. (Output)
XLDGS — Array of size p by NCOMPS containing X-loadings. (Output)
YLDGS — Array of size h by NCOMPS containing Y-loadings. (Output)
WTS — Array of size p by NCOMPS containing the weight vectors. (Output)
Generic: CALL PLSR (X, Y, COEF [,…])
Specific: The specific interface names are S_PLSR and D_PLSR.
Routine PLSR
performs partial least squares regression for a response matrix , and a set of p explanatory variables,
. PLSR
finds linear combinations of the predictor variables that have highest
covariance with Y. In so doing, PLSR
produces a predictive model for Y using components (linear combinations)
of the individual predictors. Other names for these linear combinations
are scores, factors, or latent variables. Partial least squares regression
is an alternative method to ordinary least squares for problems with many,
highly collinear predictor variables. For further discussion see, for
example, Abdi (2009), and Frank and Friedman (1993).
In Partial Least Squares (PLS), a score, or component matrix, T, is selected to represent both X and Y as in,
and
The matrices P and Q are the least squares solutions of X and Y regressed on T.
That is,
and
The columns of T in the above relations are often called X-scores, while the columns of P are the X-loadings. The columns of the matrix U in Y = UQT + G are the corresponding Y scores, where G is a residual matrix and Q as defined above contains the Y-loadings.
Restricting T to be linear in X , the problem is to find a set of weight vectors (columns of W) such that T = XW predicts both X and Y reasonably well.
Formally, where each
wj is a column vector of
length p,
M ≤ p is the number of components, and where
the m-th partial least squares (PLS) component wm solves:
where and
is the Euclidean norm. For further details see Hastie, et. al.,
pages 80-82 (2001).
That is, wm is the vector which
maximizes the product of the squared correlation between Y and Xα
and the variance of Xα, subject to being orthogonal to each previous
weight vector left multiplied by S. The PLS regression coefficients arise
from
Algorithms to solve the above optimization problem include NIPALS (nonlinear iterative partial least squares) developed by Herman Wold (1966, 1985) and numerous variations, including the SIMPLS algorithm of de Jong (1993). Subroutine PLSR implements the SIMPLS method. SIMPLS is appealing because it finds a solution in terms of the original predictor variables, whereas NIPALS reduces the matrices at each step. For univariate Y it has been shown that SIMPLS and NIPALS are equivalent (the score, loading, and weights matrices will be proportional between the two methods).
If CV=.TRUE., PLSR searches for the best number of PLS components using K-fold cross-validation. That is, for each M = 1, 2,…, p, PLSR estimates a PLS model with M components using all of the data except a hold-out set of size roughly equal to NOBS/K. Using the resulting model estimates, PLSR predicts the outcomes in the hold-out set and calculates the predicted residual sum of squares (PRESS). The procedure then selects the next hold-out sample and repeats for a total of K times (i.e., folds). For further details see Hastie, et. al., pages 241-245 (2001).
For each response variable, PLSR returns results for the model with lowest PRESS. The best model (the number of components giving lowest PRESS), generally will be different for different response variables.
When requested via the optional argument, SE, PLSR calculates modifed jackknife estimates of the standard errors as described in Martens and Martens (2000).
1. PLSR defaults to leave-one-out cross-validation when there are too few observations to form K folds in the data. The user is cautioned that there may be too few observations to make strong inferences from the results:
2. Informational errors
Type Code
2 1 For response #, residuals converged in # components, while # is the requested number of components.
3. This implementation of PLSR does not handle missing values. The user should remove missing values in the data. The user should removes missing data or NaN’s from the data input.
The following artificial data set is provided in de Jong (1993).
,
The first call to PLSR fixes the number of components to 3 for both response variables, and the second call sets cv = .true. in order to perform K-fold cross validation. Note that because n is small, PLSR performs leave-one-out (LOO) cross–validation.
use plsr_int
use umach_int
implicit none
integer, parameter :: n=4, p=3, h=2
integer :: ncomps, iprint, nout
logical :: cvflag
real(kind(1e0)) :: x(n,p), y(n,h), yhat(n,h), coef(p,h), se(p,h)
data x/-4.0, -4.0, 4.0, 4.0, 2.0, -2.0, 2.0, -2.0, 1.0, -1.0, &
-1.0, 1.0/
data y/430.0, -436.0, -361.0, 367.0, -94.0, 12.0, -22.0, 104.0/
! Print out informational error.
call erset(2, 1, 0)
call umach(2,nout)
cvflag = .false.
iprint = 1
ncomps = 3
write(nout,*) 'Example 1a: no cross-validation, request', &
ncomps, 'components.'
call plsr(y, x, coef, ncomps=ncomps, cv=cvflag, yhat=yhat, &
iprint=iprint, se=se)
write(nout,*)
write(nout,*) 'Example 1b: cross-validation '
call plsr(y, x, coef, iprint=iprint, yhat=yhat, se=se)
end
Example 1a: no cross-validation, request 3 components.
PLS Coeff
1 2
1 0.7 10.2
2 17.2 -29.0
3 398.5 5.0
Predicted Y
1 2
1 430.0 -94.0
2 -436.0 12.0
3 -361.0 -22.0
4 367.0 104.0
Std. Errors
1 2
1 131.5 5.1
2 263.0 10.3
3 526.0 20.5
*** ALERT ERROR 1 from s_plsr. For response 2, residuals converged in 2
*** components, while 3 is the requested number of components.
Example 1b: cross-validation
Cross-validated results for response 1:
Comp PRESS
1 542903.8
2 830049.8
3 830049.8
The best model has 1 component(s).
Cross-validated results for response 2:
Comp PRESS
1 5079.6
2 1263.4
3 1263.4
The best model has 2 component(s).
PLS Coeff
1 2
1 15.9 12.7
2 49.2 -23.9
3 371.1 0.6
Predicted Y
1 2
1 405.8 -97.8
2 -533.3 -3.5
3 -208.8 2.2
4 336.4 99.1
Std. Errors
1 2
1 134.1 7.1
2 269.9 3.8
3 478.5 19.5
*** ALERT ERROR 1 from s_plsr. For response 2, residuals converged in 2
*** components, while 3 is the requested number of components.
The data, as appears in S. Wold, et.al. (2001), is a single response variable, the “free energy of the unfolding of a protein”, while the predictor variables are 7 different, highly correlated measurements taken on 19 amino acids.
use plsr_int
use umach_int
implicit none
integer, parameter :: n=19, p=7, h=1
integer :: ncomps, iprint, nout
logical :: cvflag, scale
real(kind(1e0)) :: x(n,p), y(n,h), yhat(n,h), coef(p,h), se(p,h)
data x/0.23, -0.48, -0.61, 0.45, -0.11, -0.51, 0.0, 0.15, 1.2, &
1.28, -0.77, 0.9, 1.56, 0.38, 0.0, 0.17, 1.85, 0.89, &
0.71, 0.31, -0.6, -0.77, 1.54, -0.22, -0.64, 0.0, 0.13, &
1.8, 1.7, -0.99, 1.23, 1.79, 0.49, -0.04, 0.26, 2.25, &
0.96, 1.22, -0.55, 0.51, 1.2, -1.4, 0.29, 0.76, 0.0, &
-0.25, -2.1, -2.0, 0.78, -1.6, -2.6, -1.5, 0.09, -0.58, &
-2.7, -1.7, -1.6, 254.2, 303.6, 287.9, 282.9, 335.0, &
311.6, 224.9, 337.2, 322.6, 324.0, 336.6, 336.3, 366.1, &
288.5, 266.7, 283.9, 401.8, 377.8, 295.1, 2.126, 2.994, &
2.994, 2.933, 3.458, 3.243, 1.662, 3.856, 3.35, 3.518, &
2.933, 3.86, 4.638, 2.876, 2.279, 2.743, 5.755, 4.791, &
3.054, -0.02, -1.24, -1.08, -0.11, -1.19, -1.43, 0.03, &
-1.06, 0.04, 0.12, -2.26, -0.33, -0.05, -0.31, -0.4, &
-0.53, -0.31, -0.84, -0.13, 82.2, 112.3, 103.7, 99.1, &
127.5, 120.5, 65.0, 140.6, 131.7, 131.5, 144.3, 132.3, &
155.8, 106.7, 88.5, 105.3, 185.9, 162.7, 115.6/
data y/8.5, 8.2, 8.5, 11.0, 6.3, 8.8, 7.1, 10.1, 16.8, 15.0, &
7.9, 13.3, 11.2, 8.2, 7.4, 8.8, 9.9, 8.8, 12.0/
call umach(2, nout)
cvflag = .false.
iprint = 2
ncomps = 7
scale = .true.
write(nout,*) 'Example 2a: no cross-validation, request', &
ncomps,'components.'
call plsr(y, x, coef, ncomps=ncomps, cv=cvflag, scale=scale, &
iprint=iprint, yhat=yhat, se=se)
write(nout,*)
write(nout,*) 'Example 2b: cross-validation '
call plsr(y, x, coef, scale=scale, iprint=iprint, &
yhat=yhat, se=se)
end
Example 2a: no cross-validation, request 7 components.
Standard PLS Coefficients
1 -5.459
2 1.668
3 0.625
4 1.430
5 -2.550
6 4.862
7 4.859
PLS Coeff
1 -20.04
2 4.63
3 1.42
4 0.09
5 -7.27
6 20.90
7 0.46
Predicted Y
1 9.38
2 7.30
3 8.09
4 12.02
5 8.79
6 6.76
7 7.24
8 10.44
9 15.79
10 14.35
11 8.41
12 9.95
13 11.52
14 8.64
15 8.23
16 8.40
17 11.12
18 8.97
19 12.39
Std. Errors
1 3.599
2 2.418
3 0.812
4 3.214
5 1.641
6 3.326
7 3.529
Corrected Std. Errors
1 13.21
2 6.71
3 1.84
4 0.20
5 4.68
6 14.30
7 0.33
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 39.7
2 42.8
3 58.3
4 65.1
5 68.2
6 75.1
7 75.5
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.1
2 96.3
3 97.4
4 97.9
5 98.2
6 98.3
7 98.4
Example 2b: cross-validation
Cross-validated results for response 1:
Comp PRESS
1 0.669
2 0.675
3 0.885
4 1.042
5 2.249
6 1.579
7 1.194
The best model has 1 component(s).
Standard PLS Coefficients
1 0.1486
2 0.1639
3 -0.1492
4 0.0617
5 0.0669
6 0.1150
7 0.0691
PLS Coeff
1 0.5453
2 0.4546
3 -0.3384
4 0.0039
5 0.1907
6 0.4942
7 0.0065
Predicted Y
1 9.18
2 7.97
3 7.55
4 10.48
5 8.75
6 7.89
7 8.44
8 9.47
9 11.76
10 11.80
11 7.37
12 11.14
13 12.65
14 9.96
15 8.61
16 9.27
17 13.47
18 11.33
19 10.71
Std. Errors
1 0.06312
2 0.07055
3 0.06272
4 0.03911
5 0.03364
6 0.06386
7 0.03955
Corrected Std. Errors
1 0.2317
2 0.1957
3 0.1423
4 0.0025
5 0.0959
6 0.2745
7 0.0037
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 39.7
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.1
PHONE: 713.784.3131 FAX:713.781.9260 |