Chapter 2: Regression

RSTEP

Builds multiple linear regression models using forward selection, backward selection, or stepwise selection.

Required Arguments

COV — NVAR by NVAR matrix containing the variance-covariance matrix or sum of squares and crossproducts matrix.   (Input)
Only the upper triangle of COV is referenced.

NOBS — Number of observations.   (Input)

AOV — Vector of length 13 containing statistics relating to the analysis of variance for the final model in this invocation.   (Output)\

I

AOV(I)

1

Degrees of freedom for regression

2

Degrees of freedom for error

3

Total degrees of freedom

4

Sum of squares for regression

5

Sum of squares for error

6

Total sum of squares

7

Regression mean square

8

Error mean square

9

F-statistic

10

p-value

11

R2 (in percent)

12

Adjusted R2 (in percent)

13

Estimated standard deviation of the model error

14

Mean of the response (dependent) variable

15

Coefficient of variation (in percent)

COEF — NVAR 1 by 5 matrix containing statistics relating to the regression coefficients for the final model in this invocation.   (Output)
The rows correspond to the NVAR 1 variables with LEVEL(I) nonnegative, i.e., all variables but the dependent variable. The rows are in the same order as the variables in COV except that the dependent variable is excluded. Each row corresponding to a variable not in the model is for the model supposing the additional variable was in the model.

Col.

Description

1

Coefficient estimate

2

Estimated standard error of the coefficient estimate

3

t-statistic for the test that the coefficient is zero

4

p-value for the two-sided t test

5

Variance inflation factor. The square of the multiple correlation coefficient for the I-th regressor after all others can be obtained from COEF(I, 5) by the formula 1.0 1.0/COEF(I, 5).

COVS — NVAR by NVAR matrix that results after COV has been swept on the columns corresponding to the variables in the model.   (Output, if INVOKE = 0 or 1;input/output, if INVOKE = 2 or 3)
The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of COVS corresponding to the independent variables in the final model and multiplying the elements of this matrix by AOV(8). If COV is not needed, COV and COVS can occupy the same storage locations.

Optional Arguments

INVOKE — Invocation option.   (Input)
Default: INVOKE = 0.

INVOKE

Action

0

This is the only invocation of RSTEP for this variancecovariance matrix. Initialization, stepping, and wrap-up computations are performed.

1

This is the first invocation of RSTEP, and additional calls to RSTEP will be made. Initialization and stepping is performed.

2

This is an intermediate invocation of RSTEP and stepping is performed.

3

This is the final invocation of RSTEP and stepping is performed.

NVAR — Number of variables.   (Input)
Default: NVAR = size (COV,2).

LDCOV — Leading dimension of COV exactly as specified in the dimension statement in the calling program.   (Input)
Default: LDCOV = size (COV,1).

LEVEL — Vector of length NVAR containing levels of priority for variables entering and leaving the regression.   (Input)
LEVEL(I) = 1 means the I-th variable is the dependent variable. LEVEL(I) = 0 means the I-th variable is never to enter into the model. Other variables must be assigned a positive value to indicate their level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step.

NFORCE — Variables with levels 1, 2, …, NFORCE are forced into the model as the independent variables.   (Input)
Default: NFORCE = 0.

NSTEP — Step length option.   (Input)
For nonnegative NSTEP. NSTEP steps are taken. NSTEP = 1 means stepping continues until completion.
Default: NSTEP = -1.

ISTEP — Stepping option.   (Input)
Default: ISTEP = -1.

ISTEP

Action

-1

An attempt is made to remove a variable from the model (backward step). A variable is removed if its p-value exceeds POUT. During initialization, all candidate independent variables enter the model.

1

An attempt is made to add a variable to the model (forward step). A variable is added if its p-value is less than PIN. During initialization, only the forced variables enter the model.

0

A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization.

PIN — Largest p-value for entering variables.   (Input)
Variables with p-values less than PIN may enter the model. A common choice is PIN = 0.05.
Default: PIN = .05.

POUT — Smallest p-value for removing variables.   (Input)
Variables with p-values greater than POUT may leave the model. POUT must be greater or equal to PIN. A common choice is POUT = 0.10 (or 2 * PIN).
Default: POUT = .10.

TOL — Tolerance used in determining linear dependence.   (Input)
TOL = 100 * AMACH (4) is a common choice.  See documentation for AMACH in theReference Material.
Default: TOL = 1.e-5 for single precision and 2.d – 14 for double precision.

IPRINT — Printing option.   (Input)
Default: IPRINT = 0.

IPRINT

Action

0

No printing is performed.

1

Printing is performed on the final invocation.

2

Printing is performed after each step and on the final invocation.

SCALE — Vector of length NVAR containing the initial diagonal entries in COV.   (Output, if INVOKE = 0 or 1; input, if INVOKE = 2 or 3)

HIST — Vector of length NVAR containing the recent history of variables.   (Output, if INVOKE = 0 or 1; input/output, otherwise)

HIST(I)                  Meaning

k > 0   I-th variable was added to the model during the k-th step.

k < 0   I-th variable was deleted from the model during the k-th step.

0          I-th variable has never been in the model.

0.5       I-th variable was added into the model during initialization.

IEND — Completion indicator.   (Output)

IEND   Meaning

0          Additional steps may be possible.

1          No additional steps are possible.

LDCOEF — Leading dimension of exactly as specified in the dimension statement in the calling program.   (Input)
Default: LDCOEF = size (COEF,1).

LDCOVS — Leading dimension of COVS exactly as specified in the dimension statement in the calling program.   (Input)
Default: LDCOVS = size (COVS,1).

FORTRAN 90 Interface

Generic:                              CALL RSTEP (COV, NOBS, AOV, COEF, COVS [,…])

Specific:                             The specific interface names are S_RSTEP and D_RSTEP.

FORTRAN 77 Interface

Single:            CALL RSTEP (INVOKE, NVAR, COV, LDCOV, LEVEL, NFORCE, NSTEP, ISTEP, NOBS, PIN, POUT, TOL, IPRINT, SCALE, HIST, IEND, AOV, COEF, LDCOEF, COVS, LDCOVS)

Double:                              The double precision name is DRSTEP.

Description

Routine RSTEP builds a multiple linear regression model using forward selection, backward selection, or forward stepwise (with a backward glance) selection. The routine RSTEP is designed so that the user can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to RSTEP (with INVOKE = 1, 2, 2, ..., 3) are made. Alternatively, RSTEP can be invoked once (with INVOKE = 0) in order to perform the stepping until a final model is selected.

Levels of priority can be assigned to the candidate independent variables. All variables with a priority level of 1 must enter the model before any variable with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc.

Variables can also be forced into the model. If equal levels of priority are to be assumed, the levels of priority can all be set to 1.

Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum of squares and crossproducts matrix for the independent and dependent variables corrected for the mean is input for COV. Routine CORVC in Chapter 3, “Correlation” can be used to compute the corrected sum of squares and crossproducts. Routine RORDM in Chapter 19, “Utilities,” can be used to reorder this matrix, if required. Other possibilities are

1.         The intercept is not in the model. A raw (uncorrected) sum of squares and crossproducts matrix for the independent and dependent variables is required for COV. NOBS must be set to one greater than the number of observations. IMSL routine MXTXF (IMSL MATH/LIBRARY) can be used to compute the raw sum of squares and crossproducts matrix.

2.         An intercept is to be a candidate variable. A raw (uncorrected) sum of squares and crossproducts matrix for the constant regressor (= 1), independent and dependent variables is required for COV. In this case, COV contains one additional row and column corresponding to the constant regressor. This row/column contains the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in COV are the same as in the previous case. NOBS must be set to one greater than the number of observations.

The stepwise regression algorithm is due to Efroymson (1960). Routine RSTEP uses sweeps of COV to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed by Goodnight (1979) is used. A description of the stepwise algorithm is given also by Kennedy and Gentle (1980, pages 335340). The advantage of stepwise model building over all possible regressions (see routine RBEST) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest R2) for any subset size of independent variables.

Comments

1.         Workspace may be explicitly provided, if desired, by use of R2TEP/DR2TEP. The reference is:

CALL R2TEP (INVOKE, NVAR, COV, LDCOV, LEVEL, NFORCE, NSTEP, ISTEP, NOBS, PIN, POUT, TOL, IPRINT, SCALE,HIST, IEND, AOV, COEF, LDCOEF, COVS, LDCOVS, SWEPT, IWK)

The additional arguments are as follows:

SWEPT — Work vector of length NVAR with information to indicate the independent variables in the model.   (Output)
SWEPT(I) = 1.0 indicates that independent variable I is in the model. Otherwise, SWEPT(I) = 1.0. Routine RSUBM can be called with the arguments COVS and SWEPT to obtain the part of COVS pertaining to the current model.

IWK — Integer work vector of length 2 * NVAR.

2.         Informational errors

Type Code

3         1                  Based on TOL, there are linear dependencies among the variables to be forced.

4         2                  No variables entered the model. Elements of AOV are set to NaN.

Example 1

Both examples use a data set from Draper and Smith (1981, pages 629630). A corrected sum of squares and crossproducts matrix for this data is given in the DATA statement and can be computed using routine CORVC in Chapter 3, “Correlation”. The first four columns are for the independent variables and the last column is for the dependent variable. Here, RSTEP is invoked using the backward stepping option.

 

      USE RSTEP_INT

 

      IMPLICIT   NONE

      INTEGER    LDCOEF, LDCOV, LDCOVS, NVAR

      PARAMETER  (NVAR=5, LDCOEF=NVAR, LDCOV=NVAR, LDCOVS=NVAR)

!

      INTEGER    IEND, IPRINT, LEVEL(NVAR), NOBS

      REAL       AOV(13), COEF(LDCOEF,5), COV(LDCOV,NVAR), &

                 COVS(LDCOVS,NVAR), HIST(NVAR), SCALE(NVAR)

!

      DATA COV/415.231, 251.077, -372.615, -290.000, 775.962, 251.077, &

           2905.69, -166.538, -3041.00, 2292.95, -372.615, -166.538, &

           492.308, 38.0000, -618.231, -290.000, -3041.00, 38.0000, &

           3362.00, -2481.70, 775.962, 2292.95, -618.231, -2481.70, &

           2715.76/

      DATA LEVEL/4*1, -1/

!

      NOBS   = 13

      IPRINT = 2

      CALL RSTEP (COV, NOBS, AOV, COEF, COVS, IPRINT=IPRINT)

!

      END

Output

 

BACKWARD ELIMINATION
STEP 0:  4 variable(s) entered.

Dependent  R-squared   Adjusted  Est. Std. Dev.
Variable   (percent)  R-squared  of Model Error
       5      98.238     97.356           2.446

              * * * Analysis of Variance * * *
                       Sum of        Mean             Prob. of
Source           DF     Squares      Square  Overall F  Larger F
Regression        4      2667.9       667.0    111.480    0.0000
Error             8        47.9         6.0
Total            12      2715.8

              * * * Inference on Coefficients * * *
               (Conditional on the Selected Model)
               Coef.    Standard                Prob. of    Variance
Variable    Estimate       Error  t-statistic   Larger t   Inflation
       1       1.551      0.7448        2.082     0.0709        38.5
       2       0.510      0.7238        0.704     0.5012       254.4
       3       0.102      0.7547        0.135     0.8963        46.9
       4      -0.144      0.7091       -0.204     0.8437       282.5

STEP 1 :  Variable 3 removed.
Dependent  R-squared   Adjusted  Est. Std. Dev.
Variable   (percent)  R-squared  of Model Error
       5      98.234     97.645           2.309

                 * * * Analysis of Variance * * *
                         Sum of        Mean             Prob. of
Source           DF     Squares      Square  Overall F  Larger F
Regression        3      2667.8       889.3    166.835    0.0000
Error             9        48.0         5.3
Total            12      2715.8

                  * * * Inference on Coefficients * * *
                   (Conditional on the Selected Model)
               Coef.    Standard                Prob. of    Variance
Variable    Estimate       Error  t-statistic   Larger t   Inflation
       1       1.452      0.1170       12.410     0.0000        1.07
       2       0.416      0.1856        2.242     0.0517       18.78
       4      -0.237      0.1733       -1.365     0.2054       18.94

          * * * Statistics for Variables Not in the Model * * *
               Coef.    Standard  t-statistic   Prob. of    Variance
Variable    Estimate       Error     to enter   Larger t   Inflation
       3       0.102      0.7547        0.135     0.8963       46.87

STEP 2 :  Variable 4 removed.

Dependent  R-squared   Adjusted  Est. Std. Dev.
Variable   (percent)  R-squared  of Model Error
       5      97.868     97.441           2.406

                 * * * Analysis of Variance * * *
                         Sum of        Mean             Prob. of
Source           DF     Squares      Square  Overall F  Larger F
Regression        2      2657.9      1328.9    229.502    0.0000
Error            10        57.9         5.8
Total            12      2715.8

                  * * * Inference on Coefficients * * *
                   (Conditional on the Selected Model)
               Coef.    Standard                Prob. of    Variance
Variable    Estimate       Error  t-statistic   Larger t   Inflation
       1       1.468      0.1213       12.105     0.0000        1.06
       2       0.662      0.0459       14.442     0.0000        1.06

          * * * Statistics for Variables Not in the Model * * *
               Coef.    Standard  t-statistic   Prob. of    Variance
Variable    Estimate       Error     to enter   Larger t   Inflation
       3       0.250      0.1847        1.354     0.2089        3.14
       4      -0.237      0.1733       -1.365     0.2054       18.94

* * * Backward Elimination Summary * * *
        Variable    Step Removed
               3             1
               4             2

Additional Example

Example 2

This example uses the data set in Example 1. Here, RSTEP is invoked using the forward stepwise option.

 

      USE RSTEP_INT

 

      IMPLICIT   NONE

      INTEGER    LDCOEF, LDCOV, LDCOVS, NVAR

      PARAMETER  (NVAR=5, LDCOEF=NVAR, LDCOV=NVAR, LDCOVS=NVAR)

!

      INTEGER    IEND, IPRINT, ISTEP, LEVEL(NVAR), NOBS

      REAL       AOV(13), COEF(LDCOEF,5), COV(LDCOV,NVAR), &

                 COVS(LDCOVS,NVAR), HIST(NVAR), SCALE(NVAR)

!

      DATA COV/415.231, 251.077, -372.615, -290.000, 775.962, 251.077, &

           2905.69, -166.538, -3041.00, 2292.95, -372.615, -166.538, &

           492.308, 38.0000, -618.231, -290.000, -3041.00, 38.0000, &

           3362.00, -2481.70, 775.962, 2292.95, -618.231, -2481.70, &

           2715.76/

      DATA LEVEL/4*1, -1/

!

      ISTEP  = 1

      NOBS   = 13

      IPRINT = 2

      CALL RSTEP (COV, NOBS, AOV, COEF, COVS, ISTEP=ISTEP, IPRINT=IPRINT)

!

      END

Output

 

FORWARD SELECTION
STEP 0:  No variables entered.

          * * * Statistics for Variables Not in the Model * * *
                  Coef.    Standard  t-statistic   Prob. of    Variance
   Variable    Estimate       Error     to enter   Larger t   Inflation
          1       1.869      0.5264        3.550     0.0046           1
          2       0.789      0.1684        4.686     0.0007           1
          3      -1.256      0.5984       -2.098     0.0598           1
          4      -0.738      0.1546       -4.775     0.0006           1

STEP 1 :  Variable 4 entered.

Dependent  R-squared   Adjusted  Est. Std. Dev.
Variable   (percent)  R-squared  of Model Error
       5      67.454     64.496           8.964

                 * * * Analysis of Variance * * *
                         Sum of        Mean             Prob. of
Source           DF     Squares      Square  Overall F  Larger F
Regression        1      1831.9      1831.9     22.799    0.0006
Error            11       883.9        80.4
Total            12      2715.8

                  * * * Inference on Coefficients * * *
                   (Conditional on the Selected Model)
               Coef.    Standard                Prob. of    Variance
Variable    Estimate       Error  t-statistic   Larger t   Inflation
       4      -0.738      0.1546       -4.775     0.0006        1.00

          * * * Statistics for Variables Not in the Model * * *
               Coef.    Standard  t-statistic   Prob. of    Variance
Variable    Estimate       Error     to enter   Larger t   Inflation
       1       1.440      0.1384       10.403     0.0000        1.06
       2       0.311      0.7486        0.415     0.6867       18.74
       3      -1.200      0.1890       -6.348     0.0001        1.00

    STEP 2 :  Variable 1 entered.

Dependent  R-squared   Adjusted  Est. Std. Dev.
Variable   (percent)  R-squared  of Model Error
       5      97.247     96.697           2.734

                 * * * Analysis of Variance * * *
                         Sum of        Mean             Prob. of
Source           DF     Squares      Square  Overall F  Larger F
Regression        2      2641.0      1320.5    176.636    0.0000
Error            10        74.8         7.5
Total            12      2715.8

                  * * * Inference on Coefficients * * *
                   (Conditional on the Selected Model)
               Coef.    Standard                Prob. of    Variance
Variable    Estimate       Error  t-statistic   Larger t   Inflation
       1       1.440      0.1384       10.403     0.0000        1.06
       4      -0.614      0.0486      -12.622     0.0000        1.06

          * * * Statistics for Variables Not in the Model * * *
               Coef.    Standard  t-statistic   Prob. of    Variance
Variable    Estimate       Error     to enter   Larger t   Inflation
       2       0.416      0.1856        2.242     0.0517       18.78
       3      -0.410      0.1992       -2.058     0.0697        3.46

* * * Forward Selection Summary * * *
       Variable    Step Entered
              1             2
              4             1

Example 3

For an extended version of Example 2 that in addition computes the intercept and standard error for the final model from RSTEP, see “Example 2” for routine RSUBM.



http://www.vni.com/
PHONE: 713.784.3131
FAX:713.781.9260