Nested logit with corrections for endogeneous sampling

The sample is said to be endogenous if the probability for an individual to be in the sample depends on the choice that has been made. In that case, the ESML estimator is not appropriate anymore, and corrections need to be made. See Bierlaire, bolduc, McFadden (2008).

This is illustrated in this example.

author:: Michel Bierlaire, EPFL
date:: Sun Apr 9 18:25:03 2023

import numpy as np
import biogeme.biogeme_logging as blog
import biogeme.biogeme as bio
from biogeme import models
from biogeme.expressions import Beta
from biogeme.nests import OneNestForNestedLogit, NestsForNestedLogit

See the data processing script: Data preparation for Swissmetro.

from swissmetro_data import (
    database,
    CHOICE,
    SM_AV,
    CAR_AV_SP,
    TRAIN_AV_SP,
    TRAIN_TT_SCALED,
    TRAIN_COST_SCALED,
    SM_TT_SCALED,
    SM_COST_SCALED,
    CAR_TT_SCALED,
    CAR_CO_SCALED,
)

logger = blog.get_screen_logger(level=blog.INFO)
logger.info('Example b14nested_endogenous_sampling.py')

Example b14nested_endogenous_sampling.py

Parameters to be estimated.

ASC_CAR = Beta('ASC_CAR', 0, None, None, 0)
ASC_TRAIN = Beta('ASC_TRAIN', 0, None, None, 0)
ASC_SM = Beta('ASC_SM', 0, None, None, 1)
B_TIME = Beta('B_TIME', 0, None, None, 0)
B_COST = Beta('B_COST', 0, None, None, 0)
MU = Beta('MU', 1, 1, 10, 0)

In this example, we assume that the three modes exist, and that the sampling protocol is choice-based. The probability that a respondent belongs to the sample is R_i.

R_TRAIN = 4.42e-2
R_SM = 3.36e-3
R_CAR = 7.5e-3

The correction terms are the log of these quantities

correction = {1: np.log(R_TRAIN), 2: np.log(R_SM), 3: np.log(R_CAR)}

Definition of the utility functions.

V1 = ASC_TRAIN + B_TIME * TRAIN_TT_SCALED + B_COST * TRAIN_COST_SCALED
V2 = ASC_SM + B_TIME * SM_TT_SCALED + B_COST * SM_COST_SCALED
V3 = ASC_CAR + B_TIME * CAR_TT_SCALED + B_COST * CAR_CO_SCALED

Associate utility functions with the numbering of alternatives.

V = {1: V1, 2: V2, 3: V3}

Associate the availability conditions with the alternatives.

av = {1: TRAIN_AV_SP, 2: SM_AV, 3: CAR_AV_SP}

Definition of nests. Only the non-trivial nests must be defined. A trivial nest is a nest containing exactly one alternative. In this example, we create a nest for the existing modes, that is train (1) and car (3).

existing = OneNestForNestedLogit(
    nest_param=MU, list_of_alternatives=[1, 3], name='existing'
)

nests = NestsForNestedLogit(choice_set=list(V), tuple_of_nests=(existing,))

The following elements do not appear in any nest and are assumed each to be alone in a separate nest: {2}. If it is not the intention, check the assignment of alternatives to nests.

The choice model is a nested logit, with corrections for endogenous sampling We first obtain the expression of the Gi function for nested logit.

Gi = models.get_mev_for_nested(V, av, nests)

Then we calculate the MEV log probability, accounting for the correction.

logprob = models.logmev_endogenous_sampling(V, Gi, av, correction, CHOICE)

Create the Biogeme object.

the_biogeme = bio.BIOGEME(database, logprob)
the_biogeme.modelName = 'b14nested_endogenous_eampling'

Biogeme parameters read from biogeme.toml.

Estimate the parameters.

results = the_biogeme.estimate()

As the model is not too complex, we activate the calculation of second derivatives. If you want to change it, change the name of the algorithm in the TOML file from "automatic" to "simple_bounds"
*** Initial values of the parameters are obtained from the file __b14nested_endogenous_eampling.iter
Cannot read file __b14nested_endogenous_eampling.iter. Statement is ignored.
As the model is not too complex, we activate the calculation of second derivatives. If you want to change it, change the name of the algorithm in the TOML file from "automatic" to "simple_bounds"
Optimization algorithm: hybrid Newton/BFGS with simple bounds [simple_bounds]
** Optimization: Newton with trust region for simple bounds
Iter.         ASC_CAR       ASC_TRAIN          B_COST          B_TIME              MU     Function    Relgrad   Radius      Rho
    0               1              -1            0.47              -1               2      8.1e+03       0.15        1     0.56    +
    1               0            -1.1          0.0036              -2             1.6      6.1e+03      0.068       10      1.1   ++
    2            -1.5              -3           -0.95           -0.12             2.2      5.4e+03      0.036       10     0.71    +
    3            -1.4            -2.9           -0.91           -0.46             1.9      5.3e+03      0.015    1e+02      1.3   ++
    4            -1.2            -2.8           -0.98           -0.86             1.7      5.2e+03     0.0048    1e+03      1.1   ++
    5            -1.1            -2.8              -1           -0.97             1.6      5.2e+03    0.00035    1e+04        1   ++
    6            -1.1            -2.8              -1           -0.97             1.6      5.2e+03      2e-06    1e+04        1   ++
Results saved in file b14nested_endogenous_eampling.html
Results saved in file b14nested_endogenous_eampling.pickle

print(results.short_summary())

Results for model b14nested_endogenous_eampling
Nbr of parameters:              5
Sample size:                    6768
Excluded data:                  3960
Final log likelihood:           -5202.916
Akaike Information Criterion:   10415.83
Bayesian Information Criterion: 10449.93

pandas_results = results.get_estimated_parameters()
pandas_results

	Value	Rob. Std err	Rob. t-test
ASC_CAR	-1.127438	0.060630	-18.595448
ASC_TRAIN	-2.768608	0.080535	-34.377505
B_COST	-0.999353	0.064141	-15.580617
B_TIME	-0.974561	0.110774	-8.797759
MU	1.630982	0.062661	26.028460

Total running time of the script: (0 minutes 0.383 seconds)

Gallery generated by Sphinx-Gallery