Note
Go to the end to download the full example code.
Nested logit with corrections for endogeneous samplingΒΆ
The sample is said to be endogenous if the probability for an individual to be in the sample depends on the choice that has been made. In that case, the ESML estimator is not appropriate anymore, and corrections need to be made. See Bierlaire, Bolduc, McFadden (2008).
This is illustrated in this example.
Michel Bierlaire, EPFL Sat Jun 21 2025, 17:13:33
import biogeme.biogeme_logging as blog
import numpy as np
from IPython.core.display_functions import display
from biogeme.biogeme import BIOGEME
from biogeme.expressions import Beta
from biogeme.models import get_mev_for_nested, logmev_endogenous_sampling
from biogeme.nests import NestsForNestedLogit, OneNestForNestedLogit
from biogeme.results_processing import get_pandas_estimated_parameters
See the data processing script: Data preparation for Swissmetro.
from swissmetro_data import (
CAR_AV_SP,
CAR_CO_SCALED,
CAR_TT_SCALED,
CHOICE,
SM_AV,
SM_COST_SCALED,
SM_TT_SCALED,
TRAIN_AV_SP,
TRAIN_COST_SCALED,
TRAIN_TT_SCALED,
database,
)
logger = blog.get_screen_logger(level=blog.INFO)
logger.info('Example b14nested_endogenous_sampling.py')
Example b14nested_endogenous_sampling.py
Parameters to be estimated.
asc_car = Beta('asc_car', 0, None, None, 0)
asc_train = Beta('asc_train', 0, None, None, 0)
asc_sm = Beta('asc_sm', 0, None, None, 1)
b_time = Beta('b_time', 0, None, None, 0)
b_cost = Beta('b_cost', 0, None, None, 0)
nest_parameter = Beta('nest_parameter', 1, 1, 10, 0)
In this example, we assume that the three modes exist, and that the sampling protocol is choice-based. The probability that a respondent belongs to the sample is R_i.
R_TRAIN = 4.42e-2
R_SM = 3.36e-3
R_CAR = 7.5e-3
The correction terms are the log of these quantities
correction = {1: np.log(R_TRAIN), 2: np.log(R_SM), 3: np.log(R_CAR)}
Definition of the utility functions.
v_train = asc_train + b_time * TRAIN_TT_SCALED + b_cost * TRAIN_COST_SCALED
v_swissmetro = asc_sm + b_time * SM_TT_SCALED + b_cost * SM_COST_SCALED
v_car = asc_car + b_time * CAR_TT_SCALED + b_cost * CAR_CO_SCALED
Associate utility functions with the numbering of alternatives.
v = {1: v_train, 2: v_swissmetro, 3: v_car}
Associate the availability conditions with the alternatives.
av = {1: TRAIN_AV_SP, 2: SM_AV, 3: CAR_AV_SP}
Definition of nests. Only the non-trivial nests must be defined. A trivial nest is a nest containing exactly one alternative. In this example, we create a nest for the existing modes, that is train (1) and car (3).
existing = OneNestForNestedLogit(
nest_param=nest_parameter, list_of_alternatives=[1, 3], name='existing'
)
nests = NestsForNestedLogit(choice_set=list(v), tuple_of_nests=(existing,))
The following elements do not appear in any nest and are assumed each to be alone in a separate nest: {2}. If it is not the intention, check the assignment of alternatives to nests.
The choice model is a nested logit, with corrections for endogenous sampling We first obtain the expression of the Gi function for nested logit.
probability_generating_function = get_mev_for_nested(v, av, nests)
Then we calculate the MEV log probability, accounting for the correction.
log_probability = logmev_endogenous_sampling(
v, probability_generating_function, av, correction, CHOICE
)
Create the Biogeme object.
the_biogeme = BIOGEME(database, log_probability)
the_biogeme.model_name = 'b14nested_endogenous_sampling'
Biogeme parameters read from biogeme.toml.
Estimate the parameters.
results = the_biogeme.estimate()
*** Initial values of the parameters are obtained from the file __b14nested_endogenous_sampling.iter
Parameter values restored from __b14nested_endogenous_sampling.iter
Starting values for the algorithm: {'asc_train': -2.768565181199355, 'b_time': -0.9745266807210456, 'b_cost': -0.999260565958647, 'nest_parameter': 1.6310201876602868, 'asc_car': -1.1274305042219672}
As the model is rather complex, we cancel the calculation of second derivatives. If you want to control the parameters, change the algorithm from "automatic" to "simple_bounds" in the TOML file.
Optimization algorithm: hybrid Newton/BFGS with simple bounds [simple_bounds]
** Optimization: BFGS with trust region for simple bounds
Iter. asc_train b_time b_cost nest_parameter asc_car Function Relgrad Radius Rho
0 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.017 -1.6e+03 -
1 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.0087 -1.1e+03 -
2 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.0043 -5.5e+02 -
3 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.0022 -2.5e+02 -
4 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.0011 -1.2e+02 -
5 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.00054 -56 -
6 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.00027 -27 -
7 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 0.00014 -13 -
8 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 6.8e-05 -6 -
9 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 3.4e-05 -2.5 -
10 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.6e-05 1.7e-05 -0.75 -
11 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.5e-05 1.7e-05 0.13 +
12 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.4e-05 1.7e-05 0.79 +
13 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 9.7e-06 1.7e-05 0.64 +
14 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 1.1e-05 1.7e-05 0.72 +
15 -2.8 -0.97 -1 1.6 -1.1 5.2e+03 5.3e-06 1.7e-05 0.73 +
Optimization algorithm has converged.
Relative gradient: 5.30365030058552e-06
Cause of termination: Relative gradient = 5.3e-06 <= 6.1e-06
Number of function evaluations: 27
Number of gradient evaluations: 11
Number of hessian evaluations: 0
Algorithm: BFGS with trust region for simple bound constraints
Number of iterations: 16
Proportion of Hessian calculation: 0/5 = 0.0%
Optimization time: 0:00:00.405032
Calculate second derivatives and BHHH
File b14nested_endogenous_sampling~00.html has been generated.
File b14nested_endogenous_sampling~00.yaml has been generated.
print(results.short_summary())
Results for model b14nested_endogenous_sampling
Nbr of parameters: 5
Sample size: 6768
Excluded data: 3960
Final log likelihood: -5202.916
Akaike Information Criterion: 10415.83
Bayesian Information Criterion: 10449.93
pandas_results = get_pandas_estimated_parameters(estimation_results=results)
display(pandas_results)
Name Value Robust std err. Robust t-stat. Robust p-value
0 asc_train -2.768592 0.080536 -34.376951 0.0
1 b_time -0.974563 0.110775 -8.797674 0.0
2 b_cost -0.999335 0.064140 -15.580480 0.0
3 nest_parameter 1.631009 0.062667 26.026438 0.0
4 asc_car -1.127430 0.060631 -18.595046 0.0
Total running time of the script: (0 minutes 2.059 seconds)