Biogeme¶
The core routines of Biogeme.
biogeme.biogeme module¶
Implementation of the main Biogeme class that combines the database and the model specification.
 author
Michel Bierlaire
 date
Tue Mar 26 16:45:15 2019

class
biogeme.biogeme.
BIOGEME
(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, removeUnusedVariables=True, displayUsedVariables=False, suggestScales=True, missingData=99999)[source]¶ Bases:
object
Main class that combines the database and the model specification.
It works in two modes: estimation and simulation.

__init__
(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, removeUnusedVariables=True, displayUsedVariables=False, suggestScales=True, missingData=99999)[source]¶ Constructor
 Parameters
database (
biogeme.database.Database
) – choice data.formulas (
biogeme.expressions.Expression
, or dict(biogeme.expressions.Expression
)) – expression or dictionary of expressions that define the model specification. The concept is that each expression is applied to each entry of the database. The keys of the dictionary allow to provide a name to each formula. In the estimation mode, two formulas are needed, with the keys ‘loglike’ and ‘weight’. If only one formula is provided, it is associated with the label ‘loglike’. If no formula is labeled ‘weight’, the weight of each piece of data is supposed to be 1.0. In the simulation mode, the labels of each formula are used as labels of the resulting database.userNotes (str) – these notes will be included in the report file.
numberOfThreads (int) – multithreading can be used for estimation. This parameter defines the number of threads to be used. If the parameter is set to None, the number of available threads is calculated using cpu_count(). Ignored in simulation mode. Defaults: None.
numberOfDraws (int) – number of draws used for MonteCarlo integration. Default: 1000.
seed (int) – seed used for the pseudorandom number generation. It is useful only when each run should generate the exact same result. If None, a new seed is used at each run. Default: None.
skipAudit (bool) – if True, does not check the validity of the formulas. It may save significant amount of time for large models and large data sets. Default: False.
removeUnusedVariables (bool) – if True, all variables not used in the expression are removed from the database. Default: True.
displayUsedVariables (bool) – if True, displays all the variables used in the formulas. Default: False.
suggestScales (bool.) – if True, Biogeme suggests the scaling of the variables in the database. Default: True. See also
biogeme.database.Database.suggestScaling()
missingData (float) – if one variable has this value, it is assumed that a data is missing and an exception will be triggered. Default: 99999.
 Raises
biogemeError – an audit of the formulas is performed. If a formula has issues, an error is detected and an exception is raised.

algoParameters
¶ Parameters to be transferred to the optimization algorithm

algorithm
¶ Optimization algorithm

bestIteration
¶ Store the best iteration found so far.

bootstrap_results
¶ Results of the bootstrap calculation.

bootstrap_time
¶ Time needed to calculate the bootstrap standard errors

calculateInitLikelihood
()[source]¶ Calculate the value of the log likelihood function
The default values of the parameters are used.
 Returns
value of the log likelihood.
 Return type
float.

calculateLikelihood
(x, scaled, batch=None)[source]¶ Calculates the value of the log likelihood function
 Parameters
x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the value is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable. Default: True
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None
 Returns
the calculated value of the log likelihood
 Return type
float.
 Raises
ValueError – if the length of the list x is incorrect.

calculateLikelihoodAndDerivatives
(x, scaled, hessian=False, bhhh=False, batch=None)[source]¶ Calculate the value of the log likelihood function and its derivatives.
 Parameters
x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the results are devided by the number of observations.
hessian (bool) – if True, the hessian is calculated. Default: False.
bhhh (bool) – if True, the BHHH matrix is calculated. Default: False.
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None
 Returns
f, g, h, bh where
f is the value of the function (float)
g is the gradient (numpy.array)
h is the hessian (numpy.array)
bh is the BHHH matrix (numpy.array)
 Return type
tuple float, numpy.array, numpy.array, numpy.array
 Raises
ValueError – if the length of the list x is incorrect
biogemeError – if the norm of the gradient is not finite, an error is raised.

calculateNullLoglikelihood
(avail)[source]¶ Calculate the log likelihood of the null model that predicts equal probability for each alternative
 Parameters
avail (list of
biogeme.expressions.Expression
) – list of expressions to evaluate the availability conditions for each alternative. Returns
value of the log likelihood
 Return type
float

changeInitValues
(betas)[source]¶ Modifies the initial values of the pameters in all formula
 Parameters
betas (dict(string:float)) – dictionary where the keys are the names of the parameters, and the values are the new value for the parameters.

checkDerivatives
(verbose=False)[source]¶ Verifies the implementation of the derivatives.
It compares the analytical version with the finite differences approximation.
 Parameters
verbose (bool) – if True, the comparisons are reported. Default: False.
 Return type
tuple.
 Returns
f, g, h, gdiff, hdiff where
f is the value of the function,
g is the analytical gradient,
h is the analytical hessian,
gdiff is the difference between the analytical and the finite differences gradient,
hdiff is the difference between the analytical and the finite differences hessian,

columnForBatchSamplingWeights
¶ Name of the column defining weights for batch sampling in stochastic optimization.

confidenceIntervals
(betaValues, intervalSize=0.9)[source]¶ Calculate confidence intervals on the simulated quantities
 Parameters
betaValues (list(dict(str: float))) – array of parameters values to be used in the calculations. Typically, it is a sample drawn from a distribution.
intervalSize (float) – size of the reported confidence interval, in percentage. If it is denoted by s, the interval is calculated for the quantiles (1s)/2 and (1+s)/2. The default (0.9) corresponds to quantiles for the confidence interval [0.05, 0.95].
 Returns
two pandas data frames ‘left’ and ‘right’ with the same dimensions. Each row corresponds to a row in the database, and each column to a formula. ‘left’ contains the left value of the confidence interval, and ‘right’ the right value
Example:
# Read the estimation results from a file results = res.bioResults(pickleFile = 'myModel.pickle') # Retrieve the names of the betas parameters that have been # estimated betas = biogeme.freeBetaNames # Draw 100 realization of the distribution of the estimators b = results.getBetasForSensitivityAnalysis(betas, size = 100) # Simulate the formulas using the nominal values simulatedValues = biogeme.simulate(betaValues) # Calculate the confidence intervals for each formula left, right = biogeme.confidenceIntervals(b, 0.9)
 Return type
tuple of two Pandas dataframes.

createLogFile
(verbosity=3)[source]¶ Creates a log file with the messages produced by Biogeme.
The name of the file is the name of the model with an extension .log
 Parameters
verbosity (int) –
types of messages to be captured
0: no output
1: warnings
2: only general information
3: more verbose
4: debug messages
Default: 3.

database
¶ biogeme.database.Database
object

drawsProcessingTime
¶ Time needed to generate the draws.

estimate
(bootstrap=0, algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]¶ Estimate the parameters of the model.
 Parameters
bootstrap (int) – number of bootstrap resampling used to calculate the variancecovariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.
algorithm (function) – optimization algorithm to use for the maximum likelihood estimation. Default: Biogeme’s Newton’s algorithm with simple bounds.
algoParameters (dict) – parameters to transfer to the optimization algorithm
 Returns
object containing the estimation results.
 Return type
biogeme.bioResults
Example:
# Create an instance of biogeme biogeme = bio.BIOGEME(database, logprob) # Gives a name to the model biogeme.modelName = 'mymodel' # Estimate the parameters results = biogeme.estimate()
 Raises
biogemeError – if no expression has been provided for the likelihood

formulas
¶ Dictionary containing Biogeme formulas of type
biogeme.expressions.Expression
. The keys are the names of the formulas.

generateHtml
¶ Boolean variable, True if the HTML file with the results must be generated.

generatePickle
¶ Boolean variable, True if the pickle file with the results must be generated.

getBoundsOnBeta
(betaName)[source]¶ Returns the bounds on the parameter as defined by the user.
 Parameters
betaName (string) – name of the parameter
 Returns
lower bound, upper bound
 Return type
tuple
 Raises
biogemeError – if the name of the parameter is not found.

initLogLike
¶ Init value of the likelihood function

lastSample
¶ keeps track of the sample of data used to calculate the stochastic gradient / hessian

likelihoodFiniteDifferenceHessian
(x)[source]¶ Calculate the hessian of the log likelihood function using finite differences.
May be useful when the analytical hessian has numerical issues.
 Parameters
x (list(float)) – vector of values for the parameters.
 Returns
finite differences approximation of the hessian.
 Return type
numpy.array
 Raises
ValueError – if the length of the list x is incorrect

logger
¶ Logger that controls the output of messages to the screen and log file. Type: class
biogeme.messaging.bioMessage
.

loglike
¶ Object of type
biogeme.expressions.Expression
calculating the formula for the loglikelihood

loglikeName
¶ Keyword used for the name of the loglikelihood formula. Default: ‘loglike’

loglikeSignatures
¶ Internal signature of the formula for the loglikelihood.

missingData
¶ code for missing data

modelName
¶ Name of the model. Default: ‘biogemeModelDefaultName’

monteCarlo
¶ monteCarlo
is True if one of the expressions involves a MonteCarlo integration.

nullLogLike
¶ Log likelihood of the null model

numberOfThreads
¶ Number of threads used for parallel computing. Default: the number of available CPU.

oldsimulate
(theBetaValues=None)[source]¶ Applies the formulas to each row of the database. This is the old implementation. To be removed in future versions.
 Parameters
theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.
 Returns
a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.
 Return type
Pandas data frame
 Raises
biogemeError – if the number of parameters is incorrect

optimizationMessages
¶ Information provided by the optimization algorithm after completion.

optimize
(startingValues=None)[source]¶ Calls the optimization algorithm. The function self.algorithm is called.
 Parameters
startingValues (list(float)) – starting point for the algorithm
 Returns
x, messages
x is the solution generated by the algorithm,
messages is a dictionary describing several information about the algorithm
 Return type
numpay.array, dict(str:object)
 Raises
biogemeError – an error is raised if no algorithm is specified.

quickEstimate
(algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]¶  Estimate the parameters of the model. Same as estimate, where any extra calculation is skipped (init loglikelihood, tstatistics, etc.)
 Parameters
algorithm (function) – optimization algorithm to use for the maximum likelihood estimation.Default: Biogeme’s Newton’s algorithm with simple bounds.
algoParameters (dict) – parameters to transfer to the optimization algorithm
 Returns
object containing the estimation results.
 Return type
Example:
# Create an instance of biogeme biogeme = bio.BIOGEME(database, logprob) # Gives a name to the model biogeme.modelName = 'mymodel' # Estimate the parameters results = biogeme.quickEstimate()
 Raises
biogemeError – if no expression has been provided for the likelihood

saveIterations
¶ If True, the current iterate is saved after each iteration, in a file named
__[modelName].iter
, where[modelName]
is the name given to the model. If such a file exists, the starting values for the estimation are replaced by the values saved in the file.

setRandomInitValues
(defaultBound=100.0)[source]¶ Modifies the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds.
 Parameters
defaultBound (float) – If the upper bound is missing, it is replaced by this value. If the lower bound is missing, it is replaced by the opposite of this value. Default: 100.

simulate
(theBetaValues=None)[source]¶ Applies the formulas to each row of the database.
 Parameters
theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.
 Returns
a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.
 Return type
Pandas data frame
Example:
# Read the estimation results from a file results = res.bioResults(pickleFile = 'myModel.pickle') # Simulate the formulas using the nominal values simulatedValues = biogeme.simulate(betaValues)
 Raises
biogemeError – if the number of parameters is incorrect

usedVariables
¶ set of variables used in the formulas.

userNotes
¶ User notes

validate
(estimationResults, validationData)[source]¶ Perform outofsample validation.
The function performs the following tasks:
each slice defines a validation set (the slice itself) and an estimation set (the rest of the data),
the model is reestimated on the estimation set,
the estimated model is applied on the validation set,
the value of the log likelihood for each observation is reported.
 Parameters
estimationResults (biogeme.results.bioResults) – results of the model estimation based on the full data.
validationData (list(tuple(pandas.DataFrame, pandas.DataFrame))) – list of estimation and validation data sets
 Returns
a list containing as many items as slices. Each item is the result of the simulation on the validation set.
 Return type
list(pandas.DataFrame)
 Raises
biogemeError – An error is raised if the database is structured as panel data.

weight
¶ Object of type
biogeme.expressions.Expression
calculating the weight of each observation in the sample.

weightName
¶ Keyword used for the name of the weight formula. Default: ‘weight’

weightSignatures
¶ Internal signature of the formula for the weight.


class
biogeme.biogeme.
negLikelihood
(like, like_deriv, scaled)[source]¶ Bases:
biogeme.algorithms.functionToMinimize
Provides the value of the function to be minimized, as well as its derivatives. To be used by the opimization package.

batch
¶ Value betwen 0 and 1 defining the size of the batch, that is the percentage of the data that should be used to approximate the log likelihood.

bhhhv
¶ BHHH matrix

f
(batch=None)[source]¶ Calculate the value of the function
 Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for thre random batch. If None, the full data set is used. Default: None pass
 Returns
value of the function
 Return type
float

f_g
(batch=None)[source]¶ Calculate the value of the function and the gradient
 Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
 Returns
value of the function and the gradient
 Return type
tuple float, numpy.array

f_g_bhhh
(batch=None)[source]¶ Calculate the value of the function, the gradient and the BHHH matrix
 Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
 Returns
value of the function, the gradient and the BHHH
 Return type
tuple float, numpy.array, numpy.array

f_g_h
(batch=None)[source]¶ Calculate the value of the function, the gradient and the Hessian
 Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
 Returns
value of the function, the gradient and the Hessian
 Return type
tuple float, numpy.array, numpy.array

fv
¶ value of the function

gv
¶ vector with the gradient

hv
¶ second derivatives matrix

like
¶ function calculating the log likelihood

like_deriv
¶ function calculating the log likelihood and its derivatives.

recalculate
¶ True if the log likelihood must be recalculated

scaled
¶ if True, the value of the log likelihood is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable.

setVariables
(x)[source]¶ Set the values of the variables for which the function has to be calculated.
 Parameters
x (numpy.array) – values

x
¶ Vector of unknown parameters values
