Biogeme
The core routines of Biogeme.
biogeme.biogeme module
Implementation of the main Biogeme class that combines the database and the model specification.
- author
Michel Bierlaire
- date
Tue Mar 26 16:45:15 2019
- class biogeme.biogeme.BIOGEME(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, suggestScales=True, missingData=99999)[source]
Bases:
object
Main class that combines the database and the model specification.
It works in two modes: estimation and simulation.
- __init__(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, suggestScales=True, missingData=99999)[source]
Constructor
- Parameters
database (
biogeme.database.Database
) – choice data.formulas (
biogeme.expressions.Expression
, or dict(biogeme.expressions.Expression
)) – expression or dictionary of expressions that define the model specification. The concept is that each expression is applied to each entry of the database. The keys of the dictionary allow to provide a name to each formula. In the estimation mode, two formulas are needed, with the keys ‘loglike’ and ‘weight’. If only one formula is provided, it is associated with the label ‘loglike’. If no formula is labeled ‘weight’, the weight of each piece of data is supposed to be 1.0. In the simulation mode, the labels of each formula are used as labels of the resulting database.userNotes (str) – these notes will be included in the report file.
numberOfThreads (int) – multi-threading can be used for estimation. This parameter defines the number of threads to be used. If the parameter is set to None, the number of available threads is calculated using cpu_count(). Ignored in simulation mode. Defaults: None.
numberOfDraws (int) – number of draws used for Monte-Carlo integration. Default: 1000.
seed (int) – seed used for the pseudo-random number generation. It is useful only when each run should generate the exact same result. If None, a new seed is used at each run. Default: None.
skipAudit (bool) – if True, does not check the validity of the formulas. It may save significant amount of time for large models and large data sets. Default: False.
suggestScales (bool.) – if True, Biogeme suggests the scaling of the variables in the database. Default: True. See also
biogeme.database.Database.suggestScaling()
missingData (float) – if one variable has this value, it is assumed that a data is missing and an exception will be triggered. Default: 99999.
- Raises
biogemeError – an audit of the formulas is performed. If a formula has issues, an error is detected and an exception is raised.
- algoParameters
Parameters to be transferred to the optimization algorithm
- algorithm
Optimization algorithm
- bestIteration
Store the best iteration found so far.
- bootstrap_results
Results of the bootstrap calculation.
- bootstrap_time
Time needed to calculate the bootstrap standard errors
- calculateInitLikelihood()[source]
Calculate the value of the log likelihood function
The default values of the parameters are used.
- Returns
value of the log likelihood.
- Return type
float.
- calculateLikelihood(x, scaled, batch=None)[source]
Calculates the value of the log likelihood function
- Parameters
x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the value is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable. Default: True
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None
- Returns
the calculated value of the log likelihood
- Return type
float.
- Raises
ValueError – if the length of the list x is incorrect.
- calculateLikelihoodAndDerivatives(x, scaled, hessian=False, bhhh=False, batch=None)[source]
Calculate the value of the log likelihood function and its derivatives.
- Parameters
x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the results are devided by the number of observations.
hessian (bool) – if True, the hessian is calculated. Default: False.
bhhh (bool) – if True, the BHHH matrix is calculated. Default: False.
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None
- Returns
f, g, h, bh where
f is the value of the function (float)
g is the gradient (numpy.array)
h is the hessian (numpy.array)
bh is the BHHH matrix (numpy.array)
- Return type
tuple float, numpy.array, numpy.array, numpy.array
- Raises
ValueError – if the length of the list x is incorrect
biogemeError – if the norm of the gradient is not finite, an error is raised.
- calculateNullLoglikelihood(avail)[source]
Calculate the log likelihood of the null model that predicts equal probability for each alternative
- Parameters
avail (list of
biogeme.expressions.Expression
) – list of expressions to evaluate the availability conditions for each alternative. If None, all alternatives are always available.- Returns
value of the log likelihood
- Return type
float
- changeInitValues(betas)[source]
Modifies the initial values of the pameters in all formula
- Parameters
betas (dict(string:float)) – dictionary where the keys are the names of the parameters, and the values are the new value for the parameters.
- checkDerivatives(verbose=False)[source]
Verifies the implementation of the derivatives.
It compares the analytical version with the finite differences approximation.
- Parameters
verbose (bool) – if True, the comparisons are reported. Default: False.
- Return type
tuple.
- Returns
f, g, h, gdiff, hdiff where
f is the value of the function,
g is the analytical gradient,
h is the analytical hessian,
gdiff is the difference between the analytical and the finite differences gradient,
hdiff is the difference between the analytical and the finite differences hessian,
- columnForBatchSamplingWeights
Name of the column defining weights for batch sampling in stochastic optimization.
- confidenceIntervals(betaValues, intervalSize=0.9)[source]
Calculate confidence intervals on the simulated quantities
- Parameters
betaValues (list(dict(str: float))) – array of parameters values to be used in the calculations. Typically, it is a sample drawn from a distribution.
intervalSize (float) – size of the reported confidence interval, in percentage. If it is denoted by s, the interval is calculated for the quantiles (1-s)/2 and (1+s)/2. The default (0.9) corresponds to quantiles for the confidence interval [0.05, 0.95].
- Returns
two pandas data frames ‘left’ and ‘right’ with the same dimensions. Each row corresponds to a row in the database, and each column to a formula. ‘left’ contains the left value of the confidence interval, and ‘right’ the right value
Example:
# Read the estimation results from a file results = res.bioResults(pickleFile = 'myModel.pickle') # Retrieve the names of the betas parameters that have been # estimated betas = biogeme.freeBetaNames # Draw 100 realization of the distribution of the estimators b = results.getBetasForSensitivityAnalysis(betas, size = 100) # Simulate the formulas using the nominal values simulatedValues = biogeme.simulate(betaValues) # Calculate the confidence intervals for each formula left, right = biogeme.confidenceIntervals(b, 0.9)
- Return type
tuple of two Pandas dataframes.
- createLogFile(verbosity=3)[source]
Creates a log file with the messages produced by Biogeme.
The name of the file is the name of the model with an extension .log
- Parameters
verbosity (int) –
types of messages to be captured
0: no output
1: warnings
2: only general information
3: more verbose
4: debug messages
Default: 3.
- database
biogeme.database.Database
object
- drawsProcessingTime
Time needed to generate the draws.
- estimate(recycle=False, bootstrap=0, algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]
Estimate the parameters of the model.
- Parameters
recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.
bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.
algorithm (function) – optimization algorithm to use for the maximum likelihood estimation. Default: Biogeme’s Newton’s algorithm with simple bounds.
algoParameters (dict) – parameters to transfer to the optimization algorithm
- Returns
object containing the estimation results.
- Return type
biogeme.bioResults
Example:
# Create an instance of biogeme biogeme = bio.BIOGEME(database, logprob) # Gives a name to the model biogeme.modelName = 'mymodel' # Estimate the parameters results = biogeme.estimate()
- Raises
biogemeError – if no expression has been provided for the likelihood
- files_of_type(extension, all_files=False)[source]
Identify the list of files with a given extension in the local directory
- Parameters
extension (str) – extension of the requested files (without the dot): ‘pickle’, or ‘html’
all_files (bool) – if all_files is False, only files containing the name of the model are identified. If all_files is True, all files with the requested extension are identified.
- Returns
list of files with the requested extension.
- Return type
list(str)
- formulas
Dictionary containing Biogeme formulas of type
biogeme.expressions.Expression
. The keys are the names of the formulas.
- freeBetaNames()[source]
Returns the names of the parameters that must be estimated
- Returns
list of names of the parameters
- Return type
list(str)
- generateHtml
Boolean variable, True if the HTML file with the results must be generated.
- generatePickle
Boolean variable, True if the pickle file with the results must be generated.
- getBoundsOnBeta(betaName)[source]
Returns the bounds on the parameter as defined by the user.
- Parameters
betaName (string) – name of the parameter
- Returns
lower bound, upper bound
- Return type
tuple
- Raises
biogemeError – if the name of the parameter is not found.
- initLogLike
Init value of the likelihood function
- lastSample
keeps track of the sample of data used to calculate the stochastic gradient / hessian
- likelihoodFiniteDifferenceHessian(x)[source]
Calculate the hessian of the log likelihood function using finite differences.
May be useful when the analytical hessian has numerical issues.
- Parameters
x (list(float)) – vector of values for the parameters.
- Returns
finite differences approximation of the hessian.
- Return type
numpy.array
- Raises
ValueError – if the length of the list x is incorrect
- loglike
Object of type
biogeme.expressions.Expression
calculating the formula for the loglikelihood
- loglikeName
Keyword used for the name of the loglikelihood formula. Default: ‘loglike’
- loglikeSignatures
Internal signature of the formula for the loglikelihood.
- missingData
code for missing data
- modelName
Name of the model. Default: ‘biogemeModelDefaultName’
- monteCarlo
monteCarlo
is True if one of the expressions involves a Monte-Carlo integration.
- nullLogLike
Log likelihood of the null model
- numberOfDraws
Number of draws for Monte-Carlo integration.
- numberOfThreads
Number of threads used for parallel computing. Default: the number of available CPU.
- optimizationMessages
Information provided by the optimization algorithm after completion.
- optimize(startingValues=None)[source]
Calls the optimization algorithm. The function self.algorithm is called.
- Parameters
startingValues (list(float)) – starting point for the algorithm
- Returns
x, messages
x is the solution generated by the algorithm,
messages is a dictionary describing several information about the algorithm
- Return type
numpay.array, dict(str:object)
- Raises
biogemeError – an error is raised if no algorithm is specified.
- quickEstimate(algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]
- Estimate the parameters of the model. Same as estimate, where any extra calculation is skipped (init loglikelihood, t-statistics, etc.)
- Parameters
algorithm (function) – optimization algorithm to use for the maximum likelihood estimation.Default: Biogeme’s Newton’s algorithm with simple bounds.
algoParameters (dict) – parameters to transfer to the optimization algorithm
- Returns
object containing the estimation results.
- Return type
Example:
# Create an instance of biogeme biogeme = bio.BIOGEME(database, logprob) # Gives a name to the model biogeme.modelName = 'mymodel' # Estimate the parameters results = biogeme.quickEstimate()
- Raises
biogemeError – if no expression has been provided for the likelihood
- saveIterations
If True, the current iterate is saved after each iteration, in a file named
__[modelName].iter
, where[modelName]
is the name given to the model. If such a file exists, the starting values for the estimation are replaced by the values saved in the file.
- setRandomInitValues(defaultBound=100.0)[source]
Modifies the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds.
- Parameters
defaultBound (float) – If the upper bound is missing, it is replaced by this value. If the lower bound is missing, it is replaced by the opposite of this value. Default: 100.
- simulate(theBetaValues=None)[source]
Applies the formulas to each row of the database.
- Parameters
theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.
- Returns
a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.
- Return type
Pandas data frame
Example:
# Read the estimation results from a file results = res.bioResults(pickleFile = 'myModel.pickle') # Simulate the formulas using the nominal values simulatedValues = biogeme.simulate(betaValues)
- Raises
biogemeError – if the number of parameters is incorrect
- userNotes
User notes
- validate(estimationResults, validationData)[source]
Perform out-of-sample validation.
The function performs the following tasks:
each slice defines a validation set (the slice itself) and an estimation set (the rest of the data),
the model is re-estimated on the estimation set,
the estimated model is applied on the validation set,
the value of the log likelihood for each observation is reported.
- Parameters
estimationResults (biogeme.results.bioResults) – results of the model estimation based on the full data.
validationData (list(tuple(pandas.DataFrame, pandas.DataFrame))) – list of estimation and validation data sets
- Returns
a list containing as many items as slices. Each item is the result of the simulation on the validation set.
- Return type
list(pandas.DataFrame)
- Raises
biogemeError – An error is raised if the database is structured as panel data.
- weight
Object of type
biogeme.expressions.Expression
calculating the weight of each observation in the sample.
- weightName
Keyword used for the name of the weight formula. Default: ‘weight’
- weightSignatures
Internal signature of the formula for the weight.
- biogeme.biogeme.logger = <biogeme.messaging.bioMessage object>
Logger that controls the output of messages to the screen and log file. Type: class
biogeme.messaging.bioMessage
.
- class biogeme.biogeme.negLikelihood(like, like_deriv, scaled)[source]
Bases:
functionToMinimize
Provides the value of the function to be minimized, as well as its derivatives. To be used by the opimization package.
- batch
Value betwen 0 and 1 defining the size of the batch, that is the percentage of the data that should be used to approximate the log likelihood.
- bhhhv
BHHH matrix
- f(batch=None)[source]
Calculate the value of the function
- Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for thre random batch. If None, the full data set is used. Default: None pass
- Returns
value of the function
- Return type
float
- f_g(batch=None)[source]
Calculate the value of the function and the gradient
- Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
- Returns
value of the function and the gradient
- Return type
tuple float, numpy.array
- f_g_bhhh(batch=None)[source]
Calculate the value of the function, the gradient and the BHHH matrix
- Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
- Returns
value of the function, the gradient and the BHHH
- Return type
tuple float, numpy.array, numpy.array
- f_g_h(batch=None)[source]
Calculate the value of the function, the gradient and the Hessian
- Parameters
batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
- Returns
value of the function, the gradient and the Hessian
- Return type
tuple float, numpy.array, numpy.array
- fv
value of the function
- gv
vector with the gradient
- hv
second derivatives matrix
- like
function calculating the log likelihood
- like_deriv
function calculating the log likelihood and its derivatives.
- recalculate
True if the log likelihood must be recalculated
- scaled
if True, the value of the log likelihood is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable.
- setVariables(x)[source]
Set the values of the variables for which the function has to be calculated.
- Parameters
x (numpy.array) – values
- x
Vector of unknown parameters values