Database¶
The database management module.
biogeme.database module¶
Implementation of the class Database, wrapping a pandas dataframe for specific services to Biogeme
 author
Michel Bierlaire
 date
Tue Mar 26 16:42:54 2019

class
biogeme.database.
Database
(name, pandasDatabase)[source]¶ Bases:
object
Class that contains and prepare the database.

__init__
(name, pandasDatabase)[source]¶ Constructor
 Parameters
name (string) – name of the database.
pandasDatabase (pandas.DataFrame) – data stored in a pandas data frame.
 Raises
biogemeError – if the audit function detects errors.

addColumn
(expression, column)[source]¶ Add a new column in the database, calculated from an expression.
 Parameters
expression (biogeme.expressions.Expression) – expression to evaluate
column (string) – name of the column to add
 Returns
the added column
 Return type
numpy.Series
 Raises
ValueError – if the column name already exists.

buildPanelMap
()[source]¶ Sorts the data so that the observations for each individuals are contiguous, and builds a map that identifies the range of indices of the observations of each individuals.

checkAvailabilityOfChosenAlt
(avail, choice)[source]¶ Check if the chosen alternative is available for each entry in the database.
 Parameters
avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative.
choice (biogeme.expressions.Expression) – expression for the chosen alternative.
 Returns
numpy series of bool, long as the number of entries in the database, containing True is the chosen alternative is available, False otherwise.
 Return type
numpy.Series

choiceAvailabilityStatistics
(avail, choice)[source]¶ Calculates the number of time an alternative is chosen and available
 Parameters
avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative.
choice (biogeme.expressions.Expression) – expression for the chosen alternative.
 Returns
for each alternative, a tuple containing the number of time it is chosen, and the number of time it is available.
 Return type
dict(int: (int, int))

count
(columnName, value)[source]¶ Counts the number of observations that have a specific value in a given column.
 Parameters
columnName (string) – name of the column.
value (float) – value that is seeked.
 Returns
Number of times that the value appears in the column.
 Return type
int

data
¶ Pandas data frame containing the data.

descriptionOfNativeDraws
()[source]¶ Describe the draws available draws with Biogeme
 Returns
dict, where the keys are the names of the draws, and the value their description
Example of output:
{'UNIFORM: Uniform U[0, 1]', 'UNIFORM_ANTI: Antithetic uniform U[0, 1]'], 'NORMAL: Normal N(0, 1) draws'}
 Return type
dict

dumpOnFile
()[source]¶ Dumps the database in a CSV formatted file.
 Returns
name of the file
 Return type
string

excludedData
¶ Number of observations removed by the function
biogeme.Database.remove()

fullData
¶ Pandas data frame containing the full data. Useful when batches of the sample are used for approximating the log likelihood.

fullIndividualMap
¶ complete map identifying the range of observations for each individual in a panel data context. None if data is not panel. Useful when batches of the sample are used to approximate the log likelihood function.

generateDraws
(types, names, numberOfDraws)[source]¶ Generate draws for each variable.
 Parameters
types (dict) – A dict indexed by the names of the variables, describing the types of draws. Each of them can be a native type or any type defined by the function database.setRandomNumberGenerators
names (list of strings) – the list of names of the variables that require draws to be generated.
numberOfDraws (int) – number of draws to generate.
 Returns
a 3dimensional table with draws. The 3 dimensions are
number of individuals
number of draws
number of variables
 Return type
numpy.array
Example:
types = {'randomDraws1': 'NORMAL_MLHS_ANTI', 'randomDraws2': 'UNIFORM_MLHS_ANTI', 'randomDraws3': 'UNIFORMSYM_MLHS_ANTI'} theDrawsTable = myData.generateDraws(types, ['randomDraws1', 'randomDraws2', 'randomDraws3'], 10)
 Raises
biogemeError – if a type of sdraw is unknown.
biogemeError – if the output of the draw generator does not have the requested dimensions.

getNumberOfObservations
()[source]¶ Reports the number of observations in the database.
Note that it returns the same value, irrespectively if the database contains panel data or not.
 Returns
Number of observations.
 Return type
int
See also: getSampleSize()

getSampleSize
()[source]¶ Reports the size of the sample.
If the data is crosssectional, it is the number of observations in the database. If the data is panel, it is the number of individuals.
 Returns
Sample size.
 Return type
int
See also: getNumberOfObservations()

individualMap
¶ map identifying the range of observations for each individual in a panel data context. None if data is not panel.

isPanel
()[source]¶ Tells if the data is panel or not.
 Returns
True if the data is panel.
 Return type
bool

logger
¶ Logger that controls the output of messages to the screen and log file. Type: class
biogeme.messaging.bioMessage
.

name
¶ Name of the database. Used mainly for the file name when dumping data.

numberOfDraws
¶ Number of draws generated by the function Database.generateDraws. Value 0 if this function is not called.

panel
(columnName)[source]¶ Defines the data as panel data
 Parameters
columnName (string) – name of the columns that identifies individuals.
 Raises
biogemeError – if the data are not sorted properly, that is if the data for the one individuals are not consecutive.

panelColumn
¶ Name of the column identifying the individuals in a panel data context. None if data is not panel.

remove
(expression)[source]¶ Removes from the database all entries such that the value of the expression is not 0.
 Parameters
expression (biogeme.expressions.Expression) – expression to evaluate

sampleIndividualMapWithReplacement
(size=None)[source]¶ Extract a random sample of the individual map from a panel data database, with replacement.
Useful for bootstrapping.
 Parameters
size (int) – size of the sample. If None, a sample of the same size as the database will be generated. Default: None.
 Returns
pandas dataframe with the sample.
 Return type
pandas.DataFrame
 Raises
biogemeError – if the database in not in panel mode.

sampleWithReplacement
(size=None)[source]¶ Extract a random sample from the database, with replacement.
Useful for bootstrapping.
 Parameters
size (int) – size of the sample. If None, a sample of the same size as the database will be generated. Default: None.
 Returns
pandas dataframe with the sample.
 Return type
pandas.DataFrame

sampleWithoutReplacement
(samplingRate, columnWithSamplingWeights=None)[source]¶ Replace the data set by a sample for stochastic algorithms
 Parameters
samplingRate (float) – the proportion of data to include in the sample.
columnWithSamplingWeights (string) – name of the column with the sampling weights. If None, each row has equal probability.
 Raises
biogemeError – if the structure of the database has been modified since last sample.

scaleColumn
(column, scale)[source]¶ Multiply an entire column by a scale value
 Parameters
column (string) – name of the column
scale (float) – value of the scale. All values of the column will be multiplied by that scale.

setRandomNumberGenerators
(rng)[source]¶ Defines userdefined random numbers generators.
 Parameters
rng (dict) – a dictionary of generators. The keys of the dictionary characterize the name of the generators, and must be different from the predefined generators in Biogeme: NORMAL, UNIFORM and UNIFORMSYM. The elements of the dictionary are functions that take two arguments: the number of series to generate (typically, the size of the database), and the number of draws per series.
Example:
def logNormalDraws(sampleSize, numberOfDraws): return np.exp(np.random.randn(sampleSize, numberOfDraws)) def exponentialDraws(sampleSize, numberOfDraws): return 1.0 * np.log(np.random.rand(sampleSize, numberOfDraws)) # We associate these functions with a name dict = {'LOGNORMAL':(logNormalDraws, 'Draws from lognormal distribution'), 'EXP':(exponentialDraws, 'Draws from exponential distributions')} myData.setRandomNumberGenerators(dict)
 Raises
ValueError – if a reserved keyword is used for a userdefined draws.

split
(slices)[source]¶ Prepare estimation and validation sets for validation.
 Parameters
slices (int) – number of slices
 Returns
list of estimation and validation data sets
 Return type
list(tuple(pandas.DataFrame, pandas.DataFrame))

suggestScaling
(columns=None, reportAll=False)[source]¶ Suggest a scaling of the variables in the database.
For each column, \(\delta\) is the difference between the largest and the smallest value, or one if the difference is smaller than one. The level of magnitude is evaluated as a power of 10. The suggested scale is the inverse of this value.
\[s = \frac{1}{10^{\log_{10} \delta}}\]where \(x\) is the integer closest to \(x\).
 Parameters
columns (list(str)) – list of columns to be considered. If None, all of them will be considered.
reportAll (bool) – if False, remove entries where the suggested scale is 1, 0.1 or 10
 Returns
A Pandas dataframe where each row contains the name of the variable and the suggested scale s. Ideally, the column should be multiplied by s.
 Return type
pandas.DataFrame
 Raises
biogemeError – if a variable in
columns
is unknown.

sumFromDatabase
(expression)[source]¶ Calculates the value of an expression for each entry in the database, and returns the sum.
 Parameters
expression (biogeme.expressions.Expression) – expression to evaluate
 Returns
sum of the expressions over the database.
 Return type
float

theDraws
¶ Draws for MonteCarlo integration

typesOfDraws
¶ Types of draws for Monte Carlo integration

userRandomNumberGenerators
¶ Dictionary containing user defined random number generators. Defined by the function Database.setRandomNumberGenerators that checks that reserved keywords are not used. The element of the dictionary is a tuple with two elements: (0) the function generating the draws, and (1) a string describing the type of draws

valuesFromDatabase
(expression)[source]¶ Evaluates an expression for each entry of the database.
 Parameters
expression (biogeme.expressions.Expression.) – expression to evaluate
 Returns
numpy series, long as the number of entries in the database, containing the calculated quantities.
 Return type
numpy.Series

variables
¶ names of the headers of the database so that they can be used as an object of type biogeme.expressions.Expression. Initialized by _generateHeaders()
