biogeme.database module
Module managing the data and the interactions with Pandas.
Implementation of the class Database, wrapping a pandas dataframe for specific services to Biogeme
- author:
Michel Bierlaire
- date:
Tue Mar 26 16:42:54 2019
- class biogeme.database.Database(name, pandas_database)[source]
Bases:
object
Class that contains and prepare the database.
- Parameters:
name (str)
pandas_database (pd.DataFrame)
- DefineVariable(name, expression)[source]
Warning
This function is deprecated. Use
define_variable()
instead.- Return type:
- Parameters:
name (str)
expression (Expression)
- addColumn(expression, column)[source]
Warning
This function is deprecated. Use
add_column()
instead.- Return type:
Series
- Parameters:
expression (Expression)
column (str)
- add_column(expression, column)[source]
Add a new column in the database, calculated from an expression.
- Parameters:
expression (biogeme.expressions.Expression) – expression to evaluate
column (string) – name of the column to add
- Returns:
the added column
- Return type:
numpy.Series
- Raises:
ValueError – if the column name already exists.
BiogemeError – if the database is empty.
- buildPanelMap()[source]
Warning
This function is deprecated. Use
build_panel_map()
instead.- Return type:
None
- build_panel_map()[source]
Sorts the data so that the observations for each individuals are contiguous, and builds a map that identifies the range of indices of the observations of each individuals.
- Return type:
None
- checkAvailabilityOfChosenAlt(avail, choice)[source]
Warning
This function is deprecated. Use
check_availability_of_chosen_alt()
instead.- Return type:
Series
- Parameters:
avail (dict[int, Expression])
choice (Expression)
- check_availability_of_chosen_alt(avail, choice)[source]
Check if the chosen alternative is available for each entry in the database.
- Parameters:
avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative.
choice (biogeme.expressions.Expression) – expression for the chosen alternative.
- Returns:
numpy series of bool, long as the number of entries in the database, containing True is the chosen alternative is available, False otherwise.
- Return type:
numpy.Series
- Raises:
BiogemeError – if the chosen alternative does not appear in the availability dict
BiogemeError – if the database is empty.
- check_segmentation(segmentation_tuple)[source]
Check that the segmentation covers the complete database
- Parameters:
segmentation_tuple (DiscreteSegmentationTuple) – object describing the segmentation
- Returns:
number of observations per segment.
- Return type:
dict(str: int)
- choiceAvailabilityStatistics(avail, choice)[source]
Warning
This function is deprecated. Use
choice_availability_statistics()
instead.- Return type:
dict
[tuple
[int
,int
]]- Parameters:
avail (dict[int, Expression])
choice (Expression)
- choice_availability_statistics(avail, choice)[source]
Calculates the number of time an alternative is chosen and available
- Parameters:
avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative.
choice (biogeme.expressions.Expression) – expression for the chosen alternative.
- Returns:
for each alternative, a tuple containing the number of time it is chosen, and the number of time it is available.
- Return type:
dict(int: (int, int))
- Raises:
BiogemeError – if the database is empty.
- count(column_name, value)[source]
Counts the number of observations that have a specific value in a given column.
- Parameters:
column_name (string) – name of the column.
value (float) – value that is searched.
- Returns:
Number of times that the value appears in the column.
- Return type:
int
- data
Pandas data frame containing the data.
- define_variable(name, expression)[source]
Insert a new column in the database and define it as a variable.
- Return type:
- Parameters:
name (str)
expression (Expression)
- descriptionOfNativeDraws()[source]
Warning
This function is deprecated. Use
description_of_native_draws()
instead.
- dumpOnFile()[source]
Warning
This function is deprecated. Use
dump_on_file()
instead.- Return type:
str
- dump_on_file()[source]
Dumps the database in a CSV formatted file.
- Returns:
name of the file
- Return type:
string
- excludedData
Number of observations removed by the function
biogeme.Database.remove()
- extract_rows(a_range)[source]
Create a database object using only some rows
- Parameters:
a_range (
Iterable
[int
]) – specify the desired range of rows.- Return type:
- Returns:
the reduced dataabse
- fullData
Pandas data frame containing the full data. Useful when batches of the sample are used for approximating the log likelihood.
- fullIndividualMap
complete map identifying the range of observations for each individual in a panel data context. None if data is not panel. Useful when batches of the sample are used to approximate the log likelihood function.
- generateDraws(types, names, number_of_draws)[source]
Warning
This function is deprecated. Use
generate_draws()
instead.- Return type:
ndarray
- Parameters:
types (dict[str, RandomNumberGeneratorTuple])
names (list[str])
number_of_draws (int)
- generateFlatPanelDataframe(save_on_file=False, identical_columns=None)[source]
Warning
This function is deprecated. Use
generate_flat_panel_dataframe()
instead.- Return type:
DataFrame
- Parameters:
save_on_file (bool)
identical_columns (list[str] | None)
- generate_draws(draw_types, names, number_of_draws)[source]
Generate draws for each variable.
- Parameters:
draw_types (dict) –
A dict indexed by the names of the variables, describing the draws. Each of them can be a native type or any type defined by the function
setRandomNumberGenerators()
.Native types:
'UNIFORM'
: Uniform U[0, 1],'UNIFORM_ANTI
: Antithetic uniform U[0, 1]’,'UNIFORM_HALTON2'
: Halton draws with base 2, skipping the first 10,'UNIFORM_HALTON3'
: Halton draws with base 3, skipping the first 10,'UNIFORM_HALTON5'
: Halton draws with base 5, skipping the first 10,'UNIFORM_MLHS'
: Modified Latin Hypercube Sampling on [0, 1],'UNIFORM_MLHS_ANTI'
: Antithetic Modified Latin Hypercube Sampling on [0, 1],'UNIFORMSYM'
: Uniform U[-1, 1],'UNIFORMSYM_ANTI'
: Antithetic uniform U[-1, 1],'UNIFORMSYM_HALTON2'
: Halton draws on [-1, 1] with base 2, skipping the first 10,'UNIFORMSYM_HALTON3'
: Halton draws on [-1, 1] with base 3, skipping the first 10,'UNIFORMSYM_HALTON5'
: Halton draws on [-1, 1] with base 5, skipping the first 10,'UNIFORMSYM_MLHS'
: Modified Latin Hypercube Sampling on [-1, 1],'UNIFORMSYM_MLHS_ANTI'
: Antithetic Modified Latin Hypercube Sampling on [-1, 1],'NORMAL'
: Normal N(0, 1) draws,'NORMAL_ANTI'
: Antithetic normal draws,'NORMAL_HALTON2'
: Normal draws from Halton base 2 sequence,'NORMAL_HALTON3'
: Normal draws from Halton base 3 sequence,'NORMAL_HALTON5'
: Normal draws from Halton base 5 sequence,'NORMAL_MLHS'
: Normal draws from Modified Latin Hypercube Sampling,'NORMAL_MLHS_ANTI'
: Antithetic normal draws from Modified Latin Hypercube Sampling]
For an updated description of the native types, call the function
description_of_native_draws()
.names (list of strings) – the list of names of the variables that require draws to be generated.
number_of_draws (int) – number of draws to generate.
- Returns:
a 3-dimensional table with draws. The 3 dimensions are
number of individuals
number of draws
number of variables
- Return type:
numpy.array
Example:
types = {'randomDraws1': 'NORMAL_MLHS_ANTI', 'randomDraws2': 'UNIFORM_MLHS_ANTI', 'randomDraws3': 'UNIFORMSYM_MLHS_ANTI'} theDrawsTable = my_data.generateDraws(types, ['randomDraws1', 'randomDraws2', 'randomDraws3'], 10)
- Raises:
BiogemeError – if a type of draw is unknown.
BiogemeError – if the output of the draw generator does not have the requested dimensions.
- Parameters:
draw_types (dict[str, str])
names (list[str])
number_of_draws (int)
- Return type:
ndarray
- generate_flat_panel_dataframe(save_on_file=False, identical_columns=None)[source]
Generate a flat version of the panel data
- Parameters:
save_on_file (bool) – if True, the flat database is saved on file.
identical_columns (tuple(str)) – tuple of columns that contain the same values for all observations of the same individual. Default: empty list.
- Returns:
the flatten database, in Pandas format
- Return type:
pandas.DataFrame
- Raises:
BiogemeError – if the database in not panel
- generate_segmentation(variable, mapping=None, reference=None)[source]
Generate a segmentation tuple for a variable.
- Parameters:
variable (biogeme.expressions.Variable or string) – Variable object or name of the variable
mapping (dict(int: str)) – mapping associating values of the variable to names. If incomplete, default names are provided.
reference (str) – name of the reference category. If None, an arbitrary category is selected as reference. :type:
- Return type:
- getNumberOfObservations()[source]
Warning
This function is deprecated. Use
get_number_of_observations()
instead.- Return type:
int
- getSampleSize()[source]
Warning
This function is deprecated. Use
get_sample_size()
instead.- Return type:
int
- get_number_of_observations()[source]
Reports the number of observations in the database.
Note that it returns the same value, irrespectively if the database contains panel data or not.
- Returns:
Number of observations.
- Return type:
int
See also: getSampleSize()
- get_sample_size()[source]
Reports the size of the sample.
If the data is cross-sectional, it is the number of observations in the database. If the data is panel, it is the number of individuals.
- Returns:
Sample size.
- Return type:
int
See also: getNumberOfObservations()
- individualMap
map identifying the range of observations for each individual in a panel data context. None if data is not panel.
- isPanel()[source]
Warning
This function is deprecated. Use
is_panel()
instead.- Return type:
bool
- is_panel()[source]
Tells if the data is panel or not.
- Returns:
True if the data is panel.
- Return type:
bool
- mdcev_count(list_of_columns, new_column)[source]
- For the MDCEV models, we calculate the number of
alternatives that are chosen, that is the number of columns with a non zero entry.
- Parameters:
list_of_columns (
list
[str
]) – list of columns containing the quantity of each good.new_column (
str
) – name of the new column where the result is stored
- Return type:
None
- mdcev_row_split(a_range=None)[source]
For the MDCEV model, we generate a list of Database objects, each of them associated with a different row of the database,
- Parameters:
a_range (
Optional
[Iterable
[int
]]) – specify the desired range of rows.- Return type:
list
[Database
]- Returns:
list of rows, each in a Database format
- name
Name of the database. Used mainly for the file name when dumping data.
- number_of_draws
Number of draws generated by the function Database.generateDraws. Value 0 if this function is not called.
- panel(column_name)[source]
Defines the data as panel data
- Parameters:
column_name (string) – name of the columns that identifies individuals.
- Raises:
BiogemeError – if the data are not sorted properly, that is if the data for the one individuals are not consecutive.
- panelColumn
Name of the column identifying the individuals in a panel data context. None if data is not panel.
- remove(expression)[source]
Removes from the database all entries such that the value of the expression is not 0.
- Parameters:
expression (biogeme.expressions.Expression) – expression to evaluate
- sampleIndividualMapWithReplacement(size=None)[source]
Warning
This function is deprecated. Use
sample_individual_map_with_replacement()
instead.- Return type:
DataFrame
- Parameters:
size (int | None)
- sampleWithReplacement(size=None)[source]
Warning
This function is deprecated. Use
sample_with_replacement()
instead.- Return type:
DataFrame
- Parameters:
size (int | None)
- sample_individual_map_with_replacement(size=None)[source]
Extract a random sample of the individual map from a panel data database, with replacement.
Useful for bootstrapping.
- Parameters:
size (int) – size of the sample. If None, a sample of the same size as the database will be generated. Default: None.
- Returns:
pandas dataframe with the sample.
- Return type:
pandas.DataFrame
- Raises:
BiogemeError – if the database in not in panel mode.
- sample_with_replacement(size=None)[source]
Extract a random sample from the database, with replacement.
Useful for bootstrapping.
- Parameters:
size (int) – size of the sample. If None, a sample of the same size as the database will be generated. Default: None.
- Returns:
pandas dataframe with the sample.
- Return type:
pandas.DataFrame
- scaleColumn(column, scale)[source]
Warning
This function is deprecated. Use
scale_column()
instead.- Parameters:
column (str)
scale (float)
- scale_column(column, scale)[source]
Multiply an entire column by a scale value
- Parameters:
column (string) – name of the column
scale (float) – value of the scale. All values of the column will be multiplied by that scale.
- setRandomNumberGenerators(rng)[source]
Warning
This function is deprecated. Use
set_random_number_generators()
instead.- Parameters:
rng (dict[str, tuple[Callable[[int, int], ndarray], str]])
- set_random_number_generators(rng)[source]
Defines user-defined random numbers generators.
- Parameters:
rng (dict) – a dictionary of generators. The keys of the dictionary characterize the name of the generators, and must be different from the pre-defined generators in Biogeme (see
generateDraws()
for the list). The elements of the dictionary are tuples, where the first element is a function that takes two arguments: the number of series to generate (typically, the size of the database), and the number of draws per series, and returns the array of numbers. The second element is a description.
Example:
def logNormalDraws(sample_size, number_of_draws): return np.exp(np.random.randn(sample_size, number_of_draws)) def exponentialDraws(sample_size, number_of_draws): return -1.0 * np.log(np.random.rand(sample_size, number_of_draws)) # We associate these functions with a name dict = {'LOGNORMAL':(logNormalDraws, 'Draws from lognormal distribution'), 'EXP':(exponentialDraws, 'Draws from exponential distributions')} my_data.setRandomNumberGenerators(dict)
- Raises:
ValueError – if a reserved keyword is used for a user-defined draws.
- Parameters:
rng (dict[str, RandomNumberGeneratorTuple])
- split(slices, groups=None)[source]
Prepare estimation and validation sets for validation.
- Parameters:
slices (int) – number of slices
groups (str) – name of the column that defines the ID of the groups. Data belonging to the same groups will be maintained together.
- Returns:
list of estimation and validation data sets
- Return type:
list(tuple(pandas.DataFrame, pandas.DataFrame))
- Raises:
BiogemeError – if the number of slices is less than two
- suggestScaling(columns=None, report_all=False)[source]
Warning
This function is deprecated. Use
suggest_scaling()
instead.- Parameters:
columns (list[str] | None)
report_all (bool)
- suggest_scaling(columns=None, report_all=False)[source]
Suggest a scaling of the variables in the database.
For each column, \(\delta\) is the difference between the largest and the smallest value, or one if the difference is smaller than one. The level of magnitude is evaluated as a power of 10. The suggested scale is the inverse of this value.
\[s = \frac{1}{10^{|\log_{10} \delta|}}\]where \(|x|\) is the integer closest to \(x\).
- Parameters:
columns (list(str)) – list of columns to be considered. If None, all of them will be considered.
report_all (bool) – if False, remove entries where the suggested scale is 1, 0.1 or 10
- Returns:
A Pandas dataframe where each row contains the name of the variable and the suggested scale s. Ideally, the column should be multiplied by s.
- Return type:
pandas.DataFrame
- Raises:
BiogemeError – if a variable in
columns
is unknown.
- theDraws
Draws for Monte-Carlo integration
- typesOfDraws
Types of draws for Monte Carlo integration
-
userRandomNumberGenerators:
dict
[str
,RandomNumberGeneratorTuple
] Dictionary containing user defined random number generators. Defined by the function Database.setRandomNumberGenerators that checks that reserved keywords are not used. The element of the dictionary is a tuple with two elements: (0) the function generating the draws, and (1) a string describing the type of draws
- valuesFromDatabase(expression)[source]
Warning
This function is deprecated. Use
values_from_database()
instead.- Return type:
Series
- Parameters:
expression (Expression)
- values_from_database(expression)[source]
Evaluates an expression for each entry of the database.
- Parameters:
expression (biogeme.expressions.Expression.) – expression to evaluate
- Returns:
numpy series, long as the number of entries in the database, containing the calculated quantities.
- Return type:
numpy.Series
- Raises:
BiogemeError – if the database is empty.
- variables
names of the headers of the database so that they can be used as an object of type biogeme.expressions.Expression. Initialized by _generateHeaders()
- verify_segmentation(segmentation)[source]
Verifies if the definition of the segmentation is consistent with the data
- Parameters:
segmentation (DiscreteSegmentationTuple) – definition of the segmentation
- Raises:
BiogemeError – if the segmentation is not consistent with the data.
- Return type:
None
- class biogeme.database.EstimationValidation(estimation, validation)[source]
Bases:
NamedTuple
- Parameters:
estimation (DataFrame)
validation (DataFrame)
-
estimation:
DataFrame
Alias for field number 0
-
validation:
DataFrame
Alias for field number 1
- biogeme.database.logger = <Logger biogeme.database (WARNING)>
Logger that controls the output of messages to the screen and log file.