Note
Go to the end to download the full example code.
biogeme.database¶
Examples of use of several functions.
This is designed for programmers who need examples of use of the functions of the module. The examples are designed to illustrate the syntax. They do not correspond to any meaningful model.
Michel Bierlaire Sun Jun 29 2025, 02:30:19
import numpy as np
import pandas as pd
from IPython.core.display_functions import display
from biogeme.database import (
Database,
PanelDatabase,
check_availability_of_chosen_alt,
choice_availability_statistics,
)
from biogeme.exceptions import BiogemeError
from biogeme.expressions import Variable, exp
from biogeme.segmentation import DiscreteSegmentationTuple, verify_segmentation
from biogeme.version import get_text
Version of Biogeme.
print(get_text())
biogeme 3.3.1 [2025-09-03]
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)
We set the seed so that the outcome of random operations is always the same.
np.random.seed(90267)
Create a database from a pandas data frame.
df = pd.DataFrame(
{
'Person': [1, 1, 1, 2, 2],
'Exclude': [0, 0, 1, 0, 1],
'Variable1': [1, 2, 3, 4, 5],
'Variable2': [10, 20, 30, 40, 50],
'Choice': [1, 2, 3, 1, 2],
'Av1': [0, 1, 1, 1, 1],
'Av2': [1, 1, 1, 1, 1],
'Av3': [0, 1, 1, 1, 1],
}
)
my_data = Database('test', df)
print(my_data)
biogeme database test
check_segmentation: checks that the segmentation covers the complete database. A segmentation is a partition of the dataset based on the value of one of the variables. For instance, we can segment on the Choice variable.
correct_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 3: 'Alt. 3'}
correct_segmentation = DiscreteSegmentationTuple(
variable='Choice', mapping=correct_mapping
)
If the segmentation is well-defined, the function returns the size of each segment in the database.
verify_segmentation(dataframe=my_data.dataframe, segmentation=correct_segmentation)
incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2'}
incorrect_segmentation = DiscreteSegmentationTuple(
variable='Choice', mapping=incorrect_mapping
)
If the segmentation is incorrect, an exception is raised.
try:
verify_segmentation(
dataframe=my_data.dataframe, segmentation=incorrect_segmentation
)
except BiogemeError as e:
print(e)
The following entries are missing in the segmentation: {np.float64(3.0)}.
another_incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 4: 'Does not exist'}
another_incorrect_segmentation = DiscreteSegmentationTuple(
variable='Choice', mapping=another_incorrect_mapping
)
try:
verify_segmentation(
dataframe=my_data.dataframe, segmentation=another_incorrect_segmentation
)
except BiogemeError as e:
print(e)
The following entries are missing in the segmentation: {np.float64(3.0)}. Segmentation entries do not exist in the data: {4}.
checkAvailabilityOfChosenAlt: check if the chosen alternative is available for each entry in the database. %%
Av1 = Variable('Av1')
Av2 = Variable('Av2')
Av3 = Variable('Av3')
Choice = Variable('Choice')
avail = {1: Av1, 2: Av2, 3: Av3}
result = check_availability_of_chosen_alt(database=my_data, avail=avail, choice=Choice)
print(result)
[False True True True True]
choiceAvailabilityStatistics: calculates the number of time an alternative is chosen and available.
statistics = choice_availability_statistics(
database=my_data, avail=avail, choice=Choice
)
for alternative, choice_available in statistics.items():
print(
f'Alternative {alternative} is chosen {choice_available.chosen} times '
f'and available {choice_available.available} times'
)
Alternative 1.0 is chosen 2 times and available 4.0 times
Alternative 2.0 is chosen 2 times and available 5.0 times
Alternative 3.0 is chosen 1 times and available 4.0 times
Suggest a scaling of the variables in the database %%
display(my_data.dataframe.columns)
Index(['Person', 'Exclude', 'Variable1', 'Variable2', 'Choice', 'Av1', 'Av2',
'Av3'],
dtype='object')
suggested_scaling = my_data.suggest_scaling()
display(suggested_scaling)
Column Scale Largest
3 Variable2 0.01 50.0
It is possible to obtain the scaling for selected variables
suggested_scaling = my_data.suggest_scaling(columns=['Variable1', 'Variable2'])
display(suggested_scaling)
Column Scale Largest
1 Variable2 0.01 50.0
scale_column: divide an entire column by a scale value %% Before.
display(my_data.dataframe)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3
0 1.0 0.0 1.0 10.0 1.0 0.0 1.0 0.0
1 1.0 0.0 2.0 20.0 2.0 1.0 1.0 1.0
2 1.0 1.0 3.0 30.0 3.0 1.0 1.0 1.0
3 2.0 0.0 4.0 40.0 1.0 1.0 1.0 1.0
4 2.0 1.0 5.0 50.0 2.0 1.0 1.0 1.0
my_data.scale_column('Variable2', 0.01)
After.
display(my_data.dataframe)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3
0 1.0 0.0 1.0 0.1 1.0 0.0 1.0 0.0
1 1.0 0.0 2.0 0.2 2.0 1.0 1.0 1.0
2 1.0 1.0 3.0 0.3 3.0 1.0 1.0 1.0
3 2.0 0.0 4.0 0.4 1.0 1.0 1.0 1.0
4 2.0 1.0 5.0 0.5 2.0 1.0 1.0 1.0
define_variable: add a new column in the database, calculated from an expression. %%
Variable1 = Variable('Variable1')
Variable2 = Variable('Variable2')
expression = exp(0.5 * Variable2) / Variable1
result = my_data.define_variable(name='NewVariable', expression=expression)
print(my_data.dataframe['NewVariable'].tolist())
[1.051271096376024, 0.5525854590378239, 0.38727808090942767, 0.30535068954004246, 0.2568050833375483]
display(my_data.dataframe)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1.0 0.0 1.0 0.1 1.0 0.0 1.0 0.0 1.051271
1 1.0 0.0 2.0 0.2 2.0 1.0 1.0 1.0 0.552585
2 1.0 1.0 3.0 0.3 3.0 1.0 1.0 1.0 0.387278
3 2.0 0.0 4.0 0.4 1.0 1.0 1.0 1.0 0.305351
4 2.0 1.0 5.0 0.5 2.0 1.0 1.0 1.0 0.256805
remove: removes from the database all entries such that the value of the expression is not 0. %%
exclude = Variable('Exclude')
my_data.remove(exclude)
display(my_data.dataframe)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1.0 0.0 1.0 0.1 1.0 0.0 1.0 0.0 1.051271
1 1.0 0.0 2.0 0.2 2.0 1.0 1.0 1.0 0.552585
2 2.0 0.0 4.0 0.4 1.0 1.0 1.0 1.0 0.305351
sample_with_replacement: extracts a random sample from the database, with replacement. Useful for bootstrapping.
One bootstrap sample
bootstrap_sample = my_data.bootstrap_sample()
display(bootstrap_sample.dataframe)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1.0 0.0 1.0 0.1 1.0 0.0 1.0 0.0 1.051271
1 1.0 0.0 1.0 0.1 1.0 0.0 1.0 0.0 1.051271
2 1.0 0.0 1.0 0.1 1.0 0.0 1.0 0.0 1.051271
Another bootstrap sample
bootstrap_sample = my_data.bootstrap_sample()
display(bootstrap_sample.dataframe)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1.0 0.0 2.0 0.2 2.0 1.0 1.0 1.0 0.552585
1 2.0 0.0 4.0 0.4 1.0 1.0 1.0 1.0 0.305351
2 1.0 0.0 2.0 0.2 2.0 1.0 1.0 1.0 0.552585
If the database is organised for panel data, where several observations are available for each individual, the database must be flattened so that each row corresponds to an individual
my_panel_data = PanelDatabase(database=my_data, panel_column='Person')
flattened_dataframe, largest_group = my_panel_data.flatten_database(
missing_data='999999'
)
print(f'The size of the largest group of data per individual is {largest_group}')
The size of the largest group of data per individual is 2
The name of the columns of the flat dataframe are the name of the original columns, with a suffix. For each variable column in the original DataFrame (excluding the column identifying the individuals), the output contains multiple columns named columnname__panel__XX, where XX is the zero-padded observation index (starting at 01). Additionally, for each observation index, a relevant_XX column indicates whether the observation is relevant (1) or padded with a missing value (0).
print('The columns of the flat dataframe are:')
for col in flattened_dataframe.columns:
print(f'\t{col}')
display(flattened_dataframe)
The columns of the flat dataframe are:
Person
Exclude
Av2
relevant___panel__01
Exclude__panel__01
Variable1__panel__01
Variable2__panel__01
Choice__panel__01
Av1__panel__01
Av2__panel__01
Av3__panel__01
NewVariable__panel__01
relevant___panel__02
Exclude__panel__02
Variable1__panel__02
Variable2__panel__02
Choice__panel__02
Av1__panel__02
Av2__panel__02
Av3__panel__02
NewVariable__panel__02
Person Exclude Av2 ... Av2__panel__02 Av3__panel__02 NewVariable__panel__02
0 1.0 0.0 1.0 ... 1.0 1.0 0.552585
1 2.0 0.0 1.0 ... 999999 999999 999999
[2 rows x 21 columns]
Total running time of the script: (0 minutes 0.204 seconds)