Note

Go to the end to download the full example code.

biogeme.database¶

Examples of use of several functions.

This is designed for programmers who need examples of use of the functions of the module. The examples are designed to illustrate the syntax. They do not correspond to any meaningful model.

Michel Bierlaire Sun Jun 29 2025, 02:30:19

import numpy as np
import pandas as pd
from IPython.core.display_functions import display

from biogeme.database import (
    Database,
    PanelDatabase,
    check_availability_of_chosen_alt,
    choice_availability_statistics,
)
from biogeme.exceptions import BiogemeError
from biogeme.expressions import Variable, exp
from biogeme.segmentation import DiscreteSegmentationTuple, verify_segmentation
from biogeme.version import get_text

Version of Biogeme.

print(get_text())

biogeme 3.3.1 [2025-09-03]
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)

We set the seed so that the outcome of random operations is always the same.

np.random.seed(90267)

Create a database from a pandas data frame.

df = pd.DataFrame(
    {
        'Person': [1, 1, 1, 2, 2],
        'Exclude': [0, 0, 1, 0, 1],
        'Variable1': [1, 2, 3, 4, 5],
        'Variable2': [10, 20, 30, 40, 50],
        'Choice': [1, 2, 3, 1, 2],
        'Av1': [0, 1, 1, 1, 1],
        'Av2': [1, 1, 1, 1, 1],
        'Av3': [0, 1, 1, 1, 1],
    }
)
my_data = Database('test', df)
print(my_data)

biogeme database test

check_segmentation: checks that the segmentation covers the complete database. A segmentation is a partition of the dataset based on the value of one of the variables. For instance, we can segment on the Choice variable.

correct_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 3: 'Alt. 3'}
correct_segmentation = DiscreteSegmentationTuple(
    variable='Choice', mapping=correct_mapping
)

If the segmentation is well-defined, the function returns the size of each segment in the database.

verify_segmentation(dataframe=my_data.dataframe, segmentation=correct_segmentation)

incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2'}
incorrect_segmentation = DiscreteSegmentationTuple(
    variable='Choice', mapping=incorrect_mapping
)

If the segmentation is incorrect, an exception is raised.

try:
    verify_segmentation(
        dataframe=my_data.dataframe, segmentation=incorrect_segmentation
    )
except BiogemeError as e:
    print(e)

The following entries are missing in the segmentation: {np.float64(3.0)}.

another_incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 4: 'Does not exist'}
another_incorrect_segmentation = DiscreteSegmentationTuple(
    variable='Choice', mapping=another_incorrect_mapping
)

try:
    verify_segmentation(
        dataframe=my_data.dataframe, segmentation=another_incorrect_segmentation
    )
except BiogemeError as e:
    print(e)

The following entries are missing in the segmentation: {np.float64(3.0)}. Segmentation entries do not exist in the data: {4}.

checkAvailabilityOfChosenAlt: check if the chosen alternative is available for each entry in the database. %%

Av1 = Variable('Av1')
Av2 = Variable('Av2')
Av3 = Variable('Av3')
Choice = Variable('Choice')
avail = {1: Av1, 2: Av2, 3: Av3}
result = check_availability_of_chosen_alt(database=my_data, avail=avail, choice=Choice)
print(result)

[False  True  True  True  True]

choiceAvailabilityStatistics: calculates the number of time an alternative is chosen and available.

statistics = choice_availability_statistics(
    database=my_data, avail=avail, choice=Choice
)
for alternative, choice_available in statistics.items():
    print(
        f'Alternative {alternative} is chosen {choice_available.chosen} times '
        f'and available {choice_available.available} times'
    )

Alternative 1.0 is chosen 2 times and available 4.0 times
Alternative 2.0 is chosen 2 times and available 5.0 times
Alternative 3.0 is chosen 1 times and available 4.0 times

Suggest a scaling of the variables in the database %%

display(my_data.dataframe.columns)

Index(['Person', 'Exclude', 'Variable1', 'Variable2', 'Choice', 'Av1', 'Av2',
       'Av3'],
      dtype='object')

suggested_scaling = my_data.suggest_scaling()
display(suggested_scaling)

      Column  Scale  Largest
3  Variable2   0.01     50.0

It is possible to obtain the scaling for selected variables

suggested_scaling = my_data.suggest_scaling(columns=['Variable1', 'Variable2'])
display(suggested_scaling)

      Column  Scale  Largest
1  Variable2   0.01     50.0

scale_column: divide an entire column by a scale value %% Before.

display(my_data.dataframe)

   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3
   1.0      0.0        1.0       10.0     1.0  0.0  1.0  0.0
   1.0      0.0        2.0       20.0     2.0  1.0  1.0  1.0
   1.0      1.0        3.0       30.0     3.0  1.0  1.0  1.0
   2.0      0.0        4.0       40.0     1.0  1.0  1.0  1.0
   2.0      1.0        5.0       50.0     2.0  1.0  1.0  1.0

my_data.scale_column('Variable2', 0.01)

After.

display(my_data.dataframe)

   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3
   1.0      0.0        1.0        0.1     1.0  0.0  1.0  0.0
   1.0      0.0        2.0        0.2     2.0  1.0  1.0  1.0
   1.0      1.0        3.0        0.3     3.0  1.0  1.0  1.0
   2.0      0.0        4.0        0.4     1.0  1.0  1.0  1.0
   2.0      1.0        5.0        0.5     2.0  1.0  1.0  1.0

define_variable: add a new column in the database, calculated from an expression. %%

Variable1 = Variable('Variable1')
Variable2 = Variable('Variable2')
expression = exp(0.5 * Variable2) / Variable1
result = my_data.define_variable(name='NewVariable', expression=expression)
print(my_data.dataframe['NewVariable'].tolist())

[1.051271096376024, 0.5525854590378239, 0.38727808090942767, 0.30535068954004246, 0.2568050833375483]

display(my_data.dataframe)

   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
   1.0      0.0        1.0        0.1     1.0  0.0  1.0  0.0     1.051271
   1.0      0.0        2.0        0.2     2.0  1.0  1.0  1.0     0.552585
   1.0      1.0        3.0        0.3     3.0  1.0  1.0  1.0     0.387278
   2.0      0.0        4.0        0.4     1.0  1.0  1.0  1.0     0.305351
   2.0      1.0        5.0        0.5     2.0  1.0  1.0  1.0     0.256805

remove: removes from the database all entries such that the value of the expression is not 0. %%

exclude = Variable('Exclude')
my_data.remove(exclude)
display(my_data.dataframe)

   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
   1.0      0.0        1.0        0.1     1.0  0.0  1.0  0.0     1.051271
   1.0      0.0        2.0        0.2     2.0  1.0  1.0  1.0     0.552585
   2.0      0.0        4.0        0.4     1.0  1.0  1.0  1.0     0.305351

sample_with_replacement: extracts a random sample from the database, with replacement. Useful for bootstrapping.

One bootstrap sample

bootstrap_sample = my_data.bootstrap_sample()
display(bootstrap_sample.dataframe)

   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
   1.0      0.0        1.0        0.1     1.0  0.0  1.0  0.0     1.051271
   1.0      0.0        1.0        0.1     1.0  0.0  1.0  0.0     1.051271
   1.0      0.0        1.0        0.1     1.0  0.0  1.0  0.0     1.051271

Another bootstrap sample

bootstrap_sample = my_data.bootstrap_sample()
display(bootstrap_sample.dataframe)

   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
   1.0      0.0        2.0        0.2     2.0  1.0  1.0  1.0     0.552585
   2.0      0.0        4.0        0.4     1.0  1.0  1.0  1.0     0.305351
   1.0      0.0        2.0        0.2     2.0  1.0  1.0  1.0     0.552585

If the database is organised for panel data, where several observations are available for each individual, the database must be flattened so that each row corresponds to an individual

my_panel_data = PanelDatabase(database=my_data, panel_column='Person')
flattened_dataframe, largest_group = my_panel_data.flatten_database(
    missing_data='999999'
)
print(f'The size of the largest group of data per individual is {largest_group}')

The size of the largest group of data per individual is 2

The name of the columns of the flat dataframe are the name of the original columns, with a suffix. For each variable column in the original DataFrame (excluding the column identifying the individuals), the output contains multiple columns named columnname__panel__XX, where XX is the zero-padded observation index (starting at 01). Additionally, for each observation index, a relevant_XX column indicates whether the observation is relevant (1) or padded with a missing value (0).

print('The columns of the flat dataframe are:')
for col in flattened_dataframe.columns:
    print(f'\t{col}')
display(flattened_dataframe)

The columns of the flat dataframe are:
        Person
        Exclude
        Av2
        relevant___panel__01
        Exclude__panel__01
        Variable1__panel__01
        Variable2__panel__01
        Choice__panel__01
        Av1__panel__01
        Av2__panel__01
        Av3__panel__01
        NewVariable__panel__01
        relevant___panel__02
        Exclude__panel__02
        Variable1__panel__02
        Variable2__panel__02
        Choice__panel__02
        Av1__panel__02
        Av2__panel__02
        Av3__panel__02
        NewVariable__panel__02
   Person  Exclude  Av2  ...  Av2__panel__02  Av3__panel__02  NewVariable__panel__02
0     1.0      0.0  1.0  ...             1.0             1.0                0.552585
1     2.0      0.0  1.0  ...          999999          999999                  999999

[2 rows x 21 columns]

Total running time of the script: (0 minutes 0.204 seconds)

Gallery generated by Sphinx-Gallery