biogeme.database

Examples of use of several functions.

This is designed for programmers who need examples of use of the functions of the module. The examples are designed to illustrate the syntax. They do not correspond to any meaningful model.

author:

Michel Bierlaire

date:

Thu Nov 16 18:36:59 2023

import biogeme.version as ver
import pandas as pd
import numpy as np
import biogeme.database as db
from biogeme.expressions import Variable, exp, bioDraws
from biogeme.expressions import TypeOfElementaryExpression
from biogeme.native_draws import description_of_native_draws, RandomNumberGeneratorTuple
from biogeme.segmentation import DiscreteSegmentationTuple
from biogeme.exceptions import BiogemeError

Version of Biogeme.

print(ver.get_text())
biogeme 3.2.14 [2024-08-05]
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)

We set the seed so that the outcome of random operations is always the same.

np.random.seed(90267)

Create a database from a pandas data frame.

df = pd.DataFrame(
    {
        'Person': [1, 1, 1, 2, 2],
        'Exclude': [0, 0, 1, 0, 1],
        'Variable1': [1, 2, 3, 4, 5],
        'Variable2': [10, 20, 30, 40, 50],
        'Choice': [1, 2, 3, 1, 2],
        'Av1': [0, 1, 1, 1, 1],
        'Av2': [1, 1, 1, 1, 1],
        'Av3': [0, 1, 1, 1, 1],
    }
)
my_data = db.Database('test', df)
print(my_data)
biogeme database test:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3
0       1        0          1         10       1    0    1    0
1       1        0          2         20       2    1    1    1
2       1        1          3         30       3    1    1    1
3       2        0          4         40       1    1    1    1
4       2        1          5         50       2    1    1    1

valuesFromDatabase: evaluates an expression for each entry of the database. Takes as argument an expression, and returns a numpy series, long as the number of entries in the database, containing the calculated quantities.

Variable1 = Variable('Variable1')
Variable2 = Variable('Variable2')
expr = Variable1 + Variable2
result = my_data.values_from_database(expr)
print(result)
[11. 22. 33. 44. 55.]

check_segmentation: checks that the segmentation covers the complete database. A segmentation is a partition of the dataset based on the value of one of the variables. For instance, we can segment on the Choice variable.

correct_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 3: 'Alt. 3'}
correct_segmentation = DiscreteSegmentationTuple(
    variable='Choice', mapping=correct_mapping
)

If the segmentation is well defined, the function returns the size of each segment in the database.

my_data.check_segmentation(correct_segmentation)
{'Alt. 1': np.int64(2), 'Alt. 2': np.int64(2), 'Alt. 3': np.int64(1)}
incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2'}
incorrect_segmentation = DiscreteSegmentationTuple(
    variable='Choice', mapping=incorrect_mapping
)

If the segmentation is incorrect, an exception is raised.

try:
    my_data.check_segmentation(incorrect_segmentation)
except BiogemeError as e:
    print(e)
Variable Choice takes the value 3 [1 times], and it does not define any segment.
another_incorrect_mapping = {1: 'Alt. 1', 2: 'Alt. 2', 4: 'Does not exist'}
another_incorrect_segmentation = DiscreteSegmentationTuple(
    variable='Choice', mapping=another_incorrect_mapping
)
try:
    my_data.check_segmentation(another_incorrect_segmentation)
except BiogemeError as e:
    print(e)
Variable Choice does not take the value 4 representing segment "Does not exist"

checkAvailabilityOfChosenAlt: check if the chosen alternative is available for each entry in the database. %%

Av1 = Variable('Av1')
Av2 = Variable('Av2')
Av3 = Variable('Av3')
Choice = Variable('Choice')
avail = {1: Av1, 2: Av2, 3: Av3}
result = my_data.check_availability_of_chosen_alt(avail, Choice)
print(result)
[False  True  True  True  True]

choiceAvailabilityStatistics: calculates the number of time an alternative is chosen and available.

my_data.choice_availability_statistics(avail, Choice)
{np.float64(1.0): (2, np.float64(4.0)), np.float64(2.0): (2, np.float64(5.0)), np.float64(3.0): (1, np.float64(4.0))}

Suggest a scaling of the variables in the database %%

my_data.data.columns
Index(['Person', 'Exclude', 'Variable1', 'Variable2', 'Choice', 'Av1', 'Av2',
       'Av3'],
      dtype='object')
my_data.suggest_scaling()
Column Scale Largest
3 Variable2 0.01 50


my_data.suggest_scaling(columns=['Variable1', 'Variable2'])
Column Scale Largest
1 Variable2 0.01 50


scaleColumn: divide an entire column by a scale value %% Before.

my_data.data
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3
0 1 0 1 10 1 0 1 0
1 1 0 2 20 2 1 1 1
2 1 1 3 30 3 1 1 1
3 2 0 4 40 1 1 1 1
4 2 1 5 50 2 1 1 1


my_data.scale_column('Variable2', 0.01)

After.

my_data.data
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3
0 1 0 1 0.1 1 0 1 0
1 1 0 2 0.2 2 1 1 1
2 1 1 3 0.3 3 1 1 1
3 2 0 4 0.4 1 1 1 1
4 2 1 5 0.5 2 1 1 1


addColumn: add a new column in the database, calculated from an expression. %%

Variable1 = Variable('Variable1')
Variable2 = Variable('Variable2')
expression = exp(0.5 * Variable2) / Variable1
# expression = Variable2 * Variable1
result = my_data.add_column(expression, 'NewVariable')
print(my_data.data['NewVariable'].tolist())
[1.0512710963760241, 0.5525854590378239, 0.38727808090942767, 0.30535068954004246, 0.25680508333754826]
my_data.data
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1 0 1 0.1 1 0 1 0 1.051271
1 1 0 2 0.2 2 1 1 1 0.552585
2 1 1 3 0.3 3 1 1 1 0.387278
3 2 0 4 0.4 1 1 1 1 0.305351
4 2 1 5 0.5 2 1 1 1 0.256805


split: shuffle the data, and split the data into slices. For each slide, an estimation and a validation sets are generated. The validation set is the slice itself. The estimation set is the rest of the data.

dataSets = my_data.split(3)
for i in dataSets:
    print("==========")
    print("Estimation:")
    print(type(i[0]))
    print(i[0])
    print("Validation:")
    print(i[1])
/Users/bierlair/venv312/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
==========
Estimation:
<class 'pandas.core.frame.DataFrame'>
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
0       1        0          1        0.1       1    0    1    0     1.051271
3       2        0          4        0.4       1    1    1    1     0.305351
4       2        1          5        0.5       2    1    1    1     0.256805
Validation:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
1       1        0          2        0.2       2    1    1    1     0.552585
2       1        1          3        0.3       3    1    1    1     0.387278
==========
Estimation:
<class 'pandas.core.frame.DataFrame'>
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
1       1        0          2        0.2       2    1    1    1     0.552585
2       1        1          3        0.3       3    1    1    1     0.387278
4       2        1          5        0.5       2    1    1    1     0.256805
Validation:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
0       1        0          1        0.1       1    0    1    0     1.051271
3       2        0          4        0.4       1    1    1    1     0.305351
==========
Estimation:
<class 'pandas.core.frame.DataFrame'>
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
1       1        0          2        0.2       2    1    1    1     0.552585
2       1        1          3        0.3       3    1    1    1     0.387278
0       1        0          1        0.1       1    0    1    0     1.051271
3       2        0          4        0.4       1    1    1    1     0.305351
Validation:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
4       2        1          5        0.5       2    1    1    1     0.256805

count: counts the number of observations that have a specific value in a given column.

For instance, counts the number of entries for individual 1. %%

my_data.count('Person', 1)
np.int64(3)

remove: removes from the database all entries such that the value of the expression is not 0. %%

exclude = Variable('Exclude')
my_data.remove(exclude)
my_data.data
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1 0 1 0.1 1 0 1 0 1.051271
1 1 0 2 0.2 2 1 1 1 0.552585
3 2 0 4 0.4 1 1 1 1 0.305351


dumpOnFile: dumps the database in a CSV formatted file.

my_data.dump_on_file()
'test_dumped.dat'

%%bash cat test_dumped.dat

# `generateDraws`: generate draws for each variable. Takes as argument
#                  a dict indexed by the names of the variables,
#                  describing the types of draws. Each of them can be
#                  a native type or any type defined by the function
#                  database.setRandomNumberGenerators, as well as the
#                  list of names of the variables that require draws
#                  to be generated.  It returns a 3-dimensional table
#                  with draws. The 3 dimensions are
#
#               1. number of individuals
#               2. number of draws
#               3. number of variables

List of native types and their description

description_of_native_draws()
{'UNIFORM': 'Uniform U[0, 1]', 'UNIFORM_ANTI': 'Antithetic uniform U[0, 1]', 'UNIFORM_HALTON2': 'Halton draws with base 2, skipping the first 10', 'UNIFORM_HALTON3': 'Halton draws with base 3, skipping the first 10', 'UNIFORM_HALTON5': 'Halton draws with base 5, skipping the first 10', 'UNIFORM_MLHS': 'Modified Latin Hypercube Sampling on [0, 1]', 'UNIFORM_MLHS_ANTI': 'Antithetic Modified Latin Hypercube Sampling on [0, 1]', 'UNIFORMSYM': 'Uniform U[-1, 1]', 'UNIFORMSYM_ANTI': 'Antithetic uniform U[-1, 1]', 'UNIFORMSYM_HALTON2': 'Halton draws on [-1, 1] with base 2, skipping the first 10', 'UNIFORMSYM_HALTON3': 'Halton draws on [-1, 1] with base 3, skipping the first 10', 'UNIFORMSYM_HALTON5': 'Halton draws on [-1, 1] with base 5, skipping the first 10', 'UNIFORMSYM_MLHS': 'Modified Latin Hypercube Sampling on [-1, 1]', 'UNIFORMSYM_MLHS_ANTI': 'Antithetic Modified Latin Hypercube Sampling on [-1, 1]', 'NORMAL': 'Normal N(0, 1) draws', 'NORMAL_ANTI': 'Antithetic normal draws', 'NORMAL_HALTON2': 'Normal draws from Halton base 2 sequence', 'NORMAL_HALTON3': 'Normal draws from Halton base 3 sequence', 'NORMAL_HALTON5': 'Normal draws from Halton base 5 sequence', 'NORMAL_MLHS': 'Normal draws from Modified Latin Hypercube Sampling', 'NORMAL_MLHS_ANTI': 'Antithetic normal draws from Modified Latin Hypercube Sampling'}
random_draws1 = bioDraws('random_draws1', 'NORMAL_MLHS_ANTI')
random_draws2 = bioDraws('random_draws2', 'UNIFORM_MLHS_ANTI')
random_draws3 = bioDraws('random_draws3', 'UNIFORMSYM_MLHS_ANTI')

We build an expression that involves the three random variables

x = random_draws1 + random_draws2 + random_draws3
dict_of_draws = x.dict_of_elementary_expression(TypeOfElementaryExpression.DRAWS)
types = {name: expression.drawType for name, expression in dict_of_draws.items()}
print(types)
{'random_draws1': 'NORMAL_MLHS_ANTI', 'random_draws2': 'UNIFORM_MLHS_ANTI', 'random_draws3': 'UNIFORMSYM_MLHS_ANTI'}

Generation of the draws.

the_draws_table = my_data.generate_draws(
    types, ['random_draws1', 'random_draws2', 'random_draws3'], 10
)
the_draws_table
array([[[-0.5605896 ,  0.17260212, -0.35933972],
        [-0.13811324,  0.53162299,  0.85919231],
        [ 1.82908818,  0.04835596,  0.13484656],
        [ 0.38628367,  0.78402643, -0.65819673],
        [ 1.1505448 ,  0.6848117 ,  0.68832765],
        [ 0.5605896 ,  0.82739788,  0.35933972],
        [ 0.13811324,  0.46837701, -0.85919231],
        [-1.82908818,  0.95164404, -0.13484656],
        [-0.38628367,  0.21597357,  0.65819673],
        [-1.1505448 ,  0.3151883 , -0.68832765]],

       [[-0.64437973,  0.2080917 , -0.46801586],
        [ 0.00796208,  0.12040568, -0.10798735],
        [-1.6477843 ,  0.99704115,  0.21589817],
        [-1.19741369,  0.83479683,  0.95349066],
        [ 0.45912504,  0.6264498 , -0.29536472],
        [ 0.64437973,  0.7919083 ,  0.46801586],
        [-0.00796208,  0.87959432,  0.10798735],
        [ 1.6477843 ,  0.00295885, -0.21589817],
        [ 1.19741369,  0.16520317, -0.95349066],
        [-0.45912504,  0.3735502 ,  0.29536472]],

       [[ 0.79534986,  0.59656342,  0.03577992],
        [ 0.94867479,  0.35276091, -0.77071189],
        [-1.04456302,  0.88297895,  0.44668319],
        [ 0.15248012,  0.43261985,  0.51955543],
        [-0.35858978,  0.33330463, -0.98648324],
        [-0.79534986,  0.40343658, -0.03577992],
        [-0.94867479,  0.64723909,  0.77071189],
        [ 1.04456302,  0.11702105, -0.44668319],
        [-0.15248012,  0.56738015, -0.51955543],
        [ 0.35858978,  0.66669537,  0.98648324]]])

setRandomNumberGenerators: defines user-defined random numbers generators. It takes as argumentsa dictionary of generators. The keys of the dictionary characterize the name of the generators, and must be different from the pre-defined generators in Biogeme: NORMAL, UNIFORM and UNIFORMSYM. The elements of the dictionary are functions that take two arguments: the number of series to generate (typically, the size of the database), and the number of draws per series.

We first define functions returning draws, given the number of observations, and the number of draws

A lognormal distribution.

def log_normal_draws(sample_size: int, number_of_draws: int) -> np.ndarray:
    return np.exp(np.random.randn(sample_size, number_of_draws))

An exponential distribution.

def exponential_draws(sample_size: int, number_of_draws: int) -> np.ndarray:
    return -1.0 * np.log(np.random.rand(sample_size, number_of_draws))

We associate these functions with a name in a dictionary. %%

rnd_dict = {
    'LOGNORMAL': RandomNumberGeneratorTuple(
        generator=log_normal_draws, description='Draws from lognormal distribution'
    ),
    'EXP': RandomNumberGeneratorTuple(
        generator=exponential_draws, description='Draws from exponential distributions'
    ),
}
my_data.set_random_number_generators(rnd_dict)

We can now generate draws from these distributions.

random_draws1 = bioDraws('random_draws1', 'LOGNORMAL')
random_draws2 = bioDraws('random_draws2', 'EXP')
x = random_draws1 + random_draws2
the_draws = x.dict_of_elementary_expression(TypeOfElementaryExpression.DRAWS)
the_types = {name: expression.drawType for name, expression in the_draws.items()}
the_draws_table = my_data.generate_draws(
    draw_types=the_types, names=['random_draws1', 'random_draws2'], number_of_draws=10
)
print(the_draws_table)
[[[2.15336577 0.35541854]
  [0.92036707 0.38330687]
  [1.35125462 2.83842826]
  [0.27817501 0.46249413]
  [0.5007549  0.6961861 ]
  [1.11902088 1.05840875]
  [0.6539865  0.15909907]
  [0.11955894 0.38886736]
  [0.60108954 0.40525196]
  [3.93153651 0.35868107]]

 [[4.60723253 0.18021421]
  [1.27062239 2.2373742 ]
  [2.73460167 1.17203962]
  [5.61600938 1.8920716 ]
  [2.54756523 0.07930524]
  [0.77284243 2.56028383]
  [5.16153268 0.59225528]
  [0.58972275 0.67940422]
  [0.88324351 0.63497716]
  [3.67625403 3.030641  ]]

 [[2.24536739 0.70518133]
  [0.46930501 0.67990918]
  [4.86579395 0.4097506 ]
  [2.14129298 0.8086017 ]
  [0.20614091 0.06963184]
  [0.2096891  0.02382351]
  [1.70933977 0.78170648]
  [0.63660909 1.83653019]
  [1.14977308 0.75890102]
  [0.26832114 4.20117546]]]

sampleWithReplacement: extracts a random sample from the database, with replacement. Useful for bootstrapping.

my_data.sample_with_replacement()
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
3 2 0 4 0.4 1 1 1 1 0.305351
3 2 0 4 0.4 1 1 1 1 0.305351
1 1 0 2 0.2 2 1 1 1 0.552585


my_data.sample_with_replacement(6)
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
3 2 0 4 0.4 1 1 1 1 0.305351
0 1 0 1 0.1 1 0 1 0 1.051271
0 1 0 1 0.1 1 0 1 0 1.051271
0 1 0 1 0.1 1 0 1 0 1.051271
0 1 0 1 0.1 1 0 1 0 1.051271
1 1 0 2 0.2 2 1 1 1 0.552585


panel: defines the data as panel data. Takes as argument the name of the column that identifies individuals.

my_panel_data = db.Database('test', df)

Data is not considered panel yet

my_panel_data.is_panel()
False
my_panel_data.panel('Person')

Now it is panel.

print(my_panel_data.is_panel())
True
print(my_panel_data)
biogeme database test:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable
0       1        0          1        0.1       1    0    1    0     1.051271
1       1        0          2        0.2       2    1    1    1     0.552585
2       2        0          4        0.4       1    1    1    1     0.305351
Panel data
   0  1
1  0  1
2  2  2

When draws are generated for panel data, a set of draws is generated per person, not per observation.

random_draws1 = bioDraws('random_draws1', 'NORMAL')
random_draws2 = bioDraws('random_draws2', 'UNIFORM_HALTON3')

We build an expression that involves the two random variables

x = random_draws1 + random_draws2
the_draws = x.dict_of_elementary_expression(TypeOfElementaryExpression.DRAWS)
types = {name: expression.drawType for name, expression in the_draws.items()}
the_draws_table = my_panel_data.generate_draws(
    types, ['random_draws1', 'random_draws2'], 10
)
print(the_draws_table)
[[[-1.57792232  0.7037037 ]
  [ 0.10870961  0.14814815]
  [ 0.05140378  0.48148148]
  [ 1.800922    0.81481481]
  [-1.85148982  0.25925926]
  [ 0.87938314  0.59259259]
  [ 1.353763    0.92592593]
  [-0.46741631  0.07407407]
  [-1.09546279  0.40740741]
  [-0.09265338  0.74074074]]

 [[ 1.92991243  0.18518519]
  [-0.29388122  0.51851852]
  [-0.49084943  0.85185185]
  [ 0.2439256   0.2962963 ]
  [ 0.42498657  0.62962963]
  [-2.72496968  0.96296296]
  [ 2.0755831   0.01234568]
  [ 0.44793057  0.34567901]
  [-0.13185245  0.67901235]
  [-1.04344227  0.12345679]]]

getNumberOfObservations: reports the number of observations in the database. Note that it returns the same value, irrespectively if the database contains panel data or not.

my_data.get_number_of_observations()
3
my_panel_data.get_number_of_observations()
3

getSampleSize: reports the size of the sample. If the data is cross-sectional, it is the number of observations in the database. If the data is panel, it is the number of individuals.

my_data.get_sample_size()
3
my_panel_data.get_sample_size()
2

sampleIndividualMapWithReplacement: extracts a random sample of the individual map from a panel data database, with replacement. Useful for bootstrapping.

my_panel_data.sample_individual_map_with_replacement(10)
0 1
2 2 2
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
2 2 2
1 0 1


Total running time of the script: (0 minutes 0.036 seconds)

Gallery generated by Sphinx-Gallery