Sampling

Module in charge of functionalities related to the sampling of alternatives. The code is organized into sub-modules.

biogeme.sampling_of_alternatives.choice_set_generation module

Module in charge of functionalities related to the choice set generation

For thew main sample, all alternatives except the last one must be used: 0 to J-1. For MEV models, the approximation of the sum capturing the nests requires another sample not based on the choice.

author:

Michel Bierlaire

date:

Fri Oct 27 12:50:06 2023

class biogeme.sampling_of_alternatives.choice_set_generation.ChoiceSetsGeneration(context)[source]

Bases: object

Class in charge of generationg the choice sets for each individual.

Parameters:

context (SamplingContext) –

__init__(context)[source]

Constructor

Parameters:

context (SamplingContext) – contains all the information that is needed to perform the sampling of alternatives.

define_new_variables(database)[source]

Create the new variables

Parameters:

database (Database) – database, in Biogeme format.

get_attributes_from_expression(expression)[source]

Extract the names of the attributes of alternatives from an expression

Return type:

set[str]

Parameters:

expression (Expression) –

process_row(individual_row)[source]

Process one row of the individual database

Parameters:

individual_row (Series) – row corresponding to one individual

Return type:

dict

Returns:

a dictionnary containing the data for the extended row

sample_and_merge(recycle=False)[source]

Loops on the individuals and generate a choice set for each of them

Parameters:

recycle (bool) – if True, if the data file already exisits, it is not re-created.

Return type:

Database

Returns:

database for Biogeme

biogeme.sampling_of_alternatives.generate_model module

Generation of models estimated with samples of alternatives

author:

Michel Bierlaire

date:

Fri Sep 22 12:14:59 2023

class biogeme.sampling_of_alternatives.generate_model.GenerateModel(context)[source]

Bases: object

Class in charge of generating the biogeme expression for the loglikelihood function

Parameters:

context (SamplingContext) –

__init__(context)[source]

Constructor

Parameters:

context (SamplingContext) – contains all the information that is needed to perform the sampling of alternatives.

generate_utility(prefix, suffix)[source]

Generate the utility function for one alternative

Parameters:
  • prefix (str) – prefix to add to the attributes

  • suffix (str) – suffix to add to the attributes

Return type:

Expression

get_cross_nested_logit()[source]

Returns the expression for the log likelihood of the nested logit model

Return type:

Expression

get_logit()[source]

Returns the expression for the log likelihood of the logit model

Return type:

Expression

get_nested_logit(nests)[source]

Returns the expression for the log likelihood of the nested logit model

Parameters:

nests (NestsForNestedLogit) – A tuple containing as many items as nests. Each item is also a tuple containing two items:

Return type:

Expression

  • an object of type biogeme.expressions.expr.Expression representing the nest parameter,

  • a list containing the list of identifiers of the alternatives belonging to the nest.

Example:

nesta = MUA ,[1, 2, 3]
nestb = MUB ,[4, 5, 6]
nests = nesta, nestb

biogeme.sampling_of_alternatives.sampling_context module

Defines a class that characterized the context to apply sampling of alternatives

author:

Michel Bierlaire

date:

Wed Sep 6 14:38:31 2023

class biogeme.sampling_of_alternatives.sampling_context.CrossVariableTuple(name, formula)[source]

Bases: NamedTuple

A cross variable is a variable that involves socio-economic attributes of the individuals, and attributes of the alternatives. It can only be calculated after the sampling has been made.

Parameters:
formula: Expression

Alias for field number 1

name: str

Alias for field number 0

class biogeme.sampling_of_alternatives.sampling_context.SamplingContext(the_partition, sample_sizes, individuals, choice_column, alternatives, id_column, biogeme_file_name, utility_function, combined_variables, mev_partition=None, mev_sample_sizes=None, cnl_nests=None)[source]

Bases: object

Class gathering the data needed to perform an estimation with samples of alternatives

Parameters:
  • partition – Partition used for the sampling.

  • sample_sizes (Iterable[int]) – number of alternative to draw from each segment.

  • individuals (DataFrame) – Pandas data frame containing all the individuals as rows. One column must contain the choice of each individual.

  • choice_column (str) – name of the column containing the choice of each individual.

  • alternatives (DataFrame) – Pandas data frame containing all the alternatives as rows. One column must contain a unique ID identifying the alternatives. The other columns contain variables to include in the data file.

  • id_column (str) – name of the column containing the Ids of the alternatives.

  • utility_function (Expression) – definition of the generic utility function

  • combined_variables (list[CrossVariableTuple]) – definition of interaction variables

  • mev_partition (Optional[Partition]) – If a second choice set need to be sampled for the MEV terms, the corresponding partitition is provided here.

  • the_partition (Partition) –

  • biogeme_file_name (str) –

  • mev_sample_sizes (Iterable[int] | None) –

  • cnl_nests (NestsForCrossNestedLogit | None) –

__eq__(other)

Return self==value.

__init__(the_partition, sample_sizes, individuals, choice_column, alternatives, id_column, biogeme_file_name, utility_function, combined_variables, mev_partition=None, mev_sample_sizes=None, cnl_nests=None)
Parameters:
  • the_partition (Partition) –

  • sample_sizes (Iterable[int]) –

  • individuals (DataFrame) –

  • choice_column (str) –

  • alternatives (DataFrame) –

  • id_column (str) –

  • biogeme_file_name (str) –

  • utility_function (Expression) –

  • combined_variables (list[CrossVariableTuple]) –

  • mev_partition (Partition | None) –

  • mev_sample_sizes (Iterable[int] | None) –

  • cnl_nests (NestsForCrossNestedLogit | None) –

Return type:

None

alternatives: DataFrame
biogeme_file_name: str
check_expression(expression)[source]

Verifies if the variables contained in the expression can be found in the databases

Return type:

None

Parameters:

expression (Expression) –

check_mev_partition()[source]

Check if the partition is a partition of the MEV alternatives. It does not need to cover the full choice set

Return type:

None

check_partition()[source]

Check if the partition is truly a partition. If not, an exception is raised

Raises:
  • BiogemeError – if some elements are present in more than one subset.

  • BiogemeError – if the size of the union of the subsets does not match the expected total size

  • BiogemeError – if an alternative in the partition does not appear in the database of alternatives

  • BiogemeError – if a segment is empty

  • BiogemeError – if the number of sampled alternatives in a stratum is incorrect , that is zero, or larger than the stratum size..

Return type:

None

check_valid_alternatives(set_of_ids)[source]
Check if the IDs in set are indeed valid

alternatives. Typically used to check if a nest is well defined

Parameters:

set_of_ids (set[int]) – set of identifiers to check

Raises:

BiogemeError – if at least one id is invalid.

Return type:

None

choice_column: str
cnl_nests: Optional[NestsForCrossNestedLogit] = None
combined_variables: list[CrossVariableTuple]
id_column: str
individuals: DataFrame
mev_partition: Optional[Partition] = None
mev_sample_sizes: Optional[Iterable[int]] = None
reporting()[source]

Summarizes the configuration specificed by the contect object.

Return type:

None

sample_sizes: Iterable[int]
the_partition: Partition
utility_function: Expression
class biogeme.sampling_of_alternatives.sampling_context.StratumTuple(subset, sample_size)[source]

Bases: NamedTuple

A stratum is an element of a partition of the full choice set, combined with the number of alternatives that must be sampled.

Parameters:
  • subset (set[int]) –

  • sample_size (int) –

sample_size: int

Alias for field number 1

subset: set[int]

Alias for field number 0

biogeme.sampling_of_alternatives.sampling_of_alternatives module

Module in charge of functionalities related to the sampling of alternatives

author:

Michel Bierlaire

date:

Thu Sep 7 10:14:54 2023

class biogeme.sampling_of_alternatives.sampling_of_alternatives.SamplingOfAlternatives(context)[source]

Bases: object

Class dealing with the various methods needed to estimate models with samples of alternatives

Parameters:

context (SamplingContext) –

__init__(context)[source]

Constructor

Parameters:

context (SamplingContext) – contains all the information that is needed to perform the sampling of alternatives.

sample_alternatives(chosen)[source]

Performing the sampling of alternatives

Parameters:

chosen (int) – ID of the chosen alternative, that must be included in the choice set.

Return type:

DataFrame

Returns:

data frame containing a sample of alternatives. The first one is the chosen alternative

Raises:

BiogemeError – if the chosen alternative is unknown.

sample_mev_alternatives()[source]

Performing the sampling of alternatives for the MEV terms. Here, the chosen alternative is ignored.

Return type:

DataFrame

Returns:

data frame containing a sample of alternatives

biogeme.sampling_of_alternatives.sampling_of_alternatives.generate_segment_size(sample_size, number_of_segments)[source]

This function calculates the size of each segment, so that they are as close to each other as possible, and cover the full sample

Parameters:
  • sample_size (int) – total size of the sample

  • number_of_segments (int) – number of segments

Returns:

list of length number_of_segments, containing the segment sizes

Return type:

list[int]