biogeme.database.container module

DataContainer: Responsible for holding and safely manipulating the Biogeme dataset stored as a Pandas DataFrame.

Michel Bierlaire Wed Mar 26 19:30:57 2025

class biogeme.database.container.Database(name, dataframe, use_jit=True)[source]

Bases: object

Encapsulates a pandas DataFrame for Biogeme, providing safe access and basic operations such as checking for emptiness, scaling, and column manipulation.

Parameters:
  • name (str)

  • dataframe (pd.DataFrame)

  • use_jit (bool)

DefineVariable(name, expression)[source]

Warning

This function is deprecated. Use define_variable() instead.

This method evaluates a Biogeme expression row by row on the database and creates a new column in the internal DataFrame with the results.

Parameters:
  • name (str) – Name of the new column to be added.

  • expression (Expression) – Biogeme expression to evaluate for each row.

Return type:

Variable

add_column(column, values)[source]

Adds a new column to the dataset

Parameters:
  • column (str) – name of the new column

  • values (Series) – a pandas Series of same length as data

Raises:

ValueError – if column already exists or lengths mismatch

Return type:

None

bootstrap_sample()[source]

Returns a bootstrap sample of the data.

column_exists(column)[source]

Check if a column exists in the data

Return type:

bool

Parameters:

column (str)

property data_jax: Array

Returns the data as a biogeme_jax object

property dataframe: DataFrame

Returns a reference to the internal DataFrame.

define_variable(name, expression)[source]

This method evaluates a Biogeme expression row by row on the database and creates a new column in the internal DataFrame with the results.

Parameters:
  • name (str) – Name of the new column to be added.

  • expression (Expression) – Biogeme expression to evaluate for each row.

Return type:

Variable

classmethod dummy_database()[source]
Return type:

Database

extract_rows(rows)[source]

Extracts selected rows fronm the database.

Parameters:

rows (list[int]) – list of rows to extract

Return type:

Database

Returns:

the new database with the selected rows.

extract_slice(indices)[source]

Create a new Database instance containing only a subset of the data.

This is useful to maintain consistency across estimation and validation datasets by slicing the original draws array according to the provided indices.

Parameters:

indices (Index) – The indices used to extract the subset of draws.

Return type:

Database

Returns:

A new Database instance containing the sliced draws.

generate_segmentation(variable, mapping=None, reference=None)[source]

Generate a segmentation tuple for a variable.

Parameters:
  • variable (Variable | str) – Variable object or name of the variable

  • mapping (dict[int, str] | None) – mapping associating values of the variable to names. If incomplete, default names are provided.

  • reference (str | None) – name of the reference category. If None, an arbitrary category is selected as reference.

Return type:

DiscreteSegmentationTuple

get_column(column)[source]

Returns the values of a column

Return type:

Series

Parameters:

column (str)

is_empty()[source]

Returns True if the data container is empty

Return type:

bool

is_panel()[source]
Return type:

bool

num_columns()[source]

Returns the number of columns in the dataset

Return type:

int

num_rows()[source]

Returns the number of rows in the dataset

Return type:

int

panel(column_name)[source]
Parameters:

column_name (str)

register_listener(callback)[source]
Parameters:

callback (Callable[[Index], None])

remove(exclude_condition)[source]

Removes rows from the database that satisfy a given condition.

This method evaluates a Biogeme expression row by row on the database. All rows where the expression evaluates to a truthy value are removed.

Parameters:

exclude_condition (Expression | float | int | bool) – A Biogeme expression that returns a boolean-like value for each row in the dataset. Rows where the result is True (nonzero) will be excluded.

remove_column(column)[source]

Removes a column from the dataset

Parameters:

column (str)

remove_rows(condition)[source]

Removes all rows where the condition is True

Parameters:

condition (Series) – Boolean Series of same length as the data

reset_indices()[source]
Return type:

None

scale_column(column, scale)[source]

Scales all values in a given column

Parameters:
  • column (str) – name of the column to scale

  • scale (float) – scalar to multiply the column values by

Raises:

BiogemeError – if the column is not found

suggest_scaling(columns=None, report_all=False)[source]

Suggest a scaling of the variables in the database.

For each column, \(\delta\) is the difference between the largest and the smallest value, or one if the difference is smaller than one. The level of magnitude is evaluated as a power of 10. The suggested scale is the inverse of this value.

\[s = \frac{1}{10^{|\log_{10} \delta|}}\]

where \(|x|\) is the integer closest to \(x\).

Parameters:
  • columns (list[str] | None) – list of columns to be considered. If None, all of them will be considered.

  • report_all (bool) – if False, remove entries where the suggested scale is 1, 0.1 or 10

Return type:

DataFrame

Returns:

A Pandas dataframe where each row contains the name of the variable and the suggested scale s. Ideally, the column should be multiplied by s.

Raises:

BiogemeError – if a variable in columns is unknown.

verify_segmentation(segmentation)[source]

Verifies if the definition of the segmentation is consistent with the data

Parameters:

segmentation (DiscreteSegmentationTuple) – definition of the segmentation

Raises:

BiogemeError – if the segmentation is not consistent with the data.

Return type:

None

biogeme.database.container.logger = <Logger biogeme.database.container (DEBUG)>

Logger that controls the output of messages to the screen and log file.