Shared Training Infrastructure

This page documents the reusable infrastructure that resolves TE experiment identity, prepares immutable artifact folders, initializes shared Lightning components, and maintains family/program registries.

Shared training utilities for TE run identity, artifacts, and registries.

class scripts.training.shared_training_infrastructure.ExperimentIdentity(model_family, model_type, run_name)[source]

Logical experiment identity resolved from a training configuration.

Parameters:
  • model_family (str)

  • model_type (str)

  • run_name (str)

class scripts.training.shared_training_infrastructure.ModelParameterSummary(backbone_name, trainable_parameter_count, frozen_parameter_count, total_parameter_count)[source]

Trainable, frozen, and total parameter counts for one backbone.

Parameters:
  • backbone_name (str)

  • trainable_parameter_count (int)

  • frozen_parameter_count (int)

  • total_parameter_count (int)

class scripts.training.shared_training_infrastructure.RunArtifactIdentity(artifact_kind, model_family, run_name, run_instance_id, output_directory)[source]

Physical artifact identity used for immutable output folders.

Parameters:
  • artifact_kind (str)

  • model_family (str)

  • run_name (str)

  • run_instance_id (str)

  • output_directory (Path)

scripts.training.shared_training_infrastructure.load_training_config(config_path=DEFAULT_CONFIG_PATH)[source]

Load and validate a YAML training configuration.

Parameters:

config_path (str | Path) – Absolute or project-relative configuration path.

Returns:

Parsed training configuration dictionary.

Return type:

dict[str, Any]

scripts.training.shared_training_infrastructure.resolve_experiment_identity(training_config)[source]

Resolve the logical experiment identity from the training config.

Parameters:

training_config (dict[str, Any]) – Parsed training configuration dictionary.

Returns:

Normalized model family, model type, and run name.

Return type:

ExperimentIdentity

scripts.training.shared_training_infrastructure.prepare_output_artifact_training_config(training_config, artifact_kind=RUN_OUTPUT_ARTIFACT_KIND, run_name_suffix=None, run_instance_id=None)[source]

Attach output-artifact metadata to a cloned training configuration.

Parameters:
  • training_config (dict[str, Any]) – Source training configuration.

  • artifact_kind (str) – Artifact family such as training run or validation check.

  • run_name_suffix (str | None) – Optional suffix appended to the logical run name.

  • run_instance_id (str | None) – Optional explicit immutable run instance identifier.

Returns:

Cloned training configuration enriched with artifact metadata under the metadata section.

Return type:

dict[str, Any]

scripts.training.shared_training_infrastructure.resolve_run_artifact_identity(training_config)[source]

Resolve the full physical artifact identity for one prepared config.

Parameters:

training_config (dict[str, Any])

Return type:

RunArtifactIdentity

scripts.training.shared_training_infrastructure.create_datamodule_from_training_config(training_config)[source]

Instantiate the TE LightningDataModule from the training config.

Parameters:

training_config (dict[str, Any])

Return type:

TransmissionErrorDataModule

scripts.training.shared_training_infrastructure.create_regression_backbone_from_training_config(training_config, input_feature_dim)[source]

Instantiate the regression backbone declared in the config.

Parameters:
  • training_config (dict[str, Any]) – Parsed training configuration dictionary.

  • input_feature_dim (int) – Input dimension resolved from the prepared dataset.

Returns:

Configured regression backbone.

Return type:

nn.Module

scripts.training.shared_training_infrastructure.create_regression_module_from_training_config(training_config, regression_backbone, input_feature_dim, target_feature_dim, normalization_statistics)[source]

Wrap a configured backbone in the shared Lightning regression module.

Parameters:
  • training_config (dict[str, Any])

  • regression_backbone (Module)

  • input_feature_dim (int)

  • target_feature_dim (int)

  • normalization_statistics (NormalizationStatistics)

Return type:

TransmissionErrorRegressionModule

scripts.training.shared_training_infrastructure.initialize_training_components(training_config)[source]

Build the datamodule, backbone, module, and normalization bundle.

Parameters:

training_config (dict[str, Any]) – Parsed training configuration dictionary.

Returns:

Fully initialized training components ready for fit or validation work.

Return type:

tuple[TransmissionErrorDataModule, nn.Module, TransmissionErrorRegressionModule, NormalizationStatistics]

scripts.training.shared_training_infrastructure.fetch_first_batch(datamodule, split_name='train')[source]

Fetch the first batch from one requested dataloader split.

Parameters:
Return type:

dict[str, Any]

scripts.training.shared_training_infrastructure.validate_batch_dictionary(batch_dictionary, input_feature_dim, target_feature_dim)[source]

Validate the structural contract of a collated point batch.

Parameters:
  • batch_dictionary (dict[str, Any]) – Batch emitted by the datamodule collate function.

  • input_feature_dim (int) – Expected final input feature dimension.

  • target_feature_dim (int) – Expected final target feature dimension.

Returns:

Small structural summary of the validated batch.

Return type:

dict[str, Any]

scripts.training.shared_training_infrastructure.build_common_metrics_snapshot(training_config, config_path, output_directory, datamodule, parameter_summary, runtime_config, best_model_path, validation_metric_list, test_metric_list)[source]

Build the canonical metrics snapshot stored with a training artifact.

Parameters:
  • training_config (dict[str, Any])

  • config_path (str | Path)

  • output_directory (Path)

  • datamodule (TransmissionErrorDataModule)

  • parameter_summary (ModelParameterSummary)

  • runtime_config (dict[str, object])

  • best_model_path (str)

  • validation_metric_list (list[dict[str, object]])

  • test_metric_list (list[dict[str, object]])

Return type:

dict[str, object]

scripts.training.shared_training_infrastructure.save_yaml_snapshot(snapshot_dictionary, output_path)[source]

Persist one YAML snapshot to disk, creating parent folders as needed.

Parameters:
  • snapshot_dictionary (dict[str, Any])

  • output_path (Path)

Return type:

None

scripts.training.shared_training_infrastructure.save_training_config_snapshot(training_config, output_directory)[source]

Persist the effective training configuration inside an artifact folder.

Parameters:
  • training_config (dict[str, Any])

  • output_directory (Path)

Return type:

None

scripts.training.shared_training_infrastructure.save_common_metrics_snapshot(metrics_snapshot_dictionary, output_directory)[source]

Persist the common metrics snapshot inside an artifact folder.

Parameters:
  • metrics_snapshot_dictionary (dict[str, Any])

  • output_directory (Path)

Return type:

None

scripts.training.shared_training_infrastructure.update_family_registry(metrics_snapshot_dictionary)[source]

Update the family leaderboard and latest-family-best snapshots.

Parameters:

metrics_snapshot_dictionary (dict[str, Any]) – Common metrics snapshot for one completed training artifact.

Returns:

Selected best entry for the model family after update.

Return type:

dict[str, Any]

scripts.training.shared_training_infrastructure.update_program_registry(best_registry_entry)[source]

Update the program-wide best-solution registry entry.

Parameters:

best_registry_entry (dict[str, Any])

Return type:

dict[str, Any]

scripts.training.shared_training_infrastructure.save_run_metadata_snapshot(training_config, output_directory)[source]

Persist the resolved artifact identity inside an output directory.

Parameters:
  • training_config (dict[str, Any])

  • output_directory (Path)

Return type:

None