Transmission Error Dataset Family Reference

Purpose

This document is the canonical reference for the three transmission-error dataset surfaces stored under data/:

  • data/original_dataset/

  • data/simplified_dataset/

  • data/polished_dataset/

The three roots share the same experimental origin but serve different purposes. They must not be treated as interchangeable file formats.

Dataset Lineage

original_dataset
├── simplified_dataset
└── polished_dataset
    └── generated by generate_polished_dataset.py

original_dataset contains the raw test-rig recordings. Both derived datasets come from those recordings:

  • simplified_dataset is the established legacy curve dataset retained for compatibility and historical artifact reproduction;

  • polished_dataset is a direct row-level export produced by data/generate_polished_dataset.py.

A complete path-adapted repository copy is maintained at scripts/datasets/generate_polished_transmission_error_dataset.py. The two implementations have identical processing logic and differ only in their default path block.

The repository defaults to polished_dataset in config/datasets/transmission_error_dataset.yaml. The shared loader supports both schemas through the polished_dataset and simplified_dataset selectors.

Verified Inventory

The audit performed on June 20, 2026 found:

Dataset

CSV files

Approximate size

Direction representation

original_dataset

975

11.498 GiB

Both validity channels in each raw file

simplified_dataset

969

2.605 GiB

Forward and backward curves in one CSV

polished_dataset

1,938

6.737 GiB

One CSV per direction

The 969 canonical operating conditions are distributed evenly:

  • 323 conditions at nominal 25 degC;

  • 323 conditions at nominal 30 degC;

  • 323 conditions at nominal 35 degC.

polished_dataset therefore contains:

  • 969 files under backward/;

  • 969 files under forward/;

  • 323 files for each direction and nominal temperature combination.

Original Dataset

Role

Use data/original_dataset/ when preprocessing, validity-window extraction, zeroing, signal interpretation, or provenance must be reconstructed.

Raw Structure

The raw tree is grouped by nominal temperature and motor speed:

data/original_dataset/
  Test_25deg/
    1000rpm/
      1000.0rpm100.0Nm25.0deg.csv
  Test_30deg/
  Test_35deg/

The CSV files are semicolon-delimited and have no header row. The support material in the same directory includes:

  • Info_DataStructure.pptx, which describes the rig and signal layout;

  • TE.m, which demonstrates directional filtering and TE computation;

  • NumDiff.m, which contains the historical MATLAB numerical derivative;

  • fft_fun.m, which contains the historical harmonic helper.

Raw Columns Used By The Polished Export

Column numbers below are one-based, matching MATLAB and the presentation:

Raw column

Generator name

Meaning

2

theta_enc_deg

Cumulative, common-zeroed Renishaw input-side encoder position in degrees

3

q_enc_deg

Cumulative, common-zeroed Renishaw output-side encoder position in degrees

4

tau_load_nm

Measured load/output-side Manner torque in Nm

5

valid_fw

Forward valid-window flag

6

valid_bw

Backward valid-window flag

8

temp_deg_c

Measured tested-reducer oil temperature in degrees Celsius

11

q_abs_deg

Raw absolute Renishaw output-side encoder position in degrees

The presentation identifies the Renishaw devices as absolute encoders, but columns 2 and 3 are cumulative multi-turn signals after common software zeroing. They are not the unchanged single-turn absolute readings.

Validity Windows

The test procedure runs each operating condition in both motion directions. The PLC activates the corresponding validity channel while the load-side absolute encoder traverses the selected revolution:

  • raw column 5 selects forward rows;

  • raw column 6 selects backward rows.

The polished generator accepts every nonzero flag value. It does not merge directions and does not retain transient rows outside the selected windows.

Simplified Dataset

Role

data/simplified_dataset/ remains the compatibility source for legacy five-feature training and evaluation workflows.

Structure And Schema

Each operating condition has one comma-delimited CSV:

data/simplified_dataset/
  Test_25degree/
    1000rpm/
      1000.0rpm100.0Nm25.0deg.csv

Each file contains both directions:

Poisition_Output_Reducer_Fw,Transmission_Error_Fw,Position_Output_Reducer_Bw,Transmission_Error_Bw

The misspelling Poisition_Output_Reducer_Fw is present in the source files and is intentionally supported by scripts/datasets/transmission_error_dataset.py.

The current loader turns the 969 files into 1,938 directional curve samples. It parses nominal speed, torque, and temperature from the path, sorts each direction by reducer-output position, and exposes direction as an explicit model feature.

Polished Dataset

Role

data/polished_dataset/ preserves valid time-ordered rows and adds measured speed, load torque, and oil temperature to every exported sample. It is useful for temporal modeling, preprocessing audits, signal-level analysis, and future loaders that need the measured operating state instead of only nominal filename metadata.

Direction-Separated Structure

data/polished_dataset/
  backward/
    25degree/
      1000rpm/
        1000.0rpm100.0Nm25.0deg.csv
  forward/
    25degree/
      1000rpm/
        1000.0rpm100.0Nm25.0deg.csv

Direction is encoded by the top-level folder and is not repeated as a CSV column.

The filename format is:

<nominal_speed>.0rpm<nominal_torque>.0Nm<nominal_temperature>.0deg.csv

Folder and filename values describe nominal test setpoints. The CSV columns contain measured or derived sample-level values and therefore need not equal the nominal values exactly.

Verified CSV Schema

Every polished file has exactly:

theta,theta_dot,tau_load,T,theta_TE

Column

Unit

Classification

Verified meaning

theta

deg

Derived from a measured position

Input-side cumulative Renishaw angle divided by 81 and wrapped to [0, 360); an output-equivalent reducer angle, not the raw absolute motor position

theta_dot

rpm

Derived

Motor/input-side speed calculated from consecutive input-side position samples at 0.25 ms

tau_load

Nm

Measured

Signed load/output-side Manner torque from raw column 4

T

degC

Measured

Tested-reducer oil temperature from raw column 8

theta_TE

deg

Derived from measured positions

Transmission error after output-side zeroing correction

This corrects two potentially misleading shorthand descriptions:

  • theta originates from the motor/input-side absolute encoder system, but the exported value is common-zeroed, cumulative, ratio-scaled, and wrapped;

  • theta_TE is not measured by a dedicated TE sensor. It is calculated from the two measured encoder positions.

Generation Equations

Constants:

gear_ratio = 81
sample_time = 0.00025 s

For rows selected by the relevant direction flag:

theta_rad = radians(input_encoder_cumulative_deg) / gear_ratio
theta = degrees(theta_rad modulo 2*pi)

The first theta_dot value uses the difference between the first two selected samples. Every subsequent value uses the current-minus-previous difference:

dtheta_rad_s[i] = (theta_rad[i] - theta_rad[i - 1]) / sample_time
theta_dot[i] = degrees(dtheta_rad_s[i]) / 6 * gear_ratio

Because theta_rad was divided by the gear ratio and the speed expression multiplies it back, theta_dot is a motor/input-side speed in rpm.

The output-side zeroing offset uses the first three raw samples:

raw_offset = radians(mean(q_abs_deg[0:3]) - mean(q_enc_deg[0:3]))
q_offset = atan2(sin(raw_offset), cos(raw_offset))

The generator then applies its retained cluster correction:

if q_offset < -0.002 rad: q_offset += 0.0044 rad
if q_offset >  0.002 rad: q_offset -= 0.00415 rad

Finally:

q_not_zeroed_rad = radians(output_encoder_cumulative_deg) + q_offset
theta_TE = degrees(q_not_zeroed_rad - theta_rad)

Direction And Sign Conventions

  • forward files have positive mean theta_dot;

  • backward files have negative mean theta_dot;

  • tau_load is signed measurement data;

  • filename torque is a nominal nonnegative setpoint magnitude.

For this dataset, forward torque samples commonly carry the opposite sign from backward samples. Consumers must not replace measured tau_load with the unsigned filename value.

Full-Population Audit Results

All 1,938 polished CSV files were parsed during the June 20, 2026 audit:

  • expected headers: 1,938 of 1,938;

  • numeric data rows: 75,585,373;

  • files with malformed rows: 0;

  • files with non-finite values: 0;

  • empty files: 0;

  • minimum rows in one file: 10,799;

  • maximum rows in one file: 194,401.

Observed full-population ranges:

Column

Minimum

Maximum

theta

0.0000010596 deg

359.9999985254 deg

theta_dot

-4203.5883 rpm

4201.8795 rpm

tau_load

-1837.3690 Nm

1845.9344 Nm

T

23.85 degC

37.97 degC

theta_TE

-0.12790174 deg

0.11253454 deg

The instantaneous theta_dot extrema show that the numerical derivative can contain excursions beyond the nominal speed. Use the filename for the nominal condition and the column for the measured/derived sample-level speed.

Raw-To-Polished Verification

The raw inventory contains 975 CSV files. The generator explicitly ignores six known duplicate or connection files, leaving 969 source conditions:

200.0rpm0.0Nm25.0deg1.csv
200.0rpm100.0Nm25.0deg1.csv
800.0rpm200.0Nm25.0deg.csv
1100.0rpm100.0Nm30.0deg_collegamento.csv
1600.0rpm100.0Nm30.0degCollegamiento.csv
1600.0rpm100.0Nm30.0degcollegamento2.csv

The retained corrected 800 rpm source uses the _1.csv suffix. Export filenames are normalized and omit that suffix.

A deterministic formula check sampled 27 raw files across all three temperatures and the minimum, median, and maximum speed folders. It compared both directions, 54 polished outputs, and 4,082,398 rows. The maximum absolute difference was exactly zero for all five exported columns.

Choosing The Correct Dataset

Use original_dataset when:

  • reconstructing preprocessing or zeroing;

  • validating DataValid behavior;

  • inspecting signals not present in the derived datasets;

  • auditing experimental provenance.

Use simplified_dataset when:

  • reproducing legacy repository training or TE curve evaluation;

  • working with one TE curve per direction and operating condition;

  • relying on current configuration and loader compatibility.

Use polished_dataset when:

  • running new repository training through the default selector;

  • preserving the time order of valid samples;

  • using measured torque and temperature at sample level;

  • developing temporal or sequence-aware loaders;

  • auditing the direct encoder-to-TE transformation.

Reproducing The Polished Export

The standalone script resolves its defaults relative to its own location:

data/generate_polished_dataset.py

Then run:

python data/generate_polished_dataset.py

The repository-integrated copy uses data/original_dataset/ as input and output/generated_polished_dataset/ as output:

conda run --no-capture-output -n pinns_env python scripts/datasets/generate_polished_transmission_error_dataset.py

Both versions protect existing files unless OVERWRITE_EXISTING_FILES = True and show a tqdm progress bar by default.

Usage Constraints

  • Do not point the current simplified-dataset loader at polished_dataset; its expected four-column schema is different.

  • Do not infer direction from torque sign; use the forward/ or backward/ path.

  • Do not interpret filename metadata as measured sample-level values.

  • Do not call theta the unchanged absolute motor position.

  • Do not call theta_TE a directly sensed channel.

  • Preserve the validity-window and zeroing logic when creating future derivations.