Transmission Error Dataset Family Reference

Purpose

This document is the canonical reference for the three transmission-error dataset surfaces stored under data/:

data/original_dataset/
data/simplified_dataset/
data/polished_dataset/

The three roots share the same experimental origin but serve different purposes. They must not be treated as interchangeable file formats.

Dataset Lineage

original_dataset
├── simplified_dataset
└── polished_dataset
    └── generated by generate_polished_dataset.py

original_dataset contains the raw test-rig recordings. Both derived datasets come from those recordings:

simplified_dataset is the established legacy curve dataset retained for compatibility and historical artifact reproduction;
polished_dataset is a direct row-level export produced by data/generate_polished_dataset.py.

A complete path-adapted repository copy is maintained at scripts/datasets/generate_polished_transmission_error_dataset.py. The two implementations have identical processing logic and differ only in their default path block.

The repository defaults to polished_dataset in config/datasets/transmission_error_dataset.yaml. The shared loader supports both schemas through the polished_dataset and simplified_dataset selectors.

Verified Inventory

The audit performed on June 20, 2026 found:

Dataset	CSV files	Approximate size	Direction representation
`original_dataset`	975	11.498 GiB	Both validity channels in each raw file
`simplified_dataset`	969	2.605 GiB	Forward and backward curves in one CSV
`polished_dataset`	1,938	6.737 GiB	One CSV per direction

The 969 canonical operating conditions are distributed evenly:

323 conditions at nominal 25 degC;
323 conditions at nominal 30 degC;
323 conditions at nominal 35 degC.

polished_dataset therefore contains:

969 files under backward/;
969 files under forward/;
323 files for each direction and nominal temperature combination.

Original Dataset

Role

Use data/original_dataset/ when preprocessing, validity-window extraction, zeroing, signal interpretation, or provenance must be reconstructed.

Raw Structure

The raw tree is grouped by nominal temperature and motor speed:

data/original_dataset/
  Test_25deg/
    1000rpm/
      1000.0rpm100.0Nm25.0deg.csv
  Test_30deg/
  Test_35deg/

The CSV files are semicolon-delimited and have no header row. The support material in the same directory includes:

Info_DataStructure.pptx, which describes the rig and signal layout;
TE.m, which demonstrates directional filtering and TE computation;
NumDiff.m, which contains the historical MATLAB numerical derivative;
fft_fun.m, which contains the historical harmonic helper.

Raw Columns Used By The Polished Export

Column numbers below are one-based, matching MATLAB and the presentation:

Raw column	Generator name	Meaning
2	`theta_enc_deg`	Cumulative, common-zeroed Renishaw input-side encoder position in degrees
3	`q_enc_deg`	Cumulative, common-zeroed Renishaw output-side encoder position in degrees
4	`tau_load_nm`	Measured load/output-side Manner torque in Nm
5	`valid_fw`	Forward valid-window flag
6	`valid_bw`	Backward valid-window flag
8	`temp_deg_c`	Measured tested-reducer oil temperature in degrees Celsius
11	`q_abs_deg`	Raw absolute Renishaw output-side encoder position in degrees

The presentation identifies the Renishaw devices as absolute encoders, but columns 2 and 3 are cumulative multi-turn signals after common software zeroing. They are not the unchanged single-turn absolute readings.

Validity Windows

The test procedure runs each operating condition in both motion directions. The PLC activates the corresponding validity channel while the load-side absolute encoder traverses the selected revolution:

raw column 5 selects forward rows;
raw column 6 selects backward rows.

The polished generator accepts every nonzero flag value. It does not merge directions and does not retain transient rows outside the selected windows.

Simplified Dataset

Role

data/simplified_dataset/ remains the compatibility source for legacy five-feature training and evaluation workflows.

Structure And Schema

Each operating condition has one comma-delimited CSV:

data/simplified_dataset/
  Test_25degree/
    1000rpm/
      1000.0rpm100.0Nm25.0deg.csv

Each file contains both directions:

Poisition_Output_Reducer_Fw,Transmission_Error_Fw,Position_Output_Reducer_Bw,Transmission_Error_Bw

The misspelling Poisition_Output_Reducer_Fw is present in the source files and is intentionally supported by scripts/datasets/transmission_error_dataset.py.

The current loader turns the 969 files into 1,938 directional curve samples. It parses nominal speed, torque, and temperature from the path, sorts each direction by reducer-output position, and exposes direction as an explicit model feature.

Polished Dataset

Role

data/polished_dataset/ preserves valid time-ordered rows and adds measured speed, load torque, and oil temperature to every exported sample. It is useful for temporal modeling, preprocessing audits, signal-level analysis, and future loaders that need the measured operating state instead of only nominal filename metadata.

Direction-Separated Structure

data/polished_dataset/
  backward/
    25degree/
      1000rpm/
        1000.0rpm100.0Nm25.0deg.csv
  forward/
    25degree/
      1000rpm/
        1000.0rpm100.0Nm25.0deg.csv

Direction is encoded by the top-level folder and is not repeated as a CSV column.

The filename format is:

<nominal_speed>.0rpm<nominal_torque>.0Nm<nominal_temperature>.0deg.csv

Folder and filename values describe nominal test setpoints. The CSV columns contain measured or derived sample-level values and therefore need not equal the nominal values exactly.

Verified CSV Schema

Every polished file has exactly:

theta,theta_dot,tau_load,T,theta_TE

Column	Unit	Classification	Verified meaning
`theta`	deg	Derived from a measured position	Input-side cumulative Renishaw angle divided by `81` and wrapped to `[0, 360)`; an output-equivalent reducer angle, not the raw absolute motor position
`theta_dot`	rpm	Derived	Motor/input-side speed calculated from consecutive input-side position samples at `0.25 ms`
`tau_load`	Nm	Measured	Signed load/output-side Manner torque from raw column 4
`T`	degC	Measured	Tested-reducer oil temperature from raw column 8
`theta_TE`	deg	Derived from measured positions	Transmission error after output-side zeroing correction

This corrects two potentially misleading shorthand descriptions:

theta originates from the motor/input-side absolute encoder system, but the exported value is common-zeroed, cumulative, ratio-scaled, and wrapped;
theta_TE is not measured by a dedicated TE sensor. It is calculated from the two measured encoder positions.

Generation Equations

Constants:

gear_ratio = 81
sample_time = 0.00025 s

For rows selected by the relevant direction flag:

theta_rad = radians(input_encoder_cumulative_deg) / gear_ratio
theta = degrees(theta_rad modulo 2*pi)

The first theta_dot value uses the difference between the first two selected samples. Every subsequent value uses the current-minus-previous difference:

dtheta_rad_s[i] = (theta_rad[i] - theta_rad[i - 1]) / sample_time
theta_dot[i] = degrees(dtheta_rad_s[i]) / 6 * gear_ratio

Because theta_rad was divided by the gear ratio and the speed expression multiplies it back, theta_dot is a motor/input-side speed in rpm.

The output-side zeroing offset uses the first three raw samples:

raw_offset = radians(mean(q_abs_deg[0:3]) - mean(q_enc_deg[0:3]))
q_offset = atan2(sin(raw_offset), cos(raw_offset))

The generator then applies its retained cluster correction:

if q_offset < -0.002 rad: q_offset += 0.0044 rad
if q_offset >  0.002 rad: q_offset -= 0.00415 rad

Finally:

q_not_zeroed_rad = radians(output_encoder_cumulative_deg) + q_offset
theta_TE = degrees(q_not_zeroed_rad - theta_rad)

Direction And Sign Conventions

forward files have positive mean theta_dot;
backward files have negative mean theta_dot;
tau_load is signed measurement data;
filename torque is a nominal nonnegative setpoint magnitude.

For this dataset, forward torque samples commonly carry the opposite sign from backward samples. Consumers must not replace measured tau_load with the unsigned filename value.

Full-Population Audit Results

All 1,938 polished CSV files were parsed during the June 20, 2026 audit:

expected headers: 1,938 of 1,938;
numeric data rows: 75,585,373;
files with malformed rows: 0;
files with non-finite values: 0;
empty files: 0;
minimum rows in one file: 10,799;
maximum rows in one file: 194,401.

Observed full-population ranges:

Column	Minimum	Maximum
`theta`	`0.0000010596 deg`	`359.9999985254 deg`
`theta_dot`	`-4203.5883 rpm`	`4201.8795 rpm`
`tau_load`	`-1837.3690 Nm`	`1845.9344 Nm`
`T`	`23.85 degC`	`37.97 degC`
`theta_TE`	`-0.12790174 deg`	`0.11253454 deg`

The instantaneous theta_dot extrema show that the numerical derivative can contain excursions beyond the nominal speed. Use the filename for the nominal condition and the column for the measured/derived sample-level speed.

Raw-To-Polished Verification

The raw inventory contains 975 CSV files. The generator explicitly ignores six known duplicate or connection files, leaving 969 source conditions:

0rpm0.0Nm25.0deg1.csv
0rpm100.0Nm25.0deg1.csv
0rpm200.0Nm25.0deg.csv
0rpm100.0Nm30.0deg_collegamento.csv
0rpm100.0Nm30.0degCollegamiento.csv
0rpm100.0Nm30.0degcollegamento2.csv

The retained corrected 800 rpm source uses the _1.csv suffix. Export filenames are normalized and omit that suffix.

A deterministic formula check sampled 27 raw files across all three temperatures and the minimum, median, and maximum speed folders. It compared both directions, 54 polished outputs, and 4,082,398 rows. The maximum absolute difference was exactly zero for all five exported columns.

Choosing The Correct Dataset

Use original_dataset when:

reconstructing preprocessing or zeroing;
validating DataValid behavior;
inspecting signals not present in the derived datasets;
auditing experimental provenance.

Use simplified_dataset when:

reproducing legacy repository training or TE curve evaluation;
working with one TE curve per direction and operating condition;
relying on current configuration and loader compatibility.

Use polished_dataset when:

running new repository training through the default selector;
preserving the time order of valid samples;
using measured torque and temperature at sample level;
developing temporal or sequence-aware loaders;
auditing the direct encoder-to-TE transformation.

Reproducing The Polished Export

The standalone script resolves its defaults relative to its own location:

data/generate_polished_dataset.py

Then run:

python data/generate_polished_dataset.py

The repository-integrated copy uses data/original_dataset/ as input and output/generated_polished_dataset/ as output:

conda run --no-capture-output -n pinns_env python scripts/datasets/generate_polished_transmission_error_dataset.py

Both versions protect existing files unless OVERWRITE_EXISTING_FILES = True and show a tqdm progress bar by default.

Usage Constraints

Do not point the current simplified-dataset loader at polished_dataset; its expected four-column schema is different.
Do not infer direction from torque sign; use the forward/ or backward/ path.
Do not interpret filename metadata as measured sample-level values.
Do not call theta the unchanged absolute motor position.
Do not call theta_TE a directly sensed channel.
Preserve the validity-window and zeroing logic when creating future derivations.