Hello,
First of all, thank you for your excellent work on MonoDiff9D and for making the code publicly available! I am a graduate student researching 6D pose estimation and have been trying to reproduce the results from your paper.
I've encountered a puzzling situation that I hope you could help clarify.
The Core Issue:
Your provided pretrained checkpoint (epoch_210.pth) works perfectly. When I run the evaluation script (test.py) with your checkpoint, I can successfully reproduce the high-quality results reported in your paper for the REAL275 dataset.
However, when I train the model from scratch using the exact same codebase and configuration, my trained models consistently fail to match the translation accuracy. The rotation accuracy (10°) is nearly perfect, but all translation-related metrics are significantly lower. This happens whether I train for 210 epochs or 300+ epochs.
Here is a comparison between the results from my self-trained model and your paper's results (Table I, REAL275):
- 3D IoU at 50: 21.5
- 3D IoU at 75: 3.2
- 10 cm: 28.8
- 10 degree: 59.0
- 10 degree, 10cm: 15.8
Verified Steps:
Environment: My setup is built precisely from your environment.yaml file. Since your checkpoint runs correctly, my environment and evaluation pipeline should be correct.
Data Integrity: I generated the depth maps using the prescribed DINOv2-NYU head. I performed a rigorous numerical comparison between my generated .npy files and the samples you provided. The results confirmed they are highly consistent (correlation > 0.9999, max absolute difference ~1e-2).
My Question:
Could you please advise if there are any crucial details about the training procedure that might differ from the public code? For instance, specific learning rate schedules, weight initializations, or other hyperparameters that were used to train the successful epoch_210.pth model?
Any insight you could provide would be incredibly helpful in understanding this gap.
Thank you again for your time and for this fantastic contribution to the community!
Hello,
First of all, thank you for your excellent work on MonoDiff9D and for making the code publicly available! I am a graduate student researching 6D pose estimation and have been trying to reproduce the results from your paper.
I've encountered a puzzling situation that I hope you could help clarify.
The Core Issue:
Your provided pretrained checkpoint (
epoch_210.pth) works perfectly. When I run the evaluation script (test.py) with your checkpoint, I can successfully reproduce the high-quality results reported in your paper for the REAL275 dataset.However, when I train the model from scratch using the exact same codebase and configuration, my trained models consistently fail to match the translation accuracy. The rotation accuracy (
10°) is nearly perfect, but all translation-related metrics are significantly lower. This happens whether I train for 210 epochs or 300+ epochs.Here is a comparison between the results from my self-trained model and your paper's results (Table I, REAL275):
Verified Steps:
Environment: My setup is built precisely from your
environment.yamlfile. Since your checkpoint runs correctly, my environment and evaluation pipeline should be correct.Data Integrity: I generated the depth maps using the prescribed DINOv2-NYU head. I performed a rigorous numerical comparison between my generated
.npyfiles and the samples you provided. The results confirmed they are highly consistent (correlation > 0.9999, max absolute difference ~1e-2).My Question:
Could you please advise if there are any crucial details about the training procedure that might differ from the public code? For instance, specific learning rate schedules, weight initializations, or other hyperparameters that were used to train the successful
epoch_210.pthmodel?Any insight you could provide would be incredibly helpful in understanding this gap.
Thank you again for your time and for this fantastic contribution to the community!