Add deterministic_output option#1559
Conversation
jbarlow83
left a comment
There was a problem hiding this comment.
Another thing that needs to be done is to arrange for a DeterministicExecutor. It should work like
StandardExecutor._execute, except it needs to ensure the task_finished function is called in task order, not in completion order.
In practical terms this means e.g. even if some later pages finish OCR earlier, finalizing the page will be held back and done in order. (This is a MapReduce; we need to enforce ordering on the Reduce.) This will ensure the output is ordered in a consistent way.
It will be slower than StandardExecutor so it should not be the default.
| set_pikepdf_as_editor=False, update_docinfo=False, strict=False | ||
| ) as meta_original, | ||
| pdf.open_metadata() as meta_pdf, | ||
| pdf.open_metadata(set_pikepdf_as_editor=not pdf_save_settings.get("deterministic_id", False)) as meta_pdf, |
There was a problem hiding this comment.
We should set pikepdf as editor here. It's still deterministic.
| text_xobj_name = Name.random(prefix="OCR-") | ||
| if self.context.options.deterministic_output: | ||
| # Use a stable name per page for deterministic output | ||
| text_xobj_name = Name(f"/OCR-{page_num:06d}") |
There was a problem hiding this comment.
I should bind QPDFObjectHandle::getUniqueResourceName in pikepdf to make this easier.
We can't count on any particular name being available (in particular, if OCRmyPDF is being reused on the same file in some weird way, or if the input file is merged from multiple files generated by OCRmyPDF).
Without have getUniqueResourceName, just make a prefix, appending a counter, and keep incrementing until there's no conflict with an existing name.
| pdfmark['/Producer'] = f'pikepdf {PIKEPDF_VERSION}' | ||
| pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc)) | ||
| if not options.deterministic_output: | ||
| pdfmark['/ModDate'] = encode_pdf_date(datetime.now(timezone.utc)) |
There was a problem hiding this comment.
In addition to leaving ModDate undisturbed, we should use the environment variables used to control deterministic builds when compiling.
https://blog.conan.io/2019/09/02/Deterministic-builds-with-C-C++.html
In brief, if set ModDate to SOURCE_DATE_EPOCH which would be an environment variable holding a Unix epoch timestamp (seconds since 1970). If ZERO_AR_DATE is 1, then set ModDate to 0 (1970). That allows the user to control the timestamp.
refs #864
Note that this is partially written by GPT-5.
Is this the right approach? Should I add tests for other documents?