Conversation
a4f6689 to
2641629
Compare
WalkthroughAdds time-bounded indexing and optional async execution across the backend: new CLI flags ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8bdd083 to
d086f36
Compare
i think some docstrings, about a counter, were outdated i add more details to help understand the logic
d086f36 to
0aa2509
Compare
0aa2509 to
c3a0cb2
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the document indexing workflow by adding time-bounded batch selection, crash-safe checkpointing for restartability, and making bulk indexing runnable asynchronously via Celery (including a Django admin trigger).
Changes:
- Added
lower_time_bound/upper_time_boundfiltering to build an indexable document queryset. - Updated the indexer batching logic to support crash-safe checkpointing (cache last successful
updated_at). - Made the
indexmanagement command enqueue a Celery batch task and added an admin page to trigger indexing.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/backend/core/utils.py | Adds build_indexable_documents_queryset() to filter documents by updated/deleted timestamps. |
| src/backend/core/tests/test_utils.py | Adds test coverage for the new time-bound queryset builder. |
| src/backend/core/tests/test_services_search_indexers.py | Updates tests to the new index(queryset, ...) signature. |
| src/backend/core/tests/conftest.py | Adds helper to create documents with controlled updated_at values for time-bound tests. |
| src/backend/core/tests/commands/test_index.py | Expands command tests for bounds, crash-safe mode, and checkpoint recovery behavior. |
| src/backend/core/templates/runindexing.html | Introduces an admin HTML form to trigger indexing with parameters. |
| src/backend/core/tasks/search.py | Removes old search indexing tasks module. |
| src/backend/core/tasks/indexing.py | Adds new Celery tasks module for single/batch indexing + debounce/throttle. |
| src/backend/core/signals.py | Switches signal-triggered indexing to the new tasks module. |
| src/backend/core/services/search_indexers.py | Adds checkpoint constant, updates batching implementation, and extends index() API with crash-safe mode. |
| src/backend/core/models.py | Adds RunIndexing model used to expose the admin entry point. |
| src/backend/core/migrations/0032_runindexing.py | Creates the RunIndexing model via migration. |
| src/backend/core/management/commands/index.py | Updates the command to accept bounds/crash-safe mode and enqueue the Celery batch task. |
| src/backend/core/admin.py | Registers RunIndexing admin and implements the POST handler to call the command. |
| docs/commands/index.md | Adds documentation for CLI/admin/Make usage and parameters. |
| Makefile | Extends make index to pass optional parameters through to manage.py index. |
| CHANGELOG.md | Adds an entry describing the indexing command improvements. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
94b921a to
da5556a
Compare
030084f to
980befa
Compare
980befa to
f71cbe0
Compare
There was a problem hiding this comment.
Actionable comments posted: 18
♻️ Duplicate comments (1)
src/backend/core/admin.py (1)
19-50: 🧹 Nitpick | 🔵 TrivialMonkey-patching
admin.site.get_app_listis fragile; prefer a customAdminSiteor an admin index template override.This rebinds
get_app_liston the defaultAdminSiteinstance at import time, which:
- Silently breaks if another app tries the same trick (last import wins).
- Is sensitive to import ordering (if
admin.pyis imported before all other apps register, the original captured reference could change semantics in edge cases).- Injects a pseudo-"app" that Django’s permission framework never checked —
has_module_perms: Trueis set unconditionally, so the menu item shows even for staff users who have no business running indexing.Consider subclassing
AdminSite, or simply linking to/admin/core/run-indexing/from a smalladmin/index.htmltemplate override guarded by a permission check. At minimum, gate thecommands_appentry onrequest.user.is_superuser(or a specific permission) so non-privileged staff don't see a button they (hopefully, per the fix above) cannot use.♻️ Minimal hardening
def custom_get_app_list(self, request, app_label=None): """Add custom commands to the app list.""" app_list = _original_get_app_list(request, app_label) + if not request.user.is_superuser: + return app_list + # Add Commands app with Run Indexing command🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/core/admin.py` around lines 19 - 50, The current monkey-patch of admin.site.get_app_list in custom_get_app_list is fragile and exposes a Commands entry to all staff; change this by either (A) creating a custom AdminSite subclass and installing it instead of rebinding admin.site.get_app_list, or (B) at minimum gate the commands_app addition inside custom_get_app_list on a permission check (e.g., request.user.is_superuser or a specific permission) so the pseudo-app is only appended for authorized users; update references to custom_get_app_list, admin.site.get_app_list and the commands_app dict accordingly and remove the import-time rebinding if you implement the AdminSite subclass approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 17-20: Move the changelog entry "✨(backend) improve indexing
command" out of the dated v4.8.3 section and place it under the top-level
"[Unreleased]" heading; specifically, locate the block containing "✨(backend)
improve indexing command" with its subpoints "checkpoint recovery",
"asynchronicity", and "admin command trigger" and cut/paste it beneath the
"[Unreleased]" header so the entry is no longer listed under the v4.8.3 dated
release.
In `@docs/commands/index.md`:
- Around line 64-65: Update the documentation text to use the correct field name
`updated_at` (underscore) instead of `updated.at`; specifically edit the
sentence that describes saving the last document's timestamp into the
`bulk-indexer-checkpoint` cache and its use as the `lower-time-bound` so it
reads "updated_at" where the field name is referenced.
- Line 5: The top-level sentence in docs/commands/index.md is misleading; update
the description to state the command runs synchronously by default and only
dispatches to Celery when the --async flag is passed (matching the argparse
default async_mode=False in the Command implementation in
src/backend/core/management/commands/index.py); reference the --async flag and
the Command class/async_mode setting to ensure docs and code behavior are
consistent.
- Around line 22-28: Update the example CLI invocation to use the correct flag
name: replace the incorrect `--async_mode` with `--async` so it matches the
argparse definition (`dest="async_mode"`); ensure any other examples showing the
async flag use `--async` (not `--async_mode`) to avoid "unrecognized arguments"
errors when running the `index` command.
- Around line 34-36: Fix the broken ordered list and inconsistent button label:
update the numbered steps in docs/commands/index.md so they read 1, 2, 3, 4
(insert the missing step 2 or renumber sequentially) and make the button text in
the docs match the actual submit value used in the template (runindexing.html) —
either change the docs to say "Run Indexing" or update the template's submit
value to "Run Indexing Command" so both the markdown steps and the template
submit button value are consistent.
- Around line 57-60: Replace the incorrect heading `## --async_mode` with the
correct CLI flag name and heading level by changing it to `### --async` to match
other parameter headings; fix the description line ("Runs asynchronously is
async_mode==True.") to a clear grammatical sentence such as "Runs asynchronously
when the --async flag is set." and keep the type as "Boolean flag" and default
"False" unchanged.
In `@src/backend/core/admin.py`:
- Around line 274-286: After validating RunIndexingForm, stop re-reading
request.POST and use form.cleaned_data for batch_size, lower_time_bound, and
upper_time_bound when calling call_command("index"); replace
int(request.POST.get("batch_size")) and
convert_to_isoformat(request.POST.get(...)) with the form.cleaned_data values
(e.g., form.cleaned_data["batch_size"], form.cleaned_data["lower_time_bound"],
form.cleaned_data["upper_time_bound"]) so you rely on the form's coercion and
cross-field validation; remove or stop using convert_to_isoformat for these
parameters if the form already returns the correct datetime/typed values.
- Around line 271-291: The run_indexing_view currently lacks any
authentication/permission checks allowing unauthenticated users to trigger
indexing; restrict access by decorating or checking permissions before
processing POST: add a decorator such as `@staff_member_required` or
`@login_required` plus a permission check (e.g. user.is_superuser or Django's
`@permission_required`("core.run_indexing")) on run_indexing_view (or enforce
inside the function by validating request.user and returning
HttpResponseForbidden), ensure the RunIndexingForm handling remains the same,
and import the appropriate decorators (or permission utilities) and use them on
run_indexing_view to prevent unauthorized POSTs.
- Around line 287-291: The view currently renders the template directly after
handling the POST which allows form resubmission on refresh; change the POST
handling to follow Post/Redirect/Get: after calling messages.success(request,
_("Indexing triggered!")) perform a redirect (e.g., return
redirect(request.path) or return redirect(reverse(...))) instead of falling
through to template rendering, and likewise after messages.error(...) redirect
back to the form page so both the success and error POST branches return a
redirect; also ensure django.shortcuts.redirect (and django.urls.reverse if
used) are imported and that RunIndexingForm() remains used for GET.
In `@src/backend/core/forms.py`:
- Around line 33-40: The cross-field validation in the form's clean() currently
raises forms.ValidationError which creates a non-field error; change it to
attach the error to the specific field by calling
self.add_error("upper_time_bound", ...) when lower_time_bound >
upper_time_bound, and wrap the message with gettext_lazy (use _(...)) to keep
i18n consistent; update the clean method where lower_time_bound and
upper_time_bound are compared to use self.add_error instead of raise.
In `@src/backend/core/management/commands/index.py`:
- Around line 65-76: Wrap the async dispatch call to
batch_document_indexer_task.apply_async in a try/except block (similar to the
sync branch) so enqueue failures are caught and re-raised as a clean
CommandError; specifically, catch the concrete broker-related exceptions (e.g.,
kombu.exceptions.OperationalError) or a general Exception around the apply_async
call, log the failure via logger.error, and raise CommandError with a clear
message including the original exception details; keep the logger.info("Document
indexing task sent to worker") only on success.
In `@src/backend/core/models.py`:
- Around line 875-892: The time-bounded filter in filter_updated_at() currently
checks only updated_at and ancestors_deleted_at, which misses restored documents
because restore() clears/moves ancestors_deleted_at without bumping updated_at;
change the restore path (restore()/descendant restore logic) to also update a
field that the filter can use (either increment updated_at or set a dedicated
indexation timestamp like index_updated_at on soft-delete and restore), and
update filter_updated_at() to include that indexation timestamp in its Q
conditions so restored documents fall inside bounded re-index runs.
In `@src/backend/core/services/search_indexers.py`:
- Around line 163-167: The code only logs the checkpoint when crash_safe_mode is
true; instead persist the last successful checkpoint value
serialized_batch[-1]["updated_at"] into a durable cache/store so the indexing
run can resume; modify the block that currently calls logger.info(...) (the
crash_safe_mode branch) to also write that timestamp into your caching layer
(e.g., cache.set("search_indexer:checkpoint",
serialized_batch[-1]["updated_at"]) or use the existing cache client in this
module/service) and ensure any error writing the checkpoint is handled/logged so
it won't break the indexing loop.
In `@src/backend/core/templates/runindexing.html`:
- Around line 11-13: The paragraph text in runindexing.html ("This command
triggers the indexing of all documents within the specified time bound.") is a
hardcoded user-facing string and must be wrapped for translation to match the
template's i18n convention (the submit button already uses {% translate %});
update the <p> element to use either {% translate %} or {% blocktranslate %} so
the text is internationalized (keep {% blocktranslate %} if you later add
punctuation/formatting).
In `@src/backend/core/tests/commands/test_index.py`:
- Around line 169-175: The test calls call_command("index", ...) with mixed
types for time bounds—lower_time_bound passed as a datetime
(documents[3].updated_at) while upper_time_bound is a string (.isoformat())—so
make the test realistic by passing both bounds as strings just like the CLI
would; update the call_command invocation in the test to convert
lower_time_bound to a string (e.g., documents[3].updated_at.isoformat()) so both
lower_time_bound and upper_time_bound are ISO-formatted strings and the mocked
FindDocumentIndexer.push behavior remains unchanged.
In `@src/backend/core/tests/conftest.py`:
- Around line 34-37: The test fixture is mutating global logger propagation
(logging.getLogger(logger_name).propagate = True) without restoring it, which
can leak into other tests; change the fixture to record the original propagate
values for each logger from settings.LOGGING, set propagate True for the test,
yield caplog, and then in a finally/teardown restore each logger's propagate to
its saved value so original state is preserved (reference settings.LOGGING,
logging.getLogger(logger_name).propagate, and caplog).
In `@src/backend/core/tests/test_models_documents.py`:
- Around line 1709-1787: The tests (e.g.,
test_models_documents_manager_time_filter_oldest_updated_at,
test_models_documents_manager_time_filter_newest_updated_at,
test_models_documents_manager_time_filter_both_bounds, and
test_utils_build_indexable_documents_queryset_ancestors_deleted_at) currently
construct naive datetimes with datetime(...) and pass them to
DocumentFactory(updated_at=...) and
models.Document.objects.filter_updated_at(...); update those datetime(...) calls
to produce timezone-aware datetimes (for example datetime(...,
tzinfo=timezone.utc) or use django.utils.timezone.make_aware) so the
DateTimeField comparisons are consistent across timezones; ensure every
lower_time_bound and upper_time_bound and any updated_at / ancestors_deleted_at
factory args are changed accordingly and imports adjusted (timezone or
make_aware) where used.
In `@src/backend/impress/urls.py`:
- Around line 15-19: The URL for run_indexing_view is currently registered under
"admin/core/run-indexing/" which risks colliding with the Django admin app
namespace; move the route out of the `core` namespace by changing the
urlpatterns entry that references run_indexing_view so it uses a non-conflicting
path such as "admin/run-indexing/" (update the urlpatterns list containing
run_indexing_view and admin.site.urls), or alternatively register the view via a
custom AdminSite.get_urls() override so the endpoint is properly namespaced
under the admin site (refer to run_indexing_view and admin.site.urls when making
the change).
---
Duplicate comments:
In `@src/backend/core/admin.py`:
- Around line 19-50: The current monkey-patch of admin.site.get_app_list in
custom_get_app_list is fragile and exposes a Commands entry to all staff; change
this by either (A) creating a custom AdminSite subclass and installing it
instead of rebinding admin.site.get_app_list, or (B) at minimum gate the
commands_app addition inside custom_get_app_list on a permission check (e.g.,
request.user.is_superuser or a specific permission) so the pseudo-app is only
appended for authorized users; update references to custom_get_app_list,
admin.site.get_app_list and the commands_app dict accordingly and remove the
import-time rebinding if you implement the AdminSite subclass approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e277ab42-da6c-44f6-a14c-ed1e160cc90d
📒 Files selected for processing (16)
CHANGELOG.mdMakefiledocs/commands/index.mdsrc/backend/core/admin.pysrc/backend/core/factories.pysrc/backend/core/forms.pysrc/backend/core/management/commands/index.pysrc/backend/core/models.pysrc/backend/core/services/search_indexers.pysrc/backend/core/tasks/search.pysrc/backend/core/templates/runindexing.htmlsrc/backend/core/tests/commands/test_index.pysrc/backend/core/tests/conftest.pysrc/backend/core/tests/test_models_documents.pysrc/backend/core/tests/test_services_search_indexers.pysrc/backend/impress/urls.py
crash-save mode consist in indexing documents in ascending updated_at order and save the last document.update_at. This allows resuming indexing from the last successful batch in case of a crash. Signed-off-by: charles <charles.englebert@protonmail.com>
we need to be able to specify time bounds on the bulk index to allow recovering from checkpoint after crash
i am documenting the index command Signed-off-by: charles <charles.englebert@protonmail.com>
we want to run the indexing from the admin. in `dmin/core/runindexing/`is a form to do so. Signed-off-by: charles <charles.englebert@protonmail.com> ✨(backend) add async_mode flag the command must be killable. this adds a async_mode flag to preserve async feature and allow running sync. Signed-off-by: charles <charles.englebert@protonmail.com>
the index command has new params to add on the make command Signed-off-by: charles <charles.englebert@protonmail.com>
f71cbe0 to
c75f9b7
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (1)
CHANGELOG.md (1)
12-13:⚠️ Potential issue | 🟡 MinorRemove trailing whitespace.
Lines 12 and 13 have trailing whitespace after "checkpoint recovery" and "asynchronicity", which is inconsistent with line 14. This should be cleaned up for consistency.
🧹 Proposed fix to remove trailing whitespace
- ✨(backend) improve indexing command - - checkpoint recovery - - asynchronicity + - checkpoint recovery + - asynchronicity - admin command trigger🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CHANGELOG.md` around lines 12 - 13, Remove the trailing whitespace at the end of the lines containing the list items "checkpoint recovery" and "asynchronicity" in CHANGELOG.md; update those two lines so they end immediately after the text (no spaces or tabs) to match the formatting of the surrounding lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/commands/index.md`:
- Line 58: The documentation line contains a leftover diff marker ("+-
**description:** When set, dispatches the indexing job to a Celery worker...");
locate that exact string in docs/commands/index.md and remove the leading '+' so
the line becomes a normal Markdown bullet (e.g., "- **description:** When set,
...") or simply "**description:** ..." depending on surrounding list style,
ensuring no stray diff markers remain.
In `@src/backend/core/admin.py`:
- Around line 17-50: Replace the fragile monkey-patch of admin.site.get_app_list
by creating a proper AdminSite subclass that overrides get_app_list (move
custom_get_app_list logic into that override) and register it as the default
site via AppConfig.default_site (or instantiate and use your subclass when
configuring the app), or alternatively implement a normal admin entry for the
RunIndexing command by registering a ModelAdmin/proxy model or adding a custom
admin view via admin.site.get_urls so the "Commands" app appears through
Django's standard app-list machinery; remove the lambda rebind of
admin.site.get_app_list and the unused _self parameter in custom_get_app_list
once migrated.
- Around line 26-41: The Commands entry is always visible because
custom_get_app_list unconditionally appends the commands_app with
has_module_perms=True; update custom_get_app_list to only include or set
has_module_perms for the commands_app when the current request.user has staff
privileges (e.g., request.user.is_staff) or the specific permission you want to
gate; locate the commands_app dict and the custom_get_app_list function and
either skip appending commands_app for non-staff users or set its
"has_module_perms" to False when request.user.is_staff is False so the
"RunIndexing" link is hidden from non-staff.
In `@src/backend/core/forms.py`:
- Around line 17-22: The DateTimeField widgets lower_time_bound and
upper_time_bound need explicit input_formats to accept HTML5 "datetime-local"
values (e.g. "%Y-%m-%dT%H:%M" and "%Y-%m-%dT%H:%M:%S") and you must normalize
timezone awareness before passing results to filter_updated_at; update those
fields to include input_formats=["%Y-%m-%dT%H:%M", "%Y-%m-%dT%H:%M:%S"] and
then, depending on your project USE_TZ setting, either convert the parsed naive
datetimes to aware datetimes using
django.utils.timezone.make_aware(timezone.get_current_timezone()) before calling
filter_updated_at or adjust filter_updated_at to accept naive values by
localizing them consistently.
In `@src/backend/core/services/search_indexers.py`:
- Around line 144-166: Crash-safe recovery currently orders documents only by
updated_at which is non-deterministic for tied timestamps and can cause
duplicate pushes at checkpoint boundaries; either confirm with the Find team
that the indexing endpoint used by push() is idempotent (upsert by id) and
document that guarantee, or make the code deterministic by changing the ordering
to include a secondary key (e.g., order_by("updated_at", "id")) and update the
checkpoint logic to resume with a compound predicate (updated_at >= last_time
AND id > last_id) so documents at the boundary are not re-pushed; update
serialize_document/get_batch_accesses_by_users_and_teams usage as needed to
return the last_id for checkpointing and adjust logger.info to include both
updated_at and id when emitting the checkpoint.
In `@src/backend/core/templates/runindexing.html`:
- Line 6: The form element in runindexing.html is relying on the default POST
target; update the <form> tag to include an explicit action attribute (for
example action="" or action="{% url 'run_indexing' %}") so the template is safe
for reuse/inclusion—modify the <form method="POST"> declaration in the template
to include that action value.
In `@src/backend/core/tests/test_admin_run_indexing.py`:
- Line 5: The test contains an unused mock return and an unnecessary import:
remove the HttpResponse import and drop the assignment
mock_call_command.return_value = HttpResponse("Mocked render") in
test_admin_run_indexing.py (also remove the duplicated lines at 35-37); the view
under test (run_indexing_view) does a redirect and does not consume
call_command's return value, so simply mock call_command without setting a
return_value and keep assertions about call invocation and redirect behavior.
In `@src/backend/core/tests/test_models_documents.py`:
- Around line 1699-1796: Remove the redundant import "from core.factories import
DocumentFactory" and update the tests to use the already-imported factories
module (i.e., factories.DocumentFactory) so all factory usage is consistent;
locate references to DocumentFactory in the new test functions (e.g.,
test_models_documents_manager_time_filter_* and
test_utils_build_indexable_documents_queryset_ancestors_deleted_at) and replace
them with factories.DocumentFactory, then delete the duplicate import statement.
---
Duplicate comments:
In `@CHANGELOG.md`:
- Around line 12-13: Remove the trailing whitespace at the end of the lines
containing the list items "checkpoint recovery" and "asynchronicity" in
CHANGELOG.md; update those two lines so they end immediately after the text (no
spaces or tabs) to match the formatting of the surrounding lines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c2856f8d-9c81-4714-943a-4025c2b2022a
📒 Files selected for processing (17)
CHANGELOG.mdMakefiledocs/commands/index.mdsrc/backend/core/admin.pysrc/backend/core/factories.pysrc/backend/core/forms.pysrc/backend/core/management/commands/index.pysrc/backend/core/models.pysrc/backend/core/services/search_indexers.pysrc/backend/core/tasks/search.pysrc/backend/core/templates/runindexing.htmlsrc/backend/core/tests/commands/test_index.pysrc/backend/core/tests/conftest.pysrc/backend/core/tests/test_admin_run_indexing.pysrc/backend/core/tests/test_models_documents.pysrc/backend/core/tests/test_services_search_indexers.pysrc/backend/impress/urls.py
| ### `--async` | ||
| - **type:** Boolean flag | ||
| - **default:** `False` | ||
| +- **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously. |
There was a problem hiding this comment.
Stray + diff marker leftover in rendered documentation.
The line literally starts with +- **description:** — a leftover from a previously applied suggested diff that was pasted verbatim. Markdown will render this with a leading +.
📝 Proposed fix
-+- **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously.
+- **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| +- **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously. | |
| - **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/commands/index.md` at line 58, The documentation line contains a
leftover diff marker ("+- **description:** When set, dispatches the indexing job
to a Celery worker..."); locate that exact string in docs/commands/index.md and
remove the leading '+' so the line becomes a normal Markdown bullet (e.g., "-
**description:** When set, ...") or simply "**description:** ..." depending on
surrounding list style, ensuring no stray diff markers remain.
| # Customize the default admin site's get_app_list method | ||
| _original_get_app_list = admin.site.get_app_list | ||
|
|
||
|
|
||
| def custom_get_app_list(_self, request, app_label=None): | ||
| """Add custom commands to the app list.""" | ||
| app_list = _original_get_app_list(request, app_label) | ||
|
|
||
| # Add Commands app with Run Indexing command | ||
| commands_app = { | ||
| "name": _("Commands"), | ||
| "app_label": "commands", | ||
| "app_url": "#", | ||
| "has_module_perms": True, | ||
| "models": [ | ||
| { | ||
| "name": _("Run indexing"), | ||
| "object_name": "RunIndexing", | ||
| "admin_url": "/admin/run-indexing/", | ||
| "view_only": False, | ||
| "add_url": None, | ||
| "change_url": None, | ||
| } | ||
| ], | ||
| } | ||
|
|
||
| app_list.append(commands_app) | ||
| return app_list | ||
|
|
||
|
|
||
| # Monkey-patch the admin site | ||
| admin.site.get_app_list = lambda request, app_label=None: custom_get_app_list( | ||
| admin.site, request, app_label | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Monkey-patching admin.site.get_app_list is fragile; prefer a AdminSite subclass or clean override.
Directly reassigning admin.site.get_app_list to a lambda:
- Is not composable: if another module (now or in the future) does the same thing, only the last patch wins, and the order becomes import-order-dependent.
- Makes the override hard to discover and to test in isolation.
- Captures
_original_get_app_listat import time; any later rebinding by another app is invisible to this wrapper.
Cleaner alternatives:
- Subclass
AdminSite, overrideget_app_list, and setdefault_siteviaAppConfig.default_site. - Or add a proper
ModelAdmin/entry forRunIndexing(e.g. a tiny proxy model or a dedicated admin view registered viaadmin.site.get_urls) so the "Commands" entry comes through the normal app-list machinery.
Also, custom_get_app_list(_self, request, app_label=None) never uses _self; you can drop it and the lambda’s admin.site argument:
♻️ Suggested simplification
-def custom_get_app_list(_self, request, app_label=None):
+def custom_get_app_list(request, app_label=None):
"""Add custom commands to the app list."""
app_list = _original_get_app_list(request, app_label)
...
app_list.append(commands_app)
return app_list
-# Monkey-patch the admin site
-admin.site.get_app_list = lambda request, app_label=None: custom_get_app_list(
- admin.site, request, app_label
-)
+admin.site.get_app_list = custom_get_app_list🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/admin.py` around lines 17 - 50, Replace the fragile
monkey-patch of admin.site.get_app_list by creating a proper AdminSite subclass
that overrides get_app_list (move custom_get_app_list logic into that override)
and register it as the default site via AppConfig.default_site (or instantiate
and use your subclass when configuring the app), or alternatively implement a
normal admin entry for the RunIndexing command by registering a ModelAdmin/proxy
model or adding a custom admin view via admin.site.get_urls so the "Commands"
app appears through Django's standard app-list machinery; remove the lambda
rebind of admin.site.get_app_list and the unused _self parameter in
custom_get_app_list once migrated.
| commands_app = { | ||
| "name": _("Commands"), | ||
| "app_label": "commands", | ||
| "app_url": "#", | ||
| "has_module_perms": True, | ||
| "models": [ | ||
| { | ||
| "name": _("Run indexing"), | ||
| "object_name": "RunIndexing", | ||
| "admin_url": "/admin/run-indexing/", | ||
| "view_only": False, | ||
| "add_url": None, | ||
| "change_url": None, | ||
| } | ||
| ], | ||
| } |
There was a problem hiding this comment.
"Commands" entry is always shown, regardless of staff status.
custom_get_app_list unconditionally appends the commands_app with "has_module_perms": True. Any user reaching the admin index page will see the "Run indexing" link, even if they are not staff (the view itself is now correctly staff-gated, so clicking it will redirect to login — but the entry is still visible). Consider gating visibility on request.user.is_staff (or a more specific permission) so non-staff users don't see a link they can't use.
🔒 Suggested fix
def custom_get_app_list(_self, request, app_label=None):
"""Add custom commands to the app list."""
app_list = _original_get_app_list(request, app_label)
+ if not request.user.is_staff:
+ return app_list
+
# Add Commands app with Run Indexing command
commands_app = { ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/admin.py` around lines 26 - 41, The Commands entry is always
visible because custom_get_app_list unconditionally appends the commands_app
with has_module_perms=True; update custom_get_app_list to only include or set
has_module_perms for the commands_app when the current request.user has staff
privileges (e.g., request.user.is_staff) or the specific permission you want to
gate; locate the commands_app dict and the custom_get_app_list function and
either skip appending commands_app for non-staff users or set its
"has_module_perms" to False when request.user.is_staff is False so the
"RunIndexing" link is hidden from non-staff.
| lower_time_bound = forms.DateTimeField( | ||
| required=False, widget=forms.TextInput(attrs={"type": "datetime-local"}) | ||
| ) | ||
| upper_time_bound = forms.DateTimeField( | ||
| required=False, widget=forms.TextInput(attrs={"type": "datetime-local"}) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Look at Django's default DATETIME_INPUT_FORMATS and project overrides
rg -nP 'DATETIME_INPUT_FORMATS|USE_TZ\s*=' --type=py -C1
# Check how the index command's datetime.fromisoformat() handles the produced string
rg -nP 'fromisoformat|isoformat\(' src/backend/core --type=py -C2Repository: suitenumerique/docs
Length of output: 50376
🏁 Script executed:
# Search for fromisoformat usage more specifically
rg -n 'fromisoformat' src/backend --type=py -B2 -A2Repository: suitenumerique/docs
Length of output: 1019
🏁 Script executed:
# Look for admin.py and how it uses the form
fd -type f -name 'admin.py' src/backendRepository: suitenumerique/docs
Length of output: 234
🏁 Script executed:
# Search for DATETIME_INPUT_FORMATS customization
rg -n 'DATETIME_INPUT_FORMATS' src/backend --type=py -A3Repository: suitenumerique/docs
Length of output: 45
🏁 Script executed:
# Find the index command that uses these bounds
rg -n 'lower_time_bound|upper_time_bound' src/backend --type=py -B2 -A2Repository: suitenumerique/docs
Length of output: 22766
🏁 Script executed:
# Find Django version requirement
find . -name 'requirements*.txt' -o -name 'pyproject.toml' -o -name 'setup.py' | xargs grep -l 'django' 2>/dev/null | head -5Repository: suitenumerique/docs
Length of output: 92
🏁 Script executed:
# Check for tests of the admin form itself with datetime submission
rg -n 'RunIndexingForm|test.*admin.*indexing' src/backend/core/tests --type=py -lRepository: suitenumerique/docs
Length of output: 45
🏁 Script executed:
# Look at actual admin.py to see the context around form usage
cat -n src/backend/core/admin.py | sed -n '270,295p'Repository: suitenumerique/docs
Length of output: 1256
🏁 Script executed:
# Check Django version in pyproject.toml
cat src/backend/pyproject.toml | grep -A 20 -B 5 'django'Repository: suitenumerique/docs
Length of output: 3191
🏁 Script executed:
# Look for any tests of the RunIndexingForm with actual datetime-local input
rg -n 'RunIndexingForm|run_indexing_view' src/backend/core/tests --type=py -B 3 -A 5Repository: suitenumerique/docs
Length of output: 1649
🏁 Script executed:
# Check if there's a settings file that customizes DATETIME_INPUT_FORMATS
find src/backend -name 'settings.py' -o -name 'settings_*' | xargs cat 2>/dev/null | rg -A 5 'DATETIME_INPUT_FORMATS' || echo "No DATETIME_INPUT_FORMATS found"Repository: suitenumerique/docs
Length of output: 95
🏁 Script executed:
# Get the full test file to see how the form is tested
cat -n src/backend/core/tests/test_admin_run_indexing.pyRepository: suitenumerique/docs
Length of output: 1928
🏁 Script executed:
# Check Django's default DATETIME_INPUT_FORMATS to understand what it accepts
python3 << 'EOF'
from django.conf import settings
import django
# Try to get Django's default input formats
print("Python code to understand Django's DATETIME_INPUT_FORMATS")
# In Django 3.2+, the defaults include ISO 8601 formats
# We need to know which version of Django is being used
# Version constraint is: django<6.0.0 (from pyproject.toml)
# So it could be Django 3.2, 4.0, 4.1, 4.2, or 5.x
# Let's check what formats modern Django supports
try:
import django
print(f"Django version in environment: {django.get_version() if hasattr(django, 'get_version') else 'unknown'}")
except:
pass
# The key issue: does DateTimeField accept "2024-02-01T12:00" format?
# Django's default DATETIME_INPUT_FORMATS (since 3.2) are:
# '%Y-%m-%d %H:%M:%S',
# '%Y-%m-%d %H:%M:%S.%f',
# '%Y-%m-%d %H:%M',
# Note: NO T-separated format by default
print("Django defaults do NOT include T-separated ISO format by default")
print("HTML5 datetime-local submits: 2024-02-01T12:00")
print("Django defaults expect: 2024-02-01 12:00")
print("\nThis is a parsing mismatch!")
EOFRepository: suitenumerique/docs
Length of output: 182
🏁 Script executed:
# Look at the complete forms.py file to see all form definitions
cat -n src/backend/core/forms.pyRepository: suitenumerique/docs
Length of output: 1684
🏁 Script executed:
# Search for any test that actually submits datetime values to the form
rg -n 'lower_time_bound.*=' src/backend/core/tests --type=py -B 2 -A 2 | head -50Repository: suitenumerique/docs
Length of output: 4398
🏁 Script executed:
# Check if there's a custom DateTimeField or widget in the codebase
rg -n 'class.*DateTimeField|class.*DateTimeInput' src/backend --type=pyRepository: suitenumerique/docs
Length of output: 45
Add explicit input_formats to DateTimeField to accept HTML5 datetime-local format.
The HTML5 datetime-local widget submits values as YYYY-MM-DDTHH:MM[SS] (T-separated, no timezone), but Django's default DATETIME_INPUT_FORMATS only include space-separated formats (%Y-%m-%d %H:%M[:%S]). Form validation will fail with "Enter a valid date/time" when users submit via the HTML5 widget.
Also ensure consistent timezone handling: form fields produce naive datetimes (no tzinfo); these are then passed to filter_updated_at() which compares against tz-aware updated_at values. Either use USE_TZ=False or explicitly handle timezone awareness in the form and downstream queries.
Suggested fix
lower_time_bound = forms.DateTimeField(
required=False,
widget=forms.TextInput(attrs={"type": "datetime-local"}),
input_formats=["%Y-%m-%dT%H:%M", "%Y-%m-%dT%H:%M:%S"]
)
upper_time_bound = forms.DateTimeField(
required=False,
widget=forms.TextInput(attrs={"type": "datetime-local"}),
input_formats=["%Y-%m-%dT%H:%M", "%Y-%m-%dT%H:%M:%S"]
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/forms.py` around lines 17 - 22, The DateTimeField widgets
lower_time_bound and upper_time_bound need explicit input_formats to accept
HTML5 "datetime-local" values (e.g. "%Y-%m-%dT%H:%M" and "%Y-%m-%dT%H:%M:%S")
and you must normalize timezone awareness before passing results to
filter_updated_at; update those fields to include
input_formats=["%Y-%m-%dT%H:%M", "%Y-%m-%dT%H:%M:%S"] and then, depending on
your project USE_TZ setting, either convert the parsed naive datetimes to aware
datetimes using
django.utils.timezone.make_aware(timezone.get_current_timezone()) before calling
filter_updated_at or adjust filter_updated_at to accept naive values by
localizing them consistently.
| if crash_safe_mode: | ||
| queryset = queryset.order_by("updated_at") | ||
|
|
||
| for documents_batch in itertools.batched(queryset.iterator(), batch_size): | ||
| doc_paths = [doc.path for doc in documents_batch] | ||
| last_id = documents_batch[-1].id | ||
| accesses_by_document_path = get_batch_accesses_by_users_and_teams(doc_paths) | ||
|
|
||
| serialized_batch = [ | ||
| self.serialize_document(document, accesses_by_document_path) | ||
| for document in documents_batch | ||
| if document.content or document.title | ||
| ] | ||
|
|
||
| if serialized_batch: | ||
| self.push(serialized_batch) | ||
| count += len(serialized_batch) | ||
| if not serialized_batch: | ||
| continue | ||
|
|
||
| self.push(serialized_batch) | ||
| count += len(serialized_batch) | ||
|
|
||
| if crash_safe_mode: | ||
| logger.info( | ||
| "Indexing checkpoint: %s.", | ||
| serialized_batch[-1]["updated_at"], | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the Find push endpoint contract (upsert vs create-only) via docs or comments
rg -nH -C3 -e 'indexing|upsert|POST.*documents' -g '!**/node_modules/**' --iglob '*.md' --iglob '*.py'Repository: suitenumerique/docs
Length of output: 24864
🏁 Script executed:
# Find the FindDocumentIndexer.push implementation
rg -nH --type py "def push" src/backend/core/services/search_indexers.py -A 20Repository: suitenumerique/docs
Length of output: 2767
🏁 Script executed:
# Check the FindDocumentIndexer class structure to understand the push endpoint
rg -nH --type py "class FindDocumentIndexer" src/backend/core/services/search_indexers.py -A 50Repository: suitenumerique/docs
Length of output: 4082
🏁 Script executed:
# Look for the Find API request handler to see if it's upsert or create-only
rg -nH --type py "requests\.(post|put)" src/backend/core/services/search_indexers.py -B 5 -A 5Repository: suitenumerique/docs
Length of output: 1769
🏁 Script executed:
# Check test_index_with_crash_recovery to see what idempotency assumptions it validates
sed -n '100,200p' src/backend/core/tests/commands/test_index.py | cat -nRepository: suitenumerique/docs
Length of output: 5116
Crash-safe recovery relies on undocumented push idempotency.
order_by("updated_at") is non-deterministic when multiple documents share the same timestamp. With inclusive lower_time_bound on resume, documents at the checkpoint boundary will be re-pushed on restart. The test confirms this behavior (documents[3] is re-indexed after recovery), but push() idempotency is neither documented in code nor verified in practice.
Before deploying this feature, confirm with the Find team that the indexing endpoint is idempotent (upsert by id, not create-only). If uncertain, add a secondary ordering key (order_by("updated_at", "id")) and advance the checkpoint by updated_at >= X AND id > last_id to eliminate duplicate pushes at boundaries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/services/search_indexers.py` around lines 144 - 166,
Crash-safe recovery currently orders documents only by updated_at which is
non-deterministic for tied timestamps and can cause duplicate pushes at
checkpoint boundaries; either confirm with the Find team that the indexing
endpoint used by push() is idempotent (upsert by id) and document that
guarantee, or make the code deterministic by changing the ordering to include a
secondary key (e.g., order_by("updated_at", "id")) and update the checkpoint
logic to resume with a compound predicate (updated_at >= last_time AND id >
last_id) so documents at the boundary are not re-pushed; update
serialize_document/get_batch_accesses_by_users_and_teams usage as needed to
return the last_id for checkpointing and adjust logger.info to include both
updated_at and id when emitting the checkpoint.
|
|
||
| {% block content %} | ||
|
|
||
| <form method="POST" > |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Nit: be explicit about the form target.
Relying on the default (posting to the current URL) works, but an explicit action="" (or {% url 'run_indexing' %}) makes the template safer against future re-use/inclusion.
✏️ Suggested tweak
-<form method="POST" >
+<form method="POST" action="{% url 'run_indexing' %}">Also the two HTMLHint warnings (doctype-first, input-requires-label) are false positives here — the doctype comes from admin/base_site.html and labels are rendered by {{ form.as_p }}.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <form method="POST" > | |
| <form method="POST" action="{% url 'run_indexing' %}"> |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/templates/runindexing.html` at line 6, The form element in
runindexing.html is relying on the default POST target; update the <form> tag to
include an explicit action attribute (for example action="" or action="{% url
'run_indexing' %}") so the template is safe for reuse/inclusion—modify the <form
method="POST"> declaration in the template to include that action value.
|
|
||
| from unittest.mock import patch | ||
|
|
||
| from django.http import HttpResponse |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor: unused mock return value and unused HttpResponse import.
mock_call_command.return_value = HttpResponse("Mocked render") has no effect — run_indexing_view never uses the return value of call_command; it redirects on success. The line and the HttpResponse import can be dropped for clarity.
🧹 Suggested cleanup
from unittest.mock import patch
-from django.http import HttpResponse
-
import pytest
@@
with patch("core.admin.call_command") as mock_call_command:
- mock_call_command.return_value = HttpResponse("Mocked render")
response = client.post("/admin/run-indexing/", {"batch_size": batch_size})Also applies to: 35-37
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/tests/test_admin_run_indexing.py` at line 5, The test
contains an unused mock return and an unnecessary import: remove the
HttpResponse import and drop the assignment mock_call_command.return_value =
HttpResponse("Mocked render") in test_admin_run_indexing.py (also remove the
duplicated lines at 35-37); the view under test (run_indexing_view) does a
redirect and does not consume call_command's return value, so simply mock
call_command without setting a return_value and keep assertions about call
invocation and redirect behavior.
| def test_models_documents_manager_time_filter_no_filters(): | ||
| """Test time_filter with no filters returns all documents.""" | ||
| factories.DocumentFactory.create_batch(3) | ||
|
|
||
| queryset = models.Document.objects.filter_updated_at() | ||
|
|
||
| assert queryset.count() == 3 | ||
|
|
||
|
|
||
| def test_models_documents_manager_time_filter_oldest_updated_at(): | ||
| """ | ||
| Test filtering by oldest_updated_at includes documents updated after or at | ||
| lower_time_bound. | ||
| """ | ||
| lower_time_bound = datetime(2024, 2, 1, tzinfo=base_timezone.utc) | ||
|
|
||
| DocumentFactory(updated_at=lower_time_bound - timedelta(days=30)) | ||
| document_at_boundary = DocumentFactory(updated_at=lower_time_bound) | ||
| document_recent = DocumentFactory(updated_at=lower_time_bound + timedelta(days=15)) | ||
|
|
||
| queryset = models.Document.objects.filter_updated_at( | ||
| lower_time_bound=lower_time_bound | ||
| ) | ||
|
|
||
| assert queryset.count() == 2 | ||
| assert sorted(queryset.values_list("pk", flat=True)) == sorted( | ||
| [document_at_boundary.pk, document_recent.pk] | ||
| ) | ||
|
|
||
|
|
||
| def test_models_documents_manager_time_filter_newest_updated_at(): | ||
| """Test filtering by newest_updated_at includes documents updated before timestamp.""" | ||
| upper_time_bound = datetime(2024, 2, 1, tzinfo=base_timezone.utc) | ||
|
|
||
| document_old = DocumentFactory(updated_at=upper_time_bound - timedelta(days=30)) | ||
| document_at_boundary = DocumentFactory(updated_at=upper_time_bound) | ||
| document_too_recent = DocumentFactory( | ||
| updated_at=upper_time_bound + timedelta(days=15) | ||
| ) | ||
|
|
||
| queryset = models.Document.objects.filter_updated_at( | ||
| upper_time_bound=upper_time_bound | ||
| ) | ||
|
|
||
| assert queryset.count() == 2 | ||
| assert document_old in queryset | ||
| assert document_at_boundary in queryset | ||
| assert document_too_recent not in queryset | ||
|
|
||
|
|
||
| def test_models_documents_manager_time_filter_both_bounds(): | ||
| """Test filtering with both oldest and newest bounds.""" | ||
| lower_time_bound = datetime(2024, 2, 1, tzinfo=base_timezone.utc) | ||
| upper_time_bound = lower_time_bound + timedelta(days=30) | ||
|
|
||
| document_too_early = DocumentFactory( | ||
| updated_at=lower_time_bound - timedelta(days=10) | ||
| ) | ||
| document_in_window = DocumentFactory( | ||
| updated_at=lower_time_bound + timedelta(days=5) | ||
| ) | ||
| other_document_in_window = DocumentFactory( | ||
| updated_at=lower_time_bound + timedelta(days=15) | ||
| ) | ||
| document_too_late = DocumentFactory( | ||
| updated_at=upper_time_bound + timedelta(days=10) | ||
| ) | ||
|
|
||
| queryset = models.Document.objects.filter_updated_at( | ||
| lower_time_bound=lower_time_bound, upper_time_bound=upper_time_bound | ||
| ) | ||
|
|
||
| assert queryset.count() == 2 | ||
| assert document_too_early not in queryset | ||
| assert document_in_window in queryset | ||
| assert other_document_in_window in queryset | ||
| assert document_too_late not in queryset | ||
|
|
||
|
|
||
| def test_utils_build_indexable_documents_queryset_ancestors_deleted_at(): | ||
| """Test filtering includes documents with ancestors_deleted_at in range.""" | ||
| lower_time_bound = datetime(2024, 2, 1, tzinfo=base_timezone.utc) | ||
|
|
||
| document_ancestor_deleted_in_range = DocumentFactory( | ||
| updated_at=lower_time_bound - timedelta(days=30), | ||
| ancestors_deleted_at=lower_time_bound + timedelta(days=15), | ||
| ) | ||
| document_updated_before_range = DocumentFactory( | ||
| updated_at=lower_time_bound - timedelta(days=30) | ||
| ) | ||
|
|
||
| queryset = models.Document.objects.filter_updated_at( | ||
| lower_time_bound=lower_time_bound | ||
| ) | ||
|
|
||
| assert queryset.count() == 1 | ||
| assert document_ancestor_deleted_in_range in queryset | ||
| assert document_updated_before_range not in queryset |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
LGTM — comprehensive coverage of filter_updated_at semantics.
Good tests covering: no-filter passthrough, lower/upper/both bounds with boundary inclusion (gte/lte), and the ancestors_deleted_at OR-branch. Timezone-aware datetimes are used consistently, addressing earlier feedback.
One tiny nit (optional): since factories is already imported at line 23, the duplicate from core.factories import DocumentFactory at line 24 can be dropped and the new tests can use factories.DocumentFactory like the rest of the file for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/backend/core/tests/test_models_documents.py` around lines 1699 - 1796,
Remove the redundant import "from core.factories import DocumentFactory" and
update the tests to use the already-imported factories module (i.e.,
factories.DocumentFactory) so all factory usage is consistent; locate references
to DocumentFactory in the new test functions (e.g.,
test_models_documents_manager_time_filter_* and
test_utils_build_indexable_documents_queryset_ancestors_deleted_at) and replace
them with factories.DocumentFactory, then delete the duplicate import statement.
Purpose
to index all documents we need to be able to restart from last successful batch in case of crash. to do so we need to be able to specify time bounds on the bulk index to allow recovering from checkpoint after crash and asynchronously delegate it to celery worker.
Proposal
lower_time_boundandupper_time_boundargsbatch_document_indexer_taskExternal contributions
Thank you for your contribution! 🎉
Please ensure the following items are checked before submitting your pull request:
git commit --signoff(DCO compliance)git commit -S)<gitmoji>(type) title description## [Unreleased]section (if noticeable change)