huggingface: add bucket scanning#5017
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit e4f90eb. Configure here.
unsmith
left a comment
There was a problem hiding this comment.
Looks solid to me, left a couple of small non-blocking comments.
|
|
||
| // parseNextLink extracts the URL with rel="next" from a Link header value. | ||
| // Example: <https://huggingface.co/api/...?cursor=abc>; rel="next" | ||
| func parseNextLink(header string) string { |
There was a problem hiding this comment.
Just noting that this new function doesn't have any unit tests.
| if !s.skipAllBuckets { | ||
| if err := s.fetchAndCacheBuckets(orgCtx, org); err != nil { | ||
| orgCtx.Logger().Error(err, "Failed to fetch buckets for organization") | ||
| continue |
There was a problem hiding this comment.
This isn't something specific to this PR, but the pattern here of moving to the next key on error means that prior errors silently skip downstream checks that might be ok. If it's valid to (for example) check buckets if spaces happens to fail, then this could be broken up into separate checks that don't cancel each other.
| if !s.skipAllBuckets { | ||
| if err := s.fetchAndCacheBuckets(userCtx, user); err != nil { | ||
| userCtx.Logger().Error(err, "Failed to fetch buckets for user") | ||
| continue |
amanfcp
left a comment
There was a problem hiding this comment.
Approving to unblock. Clean extension that mirrors the existing huggingface patterns well: the download-timeout handling is correct and tested, path escaping is tested, error aggregation is thread-safe, and docs/man are updated.
A few non-blocking follow-ups:
1. Guard the BucketID slash-split (bucket.go:154-155). strings.Split(bucket.BucketID, "/")[1] panics if the API returns an id without a /, and it runs synchronously in enumerate(), so it crashes the whole scan. Use SplitN + a length check. Mirrors pre-existing cacheRepoInfo, but worth fixing for the new code.
2. Unit-test parseNextLink. Hand-rolled RFC 5988 parser; the naive Split(header, ",") mis-parses a next URL whose cursor contains a comma. A small table test is cheap insurance.

HF recently shipped storage buckets (Xet-backed object storage)
--bucket <namespace/name>flag, plus--include-buckets/--ignore-buckets/--skip-all-buckets;--org/--userscans pick up buckets automaticallyhandlers.HandleFileTested against a real public bucket (results identical to
trufflehog filesystemon the same file), and a planted canary AWS key comes back as a verified finding with correct bucket metadata.cc @dxa4481
Note
Medium Risk
New external HF API and file download path with token auth; follows existing S3-style patterns but increases network surface and scan scope for org/user runs.
Overview
Adds Hugging Face storage bucket support to the
huggingfacesource so TruffleHog can scan object storage alongside models, datasets, and spaces.CLI and config: New
--bucket,--include-buckets,--ignore-buckets, and--skip-all-bucketsflags (plus proto/engine wiring). Org/user scans enumerate buckets automatically unless skipped. Validation now accepts bucket as a scan target.Scan behavior: Buckets are not git repos. The source lists files via the HF tree API (with Link pagination), downloads each file through a dedicated client (no whole-request timeout on body reads), skips files over 250MB, and chunks content via
handlers.HandleFilewith HuggingFace bucket metadata—similar to the S3 source path.API client:
GetBucket,ListBucketsByAuthor,ListBucketFiles, andDownloadBucketFilewith path-segment escaping for resolve URLs.Docs (
README, man page) and minor comment fixes onScanHuggingfaceare updated.Reviewed by Cursor Bugbot for commit 5a76b32. Bugbot is set up for automated code reviews on this repo. Configure here.