implement per-repo enumeration for full commit history #1

Closed
opened 2026-05-03 16:15:22 +00:00 by grenade · 0 comments
Owner

The current GitHub Search-based backfill is hard-capped at 1000 results per query (most-recent 1000 commits + most-recent 1000 issues/PRs). Older history is unreachable through /search/commits?q=author:grenade.

To recover the rest, add a new EventSource impl that enumerates the user's owned/contributed repos and walks per-repo /repos/{owner}/{repo}/commits?author=<user> — that endpoint has no 1000-cap.

Approach

  • crates/moments-data/src/github_repo.rs
  • Enumerate repos via /user/repos?affiliation=owner,collaborator,organization_member&visibility=all&per_page=100 (paginated).
  • Per repo, fetch /repos/{owner}/{repo}/commits?author=<user>&per_page=100 (paginated, no Search-API cap).
  • Dedup by SHA, key as github-commit:<sha> (matches GithubSearchSource's scheme — same row, idempotent upsert).
  • Visibility from each repo's private field at enumeration time.
  • Long polling interval (weekly?) — exhaustive backfill, mostly no-op after the first run.

Cost

O(repos × commits/repo) calls. Roughly 100 repos × 1000 commits at 100/page = ~1000 API calls. Well within the 5000/hr authed budget but slow. One-time mostly.

When

Deferred. The Search-API source already gives a meaningful starting set (the most-recent 1000 of each kind). Revisit if the gap shows up as missing entries on the timeline that the user actually cares about.

The current GitHub Search-based backfill is hard-capped at 1000 results per query (most-recent 1000 commits + most-recent 1000 issues/PRs). Older history is unreachable through `/search/commits?q=author:grenade`. To recover the rest, add a new `EventSource` impl that enumerates the user's owned/contributed repos and walks per-repo `/repos/{owner}/{repo}/commits?author=<user>` — that endpoint has no 1000-cap. ## Approach - `crates/moments-data/src/github_repo.rs` - Enumerate repos via `/user/repos?affiliation=owner,collaborator,organization_member&visibility=all&per_page=100` (paginated). - Per repo, fetch `/repos/{owner}/{repo}/commits?author=<user>&per_page=100` (paginated, no Search-API cap). - Dedup by SHA, key as `github-commit:<sha>` (matches `GithubSearchSource`'s scheme — same row, idempotent upsert). - Visibility from each repo's `private` field at enumeration time. - Long polling interval (weekly?) — exhaustive backfill, mostly no-op after the first run. ## Cost O(repos × commits/repo) calls. Roughly 100 repos × 1000 commits at 100/page = ~1000 API calls. Well within the 5000/hr authed budget but slow. One-time mostly. ## When Deferred. The Search-API source already gives a meaningful starting set (the most-recent 1000 of each kind). Revisit if the gap shows up as missing entries on the timeline that the user actually cares about.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: grenade/moments#1