Refactor GitHub ingestion to use GraphQL API for efficiency #2
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
We already use the GitHub GraphQL API for repo discovery (
repositoriesContributedTo). The REST-based commit fetching ingithub-repois the slowest part of ingestion — it makes sequential per-repo HTTP calls, each paginating through the commits endpoint. There are efficiency gains available by extending GraphQL usage to other parts of the GitHub ingestion pipeline.Where GraphQL would help
1. Commit fetching per repo (biggest win)
Current:
github-repocallsGET /repos/{owner}/{repo}/commits?author={user}sequentially for each repo, paginating up to 100 pages per repo. Hundreds of repos × multiple pages = many REST round trips.GraphQL alternative:
repository(owner, name) { defaultBranchRef { target { ... on Commit { history(author: { id: ... }, since: ..., first: 100, after: $cursor) { nodes { oid message author { date } } pageInfo { hasNextPage endCursor } } } } } }Gains:
sincefiltering at the query level2. Repo discovery (already done)
Already using
repositoriesContributedTovia GraphQL. Could also replace the REST/user/reposcall with GraphQLviewer { repositories(...) }to unify into a single GraphQL discovery query.3. Replace
github-searchcommit/issue ingestionCurrent: REST Search API (
/search/commits,/search/issues) with a hard 1000-result cap per query type.GraphQL alternative: Since
github-repoalready walks each repo's commits with no cap, and we now discover all contributed repos via GraphQL,github-searchcommits are fully redundant. The issue/PR ingestion could move to per-repo GraphQL queries:This removes the 1000-result ceiling for issues/PRs as well.
Where GraphQL does NOT help
github(Events API): The user activity event stream (/users/{user}/events) has no GraphQL equivalent. This source must remain REST-based.Suggested approach
reqwest::Client::postto the/graphqlendpoint with auth headers)historyqueries (biggest performance win)viewer.repositories+user.repositoriesContributedTogithub-searchcan be retired entirely oncegithub-repocovers commits and a new GraphQL path covers issues/PRs per repoRate limiting note
GitHub GraphQL API has a point-based rate limit (5,000 points/hour) rather than request-count. Complex queries cost more points. The batching strategy needs to balance batch size against point cost — likely 5-10 repos per request is the sweet spot.