Refactor GitHub ingestion to use GraphQL API for efficiency #2

Open
opened 2026-05-06 02:43:37 +00:00 by grenade · 0 comments
Owner

Context

We already use the GitHub GraphQL API for repo discovery (repositoriesContributedTo). The REST-based commit fetching in github-repo is the slowest part of ingestion — it makes sequential per-repo HTTP calls, each paginating through the commits endpoint. There are efficiency gains available by extending GraphQL usage to other parts of the GitHub ingestion pipeline.

Where GraphQL would help

1. Commit fetching per repo (biggest win)

Current: github-repo calls GET /repos/{owner}/{repo}/commits?author={user} sequentially for each repo, paginating up to 100 pages per repo. Hundreds of repos × multiple pages = many REST round trips.

GraphQL alternative: repository(owner, name) { defaultBranchRef { target { ... on Commit { history(author: { id: ... }, since: ..., first: 100, after: $cursor) { nodes { oid message author { date } } pageInfo { hasNextPage endCursor } } } } } }

Gains:

  • Request only the fields we need (oid, message, author date) instead of full commit payloads — smaller responses, less bandwidth
  • Can batch multiple repos into a single GraphQL request using aliases, reducing round trips dramatically:
    {
      repo0: repository(owner: "org1", name: "repo1") { ...commitHistory }
      repo1: repository(owner: "org2", name: "repo2") { ...commitHistory }
      ...
    }
    
  • Native since filtering at the query level
  • Cursor-based pagination (more reliable than page-number pagination)

2. Repo discovery (already done)

Already using repositoriesContributedTo via GraphQL. Could also replace the REST /user/repos call with GraphQL viewer { repositories(...) } to unify into a single GraphQL discovery query.

3. Replace github-search commit/issue ingestion

Current: REST Search API (/search/commits, /search/issues) with a hard 1000-result cap per query type.

GraphQL alternative: Since github-repo already walks each repo's commits with no cap, and we now discover all contributed repos via GraphQL, github-search commits are fully redundant. The issue/PR ingestion could move to per-repo GraphQL queries:

repository(owner, name) {
  issues(filterBy: { createdBy: $user }, first: 100) { ... }
  pullRequests(first: 100) { ... }
}

This removes the 1000-result ceiling for issues/PRs as well.

Where GraphQL does NOT help

  • github (Events API): The user activity event stream (/users/{user}/events) has no GraphQL equivalent. This source must remain REST-based.
  • Gitea: Gitea has no GraphQL API (open request since 2019, never implemented). Gitea ingestion must stay REST-only.

Suggested approach

  1. Add a shared GraphQL client helper (just a thin wrapper around reqwest::Client::post to the /graphql endpoint with auth headers)
  2. Replace per-repo REST commit fetching with batched GraphQL history queries (biggest performance win)
  3. Optionally unify repo discovery into a single GraphQL query combining viewer.repositories + user.repositoriesContributedTo
  4. Consider whether github-search can be retired entirely once github-repo covers commits and a new GraphQL path covers issues/PRs per repo

Rate limiting note

GitHub GraphQL API has a point-based rate limit (5,000 points/hour) rather than request-count. Complex queries cost more points. The batching strategy needs to balance batch size against point cost — likely 5-10 repos per request is the sweet spot.

## Context We already use the GitHub GraphQL API for repo discovery (`repositoriesContributedTo`). The REST-based commit fetching in `github-repo` is the slowest part of ingestion — it makes sequential per-repo HTTP calls, each paginating through the commits endpoint. There are efficiency gains available by extending GraphQL usage to other parts of the GitHub ingestion pipeline. ## Where GraphQL would help ### 1. Commit fetching per repo (biggest win) **Current**: `github-repo` calls `GET /repos/{owner}/{repo}/commits?author={user}` sequentially for each repo, paginating up to 100 pages per repo. Hundreds of repos × multiple pages = many REST round trips. **GraphQL alternative**: `repository(owner, name) { defaultBranchRef { target { ... on Commit { history(author: { id: ... }, since: ..., first: 100, after: $cursor) { nodes { oid message author { date } } pageInfo { hasNextPage endCursor } } } } } }` **Gains**: - Request only the fields we need (oid, message, author date) instead of full commit payloads — smaller responses, less bandwidth - Can batch multiple repos into a single GraphQL request using aliases, reducing round trips dramatically: ```graphql { repo0: repository(owner: "org1", name: "repo1") { ...commitHistory } repo1: repository(owner: "org2", name: "repo2") { ...commitHistory } ... } ``` - Native `since` filtering at the query level - Cursor-based pagination (more reliable than page-number pagination) ### 2. Repo discovery (already done) Already using `repositoriesContributedTo` via GraphQL. Could also replace the REST `/user/repos` call with GraphQL `viewer { repositories(...) }` to unify into a single GraphQL discovery query. ### 3. Replace `github-search` commit/issue ingestion **Current**: REST Search API (`/search/commits`, `/search/issues`) with a hard 1000-result cap per query type. **GraphQL alternative**: Since `github-repo` already walks each repo's commits with no cap, and we now discover all contributed repos via GraphQL, `github-search` commits are fully redundant. The issue/PR ingestion could move to per-repo GraphQL queries: ```graphql repository(owner, name) { issues(filterBy: { createdBy: $user }, first: 100) { ... } pullRequests(first: 100) { ... } } ``` This removes the 1000-result ceiling for issues/PRs as well. ## Where GraphQL does NOT help - **`github` (Events API)**: The user activity event stream (`/users/{user}/events`) has no GraphQL equivalent. This source must remain REST-based. - **Gitea**: Gitea has no GraphQL API ([open request since 2019](https://github.com/go-gitea/gitea/issues/8258), never implemented). Gitea ingestion must stay REST-only. ## Suggested approach 1. Add a shared GraphQL client helper (just a thin wrapper around `reqwest::Client::post` to the `/graphql` endpoint with auth headers) 2. Replace per-repo REST commit fetching with batched GraphQL `history` queries (biggest performance win) 3. Optionally unify repo discovery into a single GraphQL query combining `viewer.repositories` + `user.repositoriesContributedTo` 4. Consider whether `github-search` can be retired entirely once `github-repo` covers commits and a new GraphQL path covers issues/PRs per repo ## Rate limiting note GitHub GraphQL API has a point-based rate limit (5,000 points/hour) rather than request-count. Complex queries cost more points. The batching strategy needs to balance batch size against point cost — likely 5-10 repos per request is the sweet spot.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: grenade/moments#2