feat: add analytics subcommand for mbox sender analysis

Adds a new `analytics` subcommand that analyzes Google Takeout mbox files
to identify top senders by message count. Designed for efficient processing
of large files (60GB+) with minimal memory usage.

Features:
- Streams files line-by-line with 1MB buffer (never loads entire file)
- Extracts sender email addresses from From: headers
- Counts messages per sender and displays top N (default 10)
- Shows progress output every 10,000 messages
- No Gmail API access needed

Usage:
  cull-gmail analytics <MBOX_FILE> [-n TOP]

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-03-16 15:12:33 +02:00
parent aee4bc2eaa
commit 285a42a7a3
3 changed files with 176 additions and 1 deletions

View File

@@ -53,8 +53,9 @@ Get started with cull-gmail in minutes using the built-in setup command:
- **Flexible configuration**: Support for file-based config, environment variables, and ephemeral tokens
- **Safety first**: Dry-run mode by default, interactive confirmations, and timestamped backups
- **Label management**: List and inspect Gmail labels for rule planning
- **Message operations**: Query, filter, and perform batch operations on Gmail messages
- **Message operations**: Query, filter, and perform batch operations on Gmail messages
- **Rule-based automation**: Configure retention rules with time-based filtering and automated actions
- **Mbox analysis**: Analyze Google Takeout exports to identify top senders (efficient streaming, no API needed)
- **Token portability**: Export/import OAuth2 tokens for containerized and CI/CD environments
### Running the optional Gmail integration test
@@ -201,9 +202,12 @@ cull-gmail [OPTIONS] [COMMAND]
### Commands
- `init`: Initialize configuration and OAuth2 credentials
- `labels`: List available Gmail labels
- `messages`: Query and operate on messages
- `rules`: Configure and run retention rules
- `analytics`: Analyze mbox files for sender statistics
- `token`: Export and import OAuth2 tokens
## Command Reference
@@ -370,6 +374,70 @@ cull-gmail rules run --execute --skip-trash
cull-gmail rules run --execute --skip-delete
```
### Analytics Command
Analyze Google Takeout mbox files to identify top senders by message count.
**Note**: This command does NOT require Gmail API access. It efficiently streams local mbox files with minimal memory usage, making it suitable for analyzing large exports (60GB+).
#### Syntax
```bash
cull-gmail analytics [OPTIONS] <MBOX_FILE>
```
#### Arguments
- `<MBOX_FILE>`: Path to mbox file to analyze (typically from Google Takeout)
#### Options
- `-n, --top <TOP>`: Number of top senders to display [default: 10]
#### Examples
**Show top 10 senders from a Google Takeout mbox**:
```bash
cull-gmail analytics ~/takeout/All\ mail\ Including\ Spam\ and\ Trash.mbox
```
**Show top 20 senders**:
```bash
cull-gmail analytics -n 20 ~/takeout/All\ mail.mbox
```
**Example Output**:
```
[INFO] Scanned 1234567 messages total.
Top 10 senders:
45678 newsletter@example.com
23456 promotions@example.com
18901 notifications@example.com
12345 support@example.com
9876 marketing@example.com
8765 updates@example.com
7654 alerts@example.com
6543 digests@example.com
5432 reports@example.com
4321 announcements@example.com
```
#### Use Cases
- Identify top email senders in your mailbox before configuring rules
- Analyze historical email patterns from a full account export
- Find unexpected high-volume senders for further investigation
- Plan email retention policies based on actual sender frequency
#### Getting a Google Takeout mbox File
1. Visit [Google Takeout](https://takeout.google.com)
2. Select "Gmail" and choose the desired email account
3. Select export format "Standard" (generates .mbox files)
4. Download the export (can be very large - multiple parts possible)
5. Extract/combine the mbox files if needed
6. Use `cull-gmail analytics` on the mbox file
## Gmail Query Syntax
The `-Q, --query` option supports Gmail's powerful search syntax: