Agency/docs/decisions/003-public-data-only.md
Ryan Schultz 299e38f696 docs: cross-link docs with anchors + guide AI agents to maintain them
Convert bold titles to ### headers in ARCHITECTURE.md. Wire ADRs and
site files to specific section anchors. Add AI agent guidance in README
to treat loose "see X" references as technical debt and replace them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 13:01:29 -05:00

89 lines
3.8 KiB
Markdown

# ADR 003 — Collectors may only use publicly accessible data
**Date:** 2026-05-11
**Status:** Accepted
---
## Context
As we began designing data collectors, a choice emerged: should the system use
authenticated APIs (requiring API keys, admin approval, or platform registration),
or restrict itself strictly to publicly accessible data?
This decision has architectural, social, and philosophical consequences that reach
beyond mere technical convenience.
---
## Decision
**Collectors may only use publicly accessible data — no API keys, authentication,
or platform permission required. If the data isn't public, it isn't in scope.**
This is one of the [core design principles](../ARCHITECTURE.md#public-data-only).
---
## Arguments for this approach
**1. Independence from platform gatekeepers**
Any system that relies on API keys can be shut down by the platform that issued them.
A revoked key, a changed terms-of-service, or an unresponsive forum admin can silently
break a collector. Public data requires no permission and cannot be revoked.
**2. Anyone can run it immediately**
A core goal of this project is forkability. If running the system requires registering
API keys across five platforms, the barrier to adoption becomes significant. Restricting
to public data means anyone can clone the repo and run it — no setup, no approvals,
no accounts.
**3. Alignment with open source values**
The communities this system is designed to serve — OSArch and others like it — are
built on the principle that open contribution should be visible and accessible. A system
that measures open participation should itself be open in how it operates. Requiring
platform credentials to observe public activity is a contradiction.
**4. Trust and transparency**
If the system only uses data that anyone could read manually in a browser, there is no
hidden data access. Community members can verify exactly what is being collected by
looking at the collector code and then visiting the same public URLs themselves.
This makes the system auditable in the most literal sense.
**5. Resilience and longevity**
APIs change, deprecate, and disappear. Public web interfaces are more stable over time
because they serve a broader audience. A system built on public data is less brittle
than one built on private API contracts.
**6. No surveillance creep**
Requiring API keys often grants access to more data than is needed — private messages,
user emails, internal activity. Restricting to public data draws a hard line: the system
can only see what the community has already chosen to make visible. This is not a
limitation — it is a safeguard.
---
## Consequences
- GitHub unauthenticated API (60 requests/hour) is acceptable; authenticated API is not
- Vanilla forum API (requires admin key) is out of scope; RSS feeds are in scope
- Open Collective public pages are in scope; their private API endpoints are not
- Git history on public repositories is in scope and requires no credentials
- Rate limiting must be handled gracefully — public endpoints are more constrained
- Some participation signals may be unavailable or lower fidelity as a result;
this is an acceptable tradeoff for the independence and trust it buys
---
## What this rules out (and why that's acceptable)
| Source | Status | Reason |
|---|---|---|
| Vanilla forum API | Out of scope | Requires admin-issued API key |
| GitHub authenticated API | Out of scope | Requires personal access token |
| Matrix private rooms | Out of scope | Not public by definition |
| Open Collective private data | Out of scope | Requires authentication |
The signal may be less complete than it would be with full API access. That is a
deliberate tradeoff. A less complete signal that anyone can verify and run is more
valuable to this project than a more complete signal that depends on platform permission.