Writing telemetry.py

Telemetry is written from the defender’s chair, not the attacker’s ego.

Reference the existing ROA poisoning telemetry for structure.

Control-plane playbook scenarios needs similar treatment.

The defender’s perspective problem

Attackers know what they’re doing. Defenders don’t.

When Scarlet Semaphore creates a fraudulent ROA (Phase 2, Action 2.1), they know:

  • This is deliberate

  • This enables the hijack in phase 3

  • Timing is coordinated

  • Cover story is prepared

Defenders see:

  • ROA creation log entry

  • Timestamp

  • Account username

  • IP address (possibly anonymised via TOR/VPN)

  • Prefix and AS number

That’s it. No context. No explanation. No “THIS IS AN ATTACK” label.

Telemetry must generate what defenders would actually observe, not what would be convenient for detection.

Event structure for control-plane attacks

From the playbook Phase 2 Action 2.1 (fraudulent ROA creation):

def generate_roa_creation_event(clock, scenario_name):
    """
    Generate RPKI ROA creation event.
    
    This is what RIR audit logs would show.
    It looks routine unless you know to check whether
    this account should be creating ROAs for this prefix.
    """
    return {
        "event_type": "rpki.roa_change",
        "timestamp": clock.now(),
        "source": {
            "feed": "rpki-ca",
            "observer": "ripe_ncc",  # or ARIN, APNIC, depending on prefix allocation
        },
        "attributes": {
            "change_type": "created",
            "prefix": "203.0.113.0/24",
            "origin_as": 64513,  # Attacker's AS, NOT victim's
            "max_length": 25,
            "actor": "admin_backup",  # Compromised account
            "actor_ip": "185.220.101.45",  # TOR exit node, suspicious but not proof
            "actor_location": "RU",  # Misleading geolocation
            "previous_roa": None,  # No ROA existed before (or victim's was deleted)
        },
        "scenario": {
            "name": scenario_name,
            "attack_step": "roa_creation",  # Only for training mode
        }
    }

Signals defenders plausibly receive

For Phase 1 (reconnaissance), defenders receive almost nothing:

  • Maybe HTTPS queries to public RPKI validators (stat.ripe.net, cloudflare RPKI API)

  • These queries are completely normal, thousands happen daily

  • No alerts, no logs worth noticing

For Phase 2 (ROA poisoning), defenders might receive:

  • ROA creation audit log (IF they’re collecting RIR logs, most aren’t)

  • Validator state change notification (30-90 minutes later, IF monitoring validators)

  • No BGP events yet (announcement hasn’t happened)

For Phase 3 (hijack), defenders should receive:

  • BGP UPDATE announcement

  • RPKI validation result (returns VALID, confusingly)

  • Traffic flow changes (IF NetFlow monitored)

  • Service degradation alerts

  • Route flapping noise (attacker-generated)

Ambiguity and overlap

From The Poisoned Registry, attackers deliberately create ambiguity:

Misdirection 1: Blame automation

Telemetry should allow this interpretation:

{
    "event_type": "rpki.roa_change",
    "attributes": {
        "actor": "automation-bot",  # Not obviously human
        "change_reason": "scheduled_maintenance",  # Plausible
        ...
    }
}

Defenders seeing this might conclude “automation error” rather than “deliberate attack”.

Misdirection 2: Timing during maintenance

If ROA creation happens during declared maintenance window:

{
    {
        "event_type": "maintenance.scheduled",
        "timestamp": clock.now() - 3600,  # One hour before ROA change
        "attributes": {
            "maintenance_type": "rpki_roa_updates",
            "scheduled_by": "noc_team",
        }
    }
    
    # Then later...
    {
        "event_type": "rpki.roa_change",
        "timestamp": clock.now(),
        "attributes": {
            "change_type": "created",
            ...
        }
    }
}

Correlation makes attack look like legitimate maintenance gone wrong.

Misdirection 3: Multiple concurrent issues

Generate overlapping events:

# At t=600, both happen:
- BGP hijack announcement
- Unrelated router high CPU alert
- Disk space warning on logging server
- BGP session flap on different peer

Which one is the attack? All of them look important.

Realistic misinterpretation allowance

Telemetry should permit defenders to reach wrong conclusions that are reasonable given the data.

Reasonable wrong conclusion 1: Configuration error

ROA created for wrong prefix by mistake:

{
    "event_type": "rpki.roa_change",
    "attributes": {
        "change_type": "created",
        "prefix": "203.0.113.0/24",  # Victim's prefix
        "origin_as": 64513,  # Attacker's AS
        "comment": "Bulk update from spreadsheet row 47",  # Suggests copy-paste error
    }
}

Defender interpretation: “Someone fat-fingered the spreadsheet during bulk ROA update.”

This is plausible. These errors happen. It’s wrong, but reasonably wrong.

Reasonable wrong conclusion 2: Vendor/automation issue

{
    "event_type": "rpki.roa_change",
    "attributes": {
        "actor": "rpki-automation-v2.1",
        "change_type": "created",
        "triggered_by": "api_call",
        "api_client": "netops-tooling",
    }
}

Defender interpretation: “Automation created ROA, probably from configuration management system. Bug in our tooling?”

Also plausible. Automation does unexpected things.

Reasonable wrong conclusion 3: Insider error

{
    "event_type": "rpki.roa_change",
    "attributes": {
        "actor": "admin_backup",  # Legitimate account name
        "actor_ip": "192.0.2.100",  # Internal IP (if attacker pivoted through internal system)
        "timestamp_utc": "2024-12-20T15:30:00Z",  # Business hours
    }
}

Defender interpretation: “Bob from NOC team made ROA change during normal work hours from office network. Probably legitimate, maybe check with Bob?”

Plausible until they actually check with Bob and he says “wasn’t me.”

Early telemetry should be boring

Phase 1 generates almost nothing interesting:

def generate_phase1_reconnaissance(clock, bus):
    """
    Phase 1 is nearly invisible.
    Generate events that look like routine RPKI queries.
    """
    # Query to public validator
    bus.publish({
        "event_type": "https.request",
        "timestamp": clock.now(),
        "source": {"feed": "webserver-logs", "observer": "stat.ripe.net"},
        "attributes": {
            "method": "GET",
            "uri": "/data/rpki-validation/data.json?resource=203.0.113.0/24",
            "client_ip": "185.220.101.1",  # TOR exit
            "user_agent": "curl/7.68.0",
        }
    })

This is so boring it wouldn’t trigger alerts at RIPE. Thousands of similar queries happen daily.

If Phase 1 looks suspicious, you’ve made it unrealistic.

Late telemetry should be confusing

Phase 3 generates many simultaneous events:

def generate_phase3_hijack(clock, bus):
    """
    Phase 3 generates multiple overlapping signals.
    Some are attack, some are noise, some are cascading effects.
    """
    # The actual hijack
    bus.publish({
        "event_type": "bgp.update",
        "timestamp": clock.now(),
        "attributes": {
            "prefix": "203.0.113.128/25",
            "origin_as": 64513,
            "as_path": [3333, 64513],
        }
    })
    
    # RPKI validation (confusingly returns VALID)
    bus.publish({
        "event_type": "rpki.validation",
        "timestamp": clock.now() + 2,
        "attributes": {
            "prefix": "203.0.113.128/25",
            "origin_as": 64513,
            "validation_state": "VALID",  # Because fraudulent ROA exists
            "roa_found": True,
        }
    })
    
    # Service degradation (5 minutes later, cascading effect)
    bus.publish({
        "event_type": "service.degraded",
        "timestamp": clock.now() + 300,
        "attributes": {
            "service": "web_frontend",
            "latency_p99": 5000,  # 5 second latency, was 100ms
            "error_rate": 0.15,  # 15% errors
        }
    })
    
    # Route flapping noise (attacker-generated)
    for i in range(20):
        bus.publish({
            "event_type": "bgp.flap",
            "timestamp": clock.now() + 60 + (i * 30),
            "attributes": {
                "prefix": f"10.{i}.0.0/16",  # Unrelated prefixes
                "flap_count": 10 + i,
            }
        })
    
    # Monitoring alerts (from multiple systems)
    alerts = [
        "BGP peer session reset",
        "High CPU on router-r1",
        "Disk space low on log-server",
        "NetFlow export delayed",
    ]
    for alert in alerts:
        bus.publish({
            "event_type": "alert.triggered",
            "timestamp": clock.now() + 120,
            "attributes": {"message": alert, "severity": "warning"},
        })

Defenders seeing this stream must:

  • Identify which event is the attack

  • Distinguish attack from noise

  • Correlate across events (hijack + validation + service degradation)

  • Ignore red herrings (unrelated alerts)

If correlation is obvious, telemetry is too clean.

Allowing reasonable misinterpretation

Perfect clarity is a bug. Defenders should have multiple plausible interpretations.

For the fraudulent ROA at t=300:

  • Interpretation A: Legitimate maintenance. “ROA created during maintenance window by automation. Probably routine.”

  • Interpretation B: Configuration error. “Wrong prefix in spreadsheet during bulk update. Human error.”

  • Interpretation C: Automation bug. “RPKI tooling created unexpected ROA. Software bug.”

  • Interpretation D: Compromised credentials. “Account used, but actor IP is TOR exit node. Possible compromise?”

  • Interpretation E: Insider threat. “Legitimate account from internal IP. Possible malicious insider?”

All five are reasonable given limited data. Only D and E are correct, but A/B/C are plausible enough that defenders might investigate them first.

Telemetry should support all interpretations. Defenders must gather additional evidence to rule out wrong ones.

What NOT to include in telemetry

No attack labels. Real logs don’t label themselves; no intent attribution. Defenders infer intent from behaviour, not from metadata; no perfect correlation keys. Real events don’t link themselves conveniently; No defender instructions. Allow defenders to figure out what best to investigate next.

Early boring late confusing as design pattern

Structure telemetry generation:

def register(event_bus, clock, scenario_name):
    # Phase 1: Mostly silent (t=0 to t=300)
    @event_bus.on(lambda e: e['entry']['action'] == 'reconnaissance')
    def phase1_quiet(event):
        # Generate minimal, boring telemetry
        pass
    
    # Phase 2: Subtle but detectable (t=300 to t=600)
    @event_bus.on(lambda e: e['entry']['action'] == 'fraudulent_roa')
    def phase2_subtle(event):
        # Generate ROA change log
        # Defenders COULD catch this IF they're monitoring audit logs
        pass
    
    # Phase 3: Loud but confusing (t=600 to t=900)
    @event_bus.on(lambda e: e['entry']['action'] == 'hijack')
    def phase3_chaos(event):
        # Generate hijack + noise + cascading failures
        # Many simultaneous events, correlation required
        pass

This models how The Poisoned Registry operation actually progresses: invisible preparation, subtle compromise, chaotic exploitation.

Testing your telemetry

Before running scenario, ask:

  • Does telemetry come from realistic sources? “rpki-ca”, “bgp-monitor”, “netflow-collector” are plausible. “attack-detector-9000” is not.

  • Could events be legitimately misinterpreted? If ROA creation can only mean “attack”, telemetry is too clean.

  • Is timing realistic? RPKI shouldn’t propagate instantly. Cascading failures shouldn’t happen simultaneously.

  • Are there red herrings? If every event is attack-related, defenders aren’t being tested.

  • Would real defenders see this? If scenario assumes perfect logging/monitoring, it’s aspirational not realistic.

The goal is telemetry that defenders must interpret, not telemetry that interprets itself.