# Yahoo Groups Archive Team Summary

This torrent contains a compressed parsed dump of the data collected by archiveteam when Yahoo! Groups shutdown. See [their wiki](https://wiki.archiveteam.org/index.php/Yahoo!_Groups) or `./docs/Yahoo! Groups - Archiveteam.pdf` for more information. 

## Data Format

* Data is [zstandard](https://github.com/facebook/zstd) compressed [jsonl](https://jsonlines.org/). 
* Messages are broken up by year. Messages where the year could not be parsed are in `unknown.jsonl.zst`
* All JSON data collected by Archive Team is included but is augmented with a parsed version of the email. 
* Each JSON line has the following schema:

```typescript
interface MessageLine {
    "subject": string,
    "rawEmail": string, //the raw email used to generate "parsed"
    "parsed": { //Adding this key is the only modification from Archive Team's dump.
        "headerLines":Array<{key:string, line:string}>, //email headers
        "text": string, //full parsed text from email
        "subject": string,
        "date": string, //ISO timestamp
        "to": string,
        "from": string,
        "messageId": string,
        "html": string,
    }
    "postDate": string,
    "from": string,
    "topicId": int,
    "spamInfo": {reason: string, string:boolean},
    "canDelete": boolean,
    "replyTo": string,
    "senderId": string,
    "nextInTime": int,
    "userId": int,
    "prevInTime": int,
    "prevInTopic": int,
    "headers": {[name:string]: string},
    "authorName": string,
    "numMessagesInTopic": 10,
    "msgSnippet": string, //partial parsed message from archiveteam. "parsed.text" is more complete.
    "contentTrasformed": boolean,
    "msgId": int,
    "nextInTopic": int,
    "systemMessage": boolean,
}
```

## Processing Examples

Any tool that can read jsonl should work but below are some example commands/tools.

```bash
#print all the messages from 2001
zstdcat --long=31 2001.jsonl.zst  | jq .parsed.text
```

```bash
#print all the messages from 2001 using multiple CPU cores
zstdcat -T0 --long=31 2001.jsonl.zst  | parallel --line-buffer --pipe --roundrobin jq .parsed.text
```
