~sircmpwn/lists.sr.ht#149: 
Ability to import gzipped mbox files

The email list system mailman offers mbox files in gzipped format. It would be convenient to directly import them. An example is the openwrt-devel mailing list.

Status
REPORTED
Submitter
~aparcar
Assigned to
No-one
Submitted
1 year, 3 months ago
Updated
1 year, 1 month ago
Labels
No labels applied.

~nabijaczleweli 1 year, 1 month ago

The problems with allowing compressed input are many, but, like other limits, I'll let Drew speak on them.

However, writing a pipermail scraper was relatively trivial:

url="$1"
url="${url%/}"

curl -SL "$url" | python3 -c '
from html.parser import HTMLParser
import sys

class PipermailHTMLParser(HTMLParser):
    td_in_row = 0

    def handle_starttag(self, tag, attrs):
        if tag == "tr":
            self.td_in_row = 0
        elif tag == "td":
            self.td_in_row += 1
        elif tag == "a":
            if self.td_in_row == 3:
                print(dict(attrs).get("href"))

PipermailHTMLParser().feed(sys.stdin.read())
' > filelist  #'

awk  "{print \"$url/\" \$0}" filelist | xargs wget
grep '\.gz$'                 filelist | xargs gunzip
sed  's/\.gz$//'             filelist | xargs cat > "$(echo "$url" | awk -F/ '{print $NF}').mbox"

Running this with "https://lists.infradead.org/pipermail/openwrt-devel/" or "https://alioth-lists.debian.net/pipermail/pkg-zfsonlinux-devel" (note that the archives aren't always gzipped, as is currently the case with pkg-zfsonlinux-devel's August archive) produced nice uniform 218M and 15M mboxes.

Running them through mbox-split, as linked from the bottom /~u/l/settings/import-export endpoint produced nice importable 10M chunks.

~sircmpwn: would this be a good inclusion into /contrib?

Register here or Log in to comment, or comment via email.