mailing list archive as mbox
Alexander Best
alexbestms at wwu.de
Mon Mar 8 01:24:21 UTC 2010
Giorgos Keramidas schrieb am 2010-03-07:
> On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best
> <alexbestms at wwu.de> wrote:
> > Dan Nelson schrieb am 2010-03-07:
> >> In the last episode (Mar 07), Alexander Best said:
> >> > hi there,
> >> > what are the steps i need to perform to get a copy of the entire
> >> > mailingslist
> >> > archive of lets say freebsd-current@ in mbox format?
> >> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/
> >> where you
> >> can download weekly gzipped archives of all the mailing lists
> >> since
> >> their
> >> creation.
> > thanks for the hint, but it would take hours to download all those
> > gzipped
> > files, extract them and merge them.
> > i really need ALL the messages of a mailinglist. of course i could
> > use the
> > gzipped files you mentioned if i had some script for downloading
> > extracting
> > and merging all those files for me.
> It's relatively easy to hack one.
wow!!! thanks a billion. that's a great script. i pointed the vars containing
ftp sites at mirrors near me which give me better download speed and will run
the script for freebsd-current@ this night (~850 archives to pull).
thanks again. great job. :-)
alex
> You can get a list of year names from the /archive/ directory itself
> with curl(1) and a small amount of Python plumbing around curl:
> >>> from subprocess import Popen as popen, PIPE
> >>> import re
> >>> yre = re.compile('^d.*\s(\d+)$')
> >>> devnull = file("/dev/null")
> >>> def years():
> ... curl = "curl -o /dev/stdout
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/"
> ... ylist = []
> ... for line in popen(curl, shell=True, stdout=PIPE,
> stderr=devnull).stdout.readlines():
> ... m = yre.match(line)
> ... if m:
> ... ylist.append(int(m.group(1)))
> ... return ylist
> ...
> >>> years()
> [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
> 2004, 2005,
> 2006, 2007, 2008, 2009, 2010]
> Then you can grab a list of the freebsd-current archives by looping
> through the list of years and looking for the list of files that
> match
> the pattern:
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz)
> Using a pipe to parse the output of curl you can collect a list of
> all
> the files that match this pattern, e.g.:
> >>> def yearfiles(year):
> ... base =
> "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current"
> % year
> ... curl = "curl -o /dev/stdout %s/" % base
> ... flist = []
> ... fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')
> ... for line in popen(curl, shell=True, stdout=PIPE,
> stderr=devnull).stdout.readlines():
> ... m = fre.match(line)
> ... if m:
> ... flist.append("%s/%s" % (base, m.group(1)))
> ... return flist
> ...
> >>> yearfiles(1994)
> []
> >>> yearfiles(1995)
> ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz',
> ...]
> Concatenating the file lists of all years and fetching each one of
> them
> with curl is then trivial:
> >>> ylist = years()
> >>> ylist
> [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
> 2004, 2005, 2006, 2007, 2008, 2009, 2010]
> >>> flist = []
> >>> for y in ylist:
> ... f = yearfiles(y)
> ... flist = flist + f
> ...
> >>> len(flist)
> 785
> Once you have the list of all the remote gzipped files, you can loop
> through the list of files once more and fetch them locally. I'm only
> going to fetch the first two files here, but feel free to fetch all
> of
> them in your version of the script:
> >>> flist = flist[:2]
> >>> flist
> ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
> 'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz']
> >>> from subprocess import call
> >>> def getfile(url):
> ... out = os.path.basename(url)
> ... retcode = call(["curl", "-o", out, url], stderr=devnull)
> ... if retcode == 0:
> ... print "fetched %s" % url
> ... return tuple([url, out, retcode])
> ...
> >>> map(getfile, flist)
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
> ...
> [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
> '19950101.freebsd-current.gz', 0),
> ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz',
> '19950226.freebsd-current.gz', 0)]
> A slightly hackish script that collects all this to a more usable
> whole
> but lacks LOTS of error checking is the following:
> #!/usr/bin/env python
> from subprocess import call, Popen as popen, PIPE
> import os
> import re
> import sys
> devnull = file("/dev/null")
> yre = re.compile('^d.*\s(\d+)$')
> fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')
> def years():
> curl = "curl -o /dev/stdout
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/"
> ylist = []
> for line in popen(curl, shell=True, stdout=PIPE,
> stderr=devnull).stdout.readlines():
> m = yre.match(line)
> if m:
> ylist.append(int(m.group(1)))
> return ylist
> def yearfiles(year):
> base =
> "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current"
> % year
> curl = "curl -o /dev/stdout %s/" % base
> flist = []
> for line in popen(curl, shell=True, stdout=PIPE,
> stderr=devnull).stdout.readlines():
> m = fre.match(line)
> if m:
> flist.append("%s/%s" % (base, m.group(1)))
> return flist
> def getfile(url):
> out = os.path.basename(url)
> retcode = call(["curl", "-o", out, url], stderr=devnull)
> if retcode == 0:
> print "fetched %s" % url
> return tuple([url, out, retcode])
> if __name__ == "__main__":
> print "Fetching year list."
> ylist = years()
> if len(ylist) == 0:
> print "No yearly archives found."
> sys.exit(1)
> print "Fetching file lists for %d years." % len(ylist)
> flist = []
> for y in ylist:
> f = yearfiles(y)
> flist = flist + f
> if len(flist) == 0:
> print "No archives found."
> sys.exit(1)
> print "Fetching %d archives." % len(flist)
> fresult = map(getfile, flist)
> fok = [fentry[1] for fentry in fresult if fentry[2] == 0]
> ferr = [fentry[1] for fentry in fresult if fentry[2] != 0]
> if len(fok) > 0:
> print ""
> print "Successfully downloaded %d archives" % len(fok)
> for f in fok:
> print " %s" % f
> if len(ferr) > 0:
> print ""
> print "Failed to download %d archives" % len(ferr)
> for f in ferr:
> print " %s" % f
> Running this with a couple of lines to limit the FTP connections a
> bit
> and fetch only parts of the freebsd-current mail archives produces
> the
> following output on my laptop:
> keramida at kobe:/tmp$ python foo.py
> Fetching year list.
> Fetching file lists for 3 years.
> Fetching 5 archives.
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950305.freebsd-current.gz
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950312.freebsd-current.gz
> fetched
> ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950319.freebsd-current.gz
> Successfully downloaded 5 archives
> 19950101.freebsd-current.gz
> 19950226.freebsd-current.gz
> 19950305.freebsd-current.gz
> 19950312.freebsd-current.gz
> 19950319.freebsd-current.gz
> Without the limiting code that I removed from the example, it will
> try
> to fetch all the archive files for all 17 years.
> Then you can simply type:
> gzip -cd *.freebsd-current.gz > freebsd-current.mbox
> to produce a single UNIX mbox file with all the messages.
More information about the freebsd-questions
mailing list