How to download an entire (active) phpbb forum

downloadphpbbwget

One of the forums that I frequent (and have added a LOT of quality content too) seems to be having problems with their server. I am not confident in their ability to sort out the problems they are having and in talking to one of the admins he mentioned that they don't back the data up.

As a complete fall back incase something goes horrifically wrong I want to download the entire forum. I am aware that I can't download the DB or the PHP files etc… I just want to make a locally browsable copy of the entire forum.

This means I could (when I have time) transfer the posts to the new site should they be starting fresh (on purpose or not).

Are there any tools that would allow this?

Side note: Obviously its really important I can browse it locally… which would be very difficult if each of the links still points to 'http://www.thesite.com/forum/specific_page.php' rather than '/forum/specific_page.php'.

Best Answer

I am doing this right now. Here's the command I'm using:

wget -k -m -E -p -np -R memberlist.php*,faq.php*,viewtopic.php*p=*,posting.php*,search.php*,ucp.php*,viewonline.php*,*sid*,*view=print*,*start=0* -o log.txt http://www.example.com/forum/

I wanted to strip out those pesky session id things (sid=blahblahblah). They seem to get added automatically by the index page, and then get attached to all the links in a virus-like fashion. Except for one squirreled away somewhere - which links to a plain index.php which then continues with no sid= parameter. (Perhaps there's a way to force the recursive wget to start from index.php - I don't know).

I have also excluded some other pages that lead to a lot of cruft being saved. In particular memberlist.php and viewtopic.php where p= is specified can create thousands of files!

Due to this bug in wget http://savannah.gnu.org/bugs/?20808 it will still download an astounding number of those useless files - esepcially viewtopic.php?p= ones - before simply deleting them. So this is going to burn a lot of time and bandwidth.

Related Question