Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How can I back up an old vBulletin forum without admin access?
90 points by spike021 10 months ago | hide | past | favorite | 71 comments
I'm part of a car community where our vBulletin forum contains a wealth of information.

What we've run into:

1. Admin/owner is nearly / absolutely unreachable, which causes a variety of issues. Mainly we cannot even request a traditional backup of the database underneath the forum software.

2. As with anything, the forums don't get much active engagement other than older forum regulars. However, Google searches easily find useful posts for things like DIY maintenance, modification installs, test data from driving with ECU tunes, track day experiences, etc.

3. It's easy to point people on other social networks at posts by their URL, but due to neglect the website constantly has problems making access increasingly complicated and inconsistent.

Ideally, it'd be nice to find a way to scrape everything as closely as possible into a manageable database.

Even more ideally, if we could convert said scraped data into a format that is easily publishable to a new platform, that would be handy. Even if the new platform is static and simply renders the old threads.

I can't imagine we are the only forum that is experiencing problems like this with most forums probably dying in the last decade.

Has anyone gone through this kind of archival process with vBulletin before?

Thanks.




Backing up to WARC, HTML, whatever is great for posterity but not much more than that.

Assuming you're a member of the organization and therefore licensed to use the content (but merely unable to access it): Purely hypothetically speaking, if an admin is this mia and obviously not on top of the job, the odds are probably high that they've neglected maintenance. Old PHP server running out-of-date PHP applications... not the most secure combo in the world. I wouldn't be surprised if there were some magic strings you could send to the server to get it to regurgitate the contents of the database in a more developer-friendly, strongly-typed fashion which you could import to myBB or XenForo and continue chugging along..


The exact same thing happened to a phpBB bass guitar forum I was a member of about 15 years ago. The owner just disappeared and we knew the bills weren’t being paid.

Hypthetically, I found a remote code execution vulnerability in that version of phpBB, read the configuration file to get the MySQL details, and then used mysqldump to download the database.

You can find exploits by just Googling or looking in Metasploit. They’re usually pretty simple query string things.

We set up a clone of the forum with a similar name just in time, then emailed the user database to tell them about the new site when the old one disappeared.

Sadly this caught the previous owner’s attention and he sent a cease and desist, despite his version of the site not existing any longer. So, we wiped the database, and then all the users just signed up again. It still lives on as https://www.basschat.co.uk/

Hypothetically.


Amazing story. I hypothetically consider you a hero.

>Sadly this caught the previous owner’s attention and he sent a cease and desist, despite his version of the site not existing any longer.

Wow, what an ass. I just can't understand the thought process of some people.


Even though I didn't keep the software maintained, I wouldn't be happy that sensitive member information (email addresses, login info) was leaked (unknown destination) by the community. I would feel some responsibility for that information getting out, and at least feel the need to reach out to past members and anyone using that data. If I wanted to be an "ass" then I would have taken harsher legal routes.

I would just scrape the site. I would even consider anonymizing names, as everyone should have the right to delete old content IMO.

After scraping, I would then watch the analytics. Anything which seems popular could then be given more attention. For example, I would probably create a design for the site itself, then create dedicated pages for anything popular. The pages could be curated "this is what you came for" and a link to the original pages.

Forums probably don't come back. You could start out with some minimal community features such as comments to see if anyone is willing to bite. A Discord server or similar could be another option. Kind of depends on the demographic I guess. Old people like forums. ;)


What you're suggesting could get OP in trouble if caught. Not sure it's worth it, but then again I'm not that into cars.


It’s okay as they were “purely hypothetically speaking”


And the site is called "hacker news" ;)


He should have just said "..in Minecraft."


That is self-evident.


The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.


Thanks for the links and info. Totally fair on my not sharing those in my original post. I was coming at it thinking in terms of still making it useful to others. WARC seems like a good first step. I'm just not sure what the intermediate steps would be to get something usable like a vBulletin -- basically with the intention of being able to continue sharing the archived stuff with users who may not be as technical and only know how to consume from a forum format if that makes sense.

Thanks again.


> I'm just not sure what the intermediate steps would be to get something usable like a vBulletin…

Once you have an archive, you can convert that unstructured data to structured data. For example, if I look at https://www.vbulletin.org/forum/showthread.php?t=326241, the thread title and hierarchy is in <table class="navheader">, posts are in <div id="posts">, etc. I see an old project (https://github.com/IanLondon/detectorist-scraper) that may be a useful place to start, and I imagine there have been other similar efforts.

Once you have a structured representation (in a database, in JSON/XML files, etc.), you can decide whether to use it to build a static site, to import it into other forum software, etc.


You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.

One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.

https://github.com/internetarchive/warctools


Please link to the forum, then we at ArchiveTeam will save it to archive.org.


Related:

An Introduction to the WARC File

https://news.ycombinator.com/item?id=39183670


One liner:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent --execute robots=off --wait=0.2 --domains example.com https://example.com


Two options:

1. Scrape it with wget or httrack or similar tool.

2. If the owners not really around it’s probably behind on its security patches, and there’s some relatively recent-ish vB exploits that would let you gain code execution and take a backup the “extremely illegal way” of the entire database, site, etc.

I recommend 1, but 2 is amusing to ponder briefly over a coffee ;)


Just to be perfectly clear to anyone who doesn't yet know, in the US, option #2 potentially involves federal prison, for real.

Someone in a situation like this would be well-advised to instead direct creative energy at making option #1 work for you.

(Example: Some kind of crawling/archiving, to capture a copy while you can, and then you can develop good scraping of the data out of your copy at your leisure, whether it ends up being scraped from archived HTML, JSON, XML, or whatever. There's a chance you need a bespoke/tweaked crawler, to avoid missing data, such as posts that are for some reason reachable by human in a Web browser, but not by a particular crawler, and then the consideration to keep in mind is to try to avoid doing something that might look like the bad kind of "hacking" to a non-technical person, even though you aren't bypassing authentication&authorization nor exploiting any vulnerabilities.)


  > Just to be perfectly clear to anyone who doesn't yet know, in the US, option #2 potentially involves federal prison, for real.
HN has a worldwide reader base. In what countries would this not be illegal?


I've seen 2 done after the admin became uncooperative and kind of held the community forum hostage.

They even went a step further, deleted all the old posts out of the database and setup a redirect to the new forum hosting.

Very illegal, but very effective at solving the problem.


No, deleting the original posts is not problem solving.


We (community for a video game) lost over a decade of accumulated community content due to an unreachable owner. This happened as I was considering scraping, but did not get a chance to implement it. Internet Archive has been a godsend - a lot of public content that was served in a text-based format is available from there. Due to a PHP misconfiguration, even a bunch of binary files were archived because they were being served with PHP errors.


I would make a read-only archive first, using `wget --mirror`.

This will fix relative paths, download assets, etc and can be published as-is on a new site. I'm ignoring copyright questions in the interest of archiving fragile data.

Then I'd use an HTML parser against the local archive to extract the individual posts, if the additional work was justified.


It has been a VERY long time since I've done this with `wget`, but here's what I pulled out of my notes.

I used to use this to mirror Slashdot when I was travelling for work and had limited/no internet:

```

  wget --ignore-length --no-remove-listing --no-check-certificate --recursive --page-requisites --span-hosts --convert-links --no-verbose --tries=3 --timeout=60 --level=1 --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0" --html-extension --exclude-directories=my --directory-prefix="./slashdot/" -e robots=off --cookies --load-cookies slashdot-cookies.txt --keep-session-cookies http://slashdot.org/
```

I suspect that still works, just needs the user-agent updated.


An acquaintance did basically this a few years ago on a vBulletin.

My memory says there was a project to be a vB->Discourse converter that didn't rely on DB access, but I can't immediately find it.


> there was a project to be a vB->Discourse converter

God, why? I can't be the only person to find Discourse inferior to vBulletin.


vBulletin is commercial software; Discourse is open source.

While I'd agree that Discourse isn't ideal, it's one of the better options available in the open source world.


I'd take myBB over Discourse.


You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl

It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.


The V-Bulletin software publishes the database credentials inside of the web root as part of a settings file. You should be able to gain access to the web root from the web host with the correct validation or access to a company email address. From there, search where the settings php file is containing the credentials to the database and read them out. Use them to log into the database. I don't know why the other comments aren't suggesting this as a first step. You should not need a scraper to read the content back out. I've performed this procedure many times with VBulletin and other similar software like PHPBB and SMF, admittedly a decade back at this point.


> This is a gem to help extract data from vBulletin Forums, specifically those which you have no control over.

https://github.com/lloydpick/vbulletin

This is a very old tool, it’s hard to say if it will work; then again, seems very relevant too so worst case it could provide an inspiration.


You think that your car make/model is bad, the one for mine is the same but also has an expired SSL certificate and the outgoing email is broken, so it's slowly getting deindexed from Google and you can't sign up for an account (can't get the verification email) so there's no way to access the attachments, etc. It's sad, honestly.

I ran a car forum (sold to VerticalScope 15+ years ago) and it's still chugging along on the same version of PunBB that I had it on when I left, so it seems that even the "experts" haven't found a simple way to migrate between forum softwares


Over a decade ago I helped quite a few people migrate their forums off places like Proboards, ActiveBoards, and many other "free" forum hosts to their own hosts using phpBB/SimpleMachinesForum etc.; many such hosts had highly customized forum software and no ability to download the database in any usable format. Copies of my converters might still be floating around on the Internet. At least one of these free hosts used something fairly similar to vBulletin, IIRC.

The process is in principle not difficult: scrape the site (I recommend a dedicated scraper for that), then go through and extract everything relevant into a SQL database formatted the way your target forum software expects. The hardest part was recovering BBCode formatting in a usable fashion. Unfortunately my converters were written back when I didn't understand HTML parsing terribly well, so they're a hodgepodge of ugly regexes and handrolled string parsing.


Modern HTML parsers are still a hodgepodge of ugly regaxes and hand rolled string parsing.


This reminds me one of my first programming gigs, the owner of a shop lost his password to his online store front, and he wanted to get off it and get onto Shopify, so I had to write a python scraper to save everything for him and upload into Shopify


When some of my favorite google groups forums were going away, I wrote a perl scraper that started grabbing materials from my groups. Eventually Google perceived it as unwanted or suspicious contact, and shut off access to google for the entire company of 2000 I worked for at the time. Fortunately this was on a timer, but I was sweating bullets.


Did they block the IP?


I've backed up a forum[1] by crawling it using wget and creating a WARC file.

I hosted it again by writing a python script[2] to serve responses from that WARC file again and put it behind nginx with caching enabled.

[1] https://forums.empiresmod.com/index.php

[2] https://gitlab.com/thexa4/warc-server

[2, deb package] https://gitlab.com/thexa4/warc-server/-/jobs/5213679726/arti...


In the spirit of the name of this website, there have been plenty of RCE reports describing how to hack/crack a vBulletin. If the owner is not there, I guess he also doesn't run any software updates?

Though this suggestion might not be acceptable in the eyes of many.


If you are looking for a place to host this data, I can gladly help you to bring it to https://gearhead.town (a Lemmy instance that I set up to migrate the reddit car communities to the Fediverse)


You should be able to determine who the hosting company is, and offer to pay to keep the site up.

If the hosting company is paid they will make and keep a backup for you but under the permissions/access of the original owner.

If the hosting company gets permission to add you as admin to the site from the site owner, who may not be in touch with you, but may respond to the hosting company, then, (since you are paying the hosting company they will be happy to keep you around) you are home free.


I wrote something 10 years ago that scrapes a vBulletin forum into a Rails app and exposes a UI so the data is accessible. Happy to share the code with you if you'd like


   1. Scrape it. Plenty of options here.
   2. set up a new forum. 
      I think the current state of the art is Discourse but I could be wrong.
   3. automate the recreation of posts and threads with some backend script
      (will depend on which sw you picked). 
      On each post, add a link to the original.
   4. Tell everyone about your superduper clone, move the old-timers over.
   5. ...
   6. Profit.


vBulletin just had a major release 5 months ago. It's not abandoned, and it's certainly superior to the UI of Discourse, especially to a community which is very familiar with vBulletin.


Fair enough, I honestly don't know the field. The point is not to change software as much as domain/instance, really.


There is only one admin? I was an admin on about a half dozen vB sites back in the day (none of them mine) and we had a lot of redundancy there, and they were mostly just to BS with people. I find it surprising a forum of any notable size would have an admin running it as the single point of failure. That’s disappointing.

It sounds like you have some options here. Best of luck.


I wonder if it wouldn't be best legally and practically to just have archive.org scrape it for you and link people to that.


I'd assume it'd be best legally and linking would probably work. But it wouldn't be as discoverable. Relying on sharing links only would work if everyone's connected but unfortunately that's not usually the case. Especially these days with various groups being involved with separate social media (discord, facebook groups, etc.).


Maybe reach out to Archive Team, their mission as far as I understand it is to try to preserve stuff like this.


Is it on archive.org ( The wayback machine) ? In case the online version/ site suddenly disappears. Owner could have passed away or the email whatever may no longer be checked .... etc


I guess your best chance is to use something like https://archivebox.io/.


Download the whole website as it is and host it as a static site somewhere else.

    wget --mirror


I think vbb has an archive site, similar to how mailing lists look, not sure what the subpath is.


Where was the vBulletin board hosted, and do you have any kind of shell access to the server?


Unfortunately I don't have that knowledge or access.


Think before trying out unlawful exploits like others suggested, please. In a liberal world, offering money might do the trick. Not always, but worth the mention.


Maybe Httrack may be usefull, it can copy a full website in a folder including ressources: https://www.httrack.com/


Would Internet Brands buy it?


You mean, "would Internet Brands kill it in a more annoying way?"


Zilvia?


I don't know for vBulletin, only for Invision Power Board, sorry dude.


It seems to me like you don’t own this data. If you want to preserve this data, try again with the owner?


Unfortunately the owner doesn't make themselves available to be contacted for anything at this point. They were more involved years ago.

As far as who owns the data, everything mentioned above is community-owned at least. So if anything it'd be up to individual community members, I'd think, to be OK with it. Since the forum is fairly dead, many of those original members aren't even active either.


Depends on where the data is stored, if the owner of the domain lives in the EU or a similar jurisdiction the data will be under Database rights for 15 years. There is no such thing as community owned. Each post might be owned by the person who posted it but often that is not the case.

EDIT: so of course you should dump the data, wget -r and dump it.


Ahhh. Sorry, hadn't thought of rights that way since that isn't my responsibility in my day to day work. Thanks for the explanation.


Why do you think they're making themselves unreachable? Is it a mental health thing, or about priorities, or something else?


I don't really like making assumptions honestly.

My best guess is that the model of car is "older" now, and they've subsequently moved onto other interests, including possibly the newer model, which I heard (but can't confirm) this person built a similar site for.

The only real changes to the forum in the past 3-4 years are more ads and many more performance/reliability problems, plus bots.


They might just view it as old, passive income now. If it's a niche and not likely that profitable, you could ask about buying or profit sharing and work on improving and maintaining it. Chances are, if it's not a priority for them, they're well aware it could use improvement and maintenance, but just never get around to it. Yet nor do they want to give it up.

I've run a forum for 20+ years and always wish I had time to do it justice, but other projects generally have better prospects, are more interesting or are more profitable. I'd at least be curious about a pitch that gave it fresh eyes but without losing the entire revenue stream or risking backend or inviting legal issues. In case any of that applies here and is useful to you.


I believe others have definitely reached out to the owner previously just for help with the performance/reliability problems. Not 100% sure on whether they've offered to buy the site or not. Either way everything I've heard is the owner either isn't responsive at all or very much doesn't have the desire to dig into any of the issues.

My guess is also probably the passive income side being the priority, especially since they've definitely increased the ads at least since I first joined the forum about a decade ago.

Thanks for the input. It's appreciated. Unfortunately as you can probably understand it's a relatively simple but complicated problem.


Ads don't make much and forums can be a bit of a pain, so the owner is likely just trying to recoup something for the historic hassles. If monetisation is not ads, it's subscriptions and a site owner can be reluctant to foist costs on regular users (who've contributed a lot of content).

Their reluctance to fix issues could be lack of time or wariness of breaking something and creating more work. The reluctance to sell could be because they don't know how to price it and don't want to regret selling it. In many instances, a web property is something you've dreamed of making bigger and you'd hate to give it up only to watch someone else take it further.

Scraping might just invite a legal hassle. I gather that this is the not the route you're interested in, but personally I'd suggest trying to find a casual way to trade messages with the owner, find out their pain points and go from there. There would likely be a path where they can maintain some control but get help and relinquish the stranglehold.


Sometimes people just don’t care.

I moved on, please don’t bother me about it and let the website live or die.

That’s about the answer I got the last time one of my favorite websites was due to be deleted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: