Postgres and the Ultimate Hitchhiker’s Guide Part One

So, after some remarkably easy setup with xml2sql and PostgreSQL 8.3beta1, I’ve finally loaded all mainspace articles and templates into a database system. Now the hard part starts.

In order to generate the HTML for each article, I need to have a copy of the wikitext for that article. Since going through each article individually, loading it into a file, and then using the hacked-together parser I found on the resulting file is terribly slow, I’m pulling down the wikitext for every article and storing it in a flat text file for later parsing.

The interesting part about this is that Postgres likes to run the query, cache it to the disk, then replay it to the client. This is remarkably inefficient for my query, which returns about 2 million rows that total about 8 GB. Solution? Postgres cursors!

A cursor is essentially a way to tell Postgres to run a query, but don’t actually run the query. Then, using the FETCH command, the server will dynamically execute the query and return an arbitrary number of rows, without putting the entire query on disk. Now that’s efficient (or better suited for my hardware, anyway).

So right now, my hackish Perl script is fetching about 1,000 articles every 5-10 seconds and pushing them to disk. Should be done in no time…



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: