Archive for the ‘wikipedia’ Category

Announcement: New perlwikipedia maintainer

December 8, 2007

Well, I finally bit the bullet today and stepped down as maintainer of Perlwikipedia, my MediaWiki bot framework. My successor is ST47, a fellow admin on enwiki who serves on the Bot Approvals Group and has more bots than I have fingers.

I can’t say that it hasn’t been a long time coming, but I think that ST47 will do a much better job as maintainer than I did. He’s enthusiastic about Wikipedia, is a great Perl hacker, and has written more bolt-on enhancements to Perlwikipedia than there are original lines of code.

In any case, I believe we’ll see a brand-spankin’-new Perlwikipedia release in the near future, one that’s more shiny and can do your dishes.

~alex

Censorship = Awesome

November 28, 2007

I received an interesting email today from someone who is helping their friend use my closed proxy server. They said that their friend couldn’t access the server after I had given them login credentials. Naturally, I SSH-ed into the Toolserver, did a wget against my domain name, and it worked. So either China had discovered the proxy and blacklisted it, or there was some other problem which I couldn’t even begin to comprehend.

So, I checked my domain against a free site that determines whether a domain is accessible from China. The test came back saying that it was inaccessible from Beijing, but perfectly fine from Seattle. Guess where this is going.

Apparently, China discovered my proxy and blacklisted the domain name. Just to check, I tested the checker against a secondary Dyndns.org domain that I maintain for redundancy. The test worked fine.

Ain’t censorship grand? I have a feeling that I’m going to need to disclose my domain names only via email from now on.

~alex

Wikipedia’s Tor Problem

November 4, 2007

Today I noticed an essay on Tor. This essay, while very interesting, brought something slightly disturbing to my attention: That another admin had performed a bot-assisted blocking run on “suspected” Tor servers. This isn’t new; admins have been doing these sorts of runs under the radar for a while. Go back and look at the CharlotteWebb RFAR. Yet, this was a fairly large run, and I wasn’t comfortable assuming there wasn’t collateral damage.

So, I went ahead and wrote several Perl scripts to grab all pages linking to Template:Tor, the standard template used to notify editors that an IP is a blocked Tor server. This list went into a Postgres database. Then, another script checked every IP and decided if it was really blocked or not. Only about 12 of the 740 IPs with this template weren’t blocked, nothing major. I removed the template from the IPs’ talk pages and went on. Then, I used a Python script distributed with Tor to get a list of all exit nodes that can access the Wikipedia servers. This also went into Postgres. Now for the problematic part.

I ran an SQL query that took all blocked IPs marked as Tor nodes, then checked if they were actually Tor nodes. The list of supposed Tor nodes contained 87 IP addresses. Want to know how many were really Tor nodes?

87.

That’s right, there are currently 653 IP addresses that were, at one point, probably Tor nodes, but now they aren’t. 653 innocent IP addresses. Now, to put this in context, let’s examine how many REAL Tor nodes are blocked.

I used the block-checking script to check the list of actual, live, Tor nodes that could access Wikipedia. There were 1553 Tor exit nodes when I ran the query. Guess how many were blocked.

269.

To save you the math, that means we are NOT preventing 82.7% of Tor exit nodes from accessing Wikipedia. That’s a great statistic, considering that Wikipedia’s policy on Tor is to disable editing access for Torified users.

Now, this isn’t a perfect study. I’m not taking into account rangeblocks, which I don’t believe show up on Special:Ipblocklist for an IP in their range, autoblocks, and other things I can’t scan for. All this means is that we have a relatively huge hole through which users can “abuse.” However, I highy doubt they will.

People need to stop taking WP:OP so seriously if they aren’t going to enforce it. I can’t begin to count how many open proxies and Tor nodes I’ve seen blocked that have since been closed or switched to a different IP. Meanwhile, the IP is still blocked, usually for periods of 5 years or more. If you block a proxy, you need to follow up on it! Administrators can’t just assume Tor nodes have static IPs; I, for one, operate a center Tor node (read: A node that can’t allow traffic out, except to other Tor nodes) on a dynamic IP address. We need to start taking more responsibility for our blocks and stop issuing fire-and-forget blocks that will, at some point when the IP changes, affect legitimate users.

I’m probably going to start testing the waters to see how the community would react to a TawkerbotTorA clone. Perhaps now that we’re seeing more adminbots, they’ll finally realize that adminbots are useful for some tasks. Based on what I’ve seen in my study, a bot would certainly be more effective and accurate than some administrators.

~alex

Chinese firewall evasion project

October 18, 2007

So about a month ago, a user contacted me, asking if I could convert a hardblock I enacted on an open proxy to a softblock. In accordance with Wikipedia policy, I denied the request. A short while later, I realized that the user was an established editor who would be a great loss to China-related articles. This is where I had the proverbial “lightbulb moment.”

If I could set up password-protected, SSL-enabled proxy servers, then users could contact me to use the servers and edit Wikipedia without violating policy! Now, about three weeks later, I’ve kicked off the WikiProject on closed proxies, a project designed to coordinate efforts between users who operate closed proxy servers. Currently, we have two members (myself and ST47, another admin), with a third user who has expressed interest. I’m the only proxy sysop at the moment, but hopefully we can get more online. Sometime in the next week or so, once we have more than one server up and everything is running smoothly, we’ll start accepting email-based applications for server accounts.

The proxy server I run is powered by Apache 2.2.6, with mod_ssl covering encryption and a somewhat-working mod_rewrite-based method of denying account creation. Apache’s built-in htpasswd-configured authentication, with flat text files, provides the authentication backend.

I don’t believe I have any legal issues to worry about, unless users start using the proxy to post libel and other bad things, but I don’t think a case would hold up in court.

Now all I need to do is wait for open registration, and hope that men in riot gear with big guns, black vans, and helicopters don’t show up at my house…

-alex

Postgres and the Ultimate Hitchhiker’s Guide Part One

October 10, 2007

So, after some remarkably easy setup with xml2sql and PostgreSQL 8.3beta1, I’ve finally loaded all mainspace articles and templates into a database system. Now the hard part starts.

In order to generate the HTML for each article, I need to have a copy of the wikitext for that article. Since going through each article individually, loading it into a file, and then using the hacked-together parser I found on the resulting file is terribly slow, I’m pulling down the wikitext for every article and storing it in a flat text file for later parsing.

The interesting part about this is that Postgres likes to run the query, cache it to the disk, then replay it to the client. This is remarkably inefficient for my query, which returns about 2 million rows that total about 8 GB. Solution? Postgres cursors!

A cursor is essentially a way to tell Postgres to run a query, but don’t actually run the query. Then, using the FETCH command, the server will dynamically execute the query and return an arbitrary number of rows, without putting the entire query on disk. Now that’s efficient (or better suited for my hardware, anyway).

So right now, my hackish Perl script is fetching about 1,000 articles every 5-10 seconds and pushing them to disk. Should be done in no time…

-alex

Perlwikipedia version 1.0

September 8, 2007

Well, after finally remembering that I own a blog, here’s an announcement: The Perlwikipedia development team is pleased to announce that Perlwikipedia version 1.0 has been released! Perlwikipedia is a MediaWiki framework written in Perl, which can be used to develop bots and other tools that need to edit or get information for any MediaWiki-based site. You can download a copy of the framework from http://code.google.com/p/perlwikipedia.

-alex