Open Encyclopedia

Article Search:

Wikipedia talk: Database download

From open-encyclopedia.com - the free encyclopedia.

Please note that questions about the database download are more likely to be answered on the wikitech-l mailing list than on this talk page.
Contents

Wikipedia as XML? / Download one article at time

I want to write a soft that displays information from the wikipedia. I don't want to make a huge database download, but just want the him to have acess to an updated tiny piece of the wikipedia article (in wiki source), more like a browser.How can I? --Alexandre Van de Sande 01:09, 2 Aug 2004 (UTC)

Wouldn't it be nice to have a download of Wikipedia based on an XML language? ~cpb 2004-04-26

Maybe Special:Export would help? Angela. 16:51, Aug 12, 2004 (UTC)

To access any article in xml, one at a time, link:

http://en.wikipedia.org/wiki/Special:Export/Title_of_the_article

or go to

Special:Export

Database on CD?

Has the Wikipedia considered selling a set of CDs, like once a year, in order to fund the hypothetical non-profit organization? If it were contracted out for a few thousand sets of 10-100? CDs, prices could be kept fairly reasonable.
~ender 2003-04-15 04:01 MST

Consider the fact that anyone can sell wikipedia on cds, books and etc and has no obligation in contributing any fee to the wikipedia. So wikipedia is not in any privileged position to sell it's database, a press industry would make books quicker, and a cd seller would make cd-roms faster (and maybe cheaper) than wikimedia foundation hiring someone. The best is that anyone who ever made money from wikipedia content be nice and donate part of it's profit to the wikimedia. --Alexandre Van de Sande 11:10, 19 Aug 2004 (UTC)

I think a 'WikiLive' DVD would simplify things for most users. It would avoid novice users having to install and configure complicated PHP and PHP-MySQL server 'thingies' and would run on any computer.

A basic setup could be a stripped down Knoppix CD, with PHP-MySQL requirements pre-installed and configured to optimise running off the DVD. Firefox can be adapted to provide the best possible viewing and browsing experience.

One of the open source search-engines could be used to search for material on the CD/DVD/hard drive.

Beyond the PHP-SQL-MediaWiki engine, which should be able to install completely and EASILY, possibly as a Knoppix based operating system, there are database dumps in specific languages. Big ones, like English, most people would presumably keep/swap/buy on DVD/CD, but small ones could be downloaded and installed to this OS, if desired. Having this OS installed to your hard drive, you could then download additional databases of Wiki material off the net and run them on your hard drive, and then maybe even share them on an intranet (off the internet).

--Jkh.gr 16:41, 26 Sep 2004 (UTC)

Experimental Mirror of Wikipedia

I've been attempting to setup a copy of wikipedia on one of my servers for experimenting and testing. I want to use the real data to experiment with the wikipedia code and be able to more closely examine the data structure. I have in mind potentially altering the code to use in another project that is something of a "People Data Store" Example: "Quotes" are made by people, people have biography that relates them to other people, places, and events in time. This could also apply to many other works of people such as "Lyrics", "Books", "Articles", "Film", "Programming code". Lots of possibilities. In many ways it it much like an encyclopedia, just more (for lack of a better way of expressing it) factual and concrete. 8-)

My problem, For a couple of days now I've been attempting to download the datadump of the encyclopedia and history from the download page. Unfortunately all I get instead of a gzip, tar, or zip file is the text data dump to my browser. Is there some way that I can get the current data and history files some other way? I don't care about the size, but a file is much more useful than a text data list of many mb. Also is there some method of data replication that is used to keep other copies current? Any help with this would be much appreciated, Thanks (albrown AT chook DOT com or al AT thetinfoilhat DOT com)

Your browser appears to be helpfully un-gzipping the data for you. If this is a problem (ie, you don't want to take up that much hard disk space just for the dump), try a less intelligent program. ;) "wget" is a nice command-line web/ftp file fetcher; I think there's a version compiled for windows. (Google it.) Keep in mind that the SQL dump will be equally effective zipped or unzipped; you have to read it back into the database or write your own program to suck the data out of the SQL commands. --Brion 19:01 Oct 2, 2002 (UTC)

There are times when I actively hate the latest IE. Is ftp an option then? I tried to connect to ftp.wikipedia.com and didn't get very far as "anon".

Get Mozilla!
Try this instead: make a link to the file you want to download by putting it in brackets, e.g. The Internet Movie Database ([http://www.imdb.com The Internet Movie Database]), then right click on the link and "save target as."
IE for me has been particularly contrary and addlebrained; I can only assume that that's what you're using too. Best, --KQ 22:20 Oct 2, 2002 (UTC)

I was able to get the files by using wget. Thanks for all the help. I wil remember the trick about making a link in brackets. This is odd though as this is the first and only time that I have ever had IE (and netscape 4.7 even tried to download with that brain dead clunky ......) both were unzipping the file into the page, normally I can click on any download link and then get a message asking me if I want to save or not. Oh well, got the files and thanks much again for the help. Al Brown 23:21 CST Oct 2, 2002.

CVSup/Rsync, and why it would be useful

For those who would like to keep their local copy in sync, I would suggest to set up a CVSup server ( http://www.cvsup.org/ ). The snapshot can be made a number of time per day (say 4 times a day) and people can very efficiently synchronize to the latest version of the wikipedia. This is a lot faster and saves a lot of bandwith compared to downloading a complete tarball every time. The server does not even have to run on the wikipedia server itself, but it seems the logical choice. CVSup is very efficient. If the wikipedia dumps can be tagged with a version number similar to RCS, the synching will probably be blazingly fast. -- Tim Hemel

I don't think that would work very well with the dumps. The bzip2-compressed versions are not going to be cleanly diffable, and if I leave them as text (~380 megabytes for English current revisions, a few gigabytes for old revisions; I'm reluctant to have them sitting around uncompressed), they're still not going to work that well in CVS if I understand its storage system correctly. Each line of the dump is an SQL INSERT statement for about 500 pages, and the slightest change to any of them (including cache invalidation timestamps) would cause the whole line to be sucked out and replaced. --Brion 18:33 25 May 2003 (UTC)
I'm not sure how applicable cvsup would be, but I think rsync is worth considering. Rsync doesn't use diff for computing deltas, so I don't think the "long lines with small changes" problem applies. As for rsyncing compressed files, there are techniques to do that, such as http://svana.org/kleptog/rgzip.html or http://lists.samba.org/archive/rsync/2002-October/004035.html (merely two results from a brief Google search, I'm sure more research on the topic would be fruitful). If people feel this would be worth pursuing, let me know. Neilc 11:30, 7 Aug 2004 (UTC)
gzip does have an --rsyncable option (see http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff and gzip --help). bzip2 doesn't seem to. See http://www.debianplanet.org/node.php?id=524 for a discussion of pros and cons (for w:Debian). I don't know how relevant the server load problem is for d.wp .

See also http://lists.debian.org/debian-devel/2001/10/msg02187.html About efficiency in part (reportedly): http://rsync.samba.org/tech_report/

More later.

Mr. Jones 05:07, 14 Aug 2004 (UTC)

Does rsync works for someone? I enter the rsync command on this page and it downloads the file just as with wget. If I then run the same command again, it complets the file with very fast speed. BUT, when I rename the 20041030_cur_table.sql.bz2 i downloaded last week to 20041106_cur_table.sql.bz2 and do the rsync command in the same directory, it downloads at normal speed. Are the cur_table.sql.bz2 and upload.tar's useable for rsync? Or is it not bandwith difference if I use wget or rsync on that files? Thanks for your answers. Jiff Daniels

Wikipedia in DICT format

Moved from Wikipedia:Village pump on Thursday, July 10th, 02003.

I wonder if someone thought about making dict files of the Wikipedia. It would be cool to have the Wikipedia wherever I am, independent of an internet connection. (Okay, I still need my laptop for this...) dict seems a good way to achieve this. I'm willing to spend some time hacking a Python script that can create the dict files from the SQL stuff. But I'd like to know if other people are interested in this as well, or maybe there's someone who already did this job... :) --Guaka 22:38 5 Jul 2003 (UTC)

Doesn't Tombraider achieve this? CGS 22:40 5 Jul 2003 (UTC).

Dear Wikipedians! I enjoy very much my tomeraider wikipedia edition from december 2003. And I dream of downloading a current version. At that time it was 180 mb with 180000 articles. Now there are 360000. Please !!!! Thousands of PDA friens will be grateful to you ! The german wikipedia for tomeraider is available for download from 1 of september 2004 with 217 mb and 180000 articles. Vlad

You mean Tomeraider? No... First of all, tomeraider is shareware. And AFAICS it is totally not meant to convert the wikipedia into the dict format. Guaka 02:37 6 Jul 2003 (UTC)
Ha ha :) I know it's not meant to convert files to dict format, but it does what you want - view files on the go without a net connection. CGS 20:28 6 Jul 2003 (UTC)
Another thing is... Tomeraider is non-free software. This is already enough reason not to use it. But even if I wanted to, I couldn't because I run GNU/Linux. Guaka 16:06 7 Jul 2003 (UTC)
If it's the right tool for the job, swallow your pride and run it through Wine. CGS 22:15 7 Jul 2003 (UTC).
I guess you paid for the PDA hardware. So why is $20 for good software a no-go? I chose TomeRaider because it was the best option at the time (and it may still be, not sure). Some people write software for a living, and if they are good at it, I hope they continue doing so. Just because they earn some money they are not neccesarily a second Bill. Not that I wouldn't prefer GNU software which is equally cross plaform, fast and economical with PDA space, it just doesn't seem a matter of higher principle to me. Erik Zachte 22:54, 18 Mar 2004 (UTC)

Erm, excuse me if I'm missing something, but wouldn't it be silly to view Wikipedia on non-free software after we go through so much trouble to make sure that the content is under the GFDL? If the content is free but the medium is not, then the company that produces it controls the content, albeit in an indirect fashion. The company could go out of business and render Tomeraider files useless, etc. At any rate, I would be interested in seeing a GPL'ed Python script that could accomplish this task, especially since I'm a beginning programmer and I'm interested in learning Python. And I'm a beginning Linux user who doesn't have a clue how to use Wine, fix problems with a program running in Wine, or anything particularly complex at all. --Nelson 23:41 8 Jul 2003 (UTC)

I fully agree with that Nelson. We just need to have a name now, so that we have a page for it. Or maybe this project would better fit on the Meta Wikipedia? Guaka 00:10 10 Jul 2003 (UTC)

Try the following Perl script for generating the Dict database. Change the DBI->connect to have the correct values for username and password instead of dbuser and dbpass.

#!/usr/bin/perl -w

use strict;
use DBI();

sub article2dict {
  my ($title, $text) = @_;

  $title =~ s/_/ /g;
  $text =~ s/\r//g;
  $text =~ s/^/  /mg;

  print "$title\n";
  print $text;
  print "\n\n";
}

# Connect to the database.
my $dbh = DBI->connect("DBI:mysql:database=wikipedia;host=localhost",
		       "dbuser", "dbpass",
		       {'RaiseError' => 1});

# Now retrieve data from the table.
my $sth = $dbh->prepare("SELECT cur_title, cur_text FROM cur " .
			"WHERE cur_namespace = 0 ORDER BY cur_title");
$sth->execute();
while (my $ref = $sth->fetchrow_hashref()) {
  article2dict($ref->{'cur_title'}, $ref->{'cur_text'});
}
$sth->finish();

# Disconnect from the database.
$dbh->disconnect();

wik2dict.py

I finally wrote something: wik2dict.py. It tries to create reasonably layouted dict articles. It can also automatically fetch the database dumps. There are some requirements though. And currently it is only version 0.2. So beware.

I would appreciate it if someone (possibly someone at Wikimedia?) could run the script regularly and put the dict files available for everyone to download. Too bad they can't be included in Debian though ("GFDL is non-free"). However, the script itself could probably be included in contrib :) G-u-a-k-@ 17:50, 27 Jul 2004 (UTC)


Moved from Wikipedia:Village pump:

Offsite backup of dumps and mailing lists

I have some questions regarding downloading the database dumps. On the page it says last dump made July 13. Does that mean what I think it means (i.e. if I download the English and non-English tarballs I only have revisions up to the 13th?). Also, as I understand it, I would only have to download the cur tarballs from here on in (if I saved the old ones), is this correct? I figure having an extra backup of the database can't hurt...especially after last night :). Addendum: should I also download the mailing list archives (from what I gather, they're separate from the dumps)? Geez, another question: is it safe to assume images are not included with the dumps? -- Notheruser 15:42 28 Jul 2003 (UTC)

Ok, I think I've found most of the answers to my above questions; I'll list them here in case anyone else was curious. The database hadn't been backed up since July 13 at the time, but, currently, it is now updated until August 1. You have to download the cur and old files to completely backup the English Wikipedia (don't forget about otherlanguages.tar for a full backup). The mailing lists are archived offsite, so they seem safe and images are currently not backed up (about 1GB worth of files). -- Notheruser 18:53, 2 Aug 2003 (UTC)

If one downloads the old database for English, and then import it into a MySQL database, one finds that it is larger than the 4.1 GB limit most operating systems put on the table size. Of course, I can go in and edit the SQL myself, but it would be better if this was done at the source. Would it be possible to have the dump write out more than one table, each less than 4.1 GB in size? -- RayKiddy 20:01, 13 Sep 2003 (UTC)

If your OS still has a 4 GB file limit, you really need a new OS. :) Multiple tables doesn't make any sense, as it wouldn't be usable. I'd recommend (well, I'd really recommend getting yourself a modern Linux or FreeBSD or something) creating the table as InnoDB and making sure your configuration is set up to use <4GB files for the innodb space (as it can use multiple files). --Brion

Incremental wikipedia updates?

from village pump

Once the full Wikipedia is downloaded, can smaller periodic updates covering new stuff and changes be obtained and used to synch the local? --Ted Clayton 04:26, 13 Sep 2003 (UTC)

No, you can't. I've been thinking the same thing myself. I think we need to:
  • Allow incremental updates for all types of download
  • Allow bulk image downloads
  • Package a stripped-down version of the old table in with the cur dumps, where the revision history (users, times, comments etc.) is included, but the old text itself is not
  • Develop a method of compressing the old table so that the similarity between adjacent revisions can be used to full advantage
-- Tim Starling 04:38, Sep 13, 2003 (UTC)

Would it be easier to have incremental updates on something like a subscription basis? The server packages dailies or weeklies and shoots them out to everyone on the list? During off hours, mass-mail fashion?

Can you suggest sources or search-terms for table manipulations treatments, as background for stripping and compressing? --Ted Clayton 03:14, 14 Sep 2003 (UTC)

I'm going to continue this on wikitech-l, because it's very much on-topic there. See Wikipedia:Mailing lists for more information. -- Tim Starling 12:48, Sep 14, 2003 (UTC)
Also Wikitech-l thread on incremental backup
Wouldn't it be a great idea to provide split database dumps, one package with only article, and one package with articles, talk, users and all. This would reduce the spreading of Wikipedia userpages to forks. — Sverdrup 08:24, 18 Mar 2004 (UTC)

Daily tarballs of older Non-English Wikipedias

These have not yet been upgraded and are running on UseMod-wiki. The software and data are included together in a single tarball.

<-- this is broken, I think The German, Polish, and Esperanto wiki trees are also --> <-- available for live update via rsync. --></blockquote> I removed the above section as it is no longer true and the links are dead. Angela. 14:49, Feb 21, 2004 (UTC)

wiki-table to html-table problems in script wiki2static

Hi, is there an update of wiki2static script which converts the sql-dump to a html file structure? This script has problems with conversion of the wiki table definitions to html. For this reason, articles containing tables aren't readable.

Hi, I haven't worked on the script for a while, so I didn't put the new table syntax into wiki2static. I'll see if I can do it in the following days. Alfio 14:58, 16 Mar 2004 (UTC)

What is this text?

I have the Spanish version in a SQL database and the old_text entries are a bunch of symbolic nonsense. For instance, the entry for Andorra is "UAN1 Eר*g@`]Յ;q[Ď#D\O;@"qJ#(}8*4RȞ12BQip+IBun=:T6R_aPq:tN ]D2A5J $D#z\:7Hb̀ ?=3nEW!a.ZGC /U>EXw.7bzyqGҀ)td0e;b<*ap6k/ )oQB"

I don't know if this is related, but when I loaded up the .sql file I got the following error: "ERROR 1064 at line 269: You have an error in your SQL syntax..."

It's compressed text, I would guess. See the section "Format Change" at http://download.wikimedia.org/ Mr. Jones 12:55, 9 Aug 2004 (UTC)

Lost Connection Error using MySQL

I tried importing the en.sql db into MySQL 4.1.2-alpha and got this error:

$ mysql wiki -u root -p < en.sql Enter password: ******* ERROR 2013 at line 13147: Lost connection to MySQL server during query

Ran it again, got the same error after about 20 GB had been processed.

--

If I remember correctly, these errors are because the data that you are importing is sent over the connection in packets. Typically, in the server, you configure a limit to the size of these packets to avoid denial of service conditions. However, by doing that, you are also limiting the amound of data that can be put in one field, because each field has to be sent in one packet. It could be that one wikipage contains so much data that the server thinks you are trying to cause a denial of service, therefore it kicks you off.

The same thing happens when you are running mozilla's bugzilla; the packet size limits the size of attachments that users can insert into the database.

Tinus 22:19, 10 Aug 2004 (UTC)

meta?

Is this information on Meta somewhere? It looks as though this page was once at m:Meta:Database dowload (back when the Meta: namespace there was called Wikipedia: also)... but it's gone now. +sj+ 05:38, 16 Jul 2004 (UTC)

I don't think it was on Meta. Links on Meta to Wikipedia:Database download will lead to this page via a redirect from Database download. Angela. 07:20, 16 Jul 2004 (UTC)

freecache.org

Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb -Alterego

Sample blocked crawler email

for some odd reason the arrow on the link http://en.wikipedia.org/wiki/Wikipedia:Database_download in the page is not showing properly. i think it's something to do with the <i> and the wiki stylesheet, or maybe some extra bug in internet explorer... Vbs 09:03, 23 Jul 2004 (UTC)

Joining SQL Download Files

What do I use to join the split SQL dump files?

See the Joining SQL Dump Files thread on wikitech-l. Angela. 20:23, Aug 2, 2004 (UTC)


403 Forbidden

Why don't I have access to http://download.wikimedia.org/archives/en/old_table.sql.bz2 ? Who maintains the page at http://download.wikimedia.org? Who looks (or look) after the download server? Mr. Jones 12:23, 11 Aug 2004 (UTC)

Try reporting it at SourceForge or on the Wikitech-l mailing list. Angela. 16:51, Aug 12, 2004 (UTC)

Titles only download

Is there a possibility to download all article titles, as single compressed file? Pjacobi 17:53, 11 Aug 2004 (UTC)

The article titles of the English Wikipedia are available. Download allentitlesinns0.gz from download.wikimedia.org/archives/en/. It's possible other languages will be added in future. Angela. 00:28, Oct 5, 2004 (UTC)

How big is the uncompressed wikipedia?

How big is the uncompressed 20040811_old_table.sql.bz2 from en? Less than 30gb, I would guess, given that it was reported to be 18gb fairly recently, and starts out at about 9gb. What is the compression ratio? It seems that the suggested bzip2.exe for windows (used with XP pro) does not work (it's much slower than unxutils' one, which does the same thing, i.e. produces a file of over 30gb without stopping, it just does it faster). I'll have to see if I can do it under Debian. Mr. Jones 21:56, 13 Aug 2004 (UTC)

OK, so the 18Gb is the size of all database files when compressed. I'll clarify the text. Still, the question remains. Mr. Jones 04:33, 14 Aug 2004 (UTC)

The answer is, for the record, that the size of the decompressed old table for the en database of 8-8-2004 is about 40Gb. Mr. Jones 14:13, 14 Aug 2004 (UTC)

BitTorrent

An idea is to use a distributed downloading system. In such a system, multiple computer with the client running would help each other download faster. I recommend BitTorrent, an open-source distributed downloading system. However, to be most effective, BitTorrent should be used on large files that are frequently downloaded. --Ixfd64 06:25, 2004 Aug 15 (UTC)

Database Dump Compression Format

Is there anywhere I can download the dump files that have been compressed with something other than bz2!? maybe gzip or zip?

Using Special Export

Is there anyway to use http://en.wikipedia.org/wiki/Special:Export/ to return a nearest match or wikipedia's search results page?

bunzip2 in WinXP

Using bunzip for WindowsXP I am unable to unzip the current DB for en. I have had the problem for around three months and was wondering if anyone else has the same issue, or knows a solution. I've tried other programs which handle bunzip files such as WinRAR and I get the same error. I also get an incorrect MD5 sum. The correct one should be "7a70559f2089155f441c322f6c565cc5" and mine is "1d423915d294592237f4450ded3b386b"  :


C:\Documents and Settings\*\Desktop>bunzip2 2004*

bunzip2: Caught a SIGSEGV or SIGBUS whilst decompressing, which probably indicates that the compressed data is corrupted. Input file = 20041023_cur_table.sql.bz2, output file = 20041023_cur_table.sql

It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to *attempt* to recover data from undamaged sections of corrupted files.

bunzip2: Deleting output file 20041023_cur_table.sql, if it exists.


Thanks. - Alterego @ 1:29AM on 10-29-04

Which version of bunzip2.exe are you using? Some versions (e.g. that at http://unxutils.sf.net last time I looked) can't handle files > 2GB. Mr. Jones 13:25, 5 Dec 2004 (UTC)
I am using "Version 0.1p12 29 Aug 1997". Admittantly a bit old lol. Do you know a better one? --Alterego 9:52 12/5/04
Well, that's not going to work, is it? :-) The latest version is 1.02. Try the link on the article page (press Ctrl+F and type "bzip2") Let us know how you get on. Mr. Jones 11:42, 8 Dec 2004 (UTC)

Current size of database?

How much diskspace is currently needed for the en.wikipedia.org DB when it's imported? Currently the SQL file is 52 GB, but when importing the Inno DB database ends up exceeding 70GB. Are there also any suggested MySQL settings for handling a DB this large?

Contribute

Found an omission? You can freely contribute to this Wikipedia article. Edit 'Wikipedia talk: Database download' article.

Last Contributor: MrJones - Article Talk Page: Discussion - GNU FDL: Verbatim Source

About Open Encyclopedia

Open Encyclopedia is an free extensive encyclopedia service provided by the New Frontier Information Network, a newly launched private company which offers easy access to thousands of online articles, e-books and documentation covering a wide variety of broad topics.


This is a minimal rendered version of a open-encyclopedia.com Web page. Our Web site is best viewed using an up-to-date Web browser, such as Mozilla Firefox, Opera or Microsoft Internet Explorer.

Copyright © 2003-2004 Zeeshan Muhammad. All rights reserved. Legal notices. Part of the New Frontier Information Network.