Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

  • article content and template pages
  • article content with revision history (huge files)
  • article content including user pages and talk pages
  • redirect graph
  • page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
  • image metadata
  • site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers.

For example, below is an excert of Wiki-syntax for a page on data mining.

'''Data mining''' (the analysis step of the '''knowledge discovery in databases''' process,<ref name="Fayyad"> or KDD), 
a relatively young and interdisciplinary field of [[computer science]]<ref name="acm" />
{{cite web|url=http://www.sigkdd.org/curriculum.php |title=Data Mining Curriculum |
publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2011-10-28}}
</ref><ref name=brittanica>{{cite web | last = Clifton | first = Christopher | title = Encyclopedia Britannica: Definition 
of Data Mining | year = 2010 | url = http://www.britannica.com/EBchecked/topic/1056150/data-mining | 
accessdate = 2010-12-09}}</ref> is the process of discovering new patterns from large [[data set]]s 
involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]] and 
[[database system]]s.<ref name="acm"> The goal of data mining is to extract knowledge from a data set in a 
human-understandable structure<ref name="acm" /> and involves database and [[data management]], 
[[Data Pre-processing|data preprocessing]], [[statistical model|model]] and [[Statistical inference|inference]] 
considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing 
of found structure, [[Data visualization|visualization]] and [[Online algorithm|online updating]].<ref name="acm" />

I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on named entity extraction and ngram extraction, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get 

Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young 
and interdisciplinary field of computer science is the process of discovering new patterns from large data sets 
involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. 
The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves 
database and data management, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of found structure, visualization and online updating.

and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.

I was delighted to stumble upon Wikipedia Extractor, a Python library developed by Antonio Fuschetto, Multimedia Laboratory, Dipartimento di Informatica, Università di Pisa, that extracts plain-text from the Wikipedia XML dump file. The script is heavily object-oriented, and it is very easy to modify and extend for other purposes. For me, it is the easiest parser to use and yields the best quality output although there are other options.

Pros

  • Very easy to run; it’s just a Python script.
  • Yields high quality output; no stray wikisyntax garbage.
  • Highly object-oriented; easy to extend and embed in text mining projects.
  • Object-oriented style makes it easier to parallelize with lightweight processes (written by the user).
  • Allows specifying the maximum size of each produced file (good for sending to S3).
  • It is written in Python.

Cons

  • Far too slow. Python profilers show major overhead involved in regex search and replace, and string replacement.
  • Is not perfect, but one of the best I have seen. For some reason, Wikilinks are converted to HTML links. Correcting this required modifying the source code.
  • Retooling the package to work with Hadoop Streaming is not too difficult, but requires some work and grokery that should be easier.

Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances. 

Ken Weiner (@kweiner) recently re-introduced me to the Cloud9 package by Jimmy Lin (@lintool) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the Bliki Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.

Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as sequence files, the same job took approximately 90 minutes to run.

Conclusion

I am looking forward to playing with Cloud9 some more… I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as AWS or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.

18 comments to Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>