Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include
|
The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.
As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:
There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers.
For example, below is an excert of Wiki-syntax for a page on data mining.
'''Data mining''' (the analysis step of the '''knowledge discovery in databases''' process,<ref name="Fayyad"> or KDD), a relatively young and interdisciplinary field of [[computer science]]<ref name="acm" /> {{cite web|url=http://www.sigkdd.org/curriculum.php |title=Data Mining Curriculum | publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2011-10-28}} </ref><ref name=brittanica>{{cite web | last = Clifton | first = Christopher | title = Encyclopedia Britannica: Definition of Data Mining | year = 2010 | url = http://www.britannica.com/EBchecked/topic/1056150/data-mining | accessdate = 2010-12-09}}</ref> is the process of discovering new patterns from large [[data set]]s involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]] and [[database system]]s.<ref name="acm"> The goal of data mining is to extract knowledge from a data set in a human-understandable structure<ref name="acm" /> and involves database and [[data management]], [[Data Pre-processing|data preprocessing]], [[statistical model|model]] and [[Statistical inference|inference]] considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing of found structure, [[Data visualization|visualization]] and [[Online algorithm|online updating]].<ref name="acm" />
I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on named entity extraction and ngram extraction, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.
and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.
I was delighted to stumble upon Wikipedia Extractor, a Python library developed by Antonio Fuschetto, Multimedia Laboratory, Dipartimento di Informatica, Università di Pisa, that extracts plain-text from the Wikipedia XML dump file. The script is heavily object-oriented, and it is very easy to modify and extend for other purposes. For me, it is the easiest parser to use and yields the best quality output although there are other options.
Pros
- Very easy to run; it’s just a Python script.
- Yields high quality output; no stray wikisyntax garbage.
- Highly object-oriented; easy to extend and embed in text mining projects.
- Object-oriented style makes it easier to parallelize with lightweight processes (written by the user).
- Allows specifying the maximum size of each produced file (good for sending to S3).
- It is written in Python.
Cons
- Far too slow. Python profilers show major overhead involved in regex search and replace, and string replacement.
- Is not perfect, but one of the best I have seen. For some reason, Wikilinks are converted to HTML links. Correcting this required modifying the source code.
- Retooling the package to work with Hadoop Streaming is not too difficult, but requires some work and grokery that should be easier.
Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances.
Ken Weiner (@kweiner) recently re-introduced me to the Cloud9 package by Jimmy Lin (@lintool) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the Bliki Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.
Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as sequence files, the same job took approximately 90 minutes to run.
Conclusion
I am looking forward to playing with Cloud9 some more… I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as AWS or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.
You might want to check out Google/Freebase’s weekly WEX dumps. They’ve done a bunch of the grunt work and publish the results on a regular basis. In the past they’ve made them available on EC2, which would save you the bandwidth charges, although I’m not sure they still do that on a regular basis.
http://wiki.freebase.com/wiki/WEX
http://download.freebase.com/wex/latest/
Thanks for the link. I have always had trouble navigating the Freebase website.
Dear Ryan,
This is very cool indeed! One way to speed up this significantly is to use Wikihadoop and use your existing Python code for the mapper. Wikihadoop, in contrast to the Cloud9 package, is able to stream the full bzipped2 XML dump files using Hadoop Streaming. I am happy to help you if you are stuck. You can find Wikihadoop at: https://github.com/whym/wikihadoop
Best,
Diederik
Thanks! We tried WikiHadoop but it did not seem very generalizable. The authors seemed to only be familiar with using it for diffing revisions. It could be an extremely powerful project if the documentation were better and if it was not restricted to Hadoop 0.21+.
Hi Ryan,
I am one of the authors and would love to get more detailed feedback on how we can improve this.
Best,
Diederik
Awesome! I will send you an email when I get a chance. WikiHadoop has the potential to be extremely useful. Cloud9 was great, but since Java is not my preferred language, it was a pain to setup at first.
[…] Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Ryan Rosario. […]
Hi Ryan,
I’ve struggled a bit trying to get data from Wikipedia. For example, I’d love to get a plain text data dump of all the Wikis, edit histories and discussion pages on S&P 500 companies or of all the Wikis under “Companies founded in year XYZ”
It would be interesting to see if older companies have longer Wikis or do some kind of sentiment analysis on S&P 500 companies or how frequent changes are made of what byte size.
Any tips on how I’d get started trying to get the data for this? I’m technical enough to know PHP, but I don’t know – this might be beyond me.
-David
After attempts to parse the Wikipedia dump myself, I end up experimenting with DBpedia data (http://wiki.dbpedia.org/Downloads37) instead. The DBpedia data includes (after cleanup) wikipedia article title, abstract, categories, redirects, disambiguation, which might be enough for my use.
I am just wondering why DBpedia didnot extract the fulltext article content, but only the abstract.
As I am still half way playing with the DBpedia data, no conclusions can be made with regards to whether it has enough info for me.
Expecting to see more efforts in this space to make Wikipedia data more accessible for programmers, especially python geeks.
Hey Ryan,
Not sure how it compares, but a while back I wrote some tokenizers/token filters for Lucene that work on Wikipedia. They aren’t perfect, but if you know Lucene, it may not be too hard to extend them for your needs. Naturally, you can then feed them into Lucene’s n-gram capabilities and other filters to build up what you need.
Hey Grant! Are these filters available online somewhere? I really should be using Lucene for all of this instead of reimplementing everything in Hadoop.
[…] Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining. Share this:TwitterFacebookLike this:LikeBe the first to like this. This entry was posted in […]
If you need a full DOM tree with all the information in a given article, try the Sweble parser at http://sweble.org
[…] http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/ […]
Wiki Extractor would be really cool IF it existed in Python 3 too. It’s quite a mess to convert.
[…] http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/ […]
Hi,
I have a stack overflow question posted about trying to use the Cloud9 process described here, but I can’t seem to get it to work. If anybody who has used this could cruise over to this post and give me any ideas of what I’m doing wrong, that would be much appreciated! Thanks. Post here:
http://stackoverflow.com/questions/35760657/extracting-wikipedia-article-text-with-cloud9-and-hadoop
Seth
thanks for you hard works, it help me to save a lot of time to extract. For original purpose, i want to wirte a java code to extract it. But the performance is poor. YOUR work really contribute to every.