This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users’ Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python. My slides are below:
Download/view slides here. Topics include
- Using Python with R with an example using web mining.
- Web mining using pure R rather than Python.
Code for demonstration is here:
- offtopic_demo.py is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account.
- RPy2_demo.py reads the data from the forum from disk and calls R from Python to perform some basic analysis.
- curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.
Video:
Running the code requires some packages that you need to install.
- twill package for web browsing, that installs a Python package for you. Requires the mechanize package as well. twill is a wrapper to mechanize.
- BeautifulSoup package for Python for HTML parsing.
- R must be built to use as a shared library using --enable-R-shlib, otherwise Python cannot call it.
- RPy2, the Python interface to R.
To see the main talk of the evening, click here.
Some Recommended Books
Natural Language Processing
- Foundations of Statistical Natural Language Processing, Manning and Schuetze.
- Speech and Language Processing, Jurafsky and Martin.
- Natural Language Processing and Text Mining, Kao and Poteet.
Text Mining
- Practical Text Mining with Perl, Bilisoly. See my review of this book in the Journal of Statistical Software here which is also excerpted on Amazon!
- Text Mining: Applications and Theory, Berry and Kogan (NEW).
- The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Feldman and Sanger.
- Mastering Regular Expressions, Friedl.
Data Mining
- Elements of Statistical Learning: Data Mining, Inference and Prediction. Hastie, Tibshirani and Friedman.
- Data Mining: Concepts and Techniques (recommended by @nealrichter). Han, Kamber and Pei.
- Data Mining: Practical Machine Learning Tools and Techniques [the fern book]. Witten and Frank.
- Introduction to Data Mining [the rock book]. Tan, Steinbach, Kumar.
Web Mining
- Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Liu.
- Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti.
- Mining Graph Data, Cook and Holder.
- Managing and Mining Graph Data, Aggarwal and Wang.
- Social Network Analysis: Methods and Applications, Wasserman and Faust.
[…] This post was mentioned on Twitter by Régis Gaidot, Ryan Rosario, B, John Myles White, Adam Bernier and others. Adam Bernier said: RT @DataJunkie: New at Byte Mining: Accessing R from Python using RPy2 http://dlvr.it/7XX10 […]
[…] Accessing R from Python using RPy2 (bytemining.com) […]
[…] Accessing R from Python using RPy2 (+ slides) […]
Hi, it seems the links provided for the .py files are no longer valid. Is there any way they can be re-uploaded? Thanks!
Thanks for letting me know. The links seem to throw a HTTP 500. I’ll see if I can fix it tonight.