SIAM Data Mining 2012 Conference

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.

Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.
Plenary Talks

Bharat Rao from SIEMENS provided the first plenary talk bright and early the first day of the conference. I only got to see the first half as I could not wake up. His talk was about privacy preserving data mining in medicine using matrix factorization. Although privacy has become an important issue in data mining, I do not totally buy that it is entirely necessary. The idea is that observations should not personally identifiable. I personally do not agree that such privacy measures are necessary when only a computer system is using the data, and not an individual person. Besides, with such massive amounts of data, someone digging through gigs and gigs of personally identifiable data to find one person’s data does not seem like a viable threat. My thoughts are similar to those on the Netflix grand challenge dataset lawsuit.

The second plenary talk came from Noshir Contractor. The main point of his work seemed to be how to build effective teams using graphs and data about each of the candidates for such a team. This did not excite me itself, but it was the data his team used that excited me and some of the stuff they learned from it. The first part of the talk discussed research into NSF grants and the types of collaboration that are more likely to lead to the awarding of such grants. His group found that women were more likely to be collaborators on awarded proposals and that multidisciplinary teams were more likely to be funded. Some analogous work involved the detection of “gold farmers” on the MMORPG game Everquest 2. Gold farming involves gathering and selling virtual goods with real cash. Interestingly, Contractor’s group found that the graph signatures present in gold farming are remarkably similar to those present with drug trafficking. There were a few other interesting tidbits that the group found. They found that a great number of players only play with friends and are somewhat disconnected from the rest of the game graph. Also, male-male relationships and female-male graph links were very common, but female-female links were uncommon. Contractor hypothesized that the male-male relationships were obvious (men are more likely to play computer games) and that women often play the game with men because it was the only way for them to get time with their significant others.

The Friday morning talk on transfer learning came from Qiang Yang from Hong Kong University. Transfer learning in this context discussed how to adapt models developed in one domain to data from another domain. Transfer learning seems to be picking up steam in Machine Learning, but anybody within training in Statistics can tell you that it really is just latent variable analysis. Of course, transfer learning applies more to learning classifiers than building descriptive models of data. The speaker’s proposed method is called Transfer Component Analysis (TCA) which is similar to, of course, Principal Component Analysis (PCA). Yang found that semi-supervised TCA was useful for sentiment analysis in a transfer learning context. A common use of transfer learning is mapping a text classifier to an image classifier where we have few labeled instances in the image domain. We can then use unlabeled source data (text) in a semi-supervised way to create a better classifier in the image domain.

The last plenary talk came from Susan Dumais from Microsoft Research who discussed temporal dynamics and information retrieval. The talk basically discussed how to mine concepts important concepts over time from data streams. One part of her research was discovering the staying power of certain words. Susan has noticed four distinct word behaviors based on how the density of the word’s usage changes over time: fast, hybrid, medium, and slow. Susan’s research also studies how often people revisit certain webpages and why. Presumably revisits are an alternative measure of influence to in-links and out-links used in PageRank (remember, Microsoft has its own anti-Google search engine). Studying temporal behavior of web visits and keyword usage is important because current methods consider only a snapshot of the web with very little evolution. Susan stated that a great page is defined as a mixture of bags of words that are formed based on page changes. Such research is important because query relevance changes over time. For example, a query of US Open refers to golf at certain times of the year and tennis at others. The query March Madness should probably return ticket prices before the event, scores during the event, and Wikipedia or sports articles recapping the event after the event.

Social Media

Social media has a session at pretty much every academic conference these days. The speakers in this session used social media data to test their hypotheses and they are always interesting. One talk discussed a feature selection technique for social processes using data from Twitter. The method used in the paper uses user-post relations (favorites, retweets, replies) and user-user relations (following etc.). The second talk used heat diffusion models to model the diffusion, cascading and propagation of ideas. The researchers were interested in also discovering or predicting the “tipping point” (or burst of activity, in their words) or a social phenomenon. Another talk discussed credibility in a social network and how credible and incredible information spreads. This work particularly discussed rumors and fake events such as the untimely death of Justin Bieber. Some of the questions investigated were: how can we filter these fake events out of the timeline? How do such rumors spread? The final talk in this session was a bit of an odd duck: how to build a team using social network analysis. The purpose of that work was to balance skillsets in a team and enhance collaborative compatibility.

Pattern Mining

The Thursday afternoon session I attended had a very generic name considering all of data mining is about finding patterns. Really, it should have been called “association rule mining.” Unfortunately, this session was fairly dry and was my least favorite of the conference. The one talk that really stood out to me discussed how to mine association rules out of long temporal events. Such association rules consisted of “episodes” which were partial orders on the graph of the event. The type of association rules considered were basically motifs — subsequences of interesting events that occurred within a long event.

Kernels and Classification

The first two talks in this session discussed multi-label classification, which is distinct from multi-class classification. In multi-class classification, we have multiple classes and each instance can belong to one, and only one class. In multi-label classification, each instance can belong to one or more classes/labels. Multi-label classification exploits correlation information among labels whereas independent classifiers do not. The first talk discussed how to use multi-label classification when there are multiple objectives. For example, when buying a cell phone, we may want to minimize price, and maximize battery life. The second talk discussed dimension reduction for multi-label classification and coupling feature selection with modeling. Another talk attempted to study the theoretical principles behind pruning and grafting in decision trees. The C4.5 software does pruning and grafting, but its theoretical properties are not well understood. The last talk discussed augmenting matrix factorization with graph information and other metadata prior to building a model. For example, for a movie recommendation problem, one factor would be a movie and another factor would be a user. These factors can be combined into a Bayesian model that can be scaled up better than other existing methods.

Transfer Learning

As I mentioned earlier, the goal of transfer learning is to map a model used in one domain to another similar domain. The classic example is classifying images using models trained on text data and some labeled images — both domains are reduced to a common set of concepts. The talks in this session mainly talked about advances in latent variable analysis. I kept finding myself confused and wondering, “why is this considered groundbreaking?” The work presented in this session basically used existing models for transfer learning. The first few talks discussed using Latent Dirichlet Allocation (LDA) to map data into concepts, and then the third talk discussed Hierarchical Latent Dirichlet Allocation (hLDA) which could be used for taxonomies and hierarchies of concepts. Although Transfer Learning is very useful, I did not find it to be all that groundbreaking. Of course, using text and images as the source and target domains is not incredibly interesting. I think Transfer Learning could be revolutionary if it could be applied to two very different domains.

Full Day Workshop: Text Mining

Of course, if there is a text mining talk, I will attend it. The workshop was led David W. Berry from University of Tennessee, Knoxville. The keynote speaker was Malu Castellanos from Hewlett-Packard Labs. Malu’s talk was amazing. She discussed a live customer intelligence system that is used for intent and sentiment analysis on various channels. Working with text is not easy. She began with a discussion of the many challenges in sentiment analysis including deceitful adjectives (despicable is negative, but Despicable Me is a proper noun that is not negative), dependency relations (wicked as slang for “good” vs. wicked witch), comparisons (x is better than y), spam, sarcasm, coreferences (use of the word it), special expressions and emoticons (LOL, ;-)), and context dependencies (predicable movie is negative whereas predictable weather may be positive). What was particularly illiuminating about Malu’s talk was that she was fairly candid about how complex HP’s sentiment analysis system is. The system does not use one model for sentiment. Different models are used to handle different kinds of tweets and based on their classifications, these tweets are ushered off to other models for further classification. For example, comparative statements are treated distinctly by the system. There may be a naive Bayes step that classifies the text as comparative or not, and then sends the tweet for further processing. She mentioned something about using special processing such as linear programming and generalized additive models (GAM) to take words such as BUT, AND etc. into account. GAMs seem rare to encounter in text mining. Some other features of the system include sentiment intensity (really good vs. good) and clustering similar words by using temporal histograms (tomorrow and 2morrow have similar usage patterns).

The first talk was from David Skillicorn, who recently published a book about mining large datasets. He discussed how to pick documents out of a corpus that are the most interesting. The second talk was given by a brave undergraduate student on query expansion. He did a very good job, but what was strange about this talk was that it used… Latent Semantic Indexing (…from 1990…) rather than one of the more useful and iterative models such as LDA. This brings me to my first personal “weird moment” about this workshop. There was very little discussion about modern (post 2000) topic models. This is very strange to me. Just a few months earlier, topic models were all the rage at KDD 2011. After the lunch break, there were talks about incremental online clustering of documents and discovery of patent trolls. The final sessions of the afternoon discussed extraction of hierarchies for increasing performance of multi-labeled classifiers and automatically evaluating text summarizers. Only one of the presentations in this workshop seemed to be attached to a paper.
I do not want to be critical because I am sure a lot of work goes into planning such events. I just found this workshop to be a bit weird. A lot of the methods used in the papers were quite old fashioned for text mining (LSI, regression) and the applications were also quite old-school (patents and legal documents just scream the old-fashioned use of information retrieval… library cataloging). It also seemed like a disproportionate number of the speakers had a prior relationship with the workshop chair. I am also not used to a workshop with so few associated papers.

Concluding Thoughts

This was a data mining conference so of course I enjoyed it. I must say though that the vibe was very different from some of the other conferences I have attended like KDD and IJCAI. Most of the speakers came from overseas, and as someone with hearing loss, it was very difficult to understand many of the speakers. It also seemed like there were very few people just attending the conference. It seemed like the majority of the people at the conference were presenting, or had a poster etc. and that is different from what I am used to. Because of that, I felt like the usual community feel was a bit missing. Additionally, there was no mention of Hadoop or R, and I found that a bit concerning since every other conference I have been to has speakers that are proud to contribute to those open-source projects. And then there was the weird text mining workshop (could have just been an off-year). Could it be because SIAM is a mathematics group? Not sure. All in all, I still had a great time and learned a lot as always.

Of course, my attendance would not have been possible without sponsorship and support from my company, GumGum. I attended this conference as part of my position as Data Scientist.

Disneyland

Of course, the white elephant in every room of the conference was the fact that Disneyland was only a 5-10 minute walk away. I got a 2-day park hopper pass and spent my lunch hours and evenings at both Disneyland and California Adventure. It really is the Happiest Place on Earth. Just being there I forget about stress and the things that worry me. I had a great time walking around and watching all the kids have fun. At Disneyland I only went on a few rides: Space Mountain, Pirates of the Carribbean, Haunted Mansion, It’s a Small World and the Disneyland Railroad (not really a ride). I also got to ride the Monorail for the first time. At California Adventure I only did the California Screaming roller coaster and Soaring Over California which features my hometown (the part with the orange orchards). Unfortunately, I missed Tom Sawyer Island again. I will have to go there first next time!

The view of California Adventure from my hotel room! A room just for kids. Surfer Goofy at the lobby entrance.

1 comment to SIAM Data Mining 2012 Conference

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>