Data Scientist - Books, Links, Papers, Tools, Projects,

By Trieu October 16, 2011

on the way to prepare & study for new job, new trends after the post web 2.0 era. I still think about what should I do, study, research , blah.. blah ... to be a Data Scientist , ya truly science job.

In the trend where the data generated from massive users, tons of data is everywhere. Blog, Facebook, YouTube, Twitter, ...
We have to deal with them everyday. Your physical brain is designed to processing a lot of news, information, work ,,.. at same time for filter what is useful information , the knowledge you should capture and then the Wisdom (http://www.systems-thinking.org/dikw/dikw.htm)
=>Stress, overloaded, ... or the limit of biological brain.

On the way to implement my idea "My Second Brain" project http://code.google.com/p/my-second-brain/

http://www.infogineering.net/data-information-knowledge.htm

As the name, it should help me processing tons of email, blogs, RSS , local news to find the keywords , the trends. That can save me time manually reading, classifying , tagging, the key information. So I can focus all my energy to do cool things, making decisions to improve my skills, also my career.,
to change the world, at least I should change my life first, and then share them for all.

First, how to extract the content of local news, and rank the best keywords. ==> http://code.google.com/p/boilerpipe/

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Second, http://incubator.apache.org/opennlp/

OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.
OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package.

Third, http://lucene.apache.org/java/docs/index.html
Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Forth, http://mahout.apache.org/
The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries.

Fifth, the Google Cloud & some tools
Hooking to browsing job, http://code.google.com/chrome/extensions/overview.html. Private cloud storage, cheap and cool, the Gmail https://mail.google.com/

Sixth, the Jetty, how your personal service running http://jetty.codehaus.org/jetty/ , http://code.google.com/p/i-jetty/

Seventh, mobile way how information is collected and consumed, http://www.phonegap.com/about
http://www.livestream.com/facebookeducation/video?clipId=pla_e86b0c30-8796-4b54-8c52-d43440f84068

Eighth, finally, visualization your personal information http://mbostock.github.com/protovis/ ,http://thejit.org/ , https://github.com/mbostock/d3

The big picture in one photo