In this article I’m going to develop the concept of “Text mining” on which the Language Technology Group conducts research.
The definition that Wikipedia gives about Text Mining is useful as a first step on understanding what does that concept refer to. Text mining refers generally to the process of deriving high quality information from text. High quality information is derived through the dividing of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output.
We can point out another definition given by Marti Hearst in his article “What is Text Mining?” . He refers to this term as the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypothesis to be explored further by more conventional means of experimentation.
In text minig, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down.
A typical example in data mining is using consumer purchasing patterns to predict which products to place close together on shelves, or to offer coupons for, and so on. A related application is automatic detection of fraud, such as in credit card usage. Analysts look across huge numbers of credit card records to find deviations from normal spending patterns.
To go deeper into what is text mining we could make reference to what Ronen Feldman and James Janger explain in their book “The text mining handbook: advanced approaches in analyzing unstructured data”. Text Mining is a new research area that tries to solve the information overload problem by using techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), and knowledge management. Text mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate represenations (such as distribution analysis, dustering, trend analysis, and association rules), and visualization of the results.
Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections.
For text mining systems, preprocessing operations center on the identification and extraction of representative features for natural language documents. These processing operations are responsible for transforming unstructured data stored in document collections into a more explicity structured intermediate format, which is a concern that is not relevant for most data mining systems.
Sources:
*Ronen Feldman and James Sanger. “The text mining handbook: advanced approaches in analyzing unstructured data”. Published in 2006, Cambrige University Press, 410 pages. Retrieved 22:15, 11 April, 2008 from http://books.google.com/books?hl=en&lr=&id=3PcEoz48RBcC&oi=fnd&pg=PR10&dq=the+text+mining+handbook&ots=dDUECB3_k4&sig=yoHE5tmdhlhZ2q8Qb9N0Spwywxc
*Marti Hearst. “What is Text Mining?”. (October 17, 2003). Retrieved 22:05, 11 April, 2008, from http://www.jaist.ac.jp/~bao/MOT-Ishikawa/FurtherReadingNo1.pdf
*Text mining. (2008, April 8). In Wikipedia, The Free Encyclopedia. Retrieved 11:52, April 12, 2008, from http://en.wikipedia.org/w/index.php?title=Text_mining&oldid=204166609