Mind Map Gallery Text Mining
Text mining, also known as text analysis, is the process of transforming unstructured text data into meaningful and actionable information.Edited at 2020-10-10 07:22:50
for the first time mentioned in Feldman et al. [FD95] (Hotho, Nürnberger, and Paaß 2005)
Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text(Hotho, Nürnberger, and Paaß 2005)
the application of algorithms and methods from the fields machine learning and statistics to texts with the goal of finding useful patterns. For this purpose it is necessary to pre-process the texts accordingly. Many authors use information extraction methods, natural language processing or some simple pre- processing steps in order to extract data from texts. To the extracted data then data mining algorithms can be applied (see [NM02, Gai03]).(Hotho, Nürnberger, and Paaß 2005)
Text Mining  is the discovery by computer of new,previously unknown information, by automaticallyextracting information from different written resources.(Gupta and Lehal 2009)
refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.(Gupta and Lehal 2009)
can work with unstructured or semi-structured data sets such as emails, full-text documents and HTML files etc.(Gupta and Lehal 2009)
knowledge discovery from text (KDT) (Hotho, Nürnberger, and Paaß 2005)
Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text (KDT)(Gupta and Lehal 2009)
Text Analytics(Ppts APomares)
Text Data Mining(Ppts APomares)
information retrieval, machine learning, statistics, computational linguistics and especially data mining.(Hotho, Nürnberger, and Paaß 2005)
Data Mining(Gupta and Lehal 2009)
tries to find interesting patterns from large databases.(Gupta and Lehal 2009)
is to explore interesting information and potential patterns from the contents of web page, the information of accessing the web page linkages and resources of e-commerce by using techniques of data mining, which can help people extract knowledge, improve web sites design, and develop e- commerce better.(Gupta and Lehal 2009)
Computational Linguistics(Gupta and Lehal 2009)
Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself [Hea99](Hotho, Nürnberger, and Paaß 2005)
Natural Language Processing NLP
The general goal of NLP is to achieve a better understanding of natural language by use of computers [Kod99].(Hotho, Nürnberger, and Paaß 2005)
The goal of information extraction methods is the ex- traction of specific information from text documents. These are stored in data base-like patterns (see [Wil97])(Hotho, Nürnberger, and Paaß 2005)
IE addresses the problem of transforming a corpus of textual documents into a more structured database, the database constructed by an IE module can be provided to the KDD module for further mining of knowledge (Gupta and Lehal 2009)
are necessary in order to analyze large quantities of data efficiently.(Hotho, Nürnberger, and Paaß 2005)
is an area of artificial intelligence concerned with the development of techniques which allow computers to ”learn” by the analysis of data sets.(Hotho, Nürnberger, and Paaß 2005)
Statistics has its grounds in mathematics and deals with the science and practice for the analysis of empirical data. It is based on statistical theory which is a branch of ap- plied mathematics(Hotho, Nürnberger, and Paaß 2005)
”Knowledge Discovery in Databases (KDD)
is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data(Hotho, Nürnberger, and Paaß 2005)
Text stream mining
Created by very large scale interacions of individuals or structured creations of particular kinds of content by dedicated organizations. i.e. news-wire services (Reuters, AP)(Aggarwal, 2012)
Provide unprecedented challenges to data mining algorithms from an efficiency perspective(Aggarwal, 2012)
Ubiquitous in recent years because wide variety of applications in social networks, news collection. In general continuous creation of massive streams(Aggarwal, 2012)
Social networks(Aggarwal, 2012)
Users continuously communicate with one another with the use of text messages
Interesting due to text messages are reflective of user interests, an the same applies to chat and email networks
News aggregator services i.e. Google News(Aggarwal, 2012)
Recieves news articles continously over time
Web crawlers(Aggarwal, 2012)
Collect large volume of documents from networks in small time frame
Combines search results from major search engines like Google, Yahoo! and Bing
Methods for online summarizations need to be designed(Aggarwal, 2012)
There are estimates that 85% of business information lives in the form of text(Hotho, Nürnberger, and Paaß 2005)
As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. (Gupta and Lehal 2009), (Tan 1999)
Humans have the ability to distinguish and apply linguistic patterns to text and humans can easily overcome obstacles that computers cannot easily handle such as slang, spelling variations and contextual meaning. However, although our language capabilities allow us to comprehend unstructured data, we lack the computer’s ability to process text in large volumes or at high speeds.(Gupta and Lehal 2009)
As the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining(Tan 1999)
CRoss Industry Standard Process for Data Mining (Crisp DM)(Hotho, Nürnberger, and Paaß 2005)
(1) business understanding2, (2) data understanding, (3) data preparation, (4) modelling, (5) evaluation, (6) deployment
For mining large document collections it is necessary to pre-process the text documents and store the information in a data structure, which is more appropriate for further processing than a plain text file(Hotho, Nürnberger, and Paaß 2005)
Filtering(Hotho, Nürnberger, and Paaß 2005)
remove words from the dictionary and thus from the documents.
Lemmatization(Hotho, Nürnberger, and Paaß 2005)
Try to map verb forms to the infinite tense and nouns to the singular form.
Stemming(Hotho, Nürnberger, and Paaß 2005)
Try to build the basic forms of words, i.e. strip the plural ’s’ from nouns, the ’ing’ from verbs, or other affixes.
Linguistic(Hotho, Nürnberger, and Paaß 2005)
Part-of-speech tagging (POS)
determines the part of speech tag, e.g. noun, verb, adjective, etc. for each term.
aims at grouping adjacent words in a sentence.
Word Sense Disambiguation (WSD)
Tries to resolve the ambiguity in the meaning of single words or phrases.
produces a full parse tree of a sentence.
Index Term Selection(Hotho, Nürnberger, and Paaß 2005)
In this case, only the selected keywords are used to describe the documents.
The most commonly used criteria is the entropy
The entropy gives a measure how well a word is suited to separate documents by keyword search.
The entropy can be seen as a measure of the importance of a word in the given domain context.
Principal Components Analysis (PCA)
a.k.a Kahrhunen-Lo`eve procedure, eigenvector analysis, and empirical orthogonal functions depending on the context in which one is being used(Berry et al. 2008)
recently it has been used primarily in statistical data analysis and image processing(Berry et al. 2008)
For text and data mining that focuses on covariance matrix analysis (COV)(Berry et al. 2008)
Latent Semantic Indexing (LSI)
Given a database with M documents and N distinguishing attributes for relevancyranking, let A denote the corresponding M-by-N document-attribute matrix model