Chapter 1 Introduction 1.1 The Basic Concepts Compared with generalized data mining technology, beyond analyzing various document formats (such as doc/docx files, PDF files, and HTML files), the greatest challenge in text data mining lies in the analysis and modeling of unstructured natural language text content. Two aspects need to be emphasized here: first, text content is almost always unstructured, unlike databases and data warehouses, which are structured; second, text content is described by natural language, not purely by data, and other non-text formats such as graphics and images are not considered. Of course, it is normal for a document to contain tables and figures, but the main body in such documents is text. Therefore, text data mining is de facto an integrated technology of natural language processing (NLP), pattern classification, and machine learning (ML). The so-called mining usually has the meanings of “discovery, search, induction and refinement.” Since discovery and refinement are necessary, the target results being sought are often not obvious but hidden and concealed in the text or cannot be found and summarized in a large range. The adjectives “hidden” and “concealed” noted here refer not only to computer systems but also human users. However, in either case, from the user’s point of view, the hope is that the system can directly provide answers and conclusions to the questions of interest, instead of delivering numerous possible search results for the input keywords and leaving users to analyze and find the required answers themselves as in the traditional retrieval system. Roughly speaking, text mining can be classified into two types. In the first, the user’s questions are very clear and specific, but they do not know the answer to the questions. For example, users want to determine what kind of relationship someone has with some organizations from many text sources. The other situation is when the user only knows the general aim but does not have specific and definite questions. For example, medical personnel may hope to determine the regularity of some diseases and the related factors from many case records. In this case, they may not be referring to a specific disease or specific factors, and the relevant data in their 2 1 Introduction entirety need to be mined automatically by system. Certainly, there is sometimes no obvious boundary between the two types. Text mining technology has very important applications in many fields, such as the national economy, social management, information services, and national security. The market demand is huge. For example, government departments and management can timely and accurately investigate the people’s will and understand public opinions by analyzing and mining microblogs, WeChat, SMSs (short message services), and other network information for ordinary people. In the field of finance or commerce, through the in-depth excavation and analysis of extensive written material, such as news reports, financial reports, and online reviews, text mining can predict the economic situation and stock market trends for a certain period. Electronic product enterprises can acquire and evaluate their product users or market reactions at any time and capture data support for further improving product quality and providing personalized services. For national security and public security departments, text data mining technology is a useful tool for the timely discovery of social instability factors and effectively controlling the current situation. In the field of medicine and public health, many phenomena, regularities, and conclusions can be found by analyzing medical reports, cases, records, and relevant documents and materials. Text mining, as a research field crossing multiple technologies, originated from single techniques such as text classification, text clustering, and automatic text summarization. In the 1950s, text classification and clustering emerged as an application of pattern recognition. At that time, research was mainly focused on the needs of books and on information classification, and classification and clustering are, of course, based on the topics and contents of texts. In 1958, H.P. Luhn proposed the concept of automatic summarization (Luhn 1958), which added new content to the field of text mining. In the late 1980s and early 1990s, with the rapid development and popularization of Internet technology, demand for new applications has promoted the continuous development and growth of this field. The US government has funded a series of research projects on information extraction, and in 1987, the US Defense Advanced Research Projects Agency (DARPA) initiated and organized the first Message Understanding Conference (MUC1) to evaluate the performance of this technology. In the subsequent 10 years, seven consecutive evaluations have made information extraction technology a research hot spot in this field. Next, a series of social media-oriented text processing technologies, such as text sentiment analysis, opinion mining, and topic detection and tracking, emerged and developed rapidly. Today, this technical field is growing rapidly not only in theory and method but also in the form of system integration and applications. 1https://www-nlpir.nist.gov/related_projects/muc/. 1.2 Main Tasks of Text Data Mining 3 1.2 Main Tasks of Text Data Mining As mentioned above, text mining is a domain that crosses multiple technologies involving a wide range of content. In practical applications, it is usually necessary to combine several related technologies to complete an application task, and the execution of mining technology is usually hidden behind the application system. For example, a question and answering (Q&A) system often requires several links, such as question parsing, knowledge base search, inference and filtering of candidate answers, and answer generation. In the process of constructing a knowledge base, key technologies such as text clustering, classification, named entity recognition, relationship extraction, and disambiguation are indispensable. Therefore, text mining is not a single technology system but is usually an integrated application of several technologies. The following is a brief introduction to several typical text mining technologies. (1) Text Classification Text classification is a specific application of pattern classification technology. Its task is to divide a given text into predefined text types. For example, according to the Chinese Library Classification (5-th Edition),2 all books are divided into 5 categories and 22 subcategories. On the first page of www.Sina.com,3 the content is divided into the following categories: news, finance, sports, entertainment, cars, blog, video, house and property, etc. Automatically classifying a book or an article into a certain category according to its content is a challenging task. Chapter 5 of this book introduces text classification techniques in detail. (2) Text Clustering The purpose of text clustering is to divide a given text set into different categories. Generally, different results can be clustered based on different perspectives. For example, based on the text content, the text set can be clustered into news, culture and entertainment, sports or finance, and so on, while based on the author’s tendency, it can be grouped into positive categories (positive views with positive and supportive attitudes) and negative categories (negative views with negative and passive attitudes). The basic difference between text clustering and text classification is that classification predefines the number of categories and the classification process automatically classifies each given text into a certain category and labels it with a category tag. Clustering, by contrast, does not predefine the number of categories, and a given document set is divided into categories that can be distinguished from each other based on certain standards and evaluation indices. Many similarities exist between text clustering and text classification, and the adopted algorithms and 2https://baike.baidu.com/item/中国图书馆图书分类法/1919634?fr=aladdin. 3https://www.sina.com.cn/. 4 1 Introduction models have intersections, such as models of text representation, distance functions, and K-means algorithms. Chapter 6 of this book introduces text clustering techniques in detail. (3) Topic Model In general, every article has a topic and several subtopics, and the topic can be expressed by a group of words that have strong correlation and that basically share the same concepts and semantics. We can consider each word as being associated with a certain topic with a certain probability, and in turn, each topic selects a certain vocabulary with a certain probability. Therefore, we can give the following simple formula:  p(wordi|documentj ) =p(wordi|topick) × p(topick|documentj) (1.1) k Thus, the probability of each word appearing in the document can be calculated. To mine the topics and concepts hidden behind words in text, people have proposed a series of statistical models called topic models. Chapter 7 of this book introduces the topic model in detail. (4) Text Sentiment Analysis and Opinion Mining Text sentiment refers to the subjective information expressed by a text’s author, that is, the author’s viewpoint and attitude. Therefore, the main tasks of text sentiment analysis, which is also called text orientation analysis or text opinion mining, include sentiment classification and attribute extraction. Sentiment classification can be regarded as a special type of text classification in which text is classified based on subjective information such as views and attitudes expressed in the text or judgments of its positive or negative polarity. For example, after a special event (such as the loss of communication with Malaysia Airlines MH370, UN President Ban Ki-moon’s participation in China’s military parade commemorating the 70th anniversary of the victory of the Anti-Fascist War or talks between Korean and North Korean leaders), there is a high number of news reports and user comments on the Internet. How can we automatically capture and understand the various views (opinions) expressed in these news reports and comments? After a company releases a new product, it needs a timely understanding of users’ evaluations and opinions (tendentiousness) and data on users’ age range, sex ratio, and geographical distribution from their online comments to help inform the next decisions. These are all tasks that can be completed by text sentiment analysis. Chapter 8 of this book introduces text sentiment analysis and opinion mining techniques. (5) Topic Detection and Tracking Topic detection usually refers to the mining and screening of text topics from numerous news reports and comments. Those topics that most people care about, pay attention to, and track are called hot topics. Hot topic discovery, detection, 1.2 Main Tasks of Text Data Mining 5 and tracking are important technological abilities in public opinion analysis, social media computing, and personalized information services. The form of their application varies, for example, Hot Topics Today is a report on what is most attracting readers’ attention from all the news events on that day, while Hot Topics 2018 lists the top news items that attracted the most attention from all the news events throughout 2018 (this could also be from January 1, 2018, to a different specified date). Chapter 9 of this book introduces techniques for topic detection and tracking. (6) Information Extraction Information extraction refers to the extraction of factual information such as entities, entity attributes, relationships between entities, and events from unstructured and semistructured natural language text (such as web news, academic documents, and social media), which it forms into structured data output (Sarawagi 2008). Typical information extraction tasks include named entity recognition, entity disambiguation, relationship extraction, and event extraction. In recent years, biomedical/medical text mining has attracted extensive attention. Biomedical/medical text mining refers to the analysis, discovery, and extraction of text in the fields of biology and medicine, for example, research from the biomedical literature to identify the factors or causes related to a certain disease, analysis of a range of cases recorded by doctors to find the cause of certain diseases or the relationship between a certain disease and other diseases, and other similar uses. Compared with text mining in other fields, text mining in the biomedical/medical field faces many special problems, such as a multitude of technical terms and medical terminology in the text, including idioms and jargon used clinically, or proteins named by laboratories. In addition, text formats vary greatly based on their different source, such as medical records, laboratory tests, research papers, public health guidelines, or manuals. Unique problems faced in this field are how to express and utilize common knowledge and how to obtain a large-scale annotation corpus. Text mining technology has also been a hot topic in the financial field in recent years. For example, from the perspective of ordinary users or regulatory authorities, the operational status and social reputation of a financial enterprise are analyzed through available materials such as financial reports, public reports, and user comments on social networks; from the perspective of an enterprise, forewarnings of possible risks may be found through the analysis of various internal reports, and credit risks can be controlled through analysis of customer data. It should be noted that the relation in information extraction usually refers to some semantic relation between two or more concepts, and relation extraction automatically discovers and mines the semantic relation between concepts. Event extraction is commonly used to extract the elements that make up the pairs of events in a specific domain. The “event” mentioned here has a different meaning from that used in daily life. In daily life, how people describe events is consistent with their understanding of events: they refer to when, where, and what happened. The thing that happened is often a complete story, including detailed descriptions of causes, processes, and results. By contrast, in even extraction, the “event” usually 6 1 Introduction refers to a specific behavior or state expressed by a certain predicate framework. For example, “John meets Mary” is an event triggered by the predicate “meet.” The event understood by ordinary people is a story, while the “event” in event extraction is just an action or state. Chapter 10 of this book introduces information extraction techniques. (7) Automatic Text Summarization Automatic text summarization or automatic summarization, in brief, refers to a technology that automatically generates summaries using natural language processing methods. Today, when information is excessively saturated, automatic summarization technology has very broad applications. For example, an information service department needs to automatically classify many news reports, form summaries of some (individual) event reports (report), and recommend these reports to users who may be interested. Some companies or supervisory departments want to know roughly the main content of statements (SMS, microblog, WeChat, etc.) published by some user groups. Automatic summarization technology is used in these situations. Chapter 11 of this book introduces automatic text summarization techniques. 1.3 Existing Challenges in Text Data Mining Study of the techniques of text mining is a challenging task. First, the theoretical system of natural language processing has not yet been fully established. At present, text analysis is to a large extent only in the “processing” stage and is far from reaching the level of deep semantic understanding achieved by human beings. In addition, natural language is the most important tool used by human beings to express emotions, feelings, and thoughts, and thus they often use euphemism, disguise, or even metaphor, irony, and other rhetoric means in text. This phenomenon is obvious, especially in Chinese texts, which presents many special difficulties for text mining. Many machine learning methods that can achieve better results in other fields, such as image segmentation and speech recognition, are often difficult to use in natural language processing. The main difficulties confronted in text mining include the following aspects. (1) Noise or ill-formed expressions present great challenges to NLP Natural language processing is usually the first step in text mining. The main data source for text mining processing is the Internet, but when compared with formal publications (such as all kinds of newspapers, literary works, political and academic publications, and formal news articles broadcast by national and local government television and radio stations), online text content includes large ill- formed expressions. According to a random sampling survey of Internet news texts conducted by Zong (2013), the average length of Chinese words on the Internet is approximately 1.68 Chinese characters, and the average length of sentences is 1.3 Existing Challenges in Text Data Mining 7 47.3 Chinese characters, which are both shorter than the word length and sentence length in the normal written text. Relatively speaking, colloquial and even ill-formed expressions are widely used in online texts. This phenomenon is common, especially in online chatting, where phrases such as “up the wall,” “raining cats and dog,” and so on can be found. The following example is a typical microblog message: //@XXXX://@YYYYY: Congratulations to the first prospective members of the Class of 2023 offered admission today under Stanford’s restrictive early action program. https://stanford.io/2E7cfGF#Stanford2023 The above microblog message contains some special expressions. Existing noise and ill-formed language phenomena greatly reduce the performance of natural language processing systems. For example, a Chinese word segmentation (CWS) system trained on a corpora of normal texts such as the People’s Daily and the Xinhua Daily can usually achieve an accuracy rate of more than 95%, even as high as 98%, but its performance on online text immediately drops below 90%. According to the experimental results of (Zhang 2014), using the character-based Chinese word segmentation method based on the maximum entropy (ME)classifier, when the dictionary size is increased to more than 1.75 million (including common words and online terms), the performance of word segmentation on microblog text as measured by the F1-measure metric can only reach approximately 90%. Usually, a Chinese parser can reach approximately 87% or more on normal text, but on online text, its performance decreases by an average of 13% points (Petrov and McDonald 2012). The online texts addressed by these data are texts on the Internet and do not include the texts of dialogues and chats in microblogs, Twitter, or WeChat. (2) Ambiguous expression and concealment of text semantics Ambiguous expressions are common phenomena in natural language texts, for example, the word “bank” may refer to a financial bank or a river bank. The word “Apple” may refer to the fruit or to a product such as an Apple iPhone or an Apple Computer, a Mac, or Macintosh. There also exist many phenomena of syntactic ambiguity. For example, the Chinese sentence “关于(guanyu, about)鲁迅(Lu Xun, a famous Chinese writer)的(de, auxiliary word)文章(wenzhang, articles)” can be understood as “关于【鲁迅的文章】 (about articles of Lu Xun)” or “【关于 鲁迅】的文章 (articles about Lu Xun).” Similarly, the English sentence “I saw a boy with a telescope” may be understood as “I saw [a boy with a telescope],” meaning I saw a boy who had a telescope, or “[I saw a boy] with a telescope” meaning I saw a boy by using a telescope. The correct parsing of these ambiguous expressions has become a very challenging task in NLP. However, regrettably, there are no effective methods to address these problems, and a large number of intentionally created “special expressions/tokens” such as Chinese “words” “木有 (no),” “坑爹 (cheating),” and ‘奥特 (out/out-of-date)” and English words “L8er (later),” “Adorbs (adorable),” and “TL;DR (Too long, didn’t read)” appear routinely in online dialogue texts. Sometimes, to avoid directly identifying certain events or personages, the speaker will turn a sentence around deliberately, for example, asking “May I know the age of the ex-wife of X’s father’s son?”. 8 1 Introduction Please look at the following news report: Mr. Smith, who had been a policeman for more than 20 years, had experienced a multitude of hardships, had numerous achievements and been praised as a hero of solitary courage. However, no one ever thought that such an steely hero, who had made addicted users frightened and filled them with trepidation, had gone on a perilous journey for a small profit and shot himself at home last night in hatred. For most readers, it is easy to understand the incident reported by this news item without much consideration. However, if someone asks the following question to a text mining system based on this news What kind of policeman is Mr. Smith? and Is he dead? it will be difficult for any current system to give a correct answer. The news story never directly expresses what kind of policeman Mr. Smith is but uses addicted users to hint to readers that he is an antidrug policeman and uses “shot himself” to show that he has committed suicide. This kind of information hidden in the text can only be mined by technology with deep understanding and reasoning, which is very difficult to achieve. (3) Difficult collection and annotation of samples At present, the mainstream text mining methods are machine learning methods based on large-scale datasets, including the traditional statistical machine learning method and the deep learning (DL) method. These require a large-scale collection of labeled training samples, but it is generally very difficult to collect and annotate such large-scale samples. On the one hand, it is difficult to obtain much online content because of copyright or privacy issues, which prohibit publication opening and sharing. On the other hand, even when data are easy to obtain, processing these data is time-consuming and laborious because they often contain considerable noise and garbled messages, they lack a uniform format, and there is no standard criterion for data annotation. In addition, the data usually belong to a specific field, and help from experts in that specific domain is necessary for annotation. Without help from experts, it is impossible to provide high-quality annotation of the data. If the field changes, the work of data collection, processing, and annotation will have to start again, and many ill-formed language phenomena (including new online words, terms and ungrammatical expressions) vary with changing domains and over time, which greatly limits expansion of the data scale and affects the development of text mining technology. (4) Hard to express the purpose and requirements of text mining Text mining is unlike other theoretical problems, wherein objective functions are clearly established and then ideal answers obtained by optimizing functions and solving the extremum. In many cases, we do not know what the results of text mining will be or how to use mathematical models to describe the expected results and conditions clearly. For example, we can extract frequently used “hot” words from some text that can represent the themes and stories of these texts, but how to organize them into story outlines (summaries) expressed in fluent natural languages is not an easy task. As another example, we know that there are some regular patterns 1.4 Overview and Organization of This Book 9 and correlations hidden in many medical cases, but we do not know what regular patterns and correlations exist and how to describe them. (5) Unintelligent methods of semantic representation and computation model Effectively constructing semantic computing models is a fundamental challenge that has puzzled the fields adopting NLP for a long time. Since the emergence of deep learning methods, word vector representation and various computing methods based on word vectors have played an important role in NLP. However, semantics in natural language are different from pixels in images, which can be accurately represented by coordinates and grayscales. Linguists, computational linguists, and scholars engaged in artificial intelligence research have been paying close attention to the core issues of how to define and represent the semantic meanings of words and how to achieve combination computing from lexical semantics to phrase, sentence, and, ultimately, paragraph and discourse semantics. To date, there are no convincing, widely accepted and effective models or methods for semantic computing. At present, most semantic computing methods, including many methods for word sense disambiguation, word sense induction based on topic models, and word vector combinations, are statistical probability-based computational methods. In a sense, statistical methods are “gambling methods” that choose high probability events. In many cases, the events with the highest probability will become the final selected answer. In fact, this is somewhat arbitrary, subjective, or even wrong. Since the model for computing probability is based on samples taken by hand, the actual situation (test set) may not be completely consistent with the labeled samples, which inevitably means that some small probability events become “fishes escaping from the net.” Therefore, the gambling method, which is always measured by probability, can solve most of the problems that are easy to count but cannot address events that occur with small probability, are hard to find, and occur with low frequency. Those small probability events are always difficult problems to solve, that is, they are the greatest “enemy” faced in text mining and NLP. In summary, text mining is a comprehensive application technology that integrates the challenges in various fields, such as NLP, ML, and pattern classification, and is sometimes combined with technologies to process graphics, images, videos, and so on. The theoretical system in this field has not yet been established, its prospect for application is extremely broad, and time is passing: text mining will surely become a hot spot for research and will grow rapidly with the development of related technologies. 1.4 Overview and Organization of This Book As mentioned in Sect. 1.1, text mining belongs to the research field combining NLP, pattern classification, ML, and other related technologies. Therefore, the use and development of technical methods in this field also change with the development and transition of related technologies. 10 1 Introduction Reviewing the history of development, which covers more than half a century, text mining methods can be roughly divided into two types: knowledge engineering- based methods and statistical learning methods. Before the 1980s, text mining was mainly based on knowledge engineering, which was consistent with the historical track of rule-based NLP and the mainstream application of expert systems dominated by syntactic pattern recognition and logical reasoning. The basic idea of this method is that experts in a domain collect and design logical rules manually for the given texts based on their empirical knowledge and common sense and then the given texts are analyzed and mined through inference algorithms using the designed rules. The advantage of this method is that it makes use of experts’ experience and common sense, there is a clear basis for each inference step, and there is a good explanation for the result. However, the problem is that it requires extensive human resources to deduce and summarize knowledge based on experience, and the performance of the system is constrained by the expert knowledge base (rules, dictionaries, etc.). When the system needs to be transplanted to the new fields and tasks, much of the experience-based knowledge cannot be reused, so that usually, much time is needed to rebuild a system. Since the later 1980s, and particularly after 1990, with the rapid development and broad application of statistical machine learning methods, text mining methods based on statistical machine learning obtained obvious advantages in terms of accuracy and stability and do not need to consume the same level of human resources. Especially in the era of big data on the Internet, given massive texts, manual methods are obviously not comparable to statistical learning methods in terms of speed, scale, or coverage when processing data. Therefore, statistical machine learning methods are gradually becoming the mainstream in this field. Deep learning methods, or neural network- based ML methods, which have emerged in recent years, belong to the same class of methods, which can also be referred to as data-driven methods. However, statistical learning methods also have their own defects; for example, supervised machine learning methods require many manually annotated samples, while unsupervised models usually perform poorly, and the results from the system for both supervised and unsupervised learning methods lack adequate interpretability. In general, knowledge engineering-based methods and statistical learning methods have their own advantages and disadvantages. Therefore, in practical application, system developers often combine the two methods, using the feature engineering method in some technical modules and the statistical learning method in the others to help the system achieve the strongest performance possible through the fusion of the two methods. Considering the maturity of technology, knowledge engineering-based methods are relatively mature, and their performance ceiling is predictable. For statistical learning methods, with the continuous improvement of existing models and the continuous introduction of new models, the performance of models and algorithms is gradually improving, but there is still great room for improvement, especially in large-scale data processing. Therefore, statistical learning methods are in the ascendant. These are the reasons this book focuses on statistical learning methods.