It is suitable for students, researchers and practitioners interested in text data mining both as a learning text and as a reference book. Professors can readily use it for classes on text data mining or NLP.


《Text data mining》 offers thorough and detailed introduction to the fundamental theories and methods of text data mining, ranging from pre-processing (for both Chinese and English texts), text representation, feature selection, to text classification and text clustering. Also it presents predominant applications of text data mining, for example, topic model, sentiment analysis and opinion mining, topic detection and tracking, information extraction, and text automatic summarization, etc.

Preface With the rapid development and popularization of Internet and mobile communi- cation technologies, text data mining has attracted much attention. In particular, with the wide use of new technologies such as cloud computing, big data, and deep learning, text mining has begun playing an increasingly important role in many application ?elds, such as opinion mining and medical and ?nancial data analysis, showing broad application prospects.  Although I was supervising graduate students studying text classi?cation and automatic summarization more than ten years ago, I did not have a clear understand- ing of the overall concept of text data mining and only regarded the research topics as speci?c applications of natural language processing. Professor Jiawei Han’s book Data Mining: Concepts and Technology, published by Elsevier, Professor Bing Liu’s Web Data Mining, published by Springer, and other books have greatly bene?ted me. Every time I listen to their talks and discuss these topics with them face to face, I have bene?ted immensely. I was inspired to write this book for the course Text Data Mining, which I was invited to teach to graduates of the University of Chinese Academy of Sciences. At the end of 2015, I accepted the invitation and began to prepare the content design and selection of materials for the course. I had to study a large number of related papers, books, and other materials and began to seriously think of the rich connotation and extension of the term Text Data Mining. After more than a year’s study, I started to compile the courseware. With teaching practice, the outline of the concept has gradually formed.  Rui Xia and Jiajun Zhang, two talented young people, helped me materialize my original writing plan. Rui Xia received his master’s degree in 2007 and was admitted to the Institute of Automation, Chinese Academy of Sciences, and studied for Ph.D. degree under my supervision. He was engaged in sentiment classi?cation and took it as the research topic of his Ph.D. dissertation. After he received his Ph.D. degree in 2011, his interests extended to opinion mining, text clustering and classi?cation, topic modeling, event detection and tracking, and other related topics. He has published a series of in?uential papers in the ?eld of sentiment analysis and opinion mining. He received the ACL 2019 outstanding paper award, and his paper on ensemble learning for sentiment classi?cation has been cited more than III IV Preface 600 times. Jiajun Zhang joined our institute after he graduated from university in 2006 and studied in my group in pursuit of his Ph.D. degree. He mainly engaged in machine translation research, but he performed well in many research topics, such as multilanguage automatic summarization, information extraction, and human– computer dialogue systems. Since 2016, he has been teaching some parts of the course on Natural Language Processing in cooperation with me, such as machine translation, automatic summarization, and text classi?cation, at the University of Chinese Academy of Sciences; this course is very popular with students. With the solid theoretical foundation of these two talents and their keen scienti?c insights, I am grati?ed that many cutting-edge technical methods and research results could be veri?ed and practiced and included in this book.  From early 2016 to June 2019, when the Chinese version of this book was published, it took more than three years. In these three years, most holidays, weekends, and other spare times of ours were devoted to the writing of this book. It was really suffering to endure the numerous modi?cations or even rewriting, but we were also very happy. We started to translate the Chinese version into English in the second half of 2019. Some more recent topics, including BERT (bidirectional encoder representations from transformers), have been added to the English version. As a cross domain of natural language processing and machine learning, text data mining faces the double challenges of the two domains and has broad application to the Internet and equipment for mobile communication. The topics and techniques presented in this book are all the technical foundations needed to develop such practical systems and have attracted much attention in recent years. It is hoped that this book will provide a comprehensive understanding for students, professors, and researchers in related areas. However, I must admit that due to the limitation of the authors’ ability and breadth of knowledge, as well as the lack of time and energy, there must be some omissions or mistakes in the book. We will be very grateful if readers provide criticism, corrections, and any suggestions. Beijing, China Chengqing Zong 20 May 2020

  • Chengqing Zong is professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences.  He serves as chairs for many prestigious conferences such as ACL-IJCNLP, IJCAI, IJCAI-ECAI, AAAI and COLING, etc., and served as associate editors for prestigious journals such as TALLIP, Machine Translation, etc. He is the President of Asian Federation on Natural Language Processing and a member of International Committee on Computational Linguistics. 

  Contents

    1 Introduction 1

    1.1 The Basic Concepts 1

    1.2 Main Tasks of Text Data Mining 3

    1.3 Existing Challenges in Text Data Mining 6

    1.4 Overview and Organization of This Book 9

    1.5 Further Reading 12

    2 Data Annotation and Preprocessing 15

    2.1 Data Acquisition 15

    2.2 Data Preprocessing 20

    2.3 Data Annotation 22

    2.4 Basic Tools of NLP 25

    2.4.1 Tokenization and POS Tagging 25

    2.4.2 Syntactic Parser 27

    2.4.3 N-gram Language Model 29

    2.5 Further Reading 30

    3 Text Representation 33

    3.1 Vector Space Model 33

    3.1.1 Basic Concepts 33

    3.1.2 Vector Space Construction 34

    3.1.3 Text Length Normalization 36

    3.1.4 Feature Engineering 37

    3.1.5 Other Text Representation

