MindMap Gallery Big data mining methods and research progress in financial texts
This is a mind map about big data mining methods and research progress in financial texts. The main contents include big data analysis of financial texts, document feature extraction, document representation, network crawling, and manual collection.
Edited at 2022-06-18 11:39:52Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
Big data mining methods and research progress in financial texts
Big data analysis of financial texts
A study on text information disclosed by listed companies
The textual information disclosed by listed companies can reflect corporate disclosure behaviors and can also be used to measure corporate financial and operating conditions and provide information to shareholders. The stock market conveys the confidence of company management in future development. Among them, financial reports, conference call texts and prospectuses Mingshu has been studied extensively, focusing on the readability, tone and similarity of the text.
1 Research on the readability of disclosure texts of listed companies
Corporate disclosure text that is more readable better conveys company information to investors. Li (2008) used the fog index to measure the readability of financial reports and found that companies with higher readability of financial reports have more durable profits. Lehavy et al (2011) pointed out that improving the readability of corporate financial reports can reduce the dispersion of analysts' earnings forecasts and improve the accuracy of analysts' earnings forecasts. Guay et al (2016) found that companies with poor financial report readability mitigate the negative impact of readability through voluntary disclosure. Lo et al (2017) found that managers strategically manipulate the readability of financial reports to mislead or influence investors' evaluation of corporate performance. (Bushee et al 2018) distinguished the complexity of conference call text language into information components and confusion components, and found that the information component can reduce the degree of information asymmetry, while the confusion component can exacerbate information asymmetry.
2 Research on the tone and tone of disclosure texts of listed companies
A study on the tone and tone of disclosure texts of listed companies. In terms of research on tone of voice, scholars believe that company disclosure texts Tone of voice can be used to predict corporate performance and changes in the stock market. Mayew et al (2015) found that in financial reporting The tone of the MD&A part contains incremental information and can predict the enterprise's bankruptcy probability and ability to continue operating. Davis et al (2015) refers to It turns out that a manager’s optimism will affect the manager’s tone during a conference call. Allee & Deangelis (2015) examined the degree of tone dispersion in conference calls and found that the degree of tone dispersion of managers is related to the future performance of the company and managers' decisions, and the degree of tone dispersion also affects the perception of information by analysts and investors. Bochkay et al (2020) developed an extreme tone dictionary and found that after managers use extreme words in conference calls, the company's stock trading volume will significantly increase and the stock price reaction will be stronger. Jiang et al (2019) constructed a manager sentiment indicator based on company financial reports and conference call texts, and pointed out that the manager sentiment indicator can effectively predict stock returns, and the predictive power of this indicator exceeds commonly used macroeconomic variables and investor sentiment.
3 Research on similarity of disclosure texts of listed companies
Similarity is also an important feature of textual information disclosed by companies. On the one hand, the relationship between different companies can be studied based on the similarity of the content of financial reports between companies. For example, Hoberg & Phillip (2010) found that the more similar the product descriptions in the financial reports of two companies are, the higher the likelihood of mergers and acquisitions between the two companies, and the better the merger performance. Hoberg & Phillips (2016) constructed a company-specific product market competitor set based on the similarity of the product description parts of different companies' financial reports, and based on this, formed a time-varying industry classification standard. On the other hand, the similarity of financial reports between companies and in different periods also provides an opportunity to study the template of corporate disclosure behavior (Brown & Tucker, 2011; Lang & Stice-Lawrence, 2015)
4 Research on the semantic features of listed companies’ disclosure texts
There is also a series of literature that studies the semantic characteristics of corporate disclosure texts. For example, Buehmaier & Whited (2018) used the naive Bayes method to construct a financing constraint index based on the MD&A part, and found that companies subject to financing constraints have higher stock returns. Bochkay et al (2019) studied the information disclosure style of corporate CEOs based on conference calls and pointed out that CEOs' forward-looking information disclosure behavior and optimism will decline during their tenure, and externally hired and inexperienced CEOs are more inclined to disclose forward-looking information. , and young CEOs show a greater degree of optimism in information disclosure. Hanley & Hoberg (2019) combined the LDA model with Word2Vec technology to extract risk-related semantic topics from bank annual reports, and combined it with investors' trading patterns to study and found that newly revealed risk signals in the financial industry help regulate the stability of the financial market. sex
2. Research on Financial Media Reports
1 Research on media attention and media reports
Some literature studies the impact of media reports on the stock market based on the number of news published by financial media and positive and negative news. Hillert et al (2014) conducted a study based on about 2.2 million articles published by 45 newspapers in the United States during the year and found that media reports will intensify investor bias, and the predictability of earnings is stronger in companies with high media attention. Baloria & Heese (2018) pointed out that Democratic Party-affiliated companies that were threatened by FNC (Fox News Channel) media's tendentious reports would conceal negative news before the election and release negative news after the election, proving that companies would avoid negative media coverage for reputational capital. Frank & Sanati research found that the stock market overreacts to good news and underreacts to bad news. Stock prices will reverse after the impact of positive news, while the impact of negative news will cause stock price drift. In addition, there is also literature that studies the predictive effect of media sentiment on future housing prices (Soo, 2018)
2 Research on economic policy uncertainty
Media reports also contain information on economic policies. Baker et al (2016) used text mining technology to construct an economic policy uncertainty index (economic policy uncertainty, EPU) based on representative media reports in multiple major economies. , this index can continuously and quantitatively describe economic policy uncertainty. Subsequently, Gulen & Ion (2016) used this index to study the impact of economic policy uncertainty on corporate investment and found that macroeconomic policy uncertainty will affect the financial decisions of micro-enterprises and inhibit corporate investment. Bonaime et al (2018) studied the impact of economic policy uncertainty on corporate M&A based on this index and found that an increase in economic policy uncertainty will reduce the value and quantity of M&A transactions
3 Research on media bias, rumors and fake news
There is also a local bias in financial media reports. Gurun & Bueler (2012) pointed out that the media use fewer negative words when reporting local company news. The reason for this phenomenon is that local companies invest more advertising expenditures. They also find that abnormal local media bias is strongly associated with corporate stock valuations. In addition, there are also company rumors and fake news in the market. Ahern & Sosyura (2015) studied rumors about corporate mergers and acquisitions and found that the media is more inclined to publish rumors about newsworthy companies. Rumors published by the media will compete for investors’ limited attention, leading to overreaction and subsequent reversal of stock prices. . Kogan et al (2019) found that false news will increase the trading volume and price volatility of the stocks involved. When the false news is exposed, the impact of all news on the platform where the false news is located on the stock trading volume and price volatility will be reduced.
3. Research on social network texts
With the rise of social networks, scholars have begun to include social networks such as Weibo and stock forums within the scope of research in the field of finance. The relationship between text information in social networks and the stock market is the focus of research in this field
1 Research on text sentiment in social networks. In the early days, Antweiler & Frank (2004) used posts in the Yahoo Finance online forum as the research object and found that the attention index measured by the number of posts can effectively predict stock returns and market volatility, and the sentiment divergence of posts is positively related to stock trading volume in the same period. A study based on articles and comments on the stock forum Seeking Alpha found that the tone of articles and comments can predict a company's future stock returns. Cookson & Niessner (2020) studied posts and user information on the US stock forum StockTwits and found that about half of investor disagreements are caused by different investment philosophies, and investor disagreements can effectively predict abnormal trading volumes in the stock market.
2 Research on strategic information disclosure Companies can also strategically release information through social network platforms. For example, Blankespoor et al (2014) found that links to news published by companies on Twitter can reduce the bid-ask spread of the company's stocks and increase trading depth. Research also shows that social networks, as a channel for information dissemination, can optimize investors' ability to obtain information and reduce investors' information search costs. Jung et al (2018) analyzed the information release on Twitter of S&P 1500 companies and found that companies will reduce the number of Twitter posts when there is bad news. For companies with less mature investors and social media audiences, For companies with a wider coverage, this strategic information disclosure behavior is more obvious
4. Research on Search Index
In the context of the continuous development of Internet technology, the Internet search index is an effective indicator to measure investors' attention to stocks. Daetal (2011) obtained the weekly Google search index of individual stocks and used the search frequency of the stock to directly measure investor attention. The study pointed out that using search index can measure investor attention in a more timely manner, and the increase in search index can predict the rise of stock prices in the next two weeks and the reversal of stock prices within one year. Chi & Shanthikymar (2017) examined investors' attention to stocks in different regions by searching for locations and found that investors have a "local bias", that is, investors are more inclined to pay attention to local stocks. This "local bias" will also affect the market's perception of stocks. Reaction to Earnings Announcement. In addition, existing literature has also studied the relationship between the search index of specific keywords and the stock market. For example, Da et al (2015) determined the historical relationship between the search volume of 118 words and market returns in the same period, selected the 30 most negative words as specific search keywords, and used them to construct the FEERS index, and found that the index predicted short-term return reversals in the stock market, temporary increases in volatility, and mutual fund flows from stock funds into bond funds
5. Research on other English texts
Some scholars have conducted research on other texts. For example, De Franco et al (2015) used analyst reports from 2002 to 2009 and found that more readable analyst reports can increase stock trading volume. Hwang & Kim (2017) analyzed the impact of the readability of annual shareholder reports disclosed by closed-end investment companies on company value. The study found that in a relatively opaque information environment, when a company publishes an annual shareholder report with poor readability, When reporting, investors will have negative emotions such as skepticism, causing the company to trade at a discount, thereby reducing the company's value. Greenetal (2019) used employees' ratings of employers on Glassdoor to conduct research and found that employees' evaluations of employers are related to the company's sales growth and profitability, and can predict unexpected earnings one quarter later. Huang (2018) found that consumer product reviews on Amazon (Amazon.com) can predict company stock returns. Chen et al (2019) studied the value impact of financial technology innovation on the financial industry based on U.S. patent text data. Ryans (2019) used the Naive Bayes text classification method and combined the company's future financial restatements and asset impairments to divide inquiry letters into restatement inquiry letters, non-restatement inquiry letters, and impairment inquiries. and non-impairment inquiry letters, and studied the correlation between inquiry letters and financial reporting quality. Ban-dieraetal (2020) used CEO logs as research text, divided CEOs into "leaders" and "managers" through machine learning algorithms, and studied the relationship between CEO behavior and corporate performance.
Document feature extraction
Currently, document feature extraction in the field of finance mainly includes four aspects: textual readability, textual sentiment, semantic relatedness and textual similarity.
1 Text readability
The readability of text reflects the difficulty for readers to understand text information. When the readability of text is low, investors will have difficulty understanding the information conveyed by the text editor, which will in turn affect investors' investment behavior. Li (2008) applied the fog index (fog) to text analysis and pointed out that the smaller the fog index, the more readable the annual report. In addition, scholars also use the number of words in the annual report (You & Zhang, 2009) and the size of the electronic file of the annual report (Loughran) to measure the readability of the annual report. Most previous studies used the fog index to measure text readability (Li, 2010b; Lehavy et al, 2011), but this method still has some problems. For example, if the words of each sentence in the text are randomly ordered, the article will be completely unintelligible, but the fog index calculated for the original text and the randomly ordered text will be exactly the same (Jones & Shoemaker, 1994). In addition, (Loughran & McDonald) research points out that the Fog Index has limitations when measuring the readability of business texts. As Loughran & McDonald (2016) proposed, when measuring corporate disclosure readability, it is difficult to separate corporate complexity and annual report readability. When a company has multiple businesses, companies with complex internal business structures are likely to disclose annual reports that are difficult to read and understand due to business complexity. Therefore, when measuring the readability of a company's annual report text, the company's business complexity factors should be considered.
2 text sentiment
3Semantic relevance
Semantic relevance is the process of identifying the semantic features of text based on a certain type of words. Specifically, a vocabulary list is first constructed based on a certain type of keywords, and then the word frequency of the words in the vocabulary list in the document is calculated, and then the semantic features in the text related to the semantics of the keywords are identified. For example, Loughran et al (2009) identified companies' use of ethics-related terms based on keywords such as "corporate responsibility social responsibility socially responsible". In addition, scholars can also use word embedding technology to deal with the problem of word semantic relevance based on the distance between word vectors in space (i.e., the similarity of syntax and semantics). For example, Li et al (2020) used Word2Vec technology to expand keywords for different types of corporate culture.
4Text similarity
At present, many scholars use the cosine similarity index to measure the similarity of financial reports (Hoberg & Phillips, 2010, 2016; Wang Xiongyuan et al., 2018) and the similarity of patent texts (Kelly et al., 2018). Assume that the text vectors corresponding to texts d1 and d2 are a = (Wa1, Wa2,...Wan), b = (w b 1 , w b 2 ,..., w bn ), then the cosines of texts d 1 and d 2 are similar Spend Calculated as follows: Among them, n is the number of features, w ai , w bi is the weight of feature i in the two texts. The value of this formula is between 0 and 1. The higher the value, the higher the value. Large indicates that the document similarity is greater.
summary
By summarizing the steps and methods of big data mining of financial texts, it can be found that the main difference between big data analysis of Chinese and English texts is Text preprocessing process. Compared with English text big data, the preprocessing process of Chinese text big data is more complicated. in Chinese Before extracting text features from this big data, researchers need to perform word segmentation processing. Moreover, Chinese words do not have clear specifications for parts of speech. It mainly relies on the grammar and semantics of the text to identify parts of speech. In terms of text readability, although most scholars refer to the fog index to construct Chinese text readability indicators, but they use different dictionaries of common words or complex words, resulting in different readability indicators constructed. all the same
Document representation
Text data is sparse, high-dimensional data that is difficult to process by computer. Therefore, after preprocessing the text data, it is necessary to Represent the information in the document in a specific way to facilitate further analysis by researchers or computers. Documentation table Representation methods mainly include: word cloud (word cloud), bag of words (BOW) model (bag of words, BOW), word embedding (word embedding) and main topic model
1 word cloud
Word cloud is a visualization technology for text big data. Text visualization refers to converting the more complex content and patterns in the text into visual symbols, which enables people to use their innate visual perception to quickly obtain the key information contained in the text. Word cloud technology can describe the frequency of words in text. When words appear more frequently, they will be presented in a larger and eye-catching form.
2 bag-of-words model
The bag-of-words model is a text vectorization representation method based on the assumption that word order of text phrases is not important. It treats text as a collection of several words and only counts the number of occurrences of each word. The model mainly includes: one-hot representation method (one hot representation) and term frequency-inverse document frequency method (term frequency inverse document frequency, TF IDF). The one-hot representation is simple to operate. Suppose there are two documents "Application of text big data in economics" and "Application of text big data in finance". Based on these two text documents, the following vocabulary list can be constructed: ["text", "big data" , "in", "economics", "finance", "in", "of", "application"], after performing bag-of-words in this order, the bag-of-words vectors of the two documents are obtained: [1, 1, 1, 1, 0, 1, 1, 1] and [1, 1, 1, 0, 1, 1, 1, 1], where "1" and "0" respectively indicate whether this occurs in the document word. However, in the documentation Not every word has the same chance of appearing; in most texts only a handful of words are used frequently, and the vast majority are used rarely. Therefore, it is necessary to assign a weight to each word to better represent the role of each word in the document. Loughran & McDonald (2011) used TF - The IDF method calculates the weight of specific words in the document. Its basic formula is as follows: In formula (1) df i is defined as the number of documents containing word i, N represents the total number of documents in the document collection, idfi is the inverse document frequency. In formula (2), tfi, j is the total number of occurrences of word i in the j-th document, a j is the number of words contained in the j-th document, tf-idfi, j is the weight of word i in the j-th document. . However, the bag-of-words model has the following problems: first, it ignores the order of words in the document and the semantic relationship between words, which may cause ambiguity; second, the dimension of the vector depends on the number of words in the document. When the number of words in the document When it is too large, dimensionality disaster is likely to occur.
3 word embeddings
Word embedding is a technique for embedding a high-dimensional space with the number of dimensions of all words into a low-dimensional continuous vector space. Through word embedding technology, words can be mapped into vectors in a low-dimensional continuous vector space, and the distance and position between vectors can be used to represent the context, grammatical and semantic similarities of words in the document, and their relationships with other words. In financial text analysis, Word2Vec technology is a commonly used word embedding technology, including CBOW (continuous bag of words) and Skip-Gram neural network models. Through training, the neural network can capture more contextual information between words, thus Map each word into a lower-dimensional, dense vector that contains more semantic information (Mikolov et al, 2013). In Word2Vec technology, word embedding vectors can obtain analogical relationships between different words. The most classic example is "king queen = man woman", as shown in Figure 2
4 topic model
The most commonly used topic model is the LDA (latent dirichlet allocation) model (Blei et al, 2003). The LDA model is an unsupervised machine learning method for extracting topic information from large-scale corpora. It assumes that document generation includes two steps: the first step. , assuming that each document has a corresponding topic distribution, extract a topic from the topic distribution of the document; in the second step, assuming that each topic has a corresponding word distribution, extract it from the word distribution corresponding to the topic extracted in the previous step a word. By iteratively fitting these two steps to each word in the document, the topic distribution of each document and the word distribution of each topic can be obtained. (Huang et al 2018) pointed out that the LDA model has the following advantages: First, the model overcomes the limitations of manual coding and can Classify a large number of text documents; secondly, the LDA model can provide reliable and replicable text topic classification, eliminating the need for The subjectivity of artificial text classification; finally, the LDA model does not require researchers to pre-specify the corresponding rules and keywords for classifying categories. However, the limitation of this model is that the way to preset the number of topics adds human subjective factors, which will affect the choice of the number of topics. This in turn affects the generation of topics and the topic classification of texts.
web scraping
Due to the increase in the amount of text and the difficulty in obtaining text big data, most scholars choose to use programming languages to directly Crawl text big data from the web (Loughran & McDonald, 2014; Blankespoor et al, 2014). On the one hand, this method It can obtain text information in a timely manner. On the other hand, it can also organize the text format and content through programming language for further processing. Carry out further analysis.
Hand collected
Disadvantages: This process requires a lot of time and labor costs
Text preprocessing
在语料获取后,研究者需要对文本进行预处理,该过程主要包括文档解析、文本定位与数据清 洗、文本的分词标注、词性的标注(part of speech tagging)以及停用词去除( stop words )五个步骤
1Document analysis
Under the information disclosure regulatory system, companies are required to publicly release electronic documents regularly or irregularly. Documentation of related information. However, these documents are only stored electronically for readers to read on electronic devices, but this does not mean that machines can automatically process them and achieve "machine readability". In the computer field, electronic documents for information disclosure are collectively referred to as rich-format documents. These documents contain text paragraphs, tables, charts and other content modalities. They are usually organized into a hierarchical directory structure and are presented to readers through beautified typesetting and formatting. Judging from the format of the document, the vast majority of information disclosure documents required by the financial market are in PDF format. Therefore, parsing rich-format documents is often the first step in text preprocessing, that is, obtaining the information content inside. There are two aspects that need to be paid attention to in the process of PDF document parsing: On the one hand, the generation of PDF documents is not a reversible process. When we use the editor of Word or Excel to export the document into a PDF document, although the formatting format of the document, etc. Visual presentation is maintained, but structural information within the document is partially or completely lost. On the other hand, the parsed document is the basis for text analysis, and inaccurate PDF parsing may have a serious impact on subsequent text analysis. Therefore, for financial text big data, it is necessary to carefully select document structure parsing tools.
2Text positioning and data cleaning
On the one hand, researchers need to use computer programs to locate text information. For example, The MD&A part is the research object of many scholars. Researchers can use regular expressions to locate MD&A in the text of financial reports. The beginning and end of the part, and then extract the content of the part. On the other hand, researchers also need to analyze what is considered noise in the text. Clean and delete content (Jiang et al, 2019), mainly including advertisements, hypertext markup language (HTML), and literal scripting language (JavaScript) and other codes and pictures, etc.
3. Word segmentation of text
In English text, word segmentation is automatically completed when words are separated by spaces. In addition, words can be further divided through lemmatization and stemming. However, there are no spaces between Chinese characters, and words are the smallest language units that can be used independently. Therefore, researchers need to perform specialized word segmentation processing of Chinese texts. At present, most scholars use the P open source "jieba" Chinese word segmentation module to segment corporate financial reports, annual performance briefings, and stock forum posts. There are three difficulties in Chinese text word segmentation, namely segmentation granularity, identification of ambiguous words and identification of new words. If the segmentation granularity is too small, it will easily destroy the meaning of the words. For example, it is easy to split "machine learning" into "machine" and "learning". For ambiguous words, an appropriate word segmentation mode should be selected. For example, when using "jieba" to segment the word, in order to improve the accuracy of the segmentation, you should select Precise word segmentation mode. For new words (such as company names, product names, and key people's names), users can customize dictionaries to help Word segmentation software identifies new words
4 part-of-speech tagging
Part-of-speech is an important grammatical feature for identifying semantic information, such as nouns, verbs, connectives, etc. Part-of-speech tagging is to mark the part-of-speech of words after segmentation. Through part-of-speech tagging, computers can identify word types, eliminate word ambiguities, and then identify grammatical structures, reducing the difficulty of computer semantic analysis. There are big differences in part-of-speech tagging between Chinese and English. English words are more rigorous in part-of-speech classification. Changes in part-of-speech can be revealed through word-suffix changes. For example, "-ing", "-ness" and "-ment" are all useful for confirming part-of-speech. Specific tips were given. However, Chinese words do not have clear specifications for part-of-speech, and they mainly rely on grammar and semantics to identify part-of-speech, that is, "English emphasizes form combination, and Chinese emphasizes meaning combination."
5 stop word removal
In order to improve the accuracy of text mining information, stop words in the text also need to be eliminated. Stop words refer to words that are important to the grammatical structure of a sentence but convey little meaning by themselves. They increase the dimensionality of text data and increase the cost of text analysis. In English texts, stop words mainly include articles (the a), conjunctions (and or), and the verb "to be" (Gentzkow et al, 2019). However, in Chinese texts, stop words should be determined according to Chinese language habits. In addition to punctuation marks and special symbols, they also include connectives that express logical relationships (and, however, etc.) and slang. In addition, stop words need to be determined based on the content of the study. For example, when studying the emotion of a text, retaining modal particles and specific punctuation marks will help measure the emotional level of the text.
Corpus acquisition
Research Background
In traditional empirical research in the field of finance, the data used are mostly limited to structured data such as financial statements and stock market data. In the era of big data, advances in computer technology have continuously enriched data types. Researchers have begun to introduce unstructured text big data into research in the field of finance, which mainly includes disclosure texts of listed companies, financial media reports, and social networks. Text, Internet search index and p2p online lending text, etc., and conduct research on the readability, tone, similarity and semantic features of the text. This kind of unstructured data accounts for a large proportion in the company's external disclosure and the stock market, and its transmission forms and expressions are more diverse. Through the disclosure texts of listed companies, financial media reports, social network texts, By mining and analyzing text big data such as Internet search index and P2P online lending text, researchers can conduct research on the disclosure behavior of the text, the emotion and tone of the text, and the market reaction of the text information, thereby providing richer information for the field of finance. research content and research perspective
Research motivation
Previous literature focused on introducing the main methods of text analysis, but lacked information on the steps and methods of text big data mining. Detailed introduction
The steps and methods of text big data mining are introduced in detail. Describes the text corpus acquisition, preprocessing process, document representation and document feature extraction
In the era of big data, the continuous improvement of computer technology has made data types richer and text Data has become data that can be interpreted and analyzed by computers, and can conduct research on economic phenomena in non-traditional fields. Especially in China's high-context communication environment of "listen to words, listen to gongs, listen to sounds", text big data is in It has high research value in the field of finance.
Secondly, it introduces the domestic and foreign financial literature This big data mines the main text information sources, and sorts out the domestic financial text big data based on different text information sources. External research progress in order to grasp the current research directions and key areas of text big data in the field of finance
Domestic researchers such as Tang Guohao et al. (2016) compiled the research progress of behavioral finance based on text sentiment analysis at home and abroad, and summarized the main text analysis methods. Shen Yan et al. (2018) reviewed the application of English text big data analysis in the fields of economics and finance, supplemented by Chinese text literature. Zhang Xueyong and Wu Yuling (2018) focused on foreign literature and sorted out the use of network big data mining technology to analyze investor psychology and behavior in the field of asset pricing from four aspects: online news data, search engine data, social network data, and online forum data. research content
Based on the current research, future research prospects are proposed, hoping to help domestic researchers further expand the application of text big data in the fields of finance and economics.
Research status at home and abroad
Text analysis research has a long history. Jone & Shoemaker (1994) and Cole & Jones (2005) respectively reviewed the accounting text content and management discussion and analysis (MD&A) related literature. Subsequently, Li (2010a) focused on large-sample text analysis in computer linguistics, natural language processing, and statistics, and investigated related research on corporate disclosure texts according to different themes. Later, Loughran & McDonald (2016) surveyed and described the text analysis literature and related methods in the foreign accounting and financial fields. Guo et al (2016) summarized the application of machine learning methods in big data analysis of financial text. Gentzkow et al (2019) describe the analysis methods of text big data and their applications in economics. Cong et al (2019) describe typical English text sources in financial markets and discuss the application of neural network models and generative statistical models in the field of text analysis. In terms of text analysis research review, domestic researchers such as Tang Guohao et al. (2016) compiled the research progress of behavioral finance based on text sentiment analysis at home and abroad, and summarized the main text analysis methods. Shen Yan et al. (2018) reviewed the application of English text big data analysis in the fields of economics and finance, supplemented by Chinese text literature. Zhang Xueyong and Wu Yuling (2018) mainly focused on foreign literature, from online news data, search engine data, social network data and online forum data From four aspects, the research content of using network big data mining technology to analyze investor psychology and behavior in the field of asset pricing is sorted out.
Research motivation
3 pages PPT
Prospects for Big Data Research in Finance Texts
This article summarizes and introduces the steps and methods of big data mining of financial texts, describing the text corpus acquisition, preprocessing process, document representation, and extraction of document features. In addition, this article sorts out the research content of domestic and foreign financial text big data based on different sources of text information. Judging from the existing literature, the mining and analysis of text big data is in a stage of vigorous development. This article believes that in the future, big data mining and analysis of financial texts can be further explored from the following aspects.
1 Enrich research content and develop more sources of textual information
The research content and information sources of text big data in the field of finance can be further refined and enriched. For example, in terms of financial media reports, researchers can not only analyze the quantity and sentiment of media reports, but also judge the types of reported events, thereby identifying media related to corporate mergers and acquisitions, IPOs, financial fraud, personal news of executives, etc. Reports, studying the impact of different event types on companies and the stock market. In terms of stock forums, researchers can build corporate networks based on investors' attention to companies and study the competitive relationships between different companies. In terms of data sources, more text big data can be developed. For example, WeChat public accounts, government work reports, State Council policy documents, court rulings, recruitment websites, performance revision announcements issued by enterprises, social responsibility assurance opinions, internal control evaluation reports, etc.
2 Use new text information extraction methods
At present, in text analysis research in the field of finance, the "bag of words" method that cannot reflect contextual meaning is still widely used. However, there are still many new analysis methods and tools in the field of natural language processing (NLP). They have not yet received enough attention in text analysis in the field of finance, but they have great potential. For example: (1) Named entity recognition (NER). NER is an important basic tool in the field of NLP. It can identify named entities in the text to be processed, thereby extracting time, place, name, institution, currency, percentage and date. Commonly used NER tools include StanfordNER. (2) Relationship extraction. Supervised machine learning methods are usually used to extract the correspondence between entity pairs from sentences containing entity pairs and analyze their co-occurrence. (3) Text summarization, which is the process of compressing text content using computer algorithms. The length of the summary depends on the compression rate. Cardinaels et al (2018) study pointed out that summaries based on algorithms are less positive than management disclosure summaries, and algorithm summaries can enable investors to make more conservative estimates of corporate stock prices.
3 Introducing deep learning into text information research
Deep learning methods have developed rapidly in the field of NLP. Deep learning models mainly include: convolutional neural networks (CNN) model, recurrent neural network (re-current neural networks, RNN) model and its variant long short term memory network (LSTM) model , Generative Adversarial Network (GAN), reinforcement learning, and models such as BERT and XL-Net that are currently popular in the field of NLP. Introducing deep learning into the text field will expand richer research content and improve the accuracy of text information extraction. For example, Caoetal (2018) used a variant of the LSTM model to detect inconsistency errors in corporate disclosure texts. There are major differences between deep learning and traditional machine learning methods in terms of feature representation and number of model parameters, as shown in Table 1. Heatonetal (2016) pointed out that deep learning methods have the following advantages in research in the financial field: first, the model considers as much data information related to the prediction problem as possible; second, it can capture the non-linear relationship between input data and improve The degree of fitting within the sample; third, it can effectively avoid the over-fitting problem of shallow structures. In addition, when the amount of training set data increases to a certain level, the accuracy of deep learning information extraction is significantly higher than that of traditional machine learning methods.
4. Construct a targeted Chinese learning dictionary
The dictionary method is based on a preset dictionary to calculate the word frequency of different types of words in the text, and combines it with appropriate weighting methods to extract text information. However, in terms of sentiment classification of Chinese texts, the application of lexicon method is still in the exploratory stage. Most scholars choose existing English emotion dictionaries and lexicon as references to construct Chinese text emotion dictionaries, which leads to the problem that the constructed dictionaries lack the Chinese context. In addition, text information from different sources and categories has different characteristics in terms of language use. For example, there are many professional terms in corporate annual reports, and there are many slangs and emoticons used in social network media (Yao Jiajian et al., 2019). Therefore, a targeted Chinese text emotion dictionary should be constructed for different Chinese text contents, and the dictionary content should be continuously verified and updated in future research.
5 Improve text readability metrics
Currently, when it comes to measuring text readability, most scholars refer to the Fog Index for analysis. However, the word order and logical relationship of the language are important factors that affect readability. If you only consider the length of the sentence and the proportion of complex words and ignore the word order and logic, it is still impossible to accurately measure the reader's understanding of the text. Currently, there are new indicators to measure the readability of text. For example, the Bog index in the Style Writer software package (Bonsall I Vet al, 2017) captures the simple features of English text emphasized by linguists. Ren Hongda and Wang Kun (2018) pointed out that the readability indicators measured using machine learning methods are comprehensive and comprehensive, and can also overcome natural language barriers. Therefore, it is foreseeable that in future research, more scholars will consider using new indicators to conduct research, and use machine learning or even deep learning models to construct more comprehensive and accurate readability indicators. Alternatively, statistics based on the content of tables in the text might construct more effective readability metrics. In previous research, tables in documents were generally deleted and only the content in text paragraphs was analyzed. However, the digitized information contained in tables is often more objective and easier to understand than textual information. As mentioned before, after using more accurate document structure recognition technology to identify all tables in the document, we can calculate the average number of tables per page, the relative proportion of numbers and text in the document, and the proportion of numbers appearing in tables and text paragraphs. etc. can also be considered readability indicators
6. Improve the replicability of research
The unstructured characteristics of text big data make the process of converting it into structured data more complicated. The conversion method will affect the replicability of the research, that is, whether others can draw consistent conclusions according to the research ideas and methods described in the article. Some scholars have introduced the research text analysis methods and steps in detail in the form of appendices in their papers (Huang et al., 2018; Yao Zhuang et al., 2019). Therefore, in future text analysis, in order to improve the replicability of research, authors should record in detail the document preprocessing process, document representation, and feature extraction methods. Whether using the dictionary method or more complex machine learning and deep learning methods, researchers should reveal in detail the keywords, dictionaries, and specific ideas and algorithms that affect the research results.
Text big data mining steps and methods
Prospects for research methods and content of big data in finance texts
Text information sources in domestic research mainly include: text information disclosed by listed companies (such as financial reports, performance briefing texts, prospectuses), financial media reports, social network texts, Internet search indexes, and P2P online lending texts. Domestic literature studies corporate performance, stock market and online lending market by extracting text features of various information. Compared with foreign countries, with the rise of domestic P2P online lending, this type of text has become one of the research focuses of domestic text analysis.
1 Research on textual information disclosed by listed companies
Regarding the annual reports of listed companies, Qiu Xinying et al. (2016) studied the annual reports of listed companies. Examining the relationship between report readability and analysts’ information interpretation, it was found that there is no significant relationship between annual report readability and analyst forecast quality, indicating that Chinese analysts fail to effectively play the role of professional interpretation of information. Zeng Qingsheng et al. (2018) studied the relationship between the tone of annual reports and insider trading behavior after annual report disclosure, and found that corporate managers had a phenomenon of "duplicity" when preparing annual reports, but a positive tone of annual reports was accompanied by higher managers' Stock selling size. Wang Xiongyuan et al. (2018) conducted a study based on the risk-related information in the annual report MD&A and found that the similarity in the content of the risk paragraphs in the two previous annual reports can reduce audit fees. Regarding other texts disclosed by the company, Lin Le and Xie Deren (2016) pointed out that investors can identify the tone of management in the annual performance briefings of listed companies. Yanetal (2019) found that uncertainty or negative tone in IPO prospectuses is significantly related to the volatility of IPO initial stock returns and subsequent stock returns, and reduces the long-term stock returns.
2 Research on financial media reports
The China Securities Regulatory Commission stipulates that listed companies must be listed in "Shanghai Securities News", "China Securities News", "Securities Times", "Financial Times", "Economic Daily", "China Reform News", "China Daily", "Securities Market Weekly" and "seven newspapers and one publication". Publish important corporate information. In addition, my country also has online news media such as Baidu News, Sina Finance, and Hexun.com, which provide a wealth of media reporting information for studying the Chinese stock market. You Jiaxing and Wu Jing (2012) conducted a study using financial newspapers as research texts and found that the higher or lower the media sentiment is, the more serious the asset pricing bias will be. Wang Changyun and Wu Jiawei (2015) conducted research on financial media reports and found that a decrease in the negative tone of the media will increase the IPO underpricing rate, the proportion of IPO over-raised funds, and the proportion of underwriters' fees. Wang Jingyi and Huang Yiping (2018) also studied the impact of online media sentiment on the online loan market. In addition, media reports can be divided into market-oriented media reports and policy-oriented media reports. Piotroskietal (2017) pointed out that the conglomeration reform of Chinese media has made policy-oriented media reports more focused on political goals, and market-oriented media reports more focused on commercial goals. Youtal (2018) studied from the perspective of information supervision and found that compared with policy-oriented media reports, market-oriented media reports can provide more information about enterprises, and only market-oriented media reports can have a significant impact on corporate governance. In addition, in terms of measuring macroeconomic policy uncertainty, Huang & Luke (2020) used multiple newspapers in mainland China to construct a new and more frequent China EPU index. The study found that the new China EPU index can predict China's stock prices, employment and output.
3 Research on social network texts
He Xianjie et al. (2016) pointed out based on research on Sina Weibo blog posts that the higher the level of corporate governance, the more likely companies are to open Weibo and publish more company information. Stock forums such as Oriental Fortune Stock Bar and Snowball.com provide opportunities to study investor attention and investor sentiment in my country's stock market. Based on information posted on the Oriental Fortune Stock Forum, Huangetal (2016) found that Chinese investors also have a "local bias" phenomenon. This bias affects stocks in underdeveloped areas, large companies, non-CSI 300 index, low trading volume and names indicating the location of the company. Especially obvious in . Sun Shuna and Sun Qian (2018) found that investor attention based on the self-selected stock information of "Snowball Network" users will increase stock prices and stock trading volumes in the short term, but the impact will gradually decay over time. In addition, Jiang, Liu & Yang (2019) showed that communication among investors in stock forums will also have an impact on stock returns.
4 Research on search index
In terms of research on search indexes, some scholars have constructed investor stock focus indicators based on the Internet search index and studied the relationship between investor attention and asset pricing. For example, Yu Qingjin and Zhang Bing (2012) used the Baidu Index as an indicator of investor attention and found that investor attention has positive price pressure on current stock returns, but this pressure will reverse in the short term. There are also scholars who use the search index of specific keywords to conduct research. For example, Zeng Jianguang (2015) constructed an investor network security risk perception index based on the Baidu search index of "Yue Bao stolen". The study found that the stronger the investor's perception of Internet security risks, the higher the risk compensation required, and the mobile Internet Investors' risk perception is stronger than that of computer investors.
5 Research on P2P online lending texts
In the context of my country's financial reform and financial innovation, P2P online lending (peer to peer leading) has set off a new craze, and some scholars have conducted research on the factors affecting the success rate of P2P online lending. Chen Xiao et al. (2018) found that readable loan descriptions can convey positive information to investors and improve the success rate of borrowing. Peng Hongfeng and Lin Chuan (2018) analyzed the impact of the proportion of specific words in loan descriptions on the loan success rate. The study found that the proportion of positive tone words and financial words was positively related to the loan success rate, and the proportion of negative tone words, the proportion of strong tone words and the proportion of weak words were positively related to the loan success rate. The proportion of tone words is negatively related to the loan success rate. In addition, many scholars have conducted research on loan interest rates, financing efficiency and other aspects based on P2P online lending texts.
In addition to analyzing the above five textual data sources, some scholars also analyzed analyst reports, annual report inquiry letters, and private meetings. Summary reports and other Chinese texts have been extensively studied. ①