Simulation of Book selection planning based on deep learning and its application

doi:10.21203/rs.3.rs-2713593/v1

Download PDF

Research Article

Simulation of Book selection planning based on deep learning and its application

https://doi.org/10.21203/rs.3.rs-2713593/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 22 May, 2023

Read the published version in Soft Computing →

You are reading this latest preprint version

Due to the complexity of network information, it presents an unbalanced feature. For the current book publishers, the publishing theme selection and planning methods of the newsroom can no longer meet the development speed of the Internet background. As the publishing house can not accurately analyze the needs of users, it is difficult to obtain the specific standards of the book publishing market. Therefore, the demand of consumers for books has decreased, and further generated practical problems such as inventory accumulation and revenue impairment. Based on the demand of extracting and analyzing target information for books, it can be realized by using deep learning methods. Therefore, this study establishes a book selection planning system. If the information of the book itself and the corresponding evaluation information are required, the system first uses Anaconda to crawl the website to obtain the book information data, and then uses the proposed KIEM algorithm to supplement the information of the data crawled by Anaconda. After completing this step, use the first layer of CRF model to separate the evaluation sentence from the opinion sentence, and split the separated target sentence to get the attribute and emotion words. Finally, ARIA algorithm is combined with the improved recommendation algorithm to improve the recommendation performance and achieve the accuracy and personalization of the system. It can be seen from the actual measurement that the MAE value calculated by the application of the system proposed in this paper is relatively small, which shows that the system has certain practicability in prediction accuracy. This paper designs an effective system application by introducing deep learning technology into the field of book topic selection planning.

Deep learning

Books

Topic selection planning

System application

In recent years, with the standardization and management of book sales data information, book theme planners have conducted in-depth research and quantitative analysis on the recent bookstore sales rankings, monthly sales reports and other actual data [1]. The results obtained are undoubtedly of practical significance, and can also reflect the future book market research and development trend to a certain extent. For publishers, the most important stage of book publishing is book topic selection [2]. The topic selection process includes selecting information, designing topic selection and demonstration, and then optimizing topic selection [3]. Through the sales data of books, market rules and user preferences can be obtained intuitively, which has a great impact on the topic selection planning [4]. However, publishers are often unable to accurately determine the topic selection of book category editing, nor can they determine the printing volume. Therefore, this study focuses on solving the following two problems: providing some books in the coming months to achieve better income results; Divide the area and predict the printing volume of a specific theme [5]. All industries need to acquire target information and relevant technical knowledge in the current era background. The book publishing industry has also gained opportunities in this process, and needs to seize such opportunities for development [6]. The book publishing industry needs to constantly integrate new information technologies, and eventually a new industrial model and industrial chain will emerge [7]. In this process, it is an inevitable trend for the book publishing industry to adopt advanced data mining technology.

Because of the large amount of book related data, the literature adopts the big data processing method to study, and uses Hadoop ecosystem as the cloud storage form to solve the storage problem of large amount of information in the book market; Hive data warehouse is used in book market data management [8]. This technology realizes the simple and efficient management of distributed data; Finally, through simple data statistics, the surface analysis of the data is carried out, and then the classification algorithm is used for in-depth analysis and mining; The above technologies are combined to achieve the goal of improving the efficiency of discipline selection and planning [9]. The literature selects and obtains the market information of an e-commerce book store as the research object by analyzing the relevant information of books on mainstream e-commerce websites, based on website activity and information integrity, and using more reasonable data acquisition methods [10]. Literature analysis of book market information resources, designed a cloud storage solution that is easy to expand, highly error tolerant, and supports large data sets; Research and adopt Hive technology to provide feasible data warehouse solutions; Using shallow analysis and deep learning classification algorithm, the analysis and design of topic selection planning scheme is realized based on data mining [11]. Literature uses web crawler technology to retrieve book review information from websites and create original book review dataset; The proposed KIEM algorithm is used to correct the missing data of the original dataset; This paper proposes a new method of emotion analysis, namely, the algorithm of emotion recommendation index (ARIA), by using the method of emotion analysis with good validity, to select articles to evaluate publications and their quality, and to analyze readers' emotional trends [12]. Literature research is of great significance for obtaining information about the book market and storing a large amount of information to select and plan book topics. The book metadata and book market information are mainly obtained from the website. The book metadata comes from the book information on the website. Book market information refers to the book product page purchase information obtained from a website, including book sales data, user comments, etc. Use Python to crawl web pages, obtain various page information, and use MongoDB to store data [13]. A collaborative filtering recommendation algorithm based on factor clustering and fuzzy similarity relation is proposed, which can effectively improve the accuracy of similarity calculation, solve the problem of data sparsity, and improve the accuracy and personalization of recommendation [14]. Literature discusses how to identify and supplement effective information, and how to use sentiment analysis and collaborative recommendation algorithms to derive personalized recommendations for users. Designing the book selection planning system can help publishers reduce the backlog, update and diversify the marketing model, and improve the publishing efficiency.

3.1 Theoretical knowledge

Boltzmann machine is a Markov random field based on energy model. The energy model establishes the functional relationship between the energy in a specific state of the system and its occurrence probability through the energy function to measure the probability distribution in a random network. From statistical mechanics, we can know that any probability distribution can be transformed into an energy based model. Boltzmann machine uses some existing properties and learning steps of energy model to realize maximum unsupervised learning of data probability distribution. RBM adopts hierarchical neural network model structure, which shows the limitation of Boltzmann machine structure, that is, neurons in layers are not connected, and neurons between adjacent layers are fully connected. Therefore, RBM is often used as a deep neural network. The unit module in front of the network model is used for common tasks such as feature extraction and dimension reduction from input data.

The RBM model consists of two layers, which shows that layers and hidden layers. As shown in Fig. 1, the lower layer is visible layer, including V1, V2, ..., VM, a total of M neurons, recorded as v = (v1, v2, ..., vm); h1, h2, ..., hN has n in total, recorded as h = (h1, h2, ..., hn). The model parameters include the bias of the visible layer b = (b1, b2, ..., bm), the bias of the hidden layer c = (c1, c2, ..., cn), and the weight of the edge of the visible layer and the hidden layer Wn* m.

3.2 Algorithm model

In the deep neural network system, the design of activation function is an important link, and the activation function is usually in the linear function z = ω 𝑥+𝑏 for later use. The activation function is designed to add nonlinear factors, usually with nonlinear and differentiable characteristics. Because of the nonlinearity of the activation function, it is easy to control the change of the output value. If the activation function is linear, the change in the output value is not controlled. Common activation functions are as follows:

Sigmoid function:

The sigmoid function is also a logical function, which is defined as follows:

$$\text{f}\left(\text{x}\right)=\frac{1}{1+{\text{e}}^{-\text{x}}}$$

The sigmoid function is the first activation function. By introducing a nonlinear function, the activation value is mapped between 0 and 1. This value indicates the degree of activation. A value of 1 indicates full activation, and a value of 0 indicates full deactivation.

Tanh function: Tanh function, also known as hyperbolic tangent function, is defined as follows:

$$\text{t}\text{a}\text{n}\text{h}\left(\text{x}\right)=\frac{{\text{e}}^{\text{x}}-{\text{e}}^{-\text{x}}}{{\text{e}}^{\text{x}}+{\text{e}}^{-\text{x}}}$$

Tanh function is very similar to Sigmaid function in form, which is a long "S" curve. But the Tanh function maps from − 1 to 1. The output value of the activation function is 1, indicating full activation, and the output value is − 1, indicating full deactivation. The curves of Tanh function and Sigmaid function are shown in Fig. 2. From the relationship between 𝑥 and y, it can be seen that the curve of Sigmaid function becomes saturated when | 𝑥 |>4, and Tanh becomes saturated after | 𝑥 |>2, which will affect the convergence of uncertain coefficients.

Because of the saturation nonlinearity of Tanh function and sigmoid function, gradient loss exists in gradient based training and learning process. Therefore, when using these two activation functions, batch normalization (BN) should be added to solve the activation problem and gradient loss caused by function saturation.

ReLU function: Correction linear function (ReLU) is an activation function commonly used in neural network structure. Its calculation expression is shown in Formula 3:

$$\text{f}\left(\text{x}\right)=\text{m}\text{a}\text{x}(\text{x},0)$$

It can be seen from Eq. 3 that the function is divided into two parts at the origin: the left part is 0, and the right part is a straight line with a slope of 1, meeting the nonlinear condition. If 𝑥<0, the output is 0; if 𝑥>0, the output is the input value. This function realizes one-way suppression, reduces the number of neurons in the neural network, increases the ability of the model to extract data features, and improves the data fitting effect.

Softmax function is usually used in the output layer of neural network as the output of classifier. The output value represents the probability of a class, and its functional form is as follows:

$${{\sigma }}_{\text{i}}\left(\text{z}\right)=\frac{{\text{e}}^{{\text{z}}_{\text{i}}}}{\sum _{\text{j}=1}^{\text{m}} {\text{e}}^{{\text{z}}_{\text{j}}}}$$

Calculate the input of hidden layer according to the input vector, weight and offset. The calculation formula is:

$${\text{u}}_{\text{j}}=\sum _{\text{i}=1}^{\text{n}} {{\omega }}_{\text{i}\text{j}}\text{*}{\text{x}}_{\text{i}}+{\text{a}}_{\text{j}},\text{j}=\text{1,2}\dots \text{l}$$

The output of the hidden layer is calculated according to the input of the hidden layer and the activation function. The calculation formula is:

$${\text{y}}_{\text{j}}=\text{f}\left({\text{u}}_{\text{j}}\right),\text{j}=\text{1,2}\dots \text{l}$$

In classification, the posterior probability is calculated through the learning model, and the maximum posterior probability is taken as the result of the classifier. According to Bayesian theorem, Bayesian classifier is expressed as a posterior probability:

$$\text{P}\left(\text{Y}={\text{c}}_{\text{k}}\mid \text{X}=\text{X}\right)=\text{a}\text{r}\text{g}\underset{{\text{c}}_{\text{k}}}{\text{m}\text{a}\text{x}} \frac{\text{P}\left(\text{Y}={\text{c}}_{\text{k}}\right)\prod _{\text{i}} \text{P}\left({\text{X}}^{\left(\text{i}\right)}={\text{x}}^{\left(\text{i}\right)}\mid \text{Y}={\text{c}}_{\text{k}}\right)}{\sum _{\text{k}} \text{P}\left(\text{Y}={\text{c}}_{\text{k}}\right)\prod _{\text{i}} \text{P}\left({\text{X}}^{\left(\text{i}\right)}={\text{x}}^{\left(\text{i}\right)}\mid \text{Y}={\text{c}}_{\text{k}}\right)}$$

Bayesian classifier parameter estimation can be divided into maximum likelihood estimation and Bayesian estimation. Generally speaking, Naive Bayesian algorithm is a relatively simple algorithm, which can combine N-Gram to represent different contexts.

For linearly separable data, a linear classifier can be constructed to maximize the hard interval. The corresponding learning algorithm is called the maximum interval method. This type of support vector machine is also called the hard interval support vector machine. The input of feature space is X=𝑅 𝑛, and the output is Y={1, − 1}. The separation hyperplane is expressed as:

$$\text{H}:\text{w}\cdot \text{x}+\text{b}=0$$

The input sample point closest to the separation hyperplane in the input sample is called the support vector, and the support vector satisfies:

$${\text{y}}_{\text{i}}\left(\text{w}\cdot {\text{x}}_{\text{i}}+\text{b}\right)-1=0$$

After the word frequency matrix is obtained, the second step is to calculate the TF-IDF weight of each word. Among them, TF stands for word frequency and IDF stands for reverse word frequency. Assuming that the number of all documents in the corpus is | D |, the current word for calculating the TF-IDF weight is w, the current text is 𝐷, the total number of words in the text is | 𝐷 |, and the number of occurrences of w is | w |, then TF and IDF are calculated as shown in Equations 10 and 11 respectively:

$$\text{T}{\text{F}}_{\text{w}}=\frac{\left|\text{w}\right|}{\left|{\text{D}}_{\text{i}}\right|}$$

$${\text{I}\text{D}\text{F}}_{\text{w}}=\text{l}\text{o}\text{g}\frac{\left|\text{D}\right|}{\left\{\left\{{\text{D}}_{\text{i}},\text{w}\in {\text{D}}_{\text{i}}\right\}\mid \right.}$$

3.3 Algorithm simulation results

In order to check the influence of the size of the convolution kernel, the length of the text and the number of iterations on the experimental results, this paper adjusts the relevant parameters for comparative experiments. The influence of the size of the convolution kernel. The convolution kernel groups of different sizes represent the extraction of context features of different lengths. In order to check the influence of context information on the model, three groups of convolution kernels are selected for comparison, as shown in Table 1.

Table 1

Experimental Results of Convolution Kernel Size Comparison
Convolution Kernel Size	Accuracy	Precision	Recall	F1
2,3,4	91.49	90.79	91.01	90.90
3,4,5	91.60	88.92	93.75	91.27
4,5,6	91.03	91.67	89.33	90.49

4.1 System requirement analysis

Topic selection and book planning require publishing institutions to grasp industry trends in a timely manner and take corresponding measures. Market book sales, book price trend, consumer purchase demand, reading habits and reader feedback are the background for book publishing institutions to understand business data. It is necessary to understand the feedback information of these markets, collect, process and analyze such information and forms, so as to use them to plan and analyze the information base of book topics.

At the same time, publishers should have a good understanding of the market trends, consumer feedback, media comments, profits and other public opinion information of published books. In the current book market, book market information is widely distributed in major shopping websites, which is characterized by large amount of data, rapid growth and great difficulty in obtaining. Big data technology and data mining technology are needed to help publishers understand these useful information as soon as possible. In addition, it is a major problem faced by publishing institutions in the age of big data to find and investigate target book consumers, determine book information dissemination channels, and achieve actual access to target reader information.

Topic selection planning is to serve users and markets. Book topic selection planning should understand the actual needs of readers and the latest trend of changes in readers' needs. In the era of information scarcity, traditional information technology can meet the needs of the public to obtain information; In the era of all media, the personal needs that are rich in information and ignored have been released unprecedentedly. People not only need to obtain information content and understand the change of time characteristics, but also have multi-level needs such as social needs, various sensory needs, emotional needs and so on. For example, after watching a movie for two hours, people still want to pay attention to the story behind it and get information that is not in the movie; After reading e-books online, people often have a desire to collect them. Publishers are responsible for spreading wonderful content to meet people's needs for different cultures. Therefore, they need to cooperate with other media to arouse specific emotional resonance between users or meet people's cognitive needs by adding and expanding relevant content.

4.2 System architecture and function design

Through sorting out the above contents and research results, this research establishes a book topic selection planning system, and realizes the main process of filling missing data with book evaluation, emotional analysis, personal recommendation and other information. The overall functional structure of the book topic selection planning system is shown in Fig. 3.

Through the research of book review mining technology, this paper proposes the process of extracting data from the journal selection planning system, and puts forward the research topic of journal selection planning of review information. The system module is shown in Fig. 4.

As shown in the figure, the core modules of the book theme planning system mainly include the following four modules: book metadata module, market information retrieval and storage module, missing data completion module, emotion analysis and personalized recommendation module.

Retrieval and storage of book metadata and market information: obtain information pages of book products through web crawler technology, remove redundant information through analysis and processing of hypertext files, and retrieve metadata, market information and user information data of corresponding books to facilitate systematic storage of book metadata and market information. In order to change the book metadata with a small amount of data, the system stores it in the relational database; For the book market information with rich data, the system stores the data in the distributed file system.

4.3 System topic selection planning algorithm

The actual forecast value can be calculated as follows:

$$\text{y}={\text{a}}_{\text{t}}+\text{k}\text{b}{\text{s}}_{\text{t}}+{\text{s}}_{\text{t}+\text{k}-{\pi }}$$

Where, k represents the number of backward smoothing periods, that is, the parameter that determines the forecast order in the next few months, and y value is the forecast sales order.

Taking children's books as an example, the analysis methods of other books are similar and will not be repeated here. First, compare and analyze the Holt model and the Holt Winters model, as shown in Fig. 5. The horizontal axis represents the year, the vertical axis represents the sales code, the black curve number represents the change rule of the sales code in the real ledger, and the bold curve is the corresponding effect of the set model on training.

(a is the Holt model; b is the Holt Winters model)

4.4 System application test

In order to verify the performance of book data mining, the user behavior data with a data volume of 1 million to 10 million rows were calculated for data mining several times (3 times) in the experiment. In a distributed environment, the time spent on the recommendation engine tends to be personalized. The experimental results are shown in Fig. 6.

MAE can calculate and measure all conditions in the prediction results, analyze the difference between them and the actual conditions, so as to determine the effectiveness and accuracy of the prediction results. The smaller the value, the higher the accuracy.

With the rapid development of information technology, the traditional industry is impacted by the emerging technology environment characterized by digitalization and mobility, and the traditional book publishing industry is no exception. The book publishing industry is facing the problem of poor sales of some books after printing, which wastes a lot of human resources, logistics, inventory and other resources. The main reason for this kind of problem is that the staff of the book publishing unit do not know enough about the development trend of the current book market, and there are a large number of unexpected or outdated topics for application, distribution and sales in the topic selection planning.

Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interests

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Data Availability

Data will be made available on request.

Tang TY, Firth M (2012) Earnings persistence and stock market reactions to the different information in book-tax differences: Evidence from China. Int J Acc 47(3):369–397
Bian Y, Li Y, Zeng Q, Liu M (2019) “Design and implementation of book publishing topic selection system based on collaborative filtering algorithm,” In IOP Conference Series: Materials Science and Engineering, Vol. 563, No. 5, p. 052018,
Lau JH, Newman D, Karimi S, Baldwin T (2010) “Best topic word selection for topic labelling,” In Coling 2010: Posters, pp. 605–613,
Bonyadi A (2014) The effect of topic selection on EFL students’ writing performance. Sage Open 4(3):2158244014547176
Sponseller AC, Wilkins M (2015) Investigating the impact of topic selection control on writing fluency. 18:141–152” Hiroshima Studies in Language and Language Education
Donovan CA, Smolkin LB (2001) Genre and other factors influencing teachers' book selections for science instruction. Reading Res Q 36(4):412–440
Nicolaisen J (2002) “The scholarliness of published peer reviews: A bibliometric study of book reviews in selected social science fields,” Research Evaluation, vol. 11, no.3, pp. 129–140,
Santhiya K, Bhuvaneswari V (2016) Big data insight: data management technologies, applications and challenges. Int J Control Theor Appl 9:77–91
Browne O, O'Reilly P, Hutchinson M, Krdzavac NB (2019) Distributed data and ontologies: An integrated semantic web architecture enabling more efficient data management. J Association Inform Sci Technol 70(6):575–586
Williams E, Tagami T (2002) Energy use in sales and distribution via e-commerce and conventional retail: A case study of the Japanese book sector. J Ind Ecol 6(2):99–114
Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH (2018) Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 85:168–188
Li S, Yan Z, Wu X, Li A, Zhou B (2017) “A method of emotional analysis of movie based on convolution neural network and bi-directional LSTM RNN,” In 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), pp. 156–161,
Darling S (2019) How Are Book Covers and Their Components Represented in the Digital Market? Interscript UCL Journal of Publishing 2(1):20–34
Zhang F (2016) A personalized time-sequence-based book recommendation algorithm for digital libraries. IEEE access 4:2714–2720

Download PDF

Journal Publication

published 22 May, 2023

Read the published version in Soft Computing →

Reviewers agreed at journal
01 Apr, 2023
Editor assigned by journal
24 Mar, 2023
First submitted to journal
20 Mar, 2023

You are reading this latest preprint version

Simulation of Book selection planning based on deep learning and its application

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Deep Learning Algorithm

3.1 Theoretical knowledge

3.2 Algorithm model

3.3 Algorithm simulation results

4. Design And Application Analysis Of Book Selection Planning System

4.1 System requirement analysis

4.2 System architecture and function design

4.3 System topic selection planning algorithm

4.4 System application test

5. Conclusion

Declarations

References

Status:

Journal Publication

Version 1