Design of computer big data processing system based on genetic algorithm

Big data technology has undoubtedly accelerated the efficiency of data processing, made data become an important reference basis in enterprise operation, and changed the mode of enterprise operation as well as the existing business model. With the development of computer technology and related equipment, more and more enterprises and organizations can begin to use big data processing technology. However, many small and medium-sized enterprises, because they cannot afford the high cost of research and development and leasing, are gradually eliminated in the information-based business competition, so it is necessary to adopt certain strategies to help small and medium-sized enterprises out of this crisis. For fragmented big data obtained from different data sources, this paper adopts load-balancing technology to provide horizontal service cluster scalability and designs a separate system module for routine testing. The experimental results show that the improved big data processing system based on genetic algorithm in this paper can better meet the business needs of small and medium-sized enterprises and meet the application needs of small and medium-sized enterprises. The simulation results also show the advantages of the system in this paper, which is faster, higher accuracy, less energy consumption, lower requirements for equipment, and more suitable for small and medium-sized enterprises. This paper designs a kind of effective big data processing system by studying genetic algorithm and computer technology.


Introduction
With the continuous development of social networking, Internet of Things and multimedia technologies, we have witnessed an explosion in the amount, velocity and variety of data from various sources such as mobile devices, sensors, social networking sites, electronic cameras, surveillance systems and more (Nath et al. 2022;Chen et al. 2020). According to IDC research analysis, the global data volume was 4.4 ZB in 2013, and by 2020 this number has increased to 44 ZB (equivalent to approximately 44 trillion GB). Big data has gradually penetrated into all aspects of people's lives (Addo-Tenkorang and Helo 2016). Although big data has been applied in many fields, most of them are used by large enterprises. Many small and medium-sized enterprises cannot afford the cost of independent research and development or complete introduction, so they can only use the big data processing system in the form of lease (Al Nuaimi et al. 2015). This has many disadvantages. First of all, the existing big data processing system is mainly developed by large enterprises for their own use, which is directly disconnected from the application requirements of small and medium-sized enterprises (Tamiminia et al. 2020). Secondly, small and medium-sized enterprises lack enough technical personnel to support the excessively large system, resulting in the loss of costs and resources. Finally, small and medium-sized enterprises are subject to others for a long time, easy to produce monopoly problems. Therefore, it is necessary to optimize and innovate the existing big data processing system. (Igbaria et al. 1995). IT services are designed to help people use various data center resource services like using ''water'' and ''electricity.'' This technology has many features, such as wide-area interconnection access, adaptive services, rapid elastic expansion, resource pooling and pay-as-you-go billing, so it is easier for people to use large-scale computing, storage and network resources, and it is also suitable for large-scale data processing. and analysis are of great help (Xiong 1992;Selwyn 2007). Based on this background, this paper develops a big data computer processing system integrating genetic algorithm and conducts in-depth analysis and research on it. The paper first introduces the basic principle of genetic algorithm and then applies it to the big data processing system. The basic logic and then the processing method of the genetic algorithm application are introduced, through which the big data processing computer system is designed and the system is tested. The test results show that the system design in this paper completely meets the requirements of use, and the error value is always within the controllable range, so the rational use of the system is very beneficial to the development of small and medium-sized enterprises.

Related work
The literature has improved the modeling of network information flow and optimized the inhomogeneous analysis method to make it simpler and more accurate. In the actual modeling process, the selected individuals often do not have variation and cross boundaries (Kim et al. 2011). The processed network model can more accurately describe the expression and information transmission process of dominant genes (Passalis et al. 2020). The literature uses network structure entropy in complex network theory to describe network inhomogeneity in information flow. The entropy of the network structure represents the order degree of the network in the information flow, that is, the difference of the network (Oloufa et al. 2004). Compared with the scale index based on the network power distribution curve, the network structure entropy is directly calculated according to the number of nodes in the network, and the connectivity of network nodes can measure the unevenness of the information flow, so it is more accurate and simpler. The literature designs a genetic algorithm based on dynamic self-organizing network (Tinos and Yang 2007). In order to effectively evaluate the importance of a node, a new definition of node importance in an exponential network is given, which takes into account the node's invalid comment ranking on the objective function of adjacent nodes and the adjacent nodes that avoid the node with a fitness of 0 quantity (Eledlebi et al. 2020). The literature proposes three topology update rules: double-new, single-new and selective deletion, so that the population structure of the genetic algorithm evolves dynamically with the evolution of the genetic algorithm, and the convergence performance of the genetic algorithm is effectively improved (Meng et al. 2021). The literature optimizes the information storage mechanism based on large heap tree nodes to reduce the cost and make it easier to apply to the field of cluster management (Delimitrou and Kozyrakis 2014). The optimized algorithm includes node information storage strategy, tree topologybased heartbeat detection mechanism and good fault recovery. means. In addition, under the background of columnar database storing large-scale structured data, data compression technology is deeply studied, and a hybrid compression strategy based on columnar storage is proposed (Zhang and Li 2006).

Basic principles of genetic algorithms
The genetic algorithm adopts the evolutionary thinking of the biological world and is mainly based on the genetic mechanism of parental gene recombination in the process of biological reproduction and the natural selection mechanism of ''survival of the fittest'' in nature. It is a global optimization search algorithm with two key areas of parallelism and global search space solution. The inherent parallelism of genetic algorithms enables them to better search for the optimal global solution, the flow of which is shown in Fig. 1.
Assuming that the population size (that is, the number of individuals in the population) is n, the fitness value of individual i is fi, and Pi is the probability that individual i is selected, then: 3.2 Design of big data processing algorithms The total number of task planning tasks M in the cloud computing environment is: The first L genes of a chromosome can be represented as follows: Among them, as long as the constraints are met, the gene sequence of Vk can be arranged arbitrarily.
In the cloud computing environment, users' satisfaction with services can be measured by QoS standards. At this time, the calculation formula of weight measurement is: The user satisfaction function of task Ti is: The execution time can be calculated by the following formula: Then, the total time required to complete all M jobs is: According to formula (4), let the text be the time the user expects to complete the Jm work, then the user satisfaction function of completing the Jm work time is: Let B user be the expected bandwidth of the user: The user satisfaction function of the bandwidth can be obtained as: The calculation method of user satisfaction function is: Assuming resources are billed per unit, the total cost of task Ti can be expressed as: Let Cuser be the cost expected by the user, then the user satisfaction function of the cost is: i¼T Total ðmÀ1Þ Design of computer big data processing system based on genetic algorithm… 7669 Scheduling jobs in a cloud environment must take into account the four goals listed above. For users, on the one hand, the less time and cost required to complete the work, the better; on the other hand, the greater the amount of bandwidth allocated to the working system, the better, and the higher the system stability and work completion. If the job runs better, then the fitness function for job scheduling is: For a chromosome individual with fitness fi, the selection probability Qi is: For the selection of the crossover probability Pc, this paper adopts an adaptive method to prevent the possibility of damage to the high-stability individual structure due to excessive Pc.

Simulation analysis of network characteristics
The scaling exponent is an approximate parameter obtained by fitting the power-law degree distribution curve of the network. It can be used to describe the non-uniformity of scale-free networks. The higher the exponent, the faster the power-law distribution curve falls, and the more intuitive the non-uniform network is. At present, there are many methods for calculating the scale index, such as direct nonlinear histogram regression, log-linear regression, Fourier transform solution, dynamic equation or Z transform solution, and maximum likelihood estimation method. Among them, the maximum likelihood idea means that you will see these samples because the probability of seeing them is the highest. The solution process usually first calculates the log-likelihood function and then takes the partial derivative of the parameters to obtain the parameter estimates. The specific process is as follows: The distribution level of network nodes can be expressed as: x refers to the number of connected edges of network nodes, k is a given constant, and m is a non-uniform scaling exponent used to measure the flow of information in the network. After adding the minimum number of connected edges, the degree distribution can be calculated as follows: To estimate the value of the scaling exponent m, the sample data n in the likelihood function can be expressed as: The log-likelihood function can be expressed as: Set dL/dm = 0, we can find: here m is the estimated value of m. It can be seen from the above formula that if the number of nodes connected to the edge is large, the scale index will be small, and the information flow network at this time will not be uniformly distributed. The probability distribution diagram of node degree in the information flow network of the big data processing system is shown in Fig. 2. The size distribution of each node degree in the information flow network of the computer big data processing system is shown in Fig. 3.
It can be seen from Figs. 2 and 3 that most nodes in the information flow network of the big data processing system have few connection edges established with other nodes in the network, and a few nodes have established a large number of edge connections. The number of nodes in the range from 1 to 10 with other nodes is more distributed, and the other regions with more connecting edges have a smaller distribution.
4 Design and testing of computer big data processing system

System requirements analysis
The main functions of the computer big data processing system include data source access, data stream processing, support for custom data processing rules, etc. These modules will be introduced separately below.
• Access the data source: The system supports access to a variety of data sources, mainly online data sources, supplemented by offline data sources. Although the system is an online data stream processing platform, some applications need to process not only online data but also a small amount of offline data.
The system will store the data offline in HDFS. For online data, the system divides data by topic and stores it in an orderly manner, and users can choose whether to allow data loss to improve application performance. Finally, once a user submits an online processing task to a data source, the data enters the next processing data flow.
• Data stream processing: The processing logic of the data stream varies with application scenarios. To enable the system to support a variety of data processing logic, common data processing operations have been summarized in a functional component. Therefore, the user only needs to flexibly combine the required functional components and specify the topological relationship between the components.  First of all, in the face of flexible business logic requirements, as an online data stream processing platform, the system's versatility and ease of use are very important.
Since the system supports multiple online tasks running at the same time, it tends to process multiple data sources at the same time. Different data sources produce data at different rates. Therefore, system stability is demonstrated by stable data source access and stable operation of online tasks. In the design, the single point of failure of the system is specially considered to ensure the stability of the system service.
Second, for streaming data, the system not only needs to receive the data in a timely manner, but also needs enough computing power to process and store the data quickly and return the results to the application. Therefore, the requirements for the computing power and response speed of the central computing part of the system are relatively high.
Finally, due to the complexity of business logic and the ever-increasing amount of data to be processed, the computing power of the system will gradually reach a bottleneck. In order to improve system performance, the coupling between modules should be reduced as much as possible to facilitate performance expansion and system hardware upgrade.

System structure design
This project designs and implements a computerized big data processing system that aims to provide a low power, low cost, and relatively simple solution for small and medium-sized businesses or households. In the basic structure of the platform, the administrator checks the operation of the platform through the management PC and assigns test tasks; the user enters the IP address of the Master through the Client (client) browser to access the embedded Web server, and the embedded Web server uses CGI interaction programmatically connect to large database engines. Users can directly manage clusters and database operations of large database engines by operating the browser page. The physical structure topology of the platform is shown in Fig. 4: The platform is based on a big database engine. Users interact with the platform through the Lighttpd embedded web server. The administrator checks the status of the cluster and uploads data from the management PC via the command line. Figure 5 is a logical structure diagram of the platform.

System module design
UDP has the flexibility that TCP does not have, because it can only guarantee the accuracy of the data, so it is especially necessary to know what features need to be designed in the data transfer protocol, and only the required properties to limit participation.
Segmented small data generation scenarios: including real-time data generated by sensors, session data generated by instant messaging, and news headlines delivered by news applications. In some cases, small data is generated continuously, and in some cases, data bursts. The data volume of the data stream itself is very small, and a lightweight transmission protocol with low cost and high reliability is required.
The priority communication protocol is proposed to solve the transmission of small fragmented data according to the requirements of the transmission module. Improve performance in real time using as few intermediate passes as possible.
As a protocol for small data transmission, the extra space occupied by this protocol is higher than the bandwidth occupancy rate of other scenarios. Therefore, in order to improve data transmission efficiency, necessary functions need to be implemented in the most efficient way.
The protocol format is divided into two parts: the packet header and the payload, and the payload is divided into two types: the data payload and the confirmation payload.
(1) Baotou: Stream number: 0-31 bits, randomly generated by the sender when generating a data stream, used to identify data from different data streams.
Packet sequence number: 32-63 bits, record the sequence number of the data packet. In the data payload, the sequence number of the first data packet is randomly generated by the sender and then incremented according to the sending order of the data packets; in the process of acknowledging the payload, the sequence number of the first data packet is determined by the receiver. It is randomly generated and then increases with the order of the confirmation packets. The confirmation packet with the higher sequence number has the stronger confirmation right. Delayed acceptance of small serial numbers can be handled directly.
Version number: 64-71 bits, used to record the protocol version number, the current version number is 0 9 01.
(2) Data payload: Data Payload Identification Bits: The first O bit identifies this packet as a data payload.
Boundary field: 1-2 bits, used to identify whether the data packet is at the boundary position in the flow. ''01'' represents the first packet of the data stream, ''10'' represents the last packet of the data stream, and ''11'' means that the data stream only contains this packet. Through the boundary field, the receiver can accurately judge the scope of the data stream, knowing that there is no additional information after receiving all the data. Packet length: 3-13 digits, a total of 11 digits, the maximum is 2047. Since the packets are divided according to the MTU value, the minimum application layer information is 548 bits, the maximum is 1472 bits, and 11 bits are enough to cover all cases. The packet length here is the total length of the application layer data of the current data packet, excluding the previously set packet header length.
Stream Length: The length here is not the length in bytes, but how many total packets are in the stream. Data Stream Offset: Indicates how many packets are in the data stream of the current packet. Limited data: Application-level data to send. Each data payload requires 13 bytes of additional data overhead at the application level. The closer the amount of data carried in the packet is to the MTU, the higher the bandwidth usage.
(3) Confirm the load The acknowledgment payload is an extra overhead designed to compensate for the loss of retransmissions. The priority communication protocol is based on the characteristics of low data volume and does not require strict reliability. There is no need to set a sliding window to check all packets, but to send all data and confirm that all packets are completed in the payload. When reliable transmission is required, the receiving end uses the same window mechanism as TCP to verify and acknowledge all data packets, and loads the acknowledgment number and sequence number into the data frame; if reliable transmission is not required, the receiving end only needs to check the first packet and the last packet, if it is confirmed that both packets have arrived, an acknowledgment packet is sent confirming that all packets have arrived.
Function definition:int mini_socket() Basic function that applies sockets, takes no input parameters, and returns the socket created for calling other functions.int mini_send(int sock, void *buf, size_t nbytes, char *host, int port, int mtu) Data transfer function. Returns the number of bytes sent if the data is sent successfully, -1 if it fails. This function divides the data waiting to be transmitted into data blocks according to the value of the mtu input parameter, and packages and sends them according to the priority communication protocol. Unlike UDP sendto, this function ensures that the data is sent to the receiver correctly and completely.int mini_recv(int sock, void *buf, size_t len) Data receiving function. The first sock parameter is expected to be the return value of the mini_socket function; the second buf parameter is the buffer reserved for data; the third parameter len is the length of the buffer. Returns the number of bytes received if the data was successfully received, -1 on failure. This function is a receiving function in a one-to-one scenario, including the trusted internal communication process, and is used in conjunction with the mini_send() function. This function is suitable for scenarios with limited and stable data sources.int mini_listen(int sock, int backlog, void (*fun)(void* buf, size_t len, void* args)) Data receiving function. The first parameter sock is the return value of the mini_socket function; the second parameter backlog is the maximum number of streams that can be maintained, if it is -1, it means there is no limit; the third parameter is the callback function. This function will dynamically open up space inside the function to handle the contents of various data streams. When the content of the data stream is completely received, the callback function is triggered, and the address of the data stream and the length of the data stream are passed to the callback function. After returning from the callback function, this part of the space will be converted into free space. The user can control the maximum number of streams processed simultaneously by the function via the backlog parameter. If the maximum number is exceeded, new data streams are rejected. This function should be used when the state of the data source is unpredictable.mini_bind(int sock, (struct sockaddr *)myaddr, int addrlen) Bind the local address to the specified socket, this function is used by the server. Figure 6 is the final design diagram of the computer big data processing system.
For fast transmission, it is necessary to avoid copying data from the kernel to the user mode. Therefore, tasks such as data feature extraction must be completed directly in the kernel. The processing location of the kernel network stack is shown in Fig. 7.
When the data reaches the core network stack, the network layer directly reads the information, completes the determination of the forwarding node and the modification of the packet header, and then sends it directly. The entire process must be performed by modules loaded by the kernel, rather than using regular user programs to perform user-mode operations.
Data storage: The role of this component is to maintain data, which has two properties. One is the result of the data stream going through a series of processing logic; the other is the data to be stored. The data volume of the former is usually small, while the data volume of the latter is usually increasing.
HBase can automatically partition a growing table, and each region after the partition contains a subset of the table's rows. For HBase tables, it consists of different regions. There is only one region at the beginning, but as the region starts to grow, when the set size threshold is exceeded, the table is split into two new regions with roughly the same row boundary size and number of regions, which will also increase accordingly. A table that is too large to fit on one server can be distributed across the cluster. Similarly, distributed access to table data can also reduce the pressure of centralized access.
Therefore, in order to better cope with different storage requirements, this component supports traditional Mysql database and KeyValue HBase database.
Statistical data: The purpose of this component is to compile statistics on one or more fields in the data stream. Common aggregation functions such as sum, count, average, maximum, and minimum are supported. Users can select one or more aggregation functions. For example, the monthly turnover and transaction frequency need to be statistically analyzed, including statistical values and statistical information.
Data collection: The function of this component is to periodically extract the data stream to obtain the fields to be analyzed. The data retrieval component is always the first step in business processing logic. On the one hand, the amount of data that needs to be sent later is reduced, and the processing efficiency of the overall task is improved; on the other hand, the field name and field type can be identified to facilitate further data processing.
Data filtering: There are common filtering operations in data processing operations. The data filtering component of this system supports exact matching, fuzzy matching and range matching of fields and supports logical operations between multiple fields such as OR and AND. For example, a data stream contains three fields, A, B, C, whose types are string, integer, and string, respectively. The rule for filtering data is that the content of field A starts with a; the content of field B is between 0 and 10; the content of field C is crazy; these three fields are AND operations. Then, the data in these three fields that satisfy the corresponding filter rules at the same time will be highlighted and output.

System test
As an external server, the data source access service must ensure stability. And if the server restarts, it will not affect the normal client service. Client applications must add support for Jar packages and then call the interface to transfer data. The server starts the service and starts recording syslogs, and at the same time starts the message queue Kafka service. The test process is shown in Table 1.
The following takes KafkaSpout, HdfsSpout, Regrex-Bolt, and FilterBolt as examples to introduce the test process, expected results, and test results, as shown in Table 2. The testing process of other components is basically the same and will not be repeated here.
In this computer big data processing system, it is necessary to check the accuracy of the confidence c and error bound e calculated by the system. Figure 8 shows the experimental results generated by query template 1 and template 2, for a total of 1000 query information.
As can be seen from Fig. 8, the actual error range is always lower than the specified error range. This result occurs because many problems use the entire sample of the system, which is more than the sample size required for the problem.

Conclusion
It can be seen from the study and research of the literature that the advantages of big data processing are significant, and the current big data processing system is also applied in many fields. This paper focuses on the research of small  (1) Call the client interface for data transmission (1) Print target logs and messages As expected (2) Print the server system log (2) The client has no abnormality (3) Print the data in the message queue 2 (1) Continuously call the client interface After restarting the server, the client can continue to serve As expected (2) Restart the server and medium-sized enterprises in the big data processing system, for small and medium-sized enterprises low cost, in the introduction of big data application equipment in a disadvantageous position, the combination of genetic algorithm and computer big data, to design a large data processing system suitable for the size of small and medium-sized enterprises, better help the development of small and medium-sized enterprises. The system designed in this paper still has some shortcomings, which need to be further improved in the subsequent design and development.
Funding The authors have not disclosed any funding.
Data Availability Data will be made available on request.

Declarations
Conflict of interest The authors declare that they have no conflict of interests.
Ethical approval This article does not contain any studies with human participants performed by any of the authors.

KafkaSpout
(1) Send a message to Kafka after starting the thread (1) print message As expected result (2) Print the data sent by KafkaSpout (2) The message is not repeated

HdfsSpout
(1) Preprocessing Hdfs data files (1) print message As expected result (2) Print HdfsSpout to send data (2) The message is not repeated

RegrexBolt
(1) Send data to the source component Correctly extract data and field names, types, etc. As expected result (2) Print the result of RegrexBolt processing

FilterBolt
(1) Send data to the source component (1) Correct match As expected result (2) Filter different fields (2) Correctly handle logical relationships (3) Print filter results

Fig. 8 Error bounds under different confidence levels
Design of computer big data processing system based on genetic algorithm… 7677