3.1 Zigbee networking
This paper implements ZigBee communication function based on the ZigBee protocol stack Z-Stack introduced by TI. It supports a variety of microcontrollers, including CC2530 on-chip system, CC2520 in MSP430 series, LM3S9B96 in Stellaris series. This protocol stack includes a variety of network topologies and is widely used in ZigBee industry.
The protocol stack defines how communication hardware and software communicate across layers in different hierarchies. For the data sender, the information packets sent by the user pass through each protocol layer in order from high to low, and each layer's entity adds its own unique identity information in a defined format, reaching the physical layer, transferring in the physical link as a binary stream, and reaching the data receiver. When the data receiver receives the data stream, the data packet passes through each protocol layer in order from lowest to highest, and the entities of each layer extract the data information from the data packet in a predefined format that needs to be processed at this layer, and finally reach the application layer.
Fig1 shows the network structure of ZigBee, it contains two important roles: coordinator and terminal node, which together constitute the simplest ZigBee communication process. The internal network is 2.4G wireless communication interaction. The external network is peripheral devices such as sensors and the Internet. The control of household appliances and environmental monitoring functions can be achieved through the interaction between the internal network and the external network. The coordinator role is the relay of the entire ZigBee network, which scans the current network condition, chooses the appropriate channel and network ID, and then starts ZigBee network, in addition, it will participate in assisting in configuring security parameters and application bindings within the network. In short, the coordinator role is primarily responsible for starting and configuring the network. Once it has completed its work, it can choose to switch to the router role or exit the current network. Such a change will not have any impact on the network as a whole; the terminal node itself is not responsible for the overall work of the network, it only needs to have the ability to sleep and wake up, go to sleep when not working, extend standby time, and wake up quickly once it receives the wake-up command from the coordinator.
3.2 Establishing a traffic model based on machine learning
In machine learning for traffic analysis, the effectiveness of the assessment requires data support and training. This paper uses Cambridge University's Moore dataset as a training test set for traffic classification. The dataset uses a high-performance network monitor to collect data, provides a time stamp with resolution over 35 nanoseconds, and consists of many objects, each of which is described by a set of features..With a large amount of manually classified data, each object in each dataset represents a single TCP packet flow between the client and the server. The characteristics of each object include classifications derived elsewhere and many derived features as inputs to probability Classification techniques. The information in the features is exported using the header information alone, while the classification classes are exported using content-based analysis.
The dataset contains 10 sub-datasets, totaling 377,536 data and 249 features. The 11 traffic types involved are WWW, FTP, DATABASE, P2P, SERVICE, MAIL ATTACK, etc.Each subset of the set is characterized as shown in Table 1:
Table 1 Dataset characteristics
Data subset
|
Duration(s)
|
Number of streams
|
entry01
|
1821.8
|
24863
|
entry02
|
1696.7
|
23801
|
entry03
|
1724.1
|
22932
|
entry04
|
1784.1
|
22285
|
entry05
|
1794.9
|
21648
|
entry06
|
1658.5
|
19384
|
entry07
|
1739.2
|
55835
|
entry08
|
1665.9
|
55494
|
entry09
|
1664.5
|
66248
|
entry010
|
1613.4
|
65036
|
Each sub dataset has a different number of streams and a different duration. The streams contained in it are TCP streams, so they have clear start and end identifiers. Each stream corresponds to different traffic types. Based on machine learning, classification models can be trained to classify the actual traffic.
3.3 Data preprocessing
The dataset itself is not always perfect. Some datasets have different data types, such as text, numbers, time series, continuity and discontinuity. It is also possible that the quality of the data is not good, there is noise, there are anomalies, there are missing, the data is wrong, the dimensions are different, there are duplicates, the data is skewed, the amount of data is too large or too small. In order for the data to fit the model and match the needs of the model, the Moore dataset needs to be pre-processed, detected from the data, corrected or deleted, inaccurate or inappropriate records for the model. Data preprocessing methods include removing unique attributes, processing missing values, attribute coding, data standardization regularization, feature selection, principal component analysis, and so on.
In machine learning, most algorithms, such as logistic regression, support vector machine SVM, k-nearest neighbor algorithm, can only process numeric data and cannot process text. In sklearn, in addition to the algorithm used to process text, other algorithms require all input arrays or matrices during training, and can not import text-based data. Some of the data in this dataset contains the characters Y and N, which cannot be processed directly using machine learning algorithms. You can encode Y as 1 and N as 0 through attribute encoding. During the network connection process, the maximum segment size cannot be known, so the dataset is represented by a'?', so there are consecutive features in the dataset that appear as'?'. For this reason, we use mean filling with Gauss white noise.
From Figure 2, it can be seen that the standard deviation and mean values of some features in the dataset are unusually large, reaching 10e17 and 10e15. For such feature data, data regularization is used. For a single sample, the sample is scaled to the unit norm for each sample. The specific process is as follows:
This paper recalculates the statistical characteristics by simply filling and replacing the data with abnormal features and normalizing the data. Figure 3 describes the dataset with some statistical features, including standard deviation, mean, 25% bits as median.
3.4 Data feature processing
When data preprocessing is complete, we need to select meaningful features to input into the machine learning algorithm and model for training.When exploratory analysis of data reveals that there are too many features introduced. To model and analyze directly with these features, further screening of the original features is required and only important features are retained. Generally, features are selected from two perspectives:
Whether a feature is divergent or not: If a feature does not diverge, for example, if the variance of a feature itself is small, then there is little difference in the sample on this feature. Maybe most of the values in the feature are the same, or even the values of the whole feature are the same, then this feature has no effect on sample differentiation.
Relevance of features to objectives: Features that are highly relevant to objectives should be selected.In addition to the variance method, the other methods described in this paper are concerned with correlation.
According to the form of feature selection, there are three feature selection methods:
Filter: A filter method that scores each feature according to divergence or correlation, sets thresholds or the number of thresholds to be selected, and selects features.
Wrapper: Packaging method that selects several features at a time based on the objective function or excludes several features, such as recursive elimination of features using a base model for multiple rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is based on a new set of features.
Embedded: Embedded method, first uses some machine learning algorithms and models to train, get the weight coefficients of each feature, and select features from large to small according to the coefficients.Similar to the Filter method, but trained to determine the quality of the features.
To explore the performance of different algorithms in the model, this paper chooses different feature selection algorithms to obtain the best traffic classification model through comparison.
3.4.1 Variance filtering
To select the optimal hyperparameter, you can draw a learning curve to find the best point of the model.However, it takes a lot of time, and the improvement of the model is limited.In this paper, variance filtering with a threshold of 0.001 is used to first eliminate some features that are obviously not needed, and then select a better feature selection method to continue to reduce the number of features.By variance filtering, features with variances less than thresholds are removed, leaving 240 features.
2 After selecting the variance, the next step is to select meaningful features related to the target tag, which can provide a lot of information.If the feature is not tagged, it will simply waste computing memory and possibly noise the model.Here, three common methods can be used to assess the correlation between features and labels: chi-square, F-test, and mutual information.
3.4.2 Chi-square filtration
Chi-square filtering is a correlation filtering specifically for discrete tags.The chi-square test calculates the chi-square statistics between each non-negative feature and label and ranks them according to the characteristics of the chi-square statistics from high to low.Combined with the scoring criteria, the classes with the highest K-score were selected to remove features that are most likely independent of labels and unrelated to the purpose of classification.In addition, if the chi-square test detects that all values in a feature are the same, it will prompt variance filtering using the difference first. However, the selection of K value is closely related to the performance of the model. In order to obtain the best K value, we need to find ways to explore the best K value.
The F test, also known as ANOVA, variance homogeneity test, is a filtering method used to capture the linear relationship between each feature and a label.It can be used for regression or classification, where F-test classification is used for data with labels as discrete variables and F-test regression is used for data with labels as continuous variables.The output statistics can be used directly to determine what kind of K we want to set.It is important to note that the F-test is very stable when the data follows a normal distribution, so using F-test filtering first converts the data into a normal distribution.The essence of F-test is to find a linear relationship between two sets of data, assuming that there is no significant linear relationship between the data.It returns two statistics, F and P.As with chi-square filtering, we want to select features with P values less than 0.05 or 0.01 that are significantly linear with the label, while features with P values greater than 0.05 or 0.01 are considered features that have no significant linear relationship with the label and should be deleted.
Mutual information is a filtering method used to capture any relationship (both linear and non-linear) between each feature and the label.Similar to the F test, it can be used for both regression and classification, and it includes both mutual information classification and mutual information regression.Both classes have the same usage and parameters as the F test, but the mutual information method is more powerful than the F test, which can only find linear relationships, while the mutual information method can find any relationships.Mutual Information does not return statistics with similar P or F values. It returns an estimate of the amount of mutual information between each feature and the target, which takes a value between [0,1]. A value of 0 indicates that the two variables are independent and a value of 1 indicates that the two variables are fully correlated.
3.4.3 Lasso
Lasso algorithm seeks the smallest sum of squares of residuals when the sum of absolute values of model coefficients is less than a constant. It is better than stepwise regression, principal component regression, ridge regression, partial least squares and so on in variable selection. It can better overcome the shortcomings of traditional methods in model selection.Lasso regression is one of the regularization methods and is a compressed estimation.It obtains a more refined model by constructing a penalty function.Using it to compress some coefficients while setting some coefficients to zero preserves the advantage of subset shrinkage and is a biased estimate for processing data with multicollinearity.Lasso is a shrinkage estimation method based on the idea of reducing the feature set. Lasso method can compress the coefficients of features and make some regression coefficients 0, which can be used for feature selection. Lasso method can be widely used in model improvement and selection.By choosing a penalty function, Lasso's ideas and methods are used to achieve the purpose of feature selection.Model selection is essentially a process of seeking sparse representation of a model, which can be accomplished by optimizing a loss + penalty function problem.The advantage of Lasso regression method is that it can make up for the deficiencies of least squares estimation and stepwise regression local optimal estimation. It can select features well and effectively solve the problem of multicollinearity among features.Its objective function can be expressed as:
3.5 Model Training
The training process of machine learning is to first define a loss function, add input samples, and get prediction tests based on forward propagation.Compared with the real sample, the loss value is obtained, and then the reverse propagation is used to update the weight value, iterating back and forth continuously until the loss function is small and the accuracy reaches the ideal value.The parameters at this point are those required by the model.That is, the ideal model is built.This paper divides the data set into training group and testing group, the ratio is 8 to 2. First, the training data is used to preliminarily train the model, and then a preliminary model is obtained. Then, the test data is used to test the model to see if there is any phenomenon of fitting.