Only Header: a reliable encrypted traffic classification framework without privacy risk

Encrypted traffic classification plays a critical role in network management, providing appropriate Quality-of-Service and Network Intrusion Detection. Conventional port-based and deep packet inspection approaches cannot classify encrypted traffic effectively. Methods based on machine learning can classify encrypted traffic by extracting statistical features of the flow. However, they require manual extraction of features. Recent studies show that the approaches based on deep learning are compelling for the task. They can automatically learn raw traffic features without manual feature extraction. However, these studies still take the payload of encrypted traffic as the model input, which may cause privacy risks. Besides, a massive encrypted payload causes great storage pressure on traffic classification. In this paper, we propose a reliable encrypted traffic classification framework by only using the flow header called Only Header, which avoids privacy risks and achieves lightweight storage. Firstly, we introduce a twice segmentation mechanism to dilute the interference traffic and increase the weight of effective traffic. Then, we use capsule neural networks (CapsNet) to learn spatial and byte features of the flow header. The Only Header’s effectiveness is compared with other methods using two public datasets, including ISCX VPN-nonVPN and ISCX Tor-nonTor datasets. The experimental results demonstrate that the Only Header outperforms the state-of-the-art encrypted traffic classification methods.


Introduction
Nowadays, internet traffic classification aims at classifying traffic based on the type of protocol or behavior, which has become a fundamental analytical technique for advanced network management. As a countermeasure to solve the increasingly severe network threats, traffic classification technology can be adopted for identifying the malicious behaviors and then hinder the threats from spreading in time (Viegas et al. 2017). From another view, with the rapid development of network technology and the gradual rise of novel applications, traffic classification technology can also help to improve the network resource utilization by providing precise traffic type knowledge ).
B Dan Du dudan@iie.ac.cn 1 Hence, traffic classification is crucial to network management, especially for Network Intrusion Detection (NID) and Quality-of-Service (QoS).
In recent decades, the plain-text network transferring has become a vulnerability with severe consequences, which challenging the regular adoption of the network and users' privacy. Therefore, more and more applications adopt secure protocols such as SSL, VPN, Tor to protect their traffic from being tapped by the Man-in-the-middle attack (Gai et al. 2017). Meanwhile, in order to bypass detection by security software such as firewalls, malware software uses encryption techniques to hide communication content. In such a situation, traffic encryption has become a standard practice adopted by benign network applications and malware for different purposes. Unfortunately, the encrypted traffic brings a challenge to network management, as the payload of the application layer cannot be inspected, making traditional traffic classification approaches do not work (Cao et al. 2014).
Recently, deep learning performs well for encrypted traffic classification. On the one hand, many studies take the first N (such as 784, 900, 1024, etc.) bytes of encrypted traffic as the model input. They then use convolutional neural network (CNN), stacked autoencoder (SAE), and other models to extract traffic features and achieve service and application classification Lotfollahi et al. 2020;Yao et al. 2022;Shapira and Shavitt 2019;Zou et al. 2018). However, the above studies directly touch the application payload, which can easily cause certain privacy troubles (Taylor et al. 2014). In addition, since a large amount of application payload is used as a feature of the models, such methods put a lot of pressure on data storage. On the other hand, some studies propose to learn the sequence features such as packet sequence of flow and message sequence of flow (Shen et al. 2017;Yao et al. 2022;Shapira and Shavitt 2019;. However, these methods are greatly affected by the environment and user habits (Fu et al. 2016). Therefore, they have low robustness.
To tackle the problem mentioned above, we propose a reliable encrypted traffic classification framework without privacy risk, Only Header. It only uses the flow header (shown in Fig. 1) as the proposed model's input, which avoids privacy risks and reduces data storage pressure. In more detail, our proposed model first extracts header and splits traffic by a twice segmentation mechanism in preprocessing to dilute the interference traffic and increase the weight of effective traffic. Then, it learns the spatial features and byte features of the flow header using CapsNet that takes the location of fixed strings and the order between packets into consideration. Finally, the traffic is classified by a fully connected softmax layer. To demonstrate the effectiveness of the Only Header, we perform experiments for encrypted traffic identification, regular and VPN traffic classification, regular and Tor traffic classification on the ISCX VPN-nonVPN dataset and ISCX Tor-nonTor dataset. The experimental results demonstrate that our proposed model outperforms the state-of-the-art classification approaches. This paper is a further expansion and deepening of the previous research work (Cui et al. 2019  Packet header is the 16-byte Record (Packets) header in the libpcap file format definition, which is a very basic format to save captured network data. Ethernet Header is the 14-byte header of the Ethernet frame. IP Header is the header of the IPv4 packet, consisting 20 bytes. TCP/UDP Header is the header of the TCP/UDP packet, including 20 or 8 bytes. CRC is a 4-byte cyclic redundancy check used to detect any in-transit corruption of data The main contributions of this paper are summarized as follows: -We propose a reliable encrypted traffic classification framework without privacy risk, Only Header. It only uses the flow header as the proposed model's input, which avoids privacy risks and reduces data storage pressure. -We propose a novel encrypted traffic classification model using CapsNet. The model is effective as not only the location of fixed strings are taken into consideration, but the order between packets also remains the effective features behind the traffic. -A twice segmentation mechanism is introduced to increase the effective traffic weight, which shows higher accuracy than traditional traffic representation over packet and flow. -We evaluate the framework against the state-of-the-art methods on the publicly available ISCX VPNnonVPN dataset and ISCX Tor-nonTor dataset. Experimental results have demonstrated the proposed model's effectiveness, measured by encrypted traffic identification, regular and VPN traffic classification, and regular and Tor traffic classification accuracy.

Related work
Traffic classification has attracted extensive attention from academic and industrial fields, achieving abundant accomplishments (Dainotti et al. 2012;Velan et al. 2015). However, with the widespread application of encrypted traffic, portbased methods (Dainotti et al. 2012;Karagiannis et al. 2004;Moore and Papagiannaki 2005;Madhukar and Williamson 2006) and deep packet inspection (DPI) methods (Chen et al. 2010;Yeganeh et al. 2012;Sen et al. 2004;Bonfiglio et al. 2007;Korczynski and Duda 2014) are not suitable for encrypted traffic classification. Recently, the methods based on Machine Learning (ML-based) and the methods based on Deep Learning (DL-based) show effective classification results. Since they can identify the encrypted traffic by mining and learning the statistical features, in this section, we outline specific ML-based methods and DL-based methods for encrypted traffic classification.

ML-based methods for encrypted classification
ML-based methods extract statistical features such as packet size and duration from the traffic samples. They then use the appropriate ML algorithms to learn the statistical traffic features for encrypted traffic classification. These methods mainly include two parts: feature extraction and model selection. In feature extraction, Moore et al. (2013) propose almost 250 flow or packet features for encrypted classification. Okada et al. (2011) analyze 49 flow features of encrypted traffic and non-encrypted traffic and obtain strong correlation features such as mean packet size, inter-arrival time (IAT), and transfer time. In general, although time-related features have outstanding classification capability, they show the worse robustness (Velan et al. 2015). Therefore, if the traffic classifier is not designed for a specific network, timerelated features will easily make the performance of the classifier unstable. Machine learning models used in encrypted traffic classification mainly include supervised learning models and semi-supervised learning models. Okada et al. (2011) propose an encrypted classification method based on the estimation of traffic features called EFM, and then they combine several supervised learning models (SVM, Naive Bayes, C4.5) to achieve application classification of encrypted traffic. Arndt and Zincir-Heywood (2011) compare C4.5, k-means, and Multi-Objective Genetic Algorithm (MOGA) in encrypted classification. C4.5 shows the best robustness, while MOGA shows the lowest false positive rate. Bar-Yanai et al. (2010) propose a real-time classification model of encrypted traffic by combining k-means and KNN algorithms, which takes into account the light complexity of k-means and the accuracy of KNN. Zhang et al. (2012) propose an improvement to the k-means algorithm, using the harmonic mean to reduce the impact of random initial clustering scores. This method can increase the accuracy of the k-means algorithm used for encrypted traffic classification.
Given the ML-based methods mentioned above, almost without exception, they have a common disadvantage that they show an over-reliance on feature selection. This process requires a comprehensive prior knowledge of the field so that we may lose essential features. Meanwhile, these methods are challenging to transfer when encountering a new scene.

DL-based methods for encrypted classification
Deep learning is an effective way to solve the problem of feature design . It can automatically select features from the raw traffic during training instead of extracting features manually (Goodfellow et al. 2016). In previous studies, DL-based methods usually take the raw traffic data as input, which includes the underlying protocol layer and the upper application data. Specifically, Wang (2015) extract the first 1000 bytes of TCP flow and use a stacked autoencoder (SAE) to achieve encrypted protocol classification.  propose to select the first 784 bytes of the raw traffic and then use one-dimensional convolution neural networks (1dCNN) to learn the spatial features for encrypted service classification. Lotfollahi et al. (2020) use the IP header and the first 1480 bytes of the IP packet payload as the input of CNN and SAE models to achieve encrypted service and application classification. Zou et al. (2018) combine CNN and Long Short-Term Memory (LSTM), using CNN to learn in-packet features of the first 784 bytes in a single packet, and using LSTM to learn inter-packet features of any three consecutive packets. Besides, other similar studies also get the same excellent classification accuracy (Yao et al. 2019;Zeng et al. 2019;Cui et al. 2019). It can be seen that these methods are based on extracting the first N bytes data of encrypted traffic and then learn the spatial features, sequence features, and byte features of the traffic through suitable deep learning models.
Other methods are to learn encrypted traffic features for the time sequence of traffic (such as packet length sequence, message sequence, etc.) and then use Markov chain, LSTM, etc., to learn the sequence features of encrypted traffic. Yao et al. (2022) regard encrypted network flow as a time sequence and build an attention-based LSTM model to learn the flow's sequence features. Shapira and Shavitt (2019) create images based on the sequence features of the packet size and the arrival time and use CNN to learn the image's spatial features.
However, the methods based on raw traffic bytes have disadvantages of privacy problems and massive storage pressure. These methods all utilize the encrypted traffic application payload as one of the features of the model. Due to the encryption algorithm, the application payload is irregular ciphertext and does not contain useful features. Moreover, taking the application payload as one of the features increases the pressure of data storage. Secondly, Taylor et al. (2014) show that although most applications currently use encryption protocols to protect user data, 80% of applications have both encrypted and unencrypted connections. Developers usually consider the importance and cost of data. Information such as passwords and locations is transmitted in encrypted mode, while other information is still transmitted in plain text. Therefore, 78% of applications have privacy issues.
In addition, the methods based on the time sequence of traffic are unstable and poor in robustness. Although these methods do not involve traffic application payload, the flow's sequence features are easily affected by network performance and user habits, resulting in large differences (Fu et al. 2016). Therefore, such methods are less robust.
In order to design an encrypted traffic classification method that is robust and does not involve user privacy, we focus on mining the difference of the flow header in each category of traffic by deep learning models. On the one hand, privacy problems can be avoided. On the other hand, the flow header can reduce the data storage pressure because of its lightweight.

The proposed model
In this section, we propose a reliable encrypted traffic classification framework without privacy risk, Only Header. It uses CapsNet to learn the spatial features and byte features of the flow header. The details of our proposed model are shown in Fig. 2, which consists of extracting flow header, training CapsNet for encrypted classification. Finally, the fine-tuned model is applied to traffic identification, regular and VPN traffic classification, regular and Tor traffic classification.

Extracting flow header
In this section, we design extracting flow header as the first part of the Only Header by the following steps: extracting header, twice segmentation, flow padding. We advocate only extracting flow header from the raw traffic, which can avoid privacy risk and reduce data storage pressure. Besides, we introduce a twice segmentation mechanism to dilute the interference traffic and increase effective traffic weight. Hence, extracting flow header can achieve extracting header bytes, traffic segmentation, traffic cleaning, and traffic standardization.

Extracting header
The application payload is easy to cause privacy risks and enormous pressure on data storage. So we only extract the header of every packet in the flow, including packet header, IP header, and TCP/UDP header. In addition, we delete the IP address in the packet to avoid model overfitting Lotfollahi et al. 2020). This is due to the fact that when datasets are collected in a limited environment with a limited number of hosts collecting traffic, it is highly likely that IP address is an important feature to distinguish between different classes. However, in a real environment, the IP address is a numerical label that is connected to a computer network that uses the internet protocol for communication. Thus, it cannot be used as a feature to distinguish traffic classes.

Flow Segmentation
DL-based methods need to divide the continuous traffic into discrete units plurality according to a particular granularity . Raw traffic P is a set containing the different size of packets, denoted as: where |P| is the number of packets in P, p i is the i-th packet in P, which is defined as:  Fig. 2 Framework of the Only Header for encrypted traffic classification destination port, transport layer protocol) of the ith packet, b i is the byte length of the ith packet and t i is the start time of the ith packet.
Raw traffic is first segmented by flow because it is frequently used in current traffic classification studies (Dainotti et al. 2012). A flow F is a group of packets in P that have the same five-tuple. The flow in this paper refers specifically to bi-directional flow, that is, the source IP address and source port can be interchanged with the destination IP and destination port. Fis defined as: where n ≤ |P|, and it is the packet number of F.

Packet Segmentation
Actual network traffic usually exists massive smaller-size flow that is unrelated to the class of traffic such as SNMP, DNS, and ARP, affecting the effective classification of traffic. Owing to those larger-size flow are the main activities in the communication process that have less unrelated traffic, we propose a packet segmentation to dilute unrelated flows and increase the weight of valid flow. It splits flow continuously by setting the maximum number of packets in the flow F. G i denotes the ith traffic in F that is defined as: where m is the number of the packet in G i , and C is the maximum number of packets that is defined as: where L sample denotes the byte length of a sample, and L header denotes the byte length of the sum of packet header, IP header (deleting the 4-byte source IP address and the 4-byte destination IP address), TCP/UDP header. Noteworthy, since the TCP header is 20 bytes and the UDP header is 8 bytes, we uniformly select the first 20 bytes of the TCP/UDP packet in order to preserve the maximum header information. Thus, the maximum byte length of the sum of header is 48. The reason for this setting is that we hope to make full use of G i to predict the whole flow accurately. In our view, the more packets G i has, the more representative it is. Thus, we make C reach the maximum.

Flow padding
Using neural networks to train data requires a fixed amount of input, so we have a uniform size of 784 bytes for the traffic of the above steps. As in ; Lotfollahi et al. (2020); Yao et al. (2022), we use 0x00 as the padding value. Using 0x00 value does not change the parameters of the neural network. Therefore, it does not introduce bias to the classification results. When G i is larger than 784 bytes, only the first 784 bytes are retained. Otherwise, the 0 × 00 is added in the end to complement it to 784 bytes. In addition, to make the traffic as a normative input to the following model, we reshape 784 bytes to 28*28 matrix.

Training CapsNet for encrypted classification
In this section, we design the training CapsNet model as another part of the Only Header. We use the CapsNet to classify the traffic matrices with the size of 28*28, which consists of convolution operation and dynamic routing. The model structure is shown in Fig. 3. The input and output of CapsNet use vectors instead of scalars of traditional neural networks. The length of vectors indicates the probability of the encrypted traffic, and the direction indicates the attributes of the features such as size and position. In addition, compared to CNN, CapsNet no longer adopts pooling operations. It is well known that the pooling operation also discards some necessary information, including accurate location information, while reducing connection parameters and refining features.

Convolution operation
CapsNet model reads traffic matrices via preprocessing mentioned above with the size of 28*28*1 that ranging from 0 to 255, so we first normalize traffic matrices to limit the value range to [0,1]. In the ReLU Conv1 layer, a convolution operation of stride 1 is performed on a traffic matrix using 256 convolution kernels with the size of 9*9 to generate 256 feature matrices of traffic with the size of 20*20.
Subsequently, the second convolutional layer (Prima-ryCaps) is used as the input of the capsule to construct the tensor structure. Specifically, we perform 8 different weighted Con2d operations on 256 feature matrices of traffic and execute 32 convolution kernels with the size of 9*9 and a stride of 2 in each Con2d to finish convolution operation. Finally, 6*6*32 vectors with a dimension of 8 are generated. Each vector is a new capsule unit formed by 8 common convolution units. The length of the capsule indicates the probability of a class that traffic belongs to. The direction of the capsule indicates the attributes of traffic (location of fixed strings, the order between packets).

Dynamic routing
The third layer of DigitCaps propagates and updates the input capsule. The capsule processing is divided into two steps: linear combination and routing. For the linear combination, the capsule output activity vector of the lower layer u i is multiplied by a weight matrix W i j to obtain a prediction vector u j|i , and all inputs of the higher layer capsule s j are weighted summations of the predicted vectors, given bŷ where c i j is a coupling coefficient determined by iterative dynamic path. For the dynamic routing mechanism, to find the most suitable path between the capsule's output and the next layer's input, c i j in (6) is updated by where b i j is the logarithmic prior probability of capsule i coupled to capsule j. The length of one capsule output vector is between [0,1], indicating the probability of a certain class. Thus, a squashing function is used to compress vectors that is defined as follows: where v j is the output vector of capsule j and s j is its total inputs. W i j and other convolution parameters of the entire network are updated by the loss function. Therefore, we use the margin loss commonly used in SVM as the loss function, defined as: where c is predicted class and T c is an indication function that if c is correct, T c equals 1, otherwise 0. m + is upper boundary of v c , while m − is lower boundary, λ is regularization strength. We adopt reconstruction loss to avoid overfitting. Hyperparameter tuning is essential to achieving the best classification performance. We train our model using Adam algorithm with a batch size of 128 examples. The learning rate is set to 0.001, and the number of learning epochs is set to 100 for the model training. Moreover, Table 1 describes the main parameters of each layer in our model. We also apply threefold cross-validation on the datasets to validate the results.

Dataset
The most critical condition for training deep learning models is that there are a large number of representative datasets. However, the lack of available datasets is an essential factor hindering traffic classification . To demonstrate the effectiveness of the proposed method, we use the ISCX VPN-nonVPN dataset (Draper-Gil et al. 2016) and ISCX Tor-nonTor (Lashkari et al. 2017) to evaluate the Only Header. ISCX VPN-nonVPN dataset provides 150 raw traffic files, including 7 kinds of conventional encrypted pcap files (chat, streaming, etc.) and 7 kinds of VPN pcap files (VPNchat, VPNstreaming, etc.). ISCX Tor-nonTor dataset provides 85 raw traffic files, including 8 kinds of conven- On the ISCX VPN-nonVPN dataset, the author labels 150 traffic files according to specific applications instead of marking them according to service, making some traffic files ambiguous. Particularly, browsing is HTTPS traffic generated when browsing or executing a task that contains a browser (Draper-Gil et al. 2016). We are not sure some certain traffic files like hangoutVoIP belonging to browsing or belonging to Voip. Therefore, we decide to delete browsing and VPNbrowsing labels, changing 14 classes to 12 classes. In addition, because the nonTor traffic in the ISCX Tor-nonTor dataset is derived from the ISCX VPN-nonVPN dataset, we only use Tor traffic. Finally, we get three encrypted traffic, including regular encrypted traffic, VPN traffic, and Tor traffic.
According to (5), the maximum number of packets in packet segmentation is set to 16. Application and the total number of samples are shown in Table 2.

Experimental environment
In this paper, we use Python3, TensorFlow as software frameworks, which run on Ubuntu 16.04 64 bit OS. The server is a DELL R720 with 16CPU cores and 128 GB of memory. An Nvidia Corporation GM204GL GPU is used as the accelerator.

Evaluation metric
We use accuracy (Acc), precision (Pre), recall (Rec), and F-measure (F 1 ) metrics to evaluate our proposed methods, reflecting the ability of the method to identify network traffic. Accuracy is used to evaluate the overall effect of the method. Precision and recall reflect the recognition efficiency of the identification method in each class. F-measure is the evaluation index obtained by comprehensive precision and recall.

Traffic preprocessing evaluation
In extracting flow header, we observe that application payload can cause privacy risk and cannot be regarded as effective features in encrypted traffic classification. Therefore, we only extract the flow header to avoid privacy trouble and reduce data storage pressure. Moreover, we introduce the twice segmentation mechanism to split traffic. The segmentation mechanism performs flow segmentation and packet segmentation on the traffic. It can achieve the purposes of diluting the proportion of unrelated traffic and increasing effective traffic weight. In order to evaluate the above traffic preprocessing methods, we perform encrypted traffic identification, regular and VPN traffic classification, regular and Tor traffic classification on the ISCX VPN-nonVPN and ISCX Tor-nonTor datasets. The detailed task description is shown in Table 3.

Analysis of the flow header
On the flow header analysis, we perform T-SNE dimensionality reduction visualization on regular encrypted traffic, VPN traffic, and Tor traffic, and the results are shown in Fig. 4. It can be seen that TSNE's dimensionality reduction visualization performs well, regardless of whether it contains application payload. However, in the visualization of all data, some sample points are not easy to separate in regular encrypted traffic and VPN traffic, while the visualization of flow header performs better. From an intuitive point of view, the flow header is more conducive to traffic classification. Due to the encryption algorithm, application payload is encrypted as randomized ciphertexts. When all data of flow are visualized by TSNE embedding, there is no difference in the payloads between the three categories. Thus, it will be difficult to aggregate into three clusters. When only the flow header is used, it can be easily aggregated into three clusters. Because the flow header is not encrypted and its value is strongly correlated with the network environment and application type. Next, we analyze the impact of only the flow header on data storage. First, we count the packet size distribution of regular encrypted traffic, VPN traffic, and Tor traffic shown in Fig. 5. Figure 5a depicts the packet size distribution for each class, with each class containing 100 randomly selected packets from the entire transmission conversation. Figure 5b depicts the proportion of packet sizes at different intervals for all transmission conversations in the six classes. It can be observed that except for the Email traffic, others are usually transmitted in large size packets (around 1500 bytes). In addition, the packets with 1280-1514 size account for the highest percentage of packets, taking up 31.9%, and the packets with more than 79 bytes occupy 74.7%. This indicates that most packets contain application payload. More intuitively, we count the size changes on whether to remain application payload of the three types of encrypted traffic, as shown in Table 4. It can be seen that all data are 11-15 times larger than the flow header. Therefore, if the application payload cannot generate effective features, extracting flow header will significantly reduce data storage pressure during the encrypted traffic classification process.
In order to verify the performance of the flow header, we perform 3 group experiments that the detail of tasks are shown in Table 3. According to Table 5, the flow header does not reduce the classification effect of encrypted traffic in the  three tasks. On the contrary, in all tasks, the classification results of the Only Header are as well as the classification results of all data, even better in regular and Tor traffic classification (Exp 3). A large amount of randomized encrypted payload cannot be used as effective features for classification when the application payload is retained. Therefore, when the application payload is removed, the flow header has a regular field distribution, which is more conducive to mining spatial features and byte features between different classes. Therefore, This process can not only keep the accuracy of classification but also avoid privacy risk and greatly reduce data storage pressure.

Analysis of twice segmentation mechanism
We propose the twice segmentation to increase the weight of effective traffic. In order to evaluate the twice segmentation mechanism, we perform 3 group tasks mentioned in Table 3. Moreover, to verify that CapsNet is more suitable than 1dCNN in traffic classification tasks, we use both models for comparison in regular and VPN traffic classification (Exp 2). The results are shown in Table 6 and Fig. 6. In Table 6 and Fig. 6, flow represents the whole flow as the identification object for traffic classification, which means only perform the flow segmentation in the twice segmentation. And the twice segmentation means the traffic after flow segmentation and packet segmentation as the identification object for classification. As Table 6 shows, in encrypted traffic identification (Exp 1), regardless of whether to perform the twice segmentation mechanism, both the accuracy and F-measure can reach 99.9%. In regular and VPN traffic classification (Exp 2), the twice segmentation mechanism improves 2.0% of the accuracy and 3.8% of the F-measure. In regular and Tor traffic classification (Exp 3), the twice segmentation mechanism increases the accuracy by 0.9% and the F-measure by 1.5% over raw flow. Figure 6 describes the performance results of CapsNet and 1dCNN on regular and VPN traffic classification. In the 1dCNN model, we observe that 10 kinds of traffic (except Chat and Email classes) can achieve better results using twice segmentation. Besides, the F-measure of two classes, File and Voip, are less than 80% using flow. In contrast, twice segmentation improves the F-measure, making each class reaches more than 90%. In the CapsNet model, all kinds of traffic using twice segmentation are better than the traditional flow, and each of them is above 97%. Moreover, in the comparison of 1dCNN and CapsNet, most classes of traffic show CapsNet performs better than 1dCNN no matter whether to conduct twice segmentation. Compared to other combinations, the F-measure for each class achieves the best value with twice segmentation and CapsNet model.
In summary, our proposed twice segmentation mechanism has shown better experimental results in encrypted traffic classification. In addition, no matter whether we execute packet segmentation or not, CapsNet shows higher accuracy and F-measure than 1dCNN.

Baseline experiments comparison
In this subsection, we perform three experiments mentioned in Table 3 to evaluate the Only Header and compare the results with baseline methods on ISCX VPN-nonVPN and ISCX Tor-nonTor datasets. Owing to the accuracy and Fmeasure for encrypted traffic identification reach 99.9%, and the baseline methods also can achieve 99% accuracy, we no

Comparison on regular and VPN traffic classification
We compare the Only Header to the following baseline methods for regular and VPN traffic classification. In order to evaluate and compare the effectiveness of the Only Header for regular and VPN traffic classification, we adopt the above preprocessing to experiment on raw traffic collected in the ISCX VPN-nonVPN dataset. The experiment shows that the precision and recall of each class are as high  Table 7. The F-measure of 9 classes (except Chat, Email, and Voip) reaches 99%. In addition, the F-measure of VPN traffic classes is better than regular encrypted traffic, indicating that Only Header performs especially outstanding in VPN traffic classification. The confusion matrix of the Only Header with rows normalized for regular and VPN traffic classification is shown in Fig. 7. As the figure shows, all of the traffic classes on the diagonal show the deeper blue color, indicating the effective classification ability of the Only Header for regular and VPN traffic classification.
Compared with the baseline methods for encrypted traffic service classification on the ISCX VPN-nonVPN dataset that use deep learning as well, Table 8 reports that accuracy and F-measure are higher 7.2% and 6.2% than CNN-LSTM in Zou et al. (2018) that is the best method to our best knowledge. In a word, the Only Header performs better and achieves the standard of practical application.

Comparison on regular and Tor traffic classification
We compare Only Header to the following baseline methods for regular and Tor traffic classification.   Zou et al. (2018) C N N a n d L S T M F l o w 9 2 9 1 flow and then uses CNN to learn the spatial features of the image to build an encrypted traffic classification model.
Tor traffic only supports the encrypted links and TCP flow over the Internet. It is complicated to trace and analyze its traffic (Lashkari et al. 2017). We use regular encrypted traffic in the ISCX VPN-nonVPN dataset and Tor traffic in the ISCX Tor-nonTor dataset to implement regular and Tor traffic classification. As shown in Table 9, we observe that all kinds of traffic can reach more than 99% except Chat and Email. The accuracy of classification is 99.3%. Besides, the precision and recall of both TorEmail and TorVoIP are 100%. The confusion matrix of the Only Header with rows normalized for regular and Tor traffic classification is shown in Fig. 8. As the figure shows, all of the traffic classes on the diagonal show the deeper blue color, indicating the effective classification ability of the Only Header for regular and Tor traffic classification.
On the comparison of other approaches, it is shown in Table 10 that the accuracy of the Only Header is higher 15.3% and 13.6% than baseline methods and the F-measure is higher 15.2% and 34.27% than baseline methods. Therefore, the Only Header makes a great improvement for Tor traffic classification.

Conclusion
Based on the analysis of the current research on encrypted traffic classification, this paper proposes a reliable encrypted traffic classification framework without privacy risk. It utilizes CapsNet model to learn the spatial and byte features of the flow header, which avoids privacy troubles and reduces data storage pressure. Besides, the Only Header is more suitable for encrypted traffic classification tasks than others for the reason that it takes into account the location of fixed strings and the order between packets. Meanwhile, the Only Header increases the effective traffic weight by a twice segmentation mechanism, which exhibits higher accuracy than traditional traffic representation such as packet and flow.
The experimental results show our study yields significant improvements against the state-of-the-art methods on ISCX VPN-nonVPN and ISCX Tor-nonTor traffic dataset.