Illegal activity detection on bitcoin transaction using deep learning

Forensic investigations increasingly leverage artificial intelligence (AI/ML) to identify illegal activities on bitcoin. bitcoin transactions have an original graph (network) structure, which is sophisticated and yet informative. However, machine learning applications on bitcoin have given limited attention to developing end-to-end deep learning frameworks that are modeled to exploit the bitcoin graph structure. To identify illegal transactions on bitcoin, the current paper extracts nineteen features from the bitcoin network and proposes a deep learning-based graph neural network model using spectral graph convolutions and transaction features. The proposed model is compared with two state-of-the-art techniques, viz., a graph attention network (GAT2) and an extreme gradient boosted decision tree (XGBOOST) trained on convoluted features for classification of illegal transactions on bitcoin. To understand the efficacy of the proposed model, a dataset is collected consisting of 13310125 transactions of 2059 entities having 3152202 bitcoin account addresses and belonging to 28 categories of users. Two sets of experiments are performed on the datasets: labeling transactions as legal or illegal (binary classification) and identifying the originator of the transaction to one of the twenty-eight types of entities (multi-class classification). For fast and accurate decisions, binary classification is appropriate, and for pinpointing the category of bitcoin users, a multi-class classifier is suitable. On both the tasks, the proposed models achieved a maximum of 92% accuracy, validating the methodology and suitability of the model for real-world deployment.


Background
Since its initiation in 2009, bitcoin 1 has been buried in contentions for giving a sanctuary to criminal operations. Multiple sorts of unlawful clients take cover of secrecy or anonymity provided by bitcoin, and revealing such kind of elements is crucial for forensic or cybercrime investigations (Fang et al. 2022). Prominent critics of bitcoin argue that it is anti-social and anti-transparency as it makes obstacles for law enforcement to follow dubious exchanges because of the anonymity and security (Min et al. 2019). Since bitcoin's outstanding growth in transactions during 2012-2016, several clients, viz. mixing services , betting destinations, exchanging trades, financial specialists, examiners, and autonomous mining enterprises (Hu et al. 2019) have entered the bitcoin biological system. The 2012onwards stage saw the development of Ponzi plans, illegal tax avoidance, cheats (Kumar et al. 2021a), misappropriations, blackmail (Kumar et al. 2021b;Paquet-Clouston et al. 2019), and tax avoidance (Conti et al. 2022) strategies that utilized the cover of anonymity afforded by bitcoin cryptocurrency to misdirect the review trail. It was theorized that in 2017, BTCs of the value $770 million were exchanged for unlawful exercises (Lee et al. 2020), a fourth of bitcoin clients were noxious, and 46% of all bitcoin action was illicit (Sean et al. 2019). The main bottlenecks in forensic operations are limited research done on feature engineering to identify each category of commercial users, limited option for deanonymizing bitcoin users and labeling transaction to each category of commercial users, machine learning-based work limited to identifying a small category of commercial users. However, new categories of commercial users have come up in recent times (not targeted by old research). There is a need for baseline data, models, and predictive results in identifying each category of commercial users. Secondly, there is a lack of ground-truth labeled datasets for machine learning and need for an updated study considering fast changing technology landscape, and finally, we need open-source work in bitcoin forensic tool development.
Feature engineering is basic for ML applications, and the possibilities for feature engineering and extraction in bitcoin are huge because of the general classifications of metadata related with blockchain. In last decade, AI or deep learning has achieved applications in image identification, object localization, or signal or speech processing. Be that as it may, AI has gained limited progress in domains like cryptocurrencies because of absence of benchmark, public datasets (Table 3) (Wei et al. 2018), research groups lack organizational infrastructure to deal with full data of blockchain, and absence of ground truth data on the characters of bitcoin clients. Aside from these issues, pseudo-obscurity in digital currencies permits clients to execute with one another through hash address or keys. These keys are reusable and also can be created and discarded infinitely; this makes the techniques for assimilating transactions to a user.
Existing studies predominantly focused on feature engineering/extraction followed by supervised learning (see Table 2) to identify illicit activities on bitcoin. Such investigations require deanonymizing methods (see Sect. 2.3) to connect numerous keys to a solitary entity. Indeed, even the most mainstream deanonymizing procedures have limits because of the utilization of mixing services (Herrera-Joancomartí 2014; Gaihre et al. 2018). Besides, to lessen the computational multifaceted nature of AI models, the objective of interest was confined to restricted classes of illegal clients. Also, the time stretch for which information was gathered from the blockchain for include designing was confined to more limited ranges. Due to such factors, the techniques that were to be used for inference generation after training protocols were not generalized. The issues mentioned above tend to limit the efficacy of forensic investigations. Deviating from existing methodologies for detecting illegal activities on bitcoin, the current paper proposes supervised learning approaches on the bitcoin transactions' network structure.

Contributions
-This paper represents the largest study in our knowledge conducted on the identification of suspicious bitcoin addresses-the dataset collected for use in this paper contains 13310125 transactions of 2059 entities having 3152202 bitcoin addresses and belonging to 28 categories. We started extensive data collection and labeling each data sample based on information scrapped through multiple sources. -Released the dataset and script to motivate further research -To address all these gaps, our work identifies nineteen different features for separating different users-usage history, usage frequency and usage quantity). -Feature engineering to identify the optimal transaction patterns observed in illegal activities. -Proposed models for supervised learning for detecting illegal activities on bitcoin. Due to the graph structure of the data, existing machine learning models (XGBOOST and GAT2) were underfitting (also the problem of using machine learning on graph structure is known and mentioned by us in the literature review as well), and so a graph neural network-based method was used. -Conducted extensive experiments on transaction graphs from 2009-2020 to validate the proposed approach and highlighted future works The current research is of significant importance to the development of bitcoin forensic investigation tools. In our opinion, the research community will benefit from knowledge of the fast and accurate methods of graph neural networks trained using graph features (adjacency matrix) and our proposed features. Further, the memory needed is less compared to XGBOOST and GAT2 (preferred methods). The methodology relies on deep learning on the transaction graph of the bitcoin network. Handcrafted feature engineering is minimized, and a data-driven approach to learning features from the transaction graph is used. An additional benefit of Fig. 1 Steps of ML on bitcoin system the proposed approach is that it processes the transaction graph and avoids the preprocessing step of deanonymizing addresses. Although the transaction graph of bitcoin has been used in previous studies to track the flow of funds (Pinna et al. 2018) or in case studies investigating individual scams (Kumar et al. 2021b;Greaves and Au 2015;Bistarelli et al. 2018), a large-scale modeling of illegal activities using transaction graph was uncharted.

Outline
The remaining parts of the paper are divided as follows: Preliminaries (see Sect. 2.1, 2.2 and 2.3) and state-of-theart research (see Sect. 2.4) in bitcoin forensics are described in Sect. 2. Dataset collection and preprocessing along with the methodology are outlined in Sect. 3. Section 4 gives the mathematical model for the proposed semi-supervised learning for detecting illegal activities. Experimental study and discussion are in Sect. 5 followed by lessons learnt and the next studies in Sect. 6.

Bibliographic studies
Data structure and reference implementation of bitcoin and principal ideas, for example, blocks, blockchain, exchanges, input keys, output keys, service providers on bitcoin, deanonymization, are depicted in Sects. 2.1, 2.2, and 2.3. Followed by basic investigation of published research on distinguishing illicit clients (see Sect. 2.4), research gaps in public databases (see Sect. 2.5) and machine learning models are utilized in bibliographic studies (see Sect. 2.6).

Data structure and reference implementation of bitcoin
Bitcoin exchanges are appended to "blocks" and recorded after consensus into a distributed public record "blockchain." Each exchange has a few data sources (senders) and outputs (recipients). The metadata 2 related with blocks, exchanges, data sources, and outputs give scope for further examination. A solitary bitcoin client can produce various addresses for sending and accepting BTCs, which makes an impediment 2 https://github.com/blockchain-etl.
in investigating bitcoin clients. Deanonymizing procedures give an answer for overcoming this issue.

Bitcoin cryptocurrency and Deanonymization
Block is a set of exchanges T = {t 1 , t 2 , ..., t n }. For each t i ⊂ T , there is a 3-tuple (t s , I t i , O t i ) where t s signifies UNIX timestamp of t i and I , O means the locations of data sources (senders) and outputs (beneficiaries) in t i separately (Zarpelão et al. 2019). Each t i can have a few data sources and yields, i.e., I t i = {i 1 , i 2 , ..., i n } and O t i = {o 1 , o 2 , ..., o n }. Each bitcoin client u i ⊂ U where U = {u 1 , u 2 , u 3 , ..., u n } can have numerous addresses and play out various transactions. For sending bitcoins (BTCs), u i can produce another address for every exchange t i .
The undertaking of a deanonymizing function f (.) is consolidating all addresses produced by u i , i.e., .., o t n u i }, across all exchanges. Here i t 1 u i is address created by u i to send BTCs in t 1 and o t n u i is address produced by u i to get BTCs in t n . Deanonymizing is a non-trivial methodology because of the multifaceted nature and variety of the bitcoin network (Herrera-Joancomartí 2014; Gaihre et al. 2018). Functions described in the literature can be arranged as heuristicbased (Conti et al. 2022;Pinna et al. 2018;Nakamoto 2019;Zarpelão et al. 2019;Bistarelli et al. 2018), disseminated network-based (Ermilov et al. 2017) and AI-based (Liu et al. 2020). Heuristic-based capacities are well known and utilized in bitcoin research papers.

Bibliographic studies on detecting illegal activities in bitcoin cryptocurrency
Existing methodologies on bitcoin forensics can be grouped into three categories (see Fig. 2), with AI-based techniques being most popular. A benefit in the investigation of digital cryptocurrencies forms of money is that transaction exchange records are kept up on an distributed record "blockchain," which is transparently accessible for assessment. The volume of the blockchain presents issues in examining it, restricting the interval of time of study, or confining the destinations utilized by concentrates in the writing to conquer this issue.
Writing on recognizing criminal operations has zeroed in on deanonymizing elements (Liu et al. 2020;Nerurkar et al. 2020;Wei et al. 2018;Jourdan et al. 2018), distinguishing botnets (Zarpelão et al. 2019), illicit exchanges (Lee et al. 2020 Table 1 sums up the systems utilized in these investigations. Feature extraction is the most basic part of bitcoin examinations centering with respect to unlawful movement or illegal client recognition. Different methodologies utilized   Table 2).

Databases utilized in published bibliographic bitcoin studies
The availability of standard datasets is a critical issue in examining bitcoin. The entire blockchain from inception to 08 May 2020 at 13:21:33 GMT was 298GB. Due to storage, computational, and time complexity, majority researchers (excluding surveys (Di Francesco et al. 2019;Maesa et al. 2016;Alqassem et al. 2018;Mauro et al. 2018) have focused on limited categories of illicit users and shorter periods (Table 3). Table 4 gives the popular ML models for bitcoin studies. From Table 1 in writing, it is obvious that the execution of a solid and secure illicit client location framework is a significant worry for protection and security in bitcoin. Existing works have not zeroed in on an expansive range of criminal operations that are directed on bitcoin. Furthermore, existing datasets are inadmissible for AI as their source is restricted. Though supervised learning may achieve high prediction accuracy, they are inapplicable when financial data have no predefined class labels and bitcoin blockchain transactions are unlabeled. Hence, no state-of-the-art methods are applied to them unless preceded by labeling. For the purpose of labeling, the current work uses multiple sources of data such as bitcoin explorers, bitcoin forums, news articles, and blogs (Gang et al. 2019;Kou et al. 2014). In this regard, Sect. 3 depicts the information assortment technique to conquer the issue of information accessibility in open datasets. Section 4 examines the proposed classifier that could distinguish a wide range of unlawful clients on bitcoin. Figure 3 illustrates the procedure for bitcoin dataset collection and preprocessing (Sect. 3.1) and construction of transaction graph from bitcoin data, which are described further.

Database cleaning
Bitcoin database was constructed using public bigquery repository. 3 All blocks and transactions from 03 Jan 2009 12:45:05 GMT to 08 May 2020 13:21:33 GMT were present in the database. The transaction graph was constructed from transactions occurring within a calendar year. The transaction graph is a tuple G = (V , E) where V is a (finite) set of vertices, and E is a finite collection of edges. The set E contains elements from the union of the one and two-element subsets of V . In the bitcoin transaction graph G, the V is transaction hashes. E represents the interactions between the users through the exchange of bitcoins. The amount of BTC transferred is an attribute of E. Table 5 gives the notations used to graphically illustrate a typical transaction graph G (see    Multiple accounts belonging to a single user need to be identified; for this purpose, multi-input heuristic clustering (Di Francesco et al. 2019;Maesa et al. 2016) was used utilizing an API 4 (Janda 2016). This helped in synthesizing a ground truth labeled database for semi-supervised learning (see Table 6) having 3310125 transactions of 2059 entities having 31522025 accounts belonging to 28 categories. Transaction graphs were directed, acyclic, and consisted of valid (defined in Anceaume et al. (2016)) and coinbase transactions. Description of users in Table 6 is given in Sect. 2.2, "unclassified" refers to users not falling in other 27 categories.
The count of transactions per category is given in Table  7. 35% of the transactions are created by illegal entities viz. dark market, blacklist, gambling, criminals, mixer, Ponzi, ransom, sextortionists, laundering, scams, and bomb threats (Tables 9, 8). Table 9 gives the features extracted from bitcoin blockchain for each transaction (tx). Description of the features is listed: -txsize: the size of this transaction (tx) in bytes -txvirtualsize: the virtual transaction size (differs from size for SegWit tx) -txinputs_count: total inputs to a tx -txoutputs_count: total outputs to a tx -txinput_val: total BTCs sent by inputs in a tx -txoutput_val: total BTCs sent to outputs in a tx -txfee: the fee paid by this transaction -Min_received: minimum BTCs received in a tx by outputs -Max_received: maximum BTCs received in a tx by outputs -Avg_received: average BTCs received in a tx by outputs -Total_received: total BTCs received in a tx by outputs -Stdev_received: standard deviation of BTCs received in a tx by outputs -Var_received: variance of BTCs received in a tx by outputs -Min_sent: minimum BTCs sent in a tx by inputs -Max_sent: maximum BTCs sent in a tx by inputs -Avg_sent: average BTCs sent in a tx by inputs -Total_sent: total BTCs sent in a tx by inputs -Stdev_sent: standard deviation of BTCs sent in a tx by inputs -Var_sent: variance of BTCs sent in a tx by inputs Figure 5 illustrates the methodology adopted to test the efficacy of the proposed model (see Sect. 4) using the bitcoin dataset. Table 10 gives the graph measurements-order and size of each graph in the bitcoin transaction dataset.

Illegal activity detection on bitcoin transaction using deep learning
Mathematical models for the proposed supervised learning approach, viz. deep learning-based graph neural network model using spectral graph convolutions (see Sect. 4.1), are given (denoted as Models-IIIA and IIIB). Three stateof-the-art approaches are also described further, viz. an (XGBOOST) ensemble of decision trees (Model-II) and GAT2 model (with GAT layers) proposed by P Velickovic ) (Model-I) for detecting illegal activities in bitcoin transaction graph with steps followed to train them on the dataset. Figure 6 illustrates the difference between the proposed approach and others (A=adjacency matrix of the bitcoin transaction dataset, X=feature matrix of transactions).

GCN-based classifier (Models-IIIA and IIIB)
The architecture of model contains two stacked graph convolutional network (GCN) layers followed by a single fully connected layer. A bitcoin transaction graph G(V , E) with V as the node set of transactions T i , T j , ..., T n ; m ≥ 1 and E as the dyad set representing BTC transfers between accounts has a binary adjacency matrix as given by Eq. (1), and a node feature matrix as given by Eq.
(2)     The node feature matrix is formed by stacking horizontally the m-dimension feature vector of each node (see Table 9). GCN has the following steps: 1. Apply the convolutional filterÂ constructed from the new adjacency matrix as given by Eq. (3), and new degree matrix as given by Eq. (4), by the operation as given by Eq. (5), 2. Use the propagation rule (see Algorithm 1) for the graph convolution layer as given by Eq. (6), H (1) is matrix of activations of first layer as given by Eq. (7), θ (0) is a trainable weight matrix of layer 0 and σ is a nonlinear activation function. 3. Feed the convoluted representations of the vertices as given by Eq. (8), which are then passed to a standard fully connected layer. 4. Apply the sigmoid function row-wise on the fully connected last layer in the GCN. 5. Compute the cross entropy loss on known node labels. 6. Back-propagate the loss and update the weight matrix θ (0) .
Convoluted representation of node is produced by propagation rule by aggregating the labeled and unlabeled neighbors of the node (Algorithm 1). During training, backpropagation of the supervised binary cross-entropy loss leads to update the weights θ (0) shared across all nodes. This loss depends on the latent feature representations of labeled nodes, which in turn depends on labeled and unlabeled nodes. Thus, the learning becomes semi-supervised.  Fig. 7 Example of laplacian smoothing where a curve is shown before and after the smoothing operation is applied

Relation of proposed model with laplacian smoothing
The proposed method performs operations that are equivalent to a laplacian smoothing operation. Consider a curve with points p i , p i−1 , p i+1 .. as shown in Fig. 7 where laplacian smoothing is applied to bring each point closer to weighted average of its neighbors using Eq. 9 and 10.
For a graph G = (V , E) with adjacency matrix A such that a i j = 1, |i − j| = 1 and degree matrix D = diag(d 1 , d 2 , .., d n ), the smoothing operation can be rewritten as The matrix L = D − A is the normalized graph laplacian and has two versions as given by Eq. (13): and by Eq. (14), The laplacian smoothing operation is as given by Eq. (15): and has smoothing parameter as given by Eq. (16): which controls strength of smoothing. Setting γ = 0, laplacian smoothing becomes equivalent to non-linear mapping function (identity function) being learned by GCN. The laplacian smoothing computes the local average of each vertex as its new representation. After the smoothing over node neighborhoods, nodes that share neighbors tend to have similar feature representations or labels.

Complexity analysis
For an input graph with N nodes and feature vectors of length C, the adjacency matrix A will be in R N * N , and the input feature matrix V a will be of size R N * C . For a given graph convolution layer, to obtain an output V out = R N * F , the time complexity is O(N 2 C F). The main limitations are feature engineering and feature extraction. The model depends on extracting features about the transaction from the database of blockchain. This extraction process is time-consuming, and so for our work we had built a database of transactions and features from the start of the blockchain till the time of writing the paper. For any users of the proposed approach, the database collection and feature extraction procedure will have to be performed (as given in the Sect. 3). The limitations are common to other machine learning-based models and not a major drawback as availability of data is not a problem at least for bitcoin blockchain-as everything is public.

Experimental study
Experiments to evaluate the efficacy of proposed model with others are elaborated (see Sect. 5.1) along with performance metrics (see Sect. 5.2). The results of the comparative study are discussed in Sect. 5.3.

Description of experiment
Proposed model is evaluated on twelve datasets of the bitcoin transaction graph (one for each year from 2009-2020). The graph datasets were split in 8:1:1 for training, test, and development set. Hyper-parameters of the models are listed, and scripts, datasets, and notes for reproducibility are available at Github repo. 5 Models-IIIA and IIIB were implemented using the StellarGraph Python library, and Models-I, II were implemented using the XGBOOST Python library. List of final model hyper-parameters: -Model-I: 1. layer-1 size=16 2. layer-2 size=16 3. layer-1 activation=relu 4. layer-2 activation=relu 5. dropout=0.5 6. optimizer=adam 7. loss=categorical cross-entropy 8. epochs=200 9. early stopping patience=50 10. restore-best-weights -Model-II: 1. learning_rate = 0.0001 2. n_estimators = 100 3. max_depth = 1 4. colsample_bylevel = 1 5. colsample_bytree = 1 6. subsample = 1 -Models-IIIA and IIIB: 1. learning_rate = 0.0001 2. n_estimators = 100 3. max_depth = 1 4. colsample_bylevel = 1 5. colsample_bytree = 1 6. subsample = 1 The model was tuned using GridSearchCV function of scikit-learn -The grid of hyper-parameters was created, and best setting among the hyper-parameters was identified by comparing on basis of scoring metric used called f1-score weighted. The metric is described as calculation of f1-score metrics for each label and finds their average weighted by support (the number of true instances for each label). This alters 'macro' to account for label imbalance; it can result in an F-score that is not between precision and recall. Below is the list of hyper-parameters used in the grid search, along with the range and the final selected value after grid search is below: 1. layer-1 (units in the first layer: range 2-40) size=16 2. layer-2 (units in the first layer: range 2-40) size=16 3. layer-1 (nonlinear activation function in the first layer: range relu,linear) activation=relu 4. layer-2 (nonlinear activation function in the first layer: range relu,linear) activation=relu 5. dropout (percentage of units in the first layer to dropout: range 0.1−0.75) rate=0.5 The other hyper-parameters were fixed from previous publications, viz., optimizer, loss, epochs, early stopping patience, and restore-best-weights.

Metrics
Given the true positives t p , true negatives t n , type I error f p , and type II error f n obtained from observing (ŷ i , y i ), accuracy (A) is defined by Eq. 17.
Loss (see Eq. 18) was calculated over m observations in the dataset as the average of loss for each observation i given the predictionŷ (i) and actual label y (i) .
Precision can be thought of as the ability of a classification model to identify only the relevant data points and is calculated by Eq. (19).
This is a class level metric. To get an overall precision or macro-precision (Macro-P), we average out the precision of each class as given by Eq. (20).
Recall is regarded as a model's ability to find all the data points of interest and is calculated by Eq. (21).
The recall is also a class level metric. The overall recall or the macro-recall (Macro-R) is average of the recall of each class as given by Eq. (22):- F 1 -score is the harmonic mean of precision and recall and is calculated by Eq. (23). The Macro-F 1 -score is given by Eq. (24):- Imbalance was handled in multiple ways including training on a large batch size (the model is fit using a larger of 256 than default batch size of 128; this is important to ensure that each batch has a decent chance of containing a few positive samples. If the batch size was too small, they would likely have no fraudulent transactions to learn from), class weighting, and oversampling. Also a initialization of the weights and bias of the neural networks final layer was done using Xavier normal initializer. As false positives (marking a legal transaction as illegal) are less serious than false negatives (marking a illegal transaction as legal) in our problem, we give a higher weight to the few positive examples that we have in the data. Finally, resampling the database and increasing the minority samples and training in a batch-wise manner helped in improving model performance.

Experimental results and discussion
Accuracy of the Models-I, II, IIIA, and IIIB on the classification task is given in Table 11. Accuracy on the datasets 2012-2020 is below 50% as the classification targets in the dataset increase. Macro-P, macro-R, and macro-F1 of the Models-I, II, IIIA, and IIIB on the classification task on test set are given in Table  12.
Performance of Models-IIIA and IIIB is 20-25% higher than Models-I and II. Hence, it is concluded that the performance of the classification task is improved by using GCN-based features and transaction features. Misclassification loss of the proposed Models-I, II, IIIA, and IIIB on the classification task is given in Table 13. As hyper-parameter tuning for the models was not performed, loss on the development set is not given in Table 13. The loss of Models-II, IIIA, and IIIB is lower than Model-I on all datasets except 2010. Hence, the performance metrics was higher.
The utilization of RAM of the proposed Models-I, II, IIIA, and IIIB on the classification task is given in Table 14. The requirement of RAM is below 10GB, making it suitable for deployment in a desktop or mobile device.
To improve performance on datasets 2012-2020, the classification task was modified from multi-class to binary (legal or illegal) by merging legal and illegal labels to a single class. Illegal labels merged were dark market, blacklist, gambling, criminals, mixer, Ponzi, ransom, sextort, laundering, scams, and bomb. Similarly, legal labels merged together were pools, exchange, trading, paymentgateway, wallets, p2plender, faucet, donations, p2pmarkets, unclassified, videosharing, miner, bond, explorer, cybersec, affiliatemarketing, and microworker. Table 15 gives accuracy of Models-I, II, IIIA, and IIIB on the datasets.
Models-IIIA and IIIB use additional features such as GCN-based features and transaction features to improve accuracy on the dataset by 20-30% compared to Models-I and II. Overall, the accuracy of binary classification has improved compared to multi-class classification by 45-55%. Misclassification loss of the proposed Models-I, II, IIIA, and IIIB on the classification task is given in Table 16.
With a lower loss in binary classification compared to multi-class classification, the accuracy of the task improves. The utilization of RAM of the proposed Models-I, II, IIIA, and IIIB on the classification task is given in Table 17. Lower RAM is consumed for binary classification compared to multi-class classification.
Based on the performance of the two models viz. multiclass and binary classification as per the requirement, an appropriate model can be selected. For fast and accurate deci- sions, binary classification is appropriate, and for pinpointing category of bitcoin users, a multi-class classifier is suitable.

Summary of results
-Performance of Models-IIIA and IIIB is 20-25% improved than Models-I and II by using GCN-based features along with transaction features. -Utilization of RAM of the proposed Models-I, II, IIIA, and IIIB is below 10GB making them suitable for deployment in a desktop or mobile device -Accuracy of binary classification has improved compared to multi-class classification by 45-55% -Lower RAM is consumed for binary classification compared to multi-class classification by 55-75%.

Conclusion and future works
In bitcoin as per the literature surveyed, major works focus on prediction of the price and volatility (Liu et al. 2021;Sebastião and Godinho 2021) and limited attention is found toward forensic tools development. Bitcoin forensic tools rely on artificial intelligence for tracking illegal and legal transactions on blockchain. To resolve the challenge due to high volume of transactions, the current paper proposes deep learning-based graph neural network model using spectral graph convolutions and transaction features for identifying illegal transactions and labeling the transactions by their originator type. The model leverages the transaction graph of bitcoin and features of transactions. It was observed through experiments that supervised learning is challenging due to the diverse types of entities present on the bitcoin network. The classifier faced difficulty in identifying discriminative features from the data. The classification task was divided into identifying whether transactions were legal or illegal (binary classification) and classifying transactions to one of the twenty-eight types of users (multi-class classification). On the multi-class classification task, the proposed model obtained accuracy higher than existing models as given: Model-I (12-83%), Model-II (22-90%), Model-IIIA (22-89%), and Model-IIIB (22-92%). On the binary classification task, the proposed model obtained accuracy as given: Model-I (60-82%), Model-II (60-85%), Model-IIIA (60-85%), and Model-IIIB (59-82%). From the performance of the models, it was observed that Models-I and II had lower performance, which was improved by proposed model. Concluding that models' performance was improved by using GCN-based features along with transaction features. The performance improvement also proves that vanilla models, i.e., models using only GCN or transaction features, will have comparatively lower performance than models that use GCNbased features along with transaction features. Based on the performance of the two models, viz. multi-class and binary classification as per the requirement, an appropriate model can be selected. For fast and accurate decisions, binary classification is appropriate, and for pinpointing the category of bitcoin users, the multi-class classifier is suitable.
Feature engineering additional features from blockchain to improve accuracy are proposed in future works. To encourage further exploration in bitcoin illegal transaction detection, the datasets and scripts are openly accessible on the Github repository. For future works, another promising Data Availability Enquiries about data availability should be directed to the authors.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

Ethical Standards
The authors declare that they have complied with ethical standards of the journal during their research.