3.2 Flow of the Framework
Task 1 – Extract VAT return data
According to the industry standard machine learning life cycle, this task is conducted under the data gathering phase of the life cycle. This step is involved with the collection of data, identification of other data sources and to integrate the data obtained from various sources. The goal of this step is to identify and obtain all data-related problems. In this step, we need to identify the different data sources, as data can be collected from various sources such as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The quantity and quality of the collected data will determine the efficiency of the output. The more data we collect, the more accurate will be the classification or prediction.
Task 2 – Aggregate Data
After collecting the data, we need to prepare it for further steps. In the machine learning life cycle this task is completed under the Data preparation phase. Data preparation is a step where we put our data into a suitable database or files and prepare it to use in our machine learning training. In our framework, for each VAT dealer, we aggregate all numerical continuous variable obtained from the return. In this study the summary values, a period of 6 years, and are calculated for each individual VAT vendor. This effectively allows the algorithm to have a work with view of the vendor behaviour as opposed to monthly or yearly scrutiny. During this task will also conduct data pre-processing and exploratory data analysis.
Task 3 – Normalize Data
This task is normally undertaken under the Data preparation phase of the Machine Learning Lifecycle. During data preparation we use a technique called normalization or standardization, to rescale our input and output variables prior to training a neural network model. The purpose, of scaling the dataset, when training a Neural Network model, is to normalize the data to obtain a mean close to zero. The review of the literature reveals that normalization could improve performance of the model (Sola & Sevilla, 1997). Normalizing the data generally speeds up learning and leads to faster convergence. Accordingly, mapping data to around 0 gains much faster training speed than mapping them to the intervals far away from 0 or using un-normalized raw data.
Task 4 – ANN-SOM Algorithm
Formally this stage is about selecting an appropriate Machine Learning Algorithm. This is an iterative process. During this study, we identified multiple machine learning algorithms applicable to our data and VAT fraud detection challenges. Regardless, we alluded previously that we shall use an unsupervised learning approach which is appropriate for unlabelled data. The algorithms we evaluated were K-means and Self Organizing Maps (SOM). According to Riveros et al. (2019) the model trained with SOM outperformed the model trained with K-means. In their study they found that the SOM improved detection of patients having vertebral problems (Riveros, Cardenas, & Pico, 2019). Likewise, after a few iterative processes, comparing the SOM and K-means performance, we chose ANN-SOM algorithm.
Task 5 – Train and Test Model
This stage is concerned with creating a model from the data given to it. At this stage we split the dataset into training and test dataset into. Accordingly, we cut the dataset into 20% for testing and 80% for training. Herein, the training process is unsupervised. The remaining dataset is then used to evaluate the model. These two steps are repeated a number of times in order to improve the performance of the model (Riveros, Cardenas, & Pico, 2019).
Task 6 – Optimize Model
A model's first results is not its last. The object of the optimization or tuning we conducted was to improve performance of the model. Tuning a model involves changing hyper parameters such as learning rate or optimizer (Bennett & Parrado-Hernandez, 2006). The purpose for tuning and improving the model is repeatability and efficiency. Someone should be able to reproduce the steps one has taken to improve performance. Additionally, we optimized the model to reduce training time.
Task 7 – Deploy Model
The aim of this stage is the proper functionality of the model after deployment. The models should be deployed in such a way that they can be used for inference as well as they should be updated regularly (Bennett & Parrado-Hernandez, 2006).
Task 8 – VAT audit Case selection
The cohort of VAT vendors with return declarations that have been identified by the SOM as suspicious land up in the “funnel” for further scrutiny. This step is comprised of human verification. This audit is merely a general audit of cases selected for further scrutiny. This contrasts with Investigative Audit, which is concerned with the auditing of cases by a specialist auditor.
Task 9 –Investigative Audit, Criminal Investigation and Enforcement
Investigative audits are different from other tax audits in that a centralised specialist team conducts them. Task 9 is undertaken based on the results obtained from the previous audits conducted in Task 8 above, where audit officers have identified evidence of serious fraud.
Task 10 - Tax Compliance
The tax compliance task is involved with the scrutiny of compliance related attributes like filing returns on time, timely payments, accurate completion of returns and timely registration with the tax authority, among others.
Task 11 – Voluntary Compliance
The aim of the VAT fraud detection AI framework is to increase voluntary compliance. The level of audit activity and frequency of audit will be dictated by the availability of staff resources. The convenience of the AI framework suggested herein, is that it will ensure that the available staff resources are deployed judiciously with the twin objectives of maximizing both revenue collection and voluntary compliance by VAT dealers.
The “filter” or “funnel” denoted by the number 8, symbolizes the audit process, which involves a detailed human verification and validation of lading. This in turn assists in the independent verification financial records such as sales invoices, purchase invoices, customs documents, and bank cash deposits. However, the scope of the human verification is limited to the subset of taxpayers that have been flagged as anomalies by the SOM algorithm we propose. Once human verification has confirmed the presence of suspicious VAT declarations, such cases are then dealt with in Step 9. Step 9 is a depiction of the work performed by investigative audit, criminal investigation, and enforcement teams, on confirmed cases. With this framework we envisage, that the effectiveness and efficiency of this AI assisted compliance framework will enhance detection of suspicious VAT vendors. Consequently, we anticipate that tax compliance will improve as the fear of detection increases (Step 10). Voluntary compliance will be a consequence of an improved, effective, and efficient AI based case selection technique (Step 11).