Combating Phone Harassment through VoiceAnalysis Filtration of Anonymous Reports

Given the increasing popularity of smartphones as all-in-one computing devices for corporate work and everyday personal use, it is no wonder that mobile devices have become the most appealing attack surface for today's cyber criminals. In that case obscene or harassing phone calls can be one of the most stressful and frightening invasions of privacy a person experiences. Thus Mobile security has become increasingly important in mobile computing. There exist various applications that block spam calls through the SIM card numbers by establishing a spam database which identi ties the source of income calls. But unfortunately, their effciency of work is not up to the mark, since its usually pointless to track and block the SIM card number, as the number of spam callers is constantly changed. Considering this point, we are presenting a new concept in which frauds will be recognized through their vocals, even in a noisy environment, with a few seconds of speech, as one can change his number several times but can't change his voice. Here we have used several algorithms and techniques, such as speaker veri cation, speaker identi cation, forensic speaker recognition (FSR), spectrogram masking, voice ltering, Mel-Frequency Cepstral Coeffcient (MFCC) and a combination of Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). Moreover, this system doesn't require any kind of personal information of the users. In this consequence, safety issues also remain in force. Findings of this study will be useful for lawyers, law enforcement agencies, and judges in the courts to recognize their suspects.


Introduction
In our proposed system, to identify the spam callers we have used a voice recognition method that includes both voice identi cation and voice veri cation techniques [1]. We also use a simple technique for separating vocals from a noisy environment and a spectrogram masking network [2] to separate the voice of a target speaker from multi-speaker signals from the call recording. Our system only works for unknown numbers, which helps to prevent anyone, from com-plaining against an innocent, for no reason.
When an unknown call is arrived in the users phone, the call will be auto recorded through our system. If someone wants to turn off the call recording process for any personal issues or other reasons, the system allows him/her to do so whenever he/she wants. In that case, system wont save any call recordings. But, if the call recording process does'nt turn off by the user, our system will continue its process.
In essence, the proposed system has two scenarios.

First Scenario
It only happens when the unknown caller's voice doesn't exist in the database. In this scenario, after receiving the unknown number, if the user doesn't turn off the automatic call recording process it means he/she may be willing to report this caller.That's why, once the call is over, the user will be asked whether he/she wants to report that caller or not. If the answer is "No," the process will be "END" and if the answer is "Yes," a sorted vocal list will be displayed according to the length or duration of the voice. These voice clips will be extracted from that call recording after eliminating the noise issue.
In this instance, the user basically needs to pick the speci c voice from the sorted list that he/she believes was a danger to him/her. If the user gets confused about which vocals, he/she should be picked up, he/she can play these voice clips to ensure. Since the voice will be reported and recorded in the database, it will help to be aware with a noti cation that it can be a fraud when the threat caller calls again from any number. If a spam caller's voice doesn't already exist in our database, then no noti cation will be occur in the noti cation bar. But if he/she thinks it was a threat to him/her, the user can report against this voice.

Second Scenario
It happens when the unknown caller's voice already exists in the database. In this scenario, after receiving the unknown number, a noti cation will be appeared on the noti cation bar within a few minutes. It will contain the number of spam reports that have been reported against this speci c voice earlier by other victims. That will notify the user to be aware from that spam caller. From this noti cation the user can get an idea about that person and can take necessary steps to be safe or more observant. The number of spam reports re ects the true depth of the fraud's crime. After the call has ended, if he/she will also willing to report against that person, he/she just needs to pick the speci c voice from the sorted list as it was said in the rst scenario.

Proposed Model
The model we proposed for speaker recognition includes both speaker identi-cation and Speaker Veri cation. These are two major applications of speaker recognition technologies and methodologies [3]. If the speaker claims to be of a certain identity and the voice is used to verify this claim, this is called ver-i cation or authentication [4]. On the other hand, identi cation is the task of determining an unknown speaker's identity [5]. In one way the speaker check is a 1:1 match where the voice of a speaker matches a certain template while the speaker identi cation is a 1:N match when the voice is compared with several templates. [6].
In gure 1 & 2, our content is explained in detail. Our system only works if there is an incoming unknown call, otherwise it will not yield any results. Also if the unknown caller's voice does not exist in the system's database then no noti cation will be sent through our system.
If the voice of a spam caller already exists in our database, then two events can take place.

1.
A noti cation will be sent to the user about that spam caller if his voice is previously reported more than 5 times. But 2. If his voice has less than 5 reports, then a noti cation will be sent only to those users who complained Otherwise, no noti cation will be sent by our system. So that no one can harass people unjustly. This precaution is taken to save innocent people from defamation.
User can report an incoming call if he/she feels harassed, threatened, tormented, humiliated, embarrassed or otherwise victimised, either he/she receives a spam noti cation about the unknown caller or not.
When a user wants to report a call, some hidden process will take place that won't be disclosed to the user. In gure 1, phase 1 and 4 will be shown on the phone screen to inform the user about the corresponding actions, but phase 3 and 4 will be hidden from the user. System will perform these tasks to extract the speci c voice from a noisy environment and multiple speakers.
The processes shown in gure 1 are almost identical to those indicated in gure 2, but there are 3 additional phases called phase 5, phase 6 and phase 7.Here, phase 7 will appear on the user's screen but phase 5 and 6 are hidden process that sends the user a warning massage.
The following part deals with the working principle of these phases.

Phase 1: Reporting threat call
Immediately after a phone call in which there is a threat of physical harm or violence, the user should report that spam call through our system. In that way, he/she could be freed from this spam caller for the rest of his/her life. Because whenever the user is called by this spam caller from any number, he/she will be noti ed about this spammer. Also it'll help others to be aware from this spammer.

Phase 2: Extract vocals from noise
In the eld of automatic or semi-automatic speaker recognition, background noise is one of the main causes of degradation in performance in various appli-cations of digital speech processing [7]. So we need to reduce background noise, as it helps to improve intelligibility and quality of a speech signal.
Recently, the REpeating Pattern Extraction Technique (REPET) was proposed to separate the repeating background from the non-repeating foreground [8,9]. The fundamental concept is to de ne repeating audio components, compare them to repeating the derived models, and extract the repeating patterns through time-frequency masking [10]. While the original REPET (and its extensions) assumes that repetitions happen periodically [11], REPETSIM, a generalization of the method that uses a similarity matrix was further proposed to handle structures where repetitions can also happen intermittently [12].
The only as-sumption is that the repeating background is dense and low-ranked, while the non-repeating foreground is sparse and varied [10].
Repetitions happens in background noise, such as car horn sounds, construc-tion work, crying babies and industrial machinery. All of them have repeated patterns. Considering this point, we have used this algorithm in our proposed system to extract vocals from background noise.
We got the following result by coding this method. Figure 4 shows voices and noise are combined. The vocal element is caused by the wiggly lines above. Our objective is to distinguish them from the instruments we use.

Phase 2: Extract vocals from noise
In the eld of automatic or semi-automatic speaker recognition, background noise is one of the main causes of degradation in performance in various appli-cations of digital speech processing [7]. So we need to reduce background noise, as it helps to improve intelligibility and quality of a speech signal.
Recently, the REpeating Pattern Extraction Technique (REPET) was proposed to separate the repeating background from the non-repeating foreground [8,9]. The fundamental concept is to de ne repeating audio components, compare them to repeating the derived models, and extract the repeating patterns through time-frequency masking [10]. While the original REPET (and its extensions) assumes that repetitions happen periodically [11], REPETSIM, a generalization of the method that uses a similarity matrix was further proposed to handle structures where repetitions can also happen intermittently [12].
The only as-sumption is that the repeating background is dense and low-ranked, while the non-repeating foreground is sparse and varied [10].
Repetitions happens in background noise, such as car horn sounds, construc-tion work, crying babies and industrial machinery. All of them have repeated patterns. Considering this point, we have used this algorithm in our proposed system to extract vocals from background noise.
We got the following result by coding this method. Figure 4 shows voices and noise are combined. The vocal element is caused by the wiggly lines above. Our objective is to distinguish them from the instruments we use vocals and background noise are separated in two slices.

Phase 3: Separation of spam caller from multiple speaker
The next phase is about to separate the voice of the spam caller from multi-speaker signals by making use of a reference signal from the target speaker. This process is presented in [13]. One way to deal with this issue is to rst apply a speech separation system on the noisy audio in order to separate the voices from different speakers. Therefore, if the noisy signal contains N speakers, this approach would yield N outputs with a potential additional output for the noise [14].
This approach can be easily extended to more than one speaker of interest by repeating the process in turns, for the reference recording of each target speaker [13].

Phase 4: Saving voice in database
In phase 2 and 3, target voice has already been detected. In this phase, system will save this speci c voice in database for further actions. Whenever this person will call the user from any number, the system will match his voice with the saving one by some complex method and send a noti cation to the user, that this man can be harmful or dangerous for him/her.

Phase 5: Voice recognition in database
The identi cation of a person through speech samples with a forensic quality is challenging.In this phase caller's voice will be checked whether it matches our database. For this purpose we have used a method for forensic speaker recog-nition that has been proposed in [15]. Here each speakers voice is recorded in both clean and noisy environments, through a microphone and a mobile channel though it has shown low equal error rates (EER) with very short test samples. This diversity facilitates its usage in forensic experimentation. The Gaussian mixture model-universal background model is used for speaker modeling and Mel-Frequency Cepstral Coe cients are used to extract features [16].

Phase 6: Creating Spam reports
Whenever a user reports a spam voice in database, system will save that vocal and create a pro le for that corresponding spammer, in which the number of spam reports will be stored. If a voice has already been reported by a user that means this voice has an individual pro le with it's corresponding spam reports. Hence, if anyone again reports this spam voice, no pro le will be created but the number of reports against this person will be increased by our system. Phase 7: Sending spam noti cation Noti cation will be send when the spam callers voice already exists in the database. In this scenario, after receiving the unknown number, a noti cation will be appear on the noti cation bar within a few minutes. It will contain the number of spam reports that have been reported against this speci c voice ear-lier by other victims. That will notify the user to be aware from that spam caller. From this noti cation the user can get an idea about that person and can take necessary steps to be safe or more observant. After the call has ended, if he/she will also willing to report against that person, he/she just needs to pick the speci c voice from the sorted list as it was explained earlier.

Results
Our proposed system has been completed in a few steps. So the result of this research has to be shown in various steps Extracting vocals from noisy environment Tables 1, 2, and 3 demonstrate the outcomes for SDR (dB) and OPS, for stereo voice estimates (sim) and stereo noise estimates (noi), for all techniques, respec-tively for subway noise, cafeteria noise and square noise estimates [10]. Here, -Algorithm 5 is based on a rst constrained ICA that estimates the mixing parameters of the target source, followed by a Wiener ltering to enhance the separation results [17]. -Algorithm 8 is based on a rst estimation of the noise from the unvoiced segments, followed by DUET [18] and spectral subtraction to re ne the results, and a minimum-statisticsbased adaptive procedure to re ne the noise estimate [19]. -Baseline is based on a rst estimation of the Time Differences Of Arrival (TDOA) of the sources, followed by a maximum likeli-hood target and noise variance estimation under a diffuse noise model, and a multichannelWiener ltering [20]; this is the baseline algorithm proposed by SiSEC.  As we can see the REPET-SIM is nearly always better than that of Algorithm 8 and Baseline and is performing, as well as of Algorithm 5. This makes sense because REPET-SIM models only the noise [10].

Multi-Vocals Separation
In [13], authors have demonstrated the effectiveness of using a discriminatively-trained speaker encoder to condition the speech separation task. Such a system is more applicable to real scenarios because it does not require prior knowledge about the number of speakers and removes the permutation problem. The VoiceFilter model trained on the LibriSpeech data set also shows that the voice recognition WER in two-language scenario decreases from 55.9% to 23.4% and WER in single-speaker situations stay about the same.

Speaker Recognition
In [15], speech of 40 speakers are used to validate the proposed method while recording the mobile channel. The recording of low bandwidth and low-quality devices is not great on mobile channels. The training set includes recording the statements of one paragraph over an average of 30 s and 10 s for testing by using the mobile channel through smooth voice recording; Figure 6 demonstrates the experiments ' DET curves. It can be said from Figure 6 that important e ciency was accomplished using mobile channel recording at a speed of approximately 97.8% with an EER equivalent to 1.98% [15].

Conclusion
This paper emphasized the solution to reduce mobile telephone harassment by ltering anonymous reports using voice recognition and data analysis. We present a new concept whereby the vocals of fraud are recognized with a few seconds of speech, even in a noisy environment, because the SIM card number can be changed a couple of times, but the vocal of a person cannot be changed.It will reduce harassment, threats, torments, humiliation, embarrassment or victimized by phone calls. Our system overcomes the limitation of existing applications and provides more security than other applications, as it doesnt require any personal information of its user.  Get noti cation about spam report