f-Slip: an efficient privacy-preserving data publishing framework for 1:M microdata with multiple sensitive attributes

Privacy-preserving data publishing is a process of releasing the anonymized dataset for various purposes of analysis and research. Earlier, researchers have dealt with datasets considering it would contain only one record for an individual [1:1 dataset], which is uncompromising in various applications. Later, many researchers concentrate on the dataset, where an individual has multiple records [1:M dataset]. In the paper, a model f-slip was proposed that can address the various attacks such as Background Knowledge (bk) attack, Multiple Sensitive attribute correlation attack (MSAcorr), Quasi-identifier correlation attack(QIcorr), Non-membership correlation attack(NMcorr) and Membership correlation attack(Mcorr) in 1:M dataset and the solutions for the attacks. In f-slip, the anatomization was performed to divide the raw table into two sub-tables (1) quasi-identifier and (2) sensitive attributes. The correlation of sensitive attributes is computed to anonymize the sensitive attributes without breaking the linking relationship. Further, the quasi-identifier table was divided and k-anonymity was implemented on it. An efficient anonymization technique, frequency-slicing, was also developed to anonymize the sensitive attributes. The novel approach in the f-slip model is the slicing of records according to the frequency of occurrences of sensitive attribute values in each sub-table. The workload experiment proves that the f-slip model is consistent as the number of records increases. Extensive experiments were performed on a real-world dataset Informs and proved that the f-slip model outstrips the state-of-the-art techniques in terms of utility loss, efficiency and also acquires an optimal balance between privacy and utility.


Introduction
Various organizations and institutions publish their data for research, analysis purposes, policy and decision-making to make the data available for public and private sectors. Due to the increase in digital transformation, Electronic Health Record (EHR) also increased enormously. EHR includes complete information of the patient's disease, symptoms, demographics, diagnosis codes, test reports, allergies, physicians and bill reports (Chu et al. 2021;Orna et al. 2020, Manoharan andSamuel 2020). The usage of Electronic Health Records was 18% in 2001 and increased to 72% in 2012 and expected to rise to 90% at the end of the era (Stephen and Michael 2016;Shakya and Lalitpur 2020). The data released by the health sectors for the analysis and research purposes may hold the personal information of an individual such as explicit identifiers (e.g., name, SSN), quasi-identifiers (e.g., name, age, sex, race) and sensitive attributes (e.g., disease, symptoms, salary). Publishing such data with private and personal information leads to a privacy breach, and thus, the individual's privacy is compromised. Later, the explicit identities were removed from the data before publishing, stating that the microdata are secured. It is well known that just removing the explicit identifiers is insufficient and may lead to linking attacks. The adversaries can connect the quasi-identifier attributes with the available external sources to re-identify a particular individual. Around 87% of the population in the USA was identified from the published medical records using the quasi-identifier such as gender, zip code and date of birth (Latanya 2002a;b). Several privacy algorithms and models were proposed to overcome the privacy breaches (Abdul and Sungchang 2020;Zhe et al. 2018;Rashad and Azhar 2018). Health sectors and organizations anonymize their microdata with the existing privacy algorithms and models to protect individuals' from various privacy breaches. The threats in health sectors were dealt with by any techniques that have been proposed by computer science communities and health informatics (Indumathi and Amala 2021). The anonymization techniques aim at privacy-preserved data publishing with strong privacy and less information loss (Ismail and Ammar 2020;Jinwen et al. 2020;Tehsin et al. 2021).
Researchers carried out their work in privacy-preserving data publishing earlier, assuming that the dataset has a single sensitive attribute. However, the dataset may have multiple sensitive attributes (MSA). In a real-time dataset, each individual has multiple records with multiple sensitive attributes. For example, a patient might have visited a particular hospital numerous times for various diseases (e.g., hypertension, bronchitis, diabetes). Each time the patient visits the hospital for different diseases, the records will be inserted into the database (e.g., three visits for hypertension and a single visit for bronchitis and a couple of visits for diabetes). Hence, a patient can have multiple records with multiple sensitive attributes (MSA). Most of the researchers have not concentrated on the above scenarios. Thus, the work focuses on 1: M microdata with MSA for Electronic Health Records.
The paper has been organized as follows. The related works of various researches are discussed in Sect. 2. The motivation and challenges are elaborated with the sample dataset in Sect. 3. Section 4 elucidates the 1:M multiple sensitive attribute attacks with corresponding scenarios. The contribution and preliminaries with various definitions are conversed clearly in Sects. 5 and 6. The model f-slip is explained elaborately in step-by-step manner in Sect. 7. Section 8 outlines the algorithms of various processes in the f-slip model with a clear description. Experimental evaluation and result analysis are explained and depicted through various graphs in Sect. 9 and Section 10 concludes the paper with future directions and its limitations.

Related works
This section evaluates the privacy anonymization methods and models for publishing the multiple sensitive attributes and 1: M microdata. Various anonymization methods and models have been proposed for MSA. K-anonymity (Latanya 2002a, b;Khaled and Fida 2008) was proposed to prevent the re-identification of the individuals. However, kanonymity could not restrict sensitive attribute disclosure and does not defend against reverse attack. (a,k)-anonymity was proposed with k-anonymity as the base. The distribution frequency of each sensitive attribute in every equivalence class should not be greater than 1/a (Xiangwen et al. 2017). The distribution of sensitive attributes reduces the competence of anonymity, so the model l-diversity (Ashwin et al. 2006) was proposed considering the diversity of the sensitive attributes. Further considering the above problems, an algorithm t-closeness was proposed with the measure of ''distance between the distribution of the sensitive attributes in the equivalence class and the sensitive attribute distribution in the whole table should not be greater than the threshold t'' (Ninghui et al. 2007). t-closeness fails in protecting the privacy of the infrequent values. Thus, b-likeness was proposed with strong constraints to achieve good privacy (Athanasios et al. 2020). When the existing privacy models were applied to the incremental data releasing model, they lead to excess privacy leakage; thus, model m-signature was proposed (Junqing et al. 2018). However, the m-signature model is limited by its time complexity application. An anatomy technique, Sensitive Label Privacy Preservation with Anatomization technique (SLPPA), and (a, b, c, d) model were proposed to prevent the record linkage, table linkage, attribute linkage and probabilistic attacks. A metric ''meansquare contingency coefficient'' was applied to divide the table and avoid re-identification during anatomization. The (a, b, c, d) model was applied on two datasets named adult and census and was limited to a single sensitive attribute. The SLPPA comprises of two processes (1) table division and (2) group division (Lin et al. 2021). The above models and algorithms were concentrating on 1:1 microdata and single sensitive attribute.
The main focus of anonymization is to transform the data to balance both privacy and loss of information. Various anonymization techniques have been designed for privacy-preserving data publishing with MSA. SLAMSA is a privacy preservation approach for MSA. The SLAMSA, using an anatomization technique that prevents the generalization of quasi-identifier attributes, leads to less information loss. As SLAMSA anatomizes the original table into multiple tables, it causes complexity and utility loss during the publishing of tables. SLAMSA is implemented on Cleveland Foundation Heart Disease and Hungarian Institute of Cardiology datasets; however, it is vulnerable to demographic attacks (Shyamala and Christopher 2016). The concept KC i -Slice considers different thresholds for different sensitive attributes. It prevents similarity attacks by applying semantic l-diversity.
The KC i -Slice reduces the utility loss and enhances the privacy for multiple sensitive attributes and is tested on the adult dataset (Lakshmipathi et al. 2018).
The distributional model was proposed by setting a threshold p-sensitive on multiple sensitive attributes. A set of rules have been fixed for the sensitive attribute values distribution. The sensitive attributes are categorized as primary sensitive attributes and contributory sensitive attributes. In the distributional model, the sensitive attributes are divided into sub-tables without following the anatomy concept. Also, the distribution model was not fixed; it can be changed according to the different models (Widodo and Wahyu 2018). The novel method called overlapped slicing with bucketization technique was proposed for privacy-preserving data publishing with multiple sensitive attributes. In overlapped slicing, sensitive attributes are anonymized by applying permutation in each bucket. The Discernibility value metric was used to measure the utility, and a comparison was made with two different existing methods and tested on an adult dataset. The overlapped slicing model lagged in dissociating the relationship between quasi-identifier and sensitive attributes (Widodo et al. 2019). The privacy and security level of each sensitive attribute differ according to the different requirements of sensitivity. L sl -diversity model was proposed with three greedy algorithms named maximal-bucket first (MBF), maximal single-dimension-capacity first (MSDCF) and maximal multi-dimension-capacity first (MMDCF) algorithm. The above three algorithms helped greatly in reducing the information loss. However, there was a slight increase in time when there is an increase in the volume of data (Yuelei and Haiqi 2020).
An effective approach (p,k)-Angelization was proposed to anonymize the multiple sensitive attributes. The (p,k)-Angelization eradicates the background join, non-membership attacks and yields the balance between privacy and utility. The approach (p,k)-Angelization made one-to-one correspondence between the quasi-identifier and sensitive attributes in the buckets (Adeel et al. (2018a, b). The ''(c, k)-anonymization'' is an advancement of (p,k)-Angelization, which enhanced the one-to-one correspondence to one-to-many correspondence to provide improved privacy and increased utility. (c, k)-anonymization also thwarts the ''fingerprint correlation attacks'' (Razaullah et al. 2020). Both (p,k)-Angelization and (c, k)-anonymization can be applied only for 1:1 microdata. Various models have been proposed and implemented on multiple sensitive attributes (Wang et al. 2018;Jayapradha et al. 2020;Rong et al. 2020). The papers discussed till now have implemented their works on 1:1 microdata, considering that datasets have only one record per person.
Later, few researchers have paid concentration toward 1: M datasets. A method called (k,k m )anonymous was proposed, and it leads to unexpected information distortion. (k,k m )anonymous framed the 1:M problem as ''multi-objective optimization problem'' and handled both relational and transactional data (Poulis et al. 2013). A method (k,l)diversity was proposed to address the disclosure risk on privacy-preserving data publishing 1:M dataset. An algorithm 1:M generalization was proposed. However, unfortunately, it fails to prevent information loss (Qiyuan et al. 2017). A hybrid method l-anatomy was proposed to ensure the privacy of individuals on 1:M datasets. Though lanatomy performed well in terms of utility, the computational complexity increased and was limited to a single sensitive attribute (Adeel et al. 2018a(Adeel et al. ,2018b. A bidirectional personalized generalization model was proposed to satisfy the higher privacy and less utility loss for multirecord datasets. This model resists bidirectional chain attack by using a hierarchical generalization strategy. Though the model performed well, it was limited to a single sensitive attribute and leads to information loss due to generalization (Xinning and Zhiping 2020). QIAB-IMSB algorithm was proposed to anonymize the set-valued dataset. Vertical partitioning has been performed to partition the table. In QIAB-IMSB algorithm, k-anonymity has been applied for quasi-identifier bucket and (k,l)-diversity for multiple sensitive attribute bucket. The algorithm resists a sensitive linking attack by adopting hierarchical generalization, and the accuracy of the data was compared using classification models (Jayapradha and Prakash 2021). As per the survey, the widely used real-world datasets for the 1:M privacy-preserving data publishing are Informs and YouTube.

Motivation and challenges
In a real-world scenario, the 1:M datasets are more than 1:1 datasets. Apart from health care, there are various domains where users can possess multiple records. The individual might post multiple pictures and statuses on the same account in a social network such as Facebook, Twitter, Foursquare. Likewise, a person can purchase various items on different days with the same membership card in the supermarket. Only a few researchers looked into the above scenario in their work earlier. Later, many researchers took the problem in hand and developed various privacy models and anonymization algorithms. However, the models and algorithms were not able to resist several disclosures and attacks.
Consider the 1:M dataset, shown in Table 1. It is a sample dataset that comprises patients' records with multiple sensitive attributes. In Table 1, patients consist of multiple records with different disease codes. The patient Alan has two records in the dataset, with two different  Table 1). Table R T consists of explicit identifiers EI = {ei 1 , ei 2 , ei 3 … ei n }, quasi-identifiers QI = {qi 1 ,qi 2 ,qi 3 …qi s } and sensitive attributes SA = {sa 1 , sa 2 , sa 3 … sa h }. The explicit identifiers are given just to identify the patient, whereas it will be removed during data publication. The quasi-identifier comprises the general information that can be associated with the publically available dataset to re-identify an individual. Sensitive attributes contain confidential information such that the individual does not want to disclose it to the public. Therefore, sensitive attributes need to be protected from intruders.
3.1 Challenge 1(failure of 1:1 privacy model on 1:M) When the existing privacy models of the 1:1 dataset are applied to the 1:M dataset, it might cause privacy breaches due to multiple records for an individual. Privacy models designed for 1:1 datasets can no longer be applied on 1:M datasets. In Table 2, 2-anonymity has been applied to protect the data against various privacy breaches. However, the datasets are not well protected. Though Liu, Dalia, Alan, Helen, Tony and Tom records are generalized by forming equivalence classes, the patients can be easily reidentified. If an intruder knows the quasi-identifier values of Alan (i.e.,) 13, M, 55,000, then the intruder can quickly identify the sensitive values of Alan. Since only the first two equivalence classes of Table 2 suit the above criteria, the intruder can infer the values of the sensitive attributes of Alan with 100% confidence as his salary is 10,201, the poverty line is high, education is 9th and also the disease codes are \ V22,V90 [ . In Table 2, the records of the individuals are not aggregated. As the individual records have the same quasi-identifier values with different sensitive attribute values, the intruder can easily re-identify a specific person and infer an individual's complete information. In Table 2, the unique id (U_ID) of Alan and Dalia is in group 2. After exploring Table 2, the intruder can infer that U_ID 2 is generalized in groups 1 and 2; thus, the patient U_ID 3 quasi-identifier values can be inferred from U_ID 2 QI's values (i.e.,) U_ID 3 should be \ 20, female and zip code in the range of 60,000-70,000. In Table 2, the U_ID 2 quasi-identifier information becomes the background knowledge for an intruder to re-identify U_ID 3 and leads a path for several attacks as listed in Table 3. Due to the implementation of the 1:1 dataset privacy model on the 1:M dataset, an intruder can gain knowledge from the published data, which causes various correlation attacks such as background knowledge (bk) attack, Multiple Sensitive Attribute correlation attack (MSAcorr), Quasi-identifier correlation attack (QIcorr), Membership correlation attack (Mcorr) and Non-Membership correlation attack (NMcorr).

Challenge 2 (individual condition fingerprint array identification)
In the 1:M dataset, each individual will have multiple records with multiple sensitive attribute(MSA) values and common quasi-identifier values. To anonymize 1:M dataset(challenge 1), the MSA of the individuals is grouped. The grouped MSA with different values alone forms an Individual Condition Fingerprint Array identification [ICFA]. The intruder can use this ICFA to re-identify the person with all the sensitive values. It cannot be assured that all individuals ICFA in the dataset will form a unique bucket. A technique, frequency-slicing (f-slicing), has been introduced to deal with different fingerprint array lengths. As per our knowledge, though existing systems have dealt with sensitive attribute fingerprint buckets, they have not adopted any technique that handles the size of the fingerprint array to protect from high utility loss and privacy breach. So, the adversaries can easily blow the Individual Condition Fingerprint Array to get the individual record. The above two challenges on the 1:M dataset make privacy-preserving data publishing very complex with optimal balance between utility and privacy. Achieving high privacy with less information loss in the 1:M dataset is always a challenge, and this has been addressed in the paper.

1:M multiple sensitive attribute attacks with corresponding scenarios
The proposed model f-slip anonymizes the 1:M dataset with multiple sensitive attributes and guarantees the intensified privacy with minimum loss of information. The proposed approach has been evaluated against various correlation attacks listed in Table 3 by implementing it in the real-world dataset. Five privacy breach cases have been discussed over 2-anonymity published data in Table 2 and explained each case. The implementation of 1:1 dataset privacy models on the 1:M dataset could not resist the following attacks: bk, MSAcorr, QIcorr, Mcorr and NMcorr. The explanation of different cases of the above correlation attacks are as follows.
Case 1 An intruder can infer the complete details of the individual's sensitive attributes if he possesses strong  background knowledge about the individual. If an intruder knows the basic information of the person, (i.e.,) Lisa is a female, her age is below 20, from zip code 60,000 also possesses strong background knowledge such as Lisa has completed her bachelors and earns a decent amount of salary, then, he can easily infer the complete information of Lisa which leads to Background Knowledge (bk) attack. Case 2 Multiple sensitive correlation attribute attacks (MSAcorr) can occur with the help of background knowledge (bk). For example, if an intruder knows that Alan has not studied much and earns less with a high poverty line, the intruder can easily infer Alan's salary and also his disease codes are \ V22, V90 [ from Table 2. Just by knowing one sensitive attribute value, the values of other sensitive attributes can be inferred.
Case 3 If an intruder has the background knowledge about the quasi-identifier values of an individual, he can correlate the QI values of an individual with the values of the sensitive attributes to perform a Quasi-identifier correlation attack (QIcorr). Suppose an intruder can correlate or map the sensitive attributes information with the assistance of background knowledge (bk) and QI attributes such as age, zip code, sex, in that case, he can infer the sensitive values with high confidence.
Case 4 If an intruder has background knowledge that Dalia is not poor and did not complete any degree courses, he can find out the QI values of an individual Dalia from Table 2. With the above information and QI values, the intruder can easily infer that Dalia belongs to 2 and 3 equivalence classes. So, Table 2 fails to guarantee privacy and leads to a Non-membership correlation attack (NMcorr).
Case 5 If the intruder can infer the existence of an individual along with the complete information, then a membership correlation attack happens. If the intruder knows the QI values of Helen (i.e.,) 30, F, 68,000 and possess background knowledge such as Helen's education is higher and poverty is very low, the intruder can easily infer Helen falls in the equivalence classes 3 & 4 and comes to a conclusion, which record belongs to Helen, with the help of sensitive attribute values.

Contribution
During 1:M dataset publishing, the patients' records need to be protected with less information loss. The anonymization methods and models that were implemented earlier to balance the privacy and utility were limited with various factors such as dimensionality, techniques, methodology. The published 1:M dataset becomes ineffective when the quasi-identifier and sensitive attributes are generalized in larger intervals. Moreover, anatomization was performed without considering the correlation between attributes which could lead to the breaking of linking relationship with complete loss of information. The significant contribution of the work is as follows.
A thorough study has been done on existing privacypreserving techniques and models of the 1:M dataset with multiple sensitive attributes to balance privacy and utility. An Anatomization (def.1) is performed based on the correlation between the attributes, which significantly outpaces the breaking of linking relationship concerning privacy and utility of the dataset. After anatomization, Individual Condition Fingerprint Array identification [ICFA] (def.2) is framed and an anonymization method ''f-slicing'' (def.5) has been implemented on sensitive attributes of the 1: M dataset.
The ''f-slicing'' can be performed based on the frequency of occurrence of ICFA. Based on ''f-slicing,'' a privacy model, ''frequencyslicing with intensified privacy (f-slip)'' has been proposed to achieve less information loss with intensified privacy.
With the experimentations performed on the proposed model, it is proved that ''f-slip'' successfully accomplished intensified privacy with minimum loss of information.
Definition 1: Anatomy (Xiaokui and Yufei 2006). Anatomy is a method used to partition the original table into various sub-tables. It disconnects the correlation between the quasi-identifiers and sensitive attributes. The main concept behind anatomy is to partition the table, apply different techniques on the quasi-identifier table (QI T ) and sensitive attribute table (SA T '), and join them together. Both QI T and SA T ' are assigned with a unique id for reference; however, it will be removed while publishing.
The QI T has a representation of The SA T ' has a representation of SA T 0 ¼ Unique ID; sa 1 0 ; sa 2 0 ; sa 3 0 . . .sa h 0 f g The SA T ' of multi-record (an individual having multiple records) with multiple sensitive attributes has a representation of SA T 0 ¼ Unique ID; sa 11 0 ; sa 12 0 ; sa 13 0 ; sa 21 0 ; sa 22 0 ; sa 23 0 ; f sa 31 0 ; sa 32 0 ; sa 33 0 . . .sa h1 0 ; sa h2 0 Usa h3 0 g where sa 11 'Usa 12 'Usa 13 ' is an aggregation of multiple sensitive values of an individual. The QI T is also further divided into sub-tables based on the categorical and numerical values in our approach. Definition 3: Equivalence Class For a multi-record dataset R T, the tuples with the same quasi-identifier values in R T form the equivalence class.
Definition 4: k-anonymity The published data are said to satisfy k-anonymity if and only if the information of each individual in the published data cannot be distinguished from at least k-1 individuals whose information also appears in the published table.
Definition 5: f-slicing It is an approach to invariably partition high-dimensional data based on the similarity between multiple attributes in the dataset using a mode frequency value f and forming equivalence classes of size 'f' containing the similarly grouped attribute tuples.
7 frequency-Slicing with intensified privacy (f-slip) model After analyzing various existing models, a model f-slip has been proposed with an algorithm and architecture framework. It is observed that lots of works have not been carried out in the 1:M dataset with multiple sensitive attributes. It has been proved that the proposed model resists various attacks and those are explained clearly in the experimental evaluation section. The goal of the f-slip model is to ensure less information loss with intensified privacy during the privacy-preserved data publishing of 1: M dataset with multiple sensitive attributes. The proposed model performs the following steps (1) finding the correlation between sensitive attributes (2) pre-processing and aggregation of multi-record to a single record (3) anatomizing both QI and SA (4) implementing k-anonymity on QI (5) f-slicing on ICFA and (6) merging of QI and SA.

Correlation of sensitive attributes
In the f-slip model, the first step is to find the correlation between the sensitive attributes. The purpose of calculating the correlation between the sensitive attributes is to measure the dependency among the different sensitive attributes. If the anatomization is just performed for splitting the tables into two or three without computing the correlation, it leads to breaking of linking relationship. Suppose the table is divided into multiple tables without computing the correlation between the sensitive attributes, there are lots of chances that unrelated attributes may be grouped together which might lead to much information loss. In Table 1, there are four sensitive attributes such as salary, poverty, education and disease code. The poverty, education and disease code are categorical, and salary is numerical. Two correlation metrics are used to find the correlation between categorical and numerical attributes. Cramer's V is a metric used to measure the correlation between the categorical sensitive attributes.
where chi 2 = chi-square statistics, TS = total sample size, nr = number of rows, nc = number of columns.
One-Way ANOVA is a metric used to measure the correlation between numerical and categorical attributes.
where MNSQ B = mean square between samples, MNSQ W = mean square within sample.
where SSS B = sum of square between sample, SSS W-= sum of square within sample, ng = number of groups, n tob = total number of observations. f-Slip: an efficient privacy-preserving data publishing framework for 1:M microdata with multiple… 13025 SSS B ¼ X ng n ng ðy ng À yÞ 2 ð5Þ The correlation between sensitive attribute salary, poverty, education and disease code is calculated.
As per the correlation metrics, we have anatomized the sensitive table (SA T ') into two sub-tables SA T1 and SA T2 , each with highly correlated attributes. As per Table 4, poverty and salary are highly correlated, so SA T1 is formed with poverty and salary and SA T2 with education and disease code.

Pre-processing and aggregation of multirecords
In the 1:M dataset, the individual records are distributed.
Since the records of an individual are widely distributed, we could not form an Individual Condition Fingerprint Array (ICFA) (Definition 2). Several records of an individual are aggregated to single record R T * as shown in Table 5. The different sensitive attribute values of an individual are aggregated to form an Individual Condition Fingerprint Array (ICFA). The intruder might use ICFA along with the quasi-identifier values to find the complete information of an individual. For example, if an intruder knows that Tom's quasi-identifier values as age are 32, gender is M and zip code is 56,000; also, he has completed his masters and earns a good salary with poverty line low, the intruder can easily infer that the last three records of Table 2 belong to Tom and can find the complete details of Tom.
As per the sample dataset, poverty, salary and education do not have multiple values for each individual record. Hence, the ICFA is formed only for the disease code, which has various values. Most of the existing systems have not given importance to preserving ICFA, which leads to privacy breaches. As the sensitive attributes in the 1: M dataset are diversified, the length of the Individual Condition Fingerprint Array for each individual will be unequal. After aggregating the records, the duplicate records are deleted as of pre-processing steps and missed values are filled with the average values of the attributes. In where Mean (Age (E qc )) is the mean value of age in an equivalence class, similarly, the mean value is calculated for each equivalence class.
Mean (Zip code (Eqc)) 60000 þ 55000 þ 56000 3 ¼ 570000 where Mean (Zip code (E qc )) is the mean value of zip code in an equivalence class. K-anonymity (def. 4) model is implemented on both QI T (num) and QI T (cat) separately, and a unique id is generated for all the tuples, as shown in Tables 6 and 7. Finally, the QI T (num) and QI T (cat) are merged to form the QI T (final) table. As per our knowledge, we are the first to anatomize the quasi-identifier table in the 1:M dataset to reduce the loss of information (Fig. 1).
7.4 f-slicing and merging of QI and SA.  Figs. 2 and 3 respectively. The disease codes are generalized, and the sample is shown in Fig. 1. The related diseases for the disease codes are mapped using the ICD9X dictionary. As the ICD9X dictionary cannot be imported in python 3, a new dictionary has been framed according to the dataset with the help of the ICD9X dictionary. Since there are several disease codes available, the values are generalized and replaced in SA T ' . The ICFA buckets of disease code are arranged alphabetically, and the number of occurrences of each bucket is stored to calculate the frequency of occurrence of each ICFA bucket to set the value of f. The SA T1 is sorted with respect to education and disease code, and equivalence classes are formed according to the frequency calculated. A  grouped (GID) is allotted to all the equivalence classes created with similar records. As per SA T1 , the frequency f = 2, since the disease codes are categorical, the maximum similarities between the fingerprint buckets in an equivalence class are taken. Few disease codes have the V alphabet in them; therefore, V's value is taken as 4; thus, the alan disease codes have been converted to 422 and 490 as shown in 9b. For SA T2, the value of f = 3, since the salary is numerical, the average of the equivalence class has been taken as shown in Table 8b. A group id (GID) is also created for the equivalence classes formed according to the frequency calculated. The unique_id is created just for reference because the records get shuffled during anonymization. The novel approach in the f-slip model is the slicing of records according to the frequency of occurrences of sensitive attribute values in each sub-table. In the existing works, records in the anatomized table are not grouped (sliced) together according to the frequency of occurrences and the partitioned tables are sliced with a fixed variable. In contrast, the f-slip model anonymizes the partitioned table with dynamic variable f named ''f-slicing.'' After the anonymization of SA T1 and SA T2 , both the tables are merged along with the GID. The complete framework of the f-slip model is depicted in Fig. 4. The anonymized tables that need to be disclosed with privacy preservation are shown in Table 10. The privacy-preserving data   publishing of the f-slip model needs to disclose the tables in three partitions a. QI T _final, b. Sensitive attribute SA T1 and c. Sensitive attribute SA T2 . As the sensitive attributes are anonymized using f-slicing, the mapping of individual details through Table 10a, b, c is very difficult.

f-slip algorithm
The main objective of the f-slip algorithm is to provide a balance between privacy and utility. As per our knowledge, the existing systems have implemented anatomization only in the sensitive attribute table for 1:M datasets, whereas fslip performs anatomization in both quasi-identifier and The process steps of the proposed model f-slip are split into four algorithms (1,2,3 and 4) and outlined for understanding purpose. In Algorithm 1, the relational table R T is passed as input in line 1. The multi-records of an individual are aggregated in line 2. The aggregated table R T * is anatomized into the quasi-identifier table and sensitive attribute table in line 3. Further, the quasi-identifier table is k-anonymized by passing the value of the k parameter in line 4. The correlation between the sensitive attributes is computed and anonymized based on the correlation in lines 5 & 6. Finally, the anonymized quasi-identifier and sensitive attribute table are merged and returned in lines 7 & 8. The implementation of k-anonymity to anonymize the quasi-identifier table is depicted in algorithm 2. The quasiidentifier table and k parameter are passed as an input parameter in line 1. The QI T is anatomized into QI T (num) and QI T (cat) in line 2. Both QI T (num) and QI T (cat) are anonymized separately by implementing k-anonymity and merged together from lines 3-6.   In algorithm 3, the correlation between the sensitive attribute is calculated. In line 1, the partitioned table SA T ' is passed as an input to the correlation function. To find the correlation between sensitive attributes, the attributes are categorized into both categorical and numerical in line 2. Cramer's v is applied on categorical attribute in line 3, and one-way ANOVA is implemented on both categorical and numerical attributes and stores in variable D from lines 4-8. The anonymization of the sensitive attribute table is outlined in algorithm 4. The aggregated table SA T and correlation table D are passed as an input parameter in line 1. Based on the correlation, the SA table is anatomized and generalized from lines 2-4. The generalized tables are sorted to find the f parameter in lines 5 & 6. The process of setting the frequency mode for both categorical and numerical attributes is explained from lines 7-15. The f-slicing is performed on both SA sub-tables based on the frequency of occurrences of sensitive attributes, and group id is allotted from lines 16-23. Finally, the merging of f-sliced sub-tables is performed in line 24. The functions used in the algorithms are described in Table 11. Fig. 5 depicts the complete workflow of the f-slip algorithm.

Experimental evaluation and result analysis
This section presents the experimental evaluation and results from the analysis of the proposed model on the realworld dataset in terms of time execution and data utility. The result analysis of the proposed model is compared with the existing 1:M privacy models through graphs. The implementation of the f-slip model is carried out in Python 3, and experimental evaluation and result analysis are executed in the windows 10 operating system with 8 GB memory. The implementation and evaluation are performed on real-world dataset Informs (https://sites.google. com/site/informsdataminingcontest/). In Informs dataset, {age, racex, sex, marry} are taken as QI attributes and {income, poverty, education, condition_code} as sensitive attributes. During the pre-processing phase, the duplicate records are removed and the missing values are replaced with the average values of the attributes. There are 2,44,321 records in Informs datasets. After aggregation and pre-processing, the total number of records in the dataset is 42,999. Since the dataset is composed of both categorical and numerical, the utility is measured for both categorical and numerical separately. The primary challenge faced in the f-slip is making the dictionary for condition_code and fixing of 'f' value for SA T1 * and SA T2 *.

Utility loss
The utility of anonymized data is measured for both categorical and numerical separately. The metric Normalized Certainty Penalty (NCP) (Qiyuan et al. 2017) is used to measure the information loss for the categorical attributes in the anonymized quasi-identifier and sensitive attribute tables. The utility metric measure for the numeric data is Numeric Information Loss (NIL) which is framed newly by customizing the iloss metric (Xinning and Zhiping 2020). NIL is used to measure the information loss for numerical attributes in the anonymized quasi-identifier and sensitive attribute tables.
where |e| is the nodes covered by the generalized value and |n| is the total number of nodes.
f-Slip: an efficient privacy-preserving data publishing framework for 1:M microdata with multiple… 13031 The output of NCP ranges from 0 to 1. The value 0 means no information loss and as the output value of NCP increases, the information loss also increases. The NCP output is directly proportional to the information loss. The NCP for each node is calculated as per Eq. 10. For example, the attribute poverty has five values and those five values are generalized to four nodes and the NCP calculation for each node is shown in Fig. 6.
For detailed explanation of NCP measure, the information loss in condition_code is calculated as follows: Figure 7a depicts the generalization hierarchy for con-dition_code, and Fig. 7b shows the generalization of node 927. Though the dataset has 1-1038 condition codes, it is not continuous; many condition_code values in between are missing. In Fig. 7, we can observe in node B, the range of values is 140-239, so the value present in the range should be 99 codes; however, the values present are 94 and five values are missing in between. Therefore, the range of values cannot be fixed as they are not consistent for each node. NCP for node Q is calculated as per Eq. 12. The The average sum of information loss for three conditions h, h and b is 0.062, which is very less. The average information loss of categorical attributes is 14.43%.
The information loss of all records in relational table R T for numerical attributes is calculated using a metric NIL: where R T is the relational table, old (R T ) is the original data before anonymization, new (R T ) is the new data after anonymization, unique (R T ) is the unique values in the numerical attribute and n is the number of tuples in R T . The total information loss in the numerical attribute is 7.97% as per the metric NIL as shown in Eq. 14. NCP and NIL have been used to measure the information loss on categorical and numerical attributes, respectively. For dataset Informs, we have analyzed the QIinformation loss and SA-information loss with varying sizes of the dataset, as shown in Fig. 8. To evaluate whether the data utility is stable as the size of the dataset (n) increases. A series of data subsets are randomly selected from the whole dataset in intervals of 5000. The fslip algorithm has been run for different sizes of the dataset. It is clearly shown that RMR has a high information loss for different sizes of the dataset and 1:M Mondrian has 30% of information loss for the 5 k sample dataset. As the size of the dataset increases, the information loss is less and consistent. The f-slip model results in very less information loss, and the loss is consistent as the size of the dataset increases for both QI and SA table.
The existing model RMR results in 66% of QI-NCP by setting the value d = 0.66. RMR has much information loss in quasi-identifier as it concentrates more on preserving the sensitive attributes. As shown in Fig. 8, RMR has a high utility loss compared to 1:M Mondrian and f-slip model. The information loss in f-slip is approximately 9% and 4% lesser than 1:M Mondrian in QI and SA, respectively. In Fig. 8, y-axis represents percentage of information loss and x-axis represents the number of records. The N (*1000) represents the number of records *1000. In Fig. 8, the graph is plotted in the interval 5 k. The QI-information loss is computed with varying parameter k and depicted in Fig. 9a. As shown in Fig. 9a, it is evidently shown that the RMR has high utility loss in all cases. As the proposed model does not have a k parameter in SA, the information loss in QI is alone computed by varying parameter k. In 1:M Mondrian, both QI-NCP and SA-NCP are sensitive to parameter k; thus, there is variation in QI-utility loss. The f-slip model has not implemented generalization in the QI table so, there is less information loss and it is consistent though the k parameter increases. The information loss in the sensitive attribute is calculated by varying parameter f, as shown in Fig. 9b. The RMR and 1:M Mondrian do not have f parameters, so the information loss percentage based on the number of records is plotted in Fig. 9b. The sensitive attribute subtables have been executed by fixing different frequency values for different records. As shown in Fig. 9b, though the value of f varies, the information loss is very less and consistent. So, the proposed work proves that f-slip can work for the higher dimensional dataset with different frequency values.

Execution time
The efficiency of the proposed model has been evaluated by the execution time compared with RMR and 1:M Mondrian. The execution time does not involve the preprocessing steps. The proposed model has been executed for a different size (n) of the dataset and compared with RMR and 1:M Mondrian. The captured results are clearly depicted in Fig. 10. The f-slip is much efficient than RMR. The average execution time of RMR is approximately

Conclusion and future directions
The study presents the work on privacy-preserving data publishing on 1:M datasets. An efficient proposed model named f-slip has been proposed to address various attacks such as Background Knowledge (bk) attack, Multiple Sensitive attribute correlation attack(MSAcorr), Quasiidentifier correlation attack(QIcorr), Non-membership correlation attack(NMcorr) and Membership correlation attack(Mcorr). Anatomization is performed to partition the original microdata into two tables' a. quasi-identifier and b. sensitive attribute. After partitioning, k-anonymity has been implemented on the quasi-identifier table. Based on the correlation among the sensitive attributes, the sensitive attribute table is partitioned. An anonymization method, fslicing, was proposed to anonymize the sensitive attributes, whereas the parameter f is fixed based on the occurrences of ICFA. The parameter f can be fixed dynamically according to the dimensionality of the dataset. Extensive experiments have been performed on a real-world dataset Informs to show that the f-slip model performs better than RMR and 1:M Mondrian in terms of efficiency and information loss by varying the size of the datasets. In the study, there are also few limitations in the f-slip model. In Informs dataset, the quasi-identifier of all the records of an individual has the same value, whereas it may change in time, e.g., age increases and zip code may also change. Also, in Informs dataset, the only condition_code has multiple values, whereas salary, poverty and education share the same value; thus, ICFA is formed only for condition_code.
Extending the work so that all the sensitive attributes might have different values and therefore all the sensitive attributes values can be included in ICFA, the parameter f needs to be carefully chosen and fixed.