A Study on the Trading Price Estimation Algorithm for Healthcare Transaction

1 Background: While more attention has been paid of late to utilization plans for big data in the 2 healthcare sector worldwide, few scholars have addressed the value estimation of healthcare data. 3 Accordingly, this study aims to propose an idea of a reasonable price estimation algorithm that 4 can be applied to bidirectional exchange in healthcare data platforms. 5 Methods: This study incorporates three methodologies for the data valuation, namely: cost-based, 6 market-based, and impact-based approaches. The cost-based approach calculates the value of data 7 based on the costs associated with data creation, management, and utilization. On the other hand, 8 the market-based approach evaluates it by comparing the market price of a service similar to the 9 data. Finally, the impact-based approach estimates the data value with an emphasis on improving 10 future revenue generation and productivity as an effect of using the data. 11 Results: The trading prices of healthcare data are determined by the sum of two prices—the 12 fundamental price and the dynamic price. Here, the fundamental price can be further subdivided 13 into the beginning value, complexity value, and network value. The beginning value is determined 14 in proportion to the physical file size of the data, and the fundamental price is estimated by adding 15 the complexity value and network value that can reflect the qualitative value (within 20% of the 16 beginning value) of the data to the beginning value. First, the complexity value can increase if 17 more personal information, more relevant information to the national health insurance system, 18 and more recent and long-term information are included in the dimensions of identification, 19 material, and time information inherent in healthcare data. Second, the network value reflects 20 whether the data can be well linked with data from, not only the healthcare sector, but also from 21 other fields and sectors. The higher the match rate between the attribute value keyword of the data 22 and the healthcare search keyword of journals of excellence and portal services, the higher value 23 is given. Finally, dynamic price reflects real-time preferences for the data and changes in data 24 supply and demand as the actual exchange proceeds through healthcare data trading. To this end, 25 dynamic value is determined within the upper and lower 5% band of the previous month's trading price based on the number of monthly views for the data, the number of downloads of summary 1 data, and the number of actual purchases, and this is reflected in the next month's trading price. 2 Conclusions: If the algorithm for estimating the trading price of healthcare data proposed in this 3 study is applied to actual data trades, it would expand the transactions of healthcare data from 4 both supply and demand sides. Also, in the processes of actual data exchange and the 5 accumulation of actual data trades, continuing studies on the weighting parameters are needed to 6 better reflect reality; such studies would enable the assignment of additional values or penalties. 7

academic journals of excellence selected by the National Research Foundation of Korea (NRF) 23 for one year were derived, and the top 30 keywords related to healthcare for the past one year 24 were derived from the internet portal services, and in this way, a total of 100 top keywords were determined. When the match rate of the developed keywords and the keywords of the attribute 1 value of the healthcare data exceeds 10%-20%, an additional value up to 10% can be added to 2 the beginning value.

3
On the other hand, the dynamic price reflects the real-time preferences of data consumers and 4 similar changes in the data supply and demand as the actual trading proceeds through data 5 exchange after the fundamental price of healthcare data is estimated. In the process, based on 6 several factors (views of the data, downloads of summary data, actual purchases), if any of these 7 falls within the upper 25% of the distribution within a month, an additional value of 1%-5% is 8 added to the previous month's price. On the contrary, if it falls within the lower 25% of the 9 distribution, the price drops by 1%-5%, and if it falls within the medium 50% of the distribution, 10 that is, 26%-74% of the distribution, the price of the previous month is reset as the price at the 11 beginning of the next month. According to previous studies, cost-based approaches, market-based approaches, and impact- 16 based approaches have been mainly adopted as techniques for the evaluation of data values [5,6,17 7]. First, the cost-based approach is a method of calculating the value of data based on the costs 18 associated with data creation, management, and utilization. On the other hand, the market-based 19 approach evaluates the data value by comparing the price of a service similar to the data in a 20 market where there are suppliers and consumers of the data. Finally, the impact-based approach 21 estimates the data value with an emphasis on improving future revenue generation and 22 productivity as an effect of using the data.

23
Since the trading market for healthcare data is not formed in South Korea, the data service charge 24 of public institutions is set primarily by cost-based methods. For example, the cost estimation for the press release material from the Financial Security Institute [10], the number of participating 23 institutions and registered data has increased sharply within 10 days of launch. However, a 24 reasonable standard for data price estimation has not yet been established. For now, it is thought that the Exchange accepts the initial quotation price from the data provider as it is.

1
While the service charge for healthcare data has been estimated based on cost by public service 2 providers, this study incorporates all three methodologies for the data valuation, that is, it 3 incorporates cost-based, market-based, and impact-based approaches. Accordingly, the 4 transaction price estimation algorithm of healthcare data is analyzed by dividing it into two stages: 5 the fundamental price and the dynamic price. According to the National Data Map of DATA.GO. KR [11], the healthcare sector is classified 10 into two categories: "health insurance" and "medical care." The former is classified according to 11 the proportion of serviced data, such as health insurance (7.12%), prescription drugs (3.31%),

19
As seen in Table 1, according to HIRA [12], the healthcare big data of the relevant institution 20 can be largely divided into six categories: healthcare treatment information, medication 21 information, treatment materials information, healthcare resources information, healthcare 22 service quality evaluation information, and non-benefit information. The data are primarily for 23 insurance review, and data on healthcare treatment and medication, in direct relation to healthcare charge, account for the main portion; a wide range of information, such as treatment materials 1 and healthcare resources, is included in the big data. In addition, the National Health Insurance Service [9] provides cohort DB services collected 4 by year and province. It is divided into four types of DBs: health examination cohort, elderly 5 cohort, infants and children screening cohort, and workplace women cohort. The DB is generally 6 composed of information on qualifications, healthcare treatment, disease history and health 7 examination, and long-term care facilities for the period.

8
As can be seen from the above, various healthcare data serviced by the public and private 9 sectors have considerable differences in the structure and format of data. Furthermore, due to 10 regulations such as personal information protection and institutional approval processes, there are 11 difficulties in accessing to and using the data. To address these limitations, previous studies have 12 investigated the utilization of distributed research network and common data model [13]. The 13 distributed research network manages distributed data only, without sharing the original data 14 between consumers, which reduces the difficulties from restrictions on the use of healthcare data.

15
Therefore, for efficient utilization of data from multiple hospitals for research, the implementation 16 of common data model, in which data are saved and managed in a shared format, is essential.

17
In this case, the healthcare data are highly standardized for utilization, facilitating links with 18 other healthcare data, which adds higher values to the healthcare data. These properties of 19 healthcare data are reflected in the "network value," which will be discussed in more detail in 20 Chapter IV. The estimation of the trading price of healthcare data is largely divided into two stages: estimation of the fundamental price and the dynamic price. In this study, an algorithm for estimating the 1 optimal price was analyzed for the two stages. This algorithm encompasses the three approaches 2 of data value discussed previously: a cost-based approach, a market-based approach, and an 3 impact-based approach. In this section, fundamental price estimation is first analyzed for the price estimation of healthcare 8 data (database). The fundamental price consists of three components: the beginning value, 9 complexity value, and network value. The estimation method for each value is presented below. Healthcare data are likely to be in the form of an Excel or text file. It is reasonable that the 14 beginning value for the fundamental price estimation of healthcare data is estimated in proportion 15 to the simple size of the file. In the case of the NHIS data mentioned above, a price of 10,000

16
Korean won per 1GB is charged when using USB. Therefore, with reference to this charge, the 17 beginning value can be applied as shown in Equation (1)

22
where BV is the beginning value.

23
For example, if the size of the healthcare data is 4,150KB (kilobytes), the beginning value is set 24 to 41,500 Korean won, roughly equivalent to 40 USD. However, it is necessary to consider that,

25
because the beginning value is estimated based on the data service charge of a public institution with a cost-based approach, the value can be lower than the price set by a private market. In 1 addition, if there is a more reasonable case of reference for the beginning value estimation, it can 2 be reflected and applied for estimation. The second element of healthcare data fundamental price, the complexity value, evaluates the 7 value of the complexity and importance of healthcare data, as shown in Equation (2) below. The 8 value is set by estimating and reflecting the qualitative value of the data to the beginning value 9 discussed above: where CV is the complexity value and indicates the identification information dimension, 14 material information (or depth) dimension, and time information (or width) dimension of the 15 healthcare data, respectively. Examining this one by one, 1 represents the aggregate level of 16 identification information level, and a different weight is assigned for each case. The level of 17 identification information is classified into four levels of person/county/province/nationwide, as 18 can be seen in Equation (3): where 1 , 2, 3 , and 4 represent the proportion of appliable identification information from 1 the entire dataset. That is, 1 + 2 + 3 + 4 = 1. In addition, persons, counties, provinces, and 2 nationwide are additional weights given to the data in units of person, county, province, and nation, 3 respectively. Regarding personal level information, an additional weight of 20% was given. This 4 is because the largest cost is required at the stage of the statistical process, where a personal 5 identifier should be transformed into anonymity due to individual sensitive information. A weight 6 of 10% is given for data in county units, 4% for province units, and no additional weight for data 7 in nationwide units. This is because if personal level data are available, regional and higher-level 8 aggregate data can be developed and constructed with ease.

9
For example, if all the data items are composed of personal information, the complexity value is 10 set by adding an additional weight (person = 0.2) of 20% to the beginning value. For individual 11 hospitals or clinics where identification information is not a person, an additional value can be 12 given on a county basis. This is because, although not at the same level of sensitivity as personal 13 information, the comparative analysis by county is likely to include more useful information 14 compared to aggregate information in the unit of province. However, more empirical studies are 15 needed for this type of additional weighting parameter system in the future. That is, the cost of 16 de-identifying the identifier of personal information may be different for each statistical agency 17 and for each type of statistic. In addition, the level of accessibility of users to aggregate data, such 18 as county and province, can be very different for each case.

19
As shown in Equation (4), the content of material information is classified into four levels: five 20 major benefits (benefit5), other benefits (benefit), and non-benefit/others, where 2 represents the material information dimension of healthcare data (and a different 3 weight is assigned to each case); 1 , 2, 3 , and 4 represent the proportion of appliable 4 material information from the entire dataset ( 1 + 2 + 3 + 4 = 1); and benefit5, benefit, non-5 benefit, and others represent items related to five major benefits in national health insurance, items 6 related to other benefits, non-benefit-related items, and items other than national health insurance.

7
First, the five major benefit-related items refer to the five diseases that account for major portions 8 of the national health insurance payment, such as cancer. Other benefit-related items refer to the 9 remaining benefit items, excluding the five major benefits items. Non-benefit-related items refer 10 to those excluded from national health insurance benefits, such as cosmetic and dental procedures 11 for esthetic purposes. Finally, all items not included in the above four categories were classified 12 as "others". The healthcare resources and healthcare service quality evaluation information in 13 <Table 1> can be classified into this category. In summary, this enables the provision of additional 14 values to data attributes related to the actual prevalence of diseases and the associated costs for 15 the public.

16
As can be seen from Equation (5), depending on whether the data are updated, weight or 17 penalty is assigned. If data within the last three years are included, an additional value (10%) can 18 be assigned owing to the update status of the data. However, if there is no data within the last 19 three years, a penalty (-5%) can be given. This is because old data are less likely to be relevant in 20 contemporary times. where 3 represents the time information dimension of healthcare data.

8
The time series span of the healthcare data is important. Because the cross-sectional information 9 (depth) of the data has already been reflected in the material dimension of Equation (4) The network value, the third and final component of healthcare data fundamental price, is a value 23 that reflects whether the data can be well linked with data from other fields, such as engineering, psychology, and social sciences. For example, even when there are two sets of healthcare data 1 with the same beginning value and complexity value, if one set of data is more likely to flexibly 2 connect with data from other research fields, a higher value can be assigned to this data compared 3 to the other data set. where NV is the network value, BV is the beginning value, and KCI is the Korea Citation Index.

21
In fact, it can be considered that the value of a certain capability of the network was present in the material and time information of the data analyzed in complexity value described above. That 1 is, if the proportion of the items related to the benefits of national health insurance is high in 2 consideration of the importance of diseases such as cancer, academic interest from other fields 3 such as psychology and social science is generally increased, and accordingly, it is highly likely 4 that the expanded application of healthcare data will increase. In addition, this section also reflects 5 the value of the network that healthcare data can have.

6
As can be seen from Equation (6) data, an appropriate additional value is given in proportion to the match rate.

17
In addition, the match rate can be applied differently for the humanities, social sciences, 18 science and technology, and healthcare fields. For example, if the keyword match rate in the 19 humanities and social field exceeds 10%, an additional weight of 10% is given, whereas in the 20 case of science technology and healthcare, an additional weight of 10% is given only when the 21 match rate exceeds 20%. However, as the actual application cases for the applicable healthcare 22 data are accumulated, more realistic numbers and appropriate classification methods can be 23 applied for these match rates.

24
The fundamental price of healthcare data, summarized by Equation (7), consists of three components: beginning value, complexity value, and network value. Then, for complexity value 1 and network value, through estimation of potential qualitative values that the data may have, these 2 values can be added to the beginning value: where FP is fundamental price, BV is beginning value, CV is complexity value, and NV is network 9 value. is somewhat different from the definition above. That is, after the fundamental price of healthcare 17 data is set, the dynamic price of this study refers to a price element that can change dynamically 18 over time, as actual trades are made through data exchange.

19
Dynamic price is estimated by reflecting changes in the supply and demand of the market and 20 preferences of data consumer groups, such as the intensity of real-time demand for the data with 21 fundamental price value set, and variations in the supply of similar healthcare data. Due to the 22 nature of healthcare data, the actual data preference expressed from the user side rather than the 23 provider side is likely to lead to a dynamic price change. Accordingly, dynamic price is estimated 24 for a specific period (e.g., one month) based on the number of views on the data, the number of downloads of summary data, and the number of actual purchases, as shown in Equation (8) where DP is dynamic price.

17
As can be seen from Equation (8), an additional value of 1% is given to the number of data views 18 and the number of summary data downloads, while an additional value of 3% is given to the where P is the trading price, FP is the fundamental price, DP is the dynamic price, and BV is the 8 beginning value.

9
That is, the fundamental price is determined at the time of data exchange registration (time 0), 10 and the dynamic price based on the trading volume of the previous period (or previous month) 11 varies, and the trading price of the next period (or next month) is updated. In addition, the distribution in terms of dynamic price estimation. If the upper and lower bands set at dynamic 1 price estimation are increased, this time of 7.5 years will decrease, and when the bands are 2 decreased, the time will increase. For example, if the upper and lower bands are decreased to 3%, 3 it is estimated to take about 12.6 years for the data value to fall to 1%. This is similar to applying 4 the principle of bond price estimation to the value estimation of healthcare data. That is, one of 5 the most fundamental characteristics of a financial debt contract, such as bonds, is the maturity of 6 liability. After 3, 5, or 10 years of maturity, the value of the bond expires. Until maturity, bond 7 prices vary according to the principle of supply and demand in the bond's secondary market.

8
When these characteristics are applied to healthcare data, after a paid period (e.g., 5, 7, or 10 9 years), on the basis of prior agreement with the data providers, the healthcare data can be provided 10 free of charge.

11
The characteristic of the bond price structure is that when one reference bond price is 12 determined, a certain liquidity premium or risk premium is applied in conjunction with the price, 10-year KTB, which have the same characteristics except maturity period, will be determined by 16 adding only a certain liquidity premium to the 3-year KTB price. This is because in the case of 17 KTB, there is little risk premium, regardless of whether the maturity is 3-year or 10-year, and 18 only compensation for liquidity constraints is required.

19
When this principle is applied to the price estimation of healthcare On the other hand, dynamic price reflects the real-time preferences of data consumers and 10 similar changes in the data supply and demand as the actual trading proceeds through data 11 exchange after the fundamental price of healthcare data is estimated. In the process, based on the 12 number of views of the data, the number of downloads of summary data, the number of actual 13 purchases, if any of these falls within the upper 25% of the distribution within a month, an 14 additional value of 1%-5% is given to the previous month's trading price. On the contrary, if it 15 falls within the lower 25% of the distribution, the price drops by 1%-5% of the previous month's 16 trading price, and if it falls within the medium 50% of the distribution, that is, 26%-74% of the 17 distribution, the price of the previous month is reset as the trading price at the beginning of the 18 next month.

19
In addition, after a certain time has elapsed, the demand for old healthcare data is bound to 20 decrease continuously. When the equation of dynamic price estimation in this study is applied to 21 this type of old healthcare data, it is highly probable that the price decreases monthly by about 5%

22
of the price of the previous month. In this case, it is estimated to take about 7.5 years for the price 23 of the data to fall to 1% of the fundamental price. When considering characteristics such as the 24 expiration period of healthcare data value, after a certain paid period (e.g., 5, 7, or 10 years), on 25 the basis of prior agreement with the data providers, the healthcare data can be provided free of

Non-benefit information
Non-benefit healthcare cost, non-benefit items (309 items), certifying documentation provision charge (31 items), minimum and maximum cost per healthcare institution and others