In this paper, we described a method of using open data to obtain the SHI of an individual based on their address or postal code. We provided example Python code in the Google CoLaboratory environment for other researchers to replicate the workflow and to demonstrate use of the respective open data APIs. These software tools are all open source and free to use. An SHI dataset created by this method can be readily linked with clinical datasets via the postal code or address, with the merged dataset subsequently utilized for data analysis.
In describing these methods, we hope to increase awareness of the potential in open data among health services and epidemiology researchers. The main advantages of open data lie in the convenience of use, containment of research costs, and better comprehensiveness in the case of government sourced open data. Previous healthcare research has also successfully leveraged local open data [26–29]. Furthermore, usage of APIs improves recency, and reduces chance of error as no human curation is required. In this study, we discovered a small number of cases where a wrong SHI had been assigned in the original dataset.
Conversely, use of open data means that researchers are limited to the data fields provided by the source. This limitation was encountered in the current methodological description, where private residential addresses were determined through a process of elimination, as there was no existing open dataset for this. For the current SHI workflow, such assumptions are reasonable given that public and private housing in Singapore are essentially mutually exclusive, but this limitation may constrain other applications of open data locally. We note that in the current SHI application, open data reduced our ability to differentiate private residential addresses into condominium (SHI 6) and landed property (SHI 7). However, we were able to identify destitute homes, which was a limitation of the original methodology. Future researchers may consider assigning a separate code (e.g. SHI 0) to this group of patients, or grouping them together with rental flat occupants. We were also able to identify residential nursing homes, and future researchers may wish to consider this as a separate group, especially for research involving older residents. Information on whether these are voluntary or private nursing homes is also available from the MOH HealthHub data source, if added granularity is desired.
Researchers using open data must trust the data source for veracity and completeness. This is an entirely reasonable assumption for local government sourced open data, given a stated commitment to providing timely and high-quality data [30]. Commercially sourced or other community sourced open data may not have such a commitment. In this study, we noted that the government sourced open data had valid results in the vast majority of cases, but did have a very small amount of data errors. For example, a NIL postal code was returned from the OneMAP API for some buildings. These were clarified with the data administrators of the relevant authority. As with any data source, researchers making use of open data need to perform validation checks prior to use, and we were able to resolve these cases by corroboration with other public government data sources.
In this study, we validated the SHI obtained from open data against the original SHI dataset by Wong TH et al. This showed near-perfect agreement and suggests that the open data version is practically equivalent to the manually curated version. We do however acknowledge that there are shortcomings to both methods. The current open dataset showed a small number of buildings with a wrong SHI computation by Wong et al. On the other hand, correct postal codes could not be retrieved for a small number of addresses via open data. These factors should be weighed by researchers who are contemplating either method of determining the SHI.
Other considerations for use of open data include research data governance - at present, usage of open data does not require Institutional Review Board (IRB) approval, as local IRBs do not have jurisdiction over data in the public domain. However, users of open data need to be familiar with the relevant licenses the data is provided under, and the acceptable terms of use. For example, the Singapore Open Data License [31] for data.gov.sg allows commercial and non-commercial use, but prevents users from assuming patent, trademark, or design rights.
Readers should also be aware that data in the public domain is not necessarily open data. In the current context of SHI determination, other information on property classification might be freely and publicly available on property agency websites. However, such websites are intended for human use and not for automated querying, and would generally not have APIs available. While data may still be programmatically obtained from such websites using web scraping software, this may not be the intention of the site owners and may be perceived as malicious online behaviour. Usage of web scraping tools is beyond the scope of this article, but we encourage fellow researchers to review the terms of use and the robots.txt file (a file describing acceptable use of automated web page retrieval for a given website) when interacting with web data sources that are not explicitly identified as open data.
The validity, strengths and limitations of SHI as a SES marker are beyond the scope of this study. The SHI can potentially be incorporated into composite indices using methods such as Principal Component Analysis. This approach of constructing building-level property-value indices as a SES marker could potentially be employed in a similar fashion outside Singapore.