1. Schmidt, B., Hildebrandt, A.: Next-generation sequencing: big data meets high performance computing. Drug Discovery Today (2017)
2. Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: The challenging yourney from the wild to the lake. In: CIDR (2015)
3. Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM computing surveys (CSUR) 37(4), 316–344 (2005)
4. Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: Ultra-large-scale software repository and source-code mining. ACM Transactions on Software Engineering and Methodology (TOSEM) 25(1), 7 (2015)
5. Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: Ultra-large-scale software repository and source-code mining. ACM Transactions on Software Engineering and Methodology (TOSEM) 25(1), 7 (2015)
6. Deus, H.F., Correa, M.C., Stanislaus, R., Miragaia, M., Maass, W., De Lencastre, H., Fox, R., Almeida, J.S.: S3ql: A distributed domain specific language for controlled semantic integration of life sciences data. BMC bioinformatics 12(1), 285 (2011)
7. Prlic´, A., Yates, A., Bliven, S.E., Rose, P.W., Jacobsen, J., Troshin, P.V., Chapman, M., Gao, J., Koh, C.H., Foisy, S., et al.: Biojava: an open-source framework for bioinformatics in 2012. Bioinformatics 28(20), 2693–2695 (2012)
8. Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., et al.: The bioperl toolkit: Perl modules for the life sciences. Genome research 12(10), 1611–1618 (2002)
9. Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)
10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
11. Hadoop and MongoDB.
https://www.mongodb.com/hadoop-and-mongodb
12. Genomics England. https://www.genomicsengland.co.uk/
13. Turnbull, C., Scott, R.H., Thomas, E., Jones, L., Murugaesu, N., Pretty, F.B., Halai, D., Baple, E., Craig, C., Hamblin, A., et al.: The 100000 genomes project: Bringing whole genome sequencing to the nhs. BMJ: British Medical Journal (Online) 361 (2018)
14. Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11 Suppl 12, 1 (2010)
15. Mahadik, K., Wright, C., Zhang, J., Kulkarni, M., Bagchi, S., Chaterji, S.: Sarvavid: A domain specific language for developing scalable computational genomics applications. In: Proceedings of the 2016 International Conference on Supercomputing. ICS ’16, pp. 34–13412. ACM, New York, NY, USA (2016). doi:10.1145/2925426.2926283. http://doi.acm.org/10.1145/2925426.2926283
16. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of molecular biology 215(3), 403–410 (1990)
17. Leo, S., Santoni, F., Zanetti, G.: Biodoop: bioinformatics on hadoop. In: Parallel Processing Workshops, 2009. ICPPW’09. International Conference On, pp. 415–422 (2009). IEEE
18. Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., Heljanko, K.: Hadoop-bam: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28(6), 876–877 (2012). doi:10.1093/bioinformatics/bts054
19. Sadasivam, G.S., Baktavatchalam, G.: A novel approach to multiple sequence alignment using hadoop data grids. In: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud. MDAC ’10, pp. 2–127. ACM, New York, NY, USA (2010). doi:10.1145/1779599.1779601. http://doi.acm.org/10.1145/1779599.1779601
20. Langmead, B., Hansen, K.D., Leek, J.T.: Cloud-scale RNA-sequencing scientific data analysis. In: Proceedings of the 4th ACM Workshop on Scientific Cloud Computing, pp. 13–20 (2013). ACM
21. Islam, M.J., Sharma, A., Rajan, H.: A cyberinfrastructure for big data transportation engineering. Journal of Big Data Analytics in Transportation (2019). doi:10.1007/s42421-019-00006-8
22. Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: Biomart–biological queries made easy. BMC genomics 10(1), 22 (2009)
23. Drost, H.-G., Paszkowski, J.: Biomartr: genomic data retrieval with r. Bioinformatics 33(8), 1216–1217 (2017)
24. Koonin, E.V., Wolf, Y.I.: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic acids research 36(21), 6688–6719 (2008)
25. Dede, E., Govindaraju, M., Gunter, D., Canon, R.S., Ramakrishnan, L.: Performance evaluation of a mongodb and hadoop platform for
scientific data analysis. In: Proceedings of the 4th ACM Workshop on Scientific Cloud Computing, pp. 13–20 (2013). ACM
26. Chodorow, K.: MongoDB: the Definitive Guide: Powerful and Scalable Data Storage. " O’Reilly Media, Inc.", ??? (2013)
27. Pruitt, K.D., Tatusova, T., Maglott, D.R.: Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 35(suppl_1), 61–65 (2006)
28. Rajan, H.: Bridging the digital divide in data science. In: SPLASH/SPLASH-I’17: The ACM SIGPLAN Conference on Systems, Programming, Languages and Applications: Software for Humanity (2017)
29. Generic Feature Format Version 3. http://gmod.org/wiki/GFF3