Design of logical model
The design of Tilde started with developing a domain model that could be used to describe all concepts pertinent to the GeoNet data use cases. While doing that, we explored open standards and vocabularies used to describe time series data.
The Open Geospatial Consortium (OGC) is an international consortium that develops and publishes standards for geospatial and location-based services. OGC developed the TimeSeriesML (https://www.ogc.org/standard/tsml/) and SensorML (https://www.ogc.org/standard/sensorml/) standards. We explored both, but unfortunately, we were not able to fully model our metadata requirements using TimeseriesML (too generic) or SensorML (too sensor oriented), to fully comply to all our requirements. The cost and skillsets required to develop an OGC compliant systems were not available to us within a workable timeframe.
The WOVOdat vocabulary and schema (Venezky and Newhall, 2007) was, on the other hand, too specific to volcano data for our multi-hazard requirements of Tilde and we were not able to utilize the WOVOdat vocabulary and schema either to cover all the Tilde use cases. Furthermore, the VMG, our largest volcano data user, does not use WOVOdat vocabulary and schema so trying to incorporate parts of that into how Tilde was used for volcano data did not offer any advantages.
We opted for an in-house developed logical model and a vocabulary. The model has been designed to rationalize, map out and name all common concepts we needed to describe GeoNet time series data and how their entities are related. There are benefits and drawbacks to this decision, which we will touch on later.
The Tilde logical model is based on the concepts of domains, series, “data objects” and their attributes (Fig. 1). A “series” is composed by all observations recorded by a sensor through time and processed to obtain measurements of a specific parameter. Each data object is used to describe individual observations at a specific point in time, and includes a timestamp, an observation value, an observation error and a quality control flag. While the quality control flag is available, we have not yet begun to use it, though works is underway to do this. Different series are grouped under the same domain. A domain is directly related to the type of discipline and sensor used to record the data object (for example environmental sensor). A series is uniquely identified by a number of attributes that include: a) the station identifier, where the sensor is installed or manual observations is made; b) a sensor code identifier, used to distinguish between different sensors at the same location, or to differentiate changes in the same type of sensor through times, when those changes are significant and might impact the time series; c) a series name, used to describe what is measured by the data object (for example water temperature); d) a method, used to describe the methodology used to collect the measurements or their time resolution, for example 15 minutes sampling rate, or flyspec collection system; e) an aspect, used to differentiate between features of the same time series, if necessary to further disambiguate the time series.
Additional attributes are provided to describe the station metadata, and properties of all data objects in a series. Figure 1 provide a simplified snapshot of the Tilde logical model and some additional properties. A detailed and complete description of all properties is provided in the API Tilde documentation (GeoNet, 2022).
Figures 2 and 3 provides some real-world examples of how the logical model has been applied to two types of volcano monitoring time series. We provide one example for continuous data and one example for manually collected data to show what are the fundamental entities used to identify a time series, and how they can be applied for different scenarios.
Figure 2 illustrates entities of a timeseries that is generated by a datalogger measuring the temperature of a fumarole. The datalogger is part of the “envirosensor” domain (GeoNet, 2023; GNS Science, 2018). The datalogger is installed at a station, and the sensors in the nearby fumarole. The datalogger will record a snapshot of the temperature every 10 minutes and a maximum and minimum value over a 10-minutes interval. Different sensors are identified with the sensor code attribute. Each sensor will thus generate a uniquely identifiable time series, that can be grouped under the same station and name. The same station often has other types of sensors measuring different environmental observations.
Figure 3 illustrates a similar use case, but for the manually collected data domain (or “manualcollect”; GNS Science, 1954). In this case, the sensor code attribute will be used to distinguish “spring-temperature” observations taken at slightly different points in the same area (station), for example spring source and spring outflow, with a thermocouple sensor. For “manualcollect” timeseries, the aspect attribute is rarely used, but can differentiate observations derived from multiple samples at one time from the same feature, for example chloride concentration from spring water where all samples and analyses provide slightly different but equally valid values.
A complete description of the tilde domain model and all its entities is provided through the Tilde API documentation (GeoNet, 2022).
System Architecture
A detailed description of the tilde system architecture is out of scope of this paper, but a simplified overview is provided in Fig. 4.
The majority of GeoNet data processing systems are running on Amazon Web Services (AWS) cloud infrastructure. In house developed software, configurations and databases are maintained on GitHub and systems integration and deployments are orchestrated by several processes and coded infrastructure. GeoNet data storage is relying on AWS Simple Storage Service (S3), and applications will generally utilize scheduled tasks or messaging systems and queues to run processes. A metadata database common to all GeoNet sensors networks (GNS Science, 2019) describes instruments and station metadata that all applications rely on.
GeoNet data collection and processing systems separate from Tilde handle the continuous data feeds (at sampling rates that depends on the data domain) and generates time series observations. Similarly, VMG curate and process manually collected data separate from Tilde. Tilde time series metadata attributes are automatically extracted from the common metadata database (GNS Science, 2019), that is also used as primary source for systems that perform data processing.
Once the continuous (or sampled) data processing is completed, each time series data point is formatted with a JSON structure compatible with Tilde and loaded in a S3 “spool” bucket. A Tilde ingest service then consumes the data point and associates it with the correct time series metadata by using the time series unique identifiers attributes and stores it on S3. Tilde uses the files stored on S3 as a form of database. This proves to be both cost effective and scalable (supports many concurrent users) however as time series grow many accumulated individual data points must be routinely repackaged into monthly and yearly files (a process referred to as “bundling”) to keep performance acceptable. The current design limits Tilde to time series with 15 second or greater time resolution, further evolutions of Tilde will improve on this, if there is a sensible use case.
We anticipate that there will be future evolution of the Tilde system, and code, storage and frontend solutions. As such, Tilde has been designed to support versioning and upgrades whilst allowing continuity and migration of downstream processes. At the time of writing, we are currently at version 3 of Tilde.
Storage
Tilde time series bundles are stored in Comma Separated Values (CSV) format, a widely used, open and flexible format. To enable machine learning applications and following recommendations for big datasets on cloud infrastructures, we adopted a self-documenting naming convention to store the timeseries bundles. Time series objects in the S3 Tilde archive follow a hive style partitioning in their name, where time series attributes are explicitly presented in the archived S3 object.
Tilde bundled Time series data are made available through the AWS Open Data Program (GeoNet, 2022A) and can be accessed via the s3://geonet-open-data/time-series/tilde/ S3 bucket. Timeseries are organized by version, domain, station, time series name, sensor code, time series parameters (method and aspect) and time.
A time series bundle naming convention is as follows (the name is splitted below as bullet point list for readability):
- time-series/tilde/[v]/domain=[domainkey]/
- station=[stationkey]/name=[namekey]/sensorcode=[sensorcodekey]/
- method=[methodkey]/aspect=[aspectkey]/
- start=[YYYY-MM-DD]/
- [domainkey].[stationkey].[namekey].[sensorcodekey].[methodkey].[aspectkey].[unit].[errorunit].[YYYY-MM-DD]T00:00:00Z.csv.gz
In this naming style, “v” is the Tilde system version and domain, station, name, sensorcode, method and aspect have been already discussed. “start” is the starting point of the time window of the time series bundled. We chose to slice the data monthly as a compromise among the range of sampling rates of the timeseries types we currently handle. The remaining part of the S3 object name will repeat these information and add time series unit attributes.
Hive-style naming should allow programs and applications to easily select and slice time series observations of interest. The S3 geonet-open-data access mechanism (GeoNet, 2022A) is the recommended one for those who want large volumes of data and is the most effective to use for bulk downloads.
Application Programmatic Interface
The Tilde frontend interface is an Application Program Interface (API). By default, the Tilde API will return information or data in JavaScript Object Notation (JSON) format. JSON is a standard text-based format that represents structured data, is human readable but is oriented to programmatic use.
The Tilde API supports three endpoints that are used to query timeseries data, statistics and metadata.
The data endpoint (GeoNet, 2022B) is the mechanism to query time series data. The Tilde API allow flexibility in how timeseries data are queried, with options to slice by time interval, observation name, method and so on. Some of the attributes (sensorCode, method and aspect) can also be wildcarded by using the “-” symbol, allowing requestors to download all available time series for a given station and series name without having to know in advance all attributes of the data they’re interested in.
The stats endpoint (GeoNet, 2022C) is the mechanism to query time series observations statistics such as first and last observation, number of records, maximum, minimum, mean values and some other simple statistical calculations. This endpoint is useful for a quick inspection of the timeseries of interest and can be used to assess or do simple and “pre-canned” analysis of the observations.
The dataSummary endpoint (GeoNet, 2022D) is a flexible mechanism to query and explore information about what timeseries are available. The endpoint can be used to query what domains are available, stations available for a certain domain, sensorCodes available for a certain station and so on. An overview of the time interval covered, number of series, and associated stations, and more specific time interval for each time series name is also provided.
Each endpoint will return standard HTTP response status code to provide information on successful requests (HTTP status code 200), malformed requests (HTTP status code 400) and so on.
Even if the API can serve bulk downloads and large data requests, we are encouraging those interested in large volumes of data to use the S3 stored time series bundles available on the GeoNet AWS Open data bucket (GeoNet, 2022).
Graphical User Interface
The Tilde graphical interface is built as a Data Discovery GUI (GeoNet, 2022E), following key requirements discovered during the design research phase. The Data Discovery GUI allows users to easily find what datasets are available in Tilde, what time periods they are available for, view the time series on an interactive graph and allow for data to be downloaded as a CSV file or in JSON format.
The elements on the Data Discovery GUI consist of 1) a data table with relevant domain and timeseries information; 2) a map with station icons and 3) a data query build element with dropdown selections to allow the user to select the time frame of interest and a data aggregation function (Fig. 5). Once the data query elements are selected, the user can then view the time series on a graph, download the data in JSON or CSV format, and view how the same request needs to be programmed to use the Tilde API in their own application. The design phase clearly indicated the need of data users interested in using the API to have a mechanism to quickly understand how an intuitive graphical selection shall be handled programmatically.
The Tilde Data Discovery GUI testing phase was crucial to uncover bugs in the design that made it difficult for user to perform tasks and request the data they were looking for. The iterative approach used to develop the Tilde GUI allows as to adjust the GUI design accordingly.
We implemented some limitations on the amount of data that can be viewed using the GUI to optimize its performance. When a GUI plotting query is too large, a response message is provided to the user that recommends using the API or the AWS Open Data access mechanisms.
The Tilde GUI has been designed in a modular way to allow for the creation of peril specific web-pages on the GeoNet website. This allows our developers to re-use the same elements created for the Data Discovery GUI to create dedicated data discovery pages for specific domains. Website pages dedicated to volcano monitoring will be developed in the future.
Tutorials
Increased usage of the low and medium sample rate data is at the heart of Tilde development. As such, alongside a detailed API documentation, we have developed a series of tutorials that showcase how the API can be used in python and bash codes. Those are publicly available via the GeoNet GitHub data-tutorials repository (GeoNet, 2020) under a dedicated section for Tilde. We would like to encourage anyone using these to contribute to the data tutorials repository and provide input if they wish. In this way, GeoNet is encouraging data users to also support each other.
GeoNet has recently started producing Data Blogs (https://www.geonet.org.nz/news; filter for Data Blogs) to illustrate the diversity of GeoNet data, showcase some interesting use cases and support with examples the understanding of how to interact with them. To date, several data blogs have focused on data that can be accessed via Tilde, and many of them with a volcano data focus. These include analysing some of the spring-based water height and temperature monitoring (GeoNet, 2023A), accessing manually collected data (GeoNet, 2024) and using a lake level monitoring site on Lake Taupo, installed to capture tsunami generated within the lake in Taupo caldera (GeoNet, 2023B). More data blogs will continue to be produced building on Tilde and its data delivery opportunities, among other topics.
End-users were at the heart of the Tilde design and implementation. Features like the generation of an API request for a given GUI data search were added based on user feedback and significantly lower the barrier for programmatic data access for some users. And combining data access via a GUI, API and through GeoNet AWS Open Data provide for a wide swath of use cases, ranging from very specific data analyses to large machine-learning initiatives.