4.1 Unsupervised hierarchical clustering and top trends
Because the WSS and average silhouette methods identified the proper number of clusters for our dataset as four and five respectively, we generated clusters of tweets for both findings. However, the leading trends identified did not vary between four and five clusters as illustrated in the word clouds below. Word clouds are basic and intuitive tools that allow us to evaluate text results for insight [30].
The word cloud on the left is the largest cluster when only four clusters were generated. The word cloud on the right is the largest cluster when five were generated. We performed a similar analysis of trend analysis throughout the cluster creation and the leading identified trends did not alter. Regardless of the number of clusters created, the top mentioned term continued to be “data science”. It was closely followed by “machine learning”, and subsequent frequent terms began dropping off in mention at a greater pace than compared to the first and second most mentioned terms. The mention analysis of trending topics is illustrated in the following figure.
4.2 A Small number of Cybersecurity Mentions within the IoT Tweets
Among the trend analysis, in general, what was most concerning was the lack of cybersecurity topics in the list of top mentioned terms. As illustrated in the following pie chart, only 12% of the 684,503 tweets had any mention of the following stemmed cybersecurity-related terms: cyber, secure, hack, vulnerability, risk, exploit, breach, malware, virus, ransomware, spyware, worm, trojan, encrypt or phishing.
When tweets did mention cybersecurity terms, the topics of the three most retweeted conversations included an industry roundtable discussion [31], a reference to an opinion article about the risk of AI on military technology [32], and a reference to an article on the risk of AI on national security [33]. Among the most retweeted tweets discussing cybersecurity, the top three are each a technology being touted to secure IoT implementations.
4.3 Content-based Analysis of Industries within the IoT Tweets
What is further concerning by the dearth of cybersecurity-related discussions within the collection of IoT-related tweets is that the top mentioned industry was healthcare. Previous research identified healthcare as one of the lesser influential industries mentioned in research papers on IoT [34]. Our research and this paper are one effort in shifting that claim. The top ten mentioned industries are depicted in Fig. 7. It is not surprising to see healthcare leading the mentions as many countries are still experiencing the Covid-19 pandemic. While collecting these tweets based upon the inclusion of #iot, 4% of the tweets referenced Covid-19. Recent research has discussed the relationship between digital twins, IoT, and contact tracing technology [35], which could be utilized to help understand the behavior of a pandemic. After healthcare, the second most mentioned industry within the IoT tweets is commerce followed by financial.
4.4 Network analysis and relationship identification
A network analysis was also performed on the relationships between trends and industries. Fundamental parameters of a network are its number of nodes, otherwise known as the network size, and the number of edges [36]. We are surrounded by naturally connected structures and networks [37]. Industries and technology trends are no different, as we confirm with this analysis regarding the health industry connection to all the top identified IoT trends.
To construct the network graph in Fig. 8, the tweets’ metadata labels were cast as nodes into two tables. The first table listed every industry and the trend terms (nodes) along with a unique identifier. The second table was a large list of the industry nodes, a corresponding trend node, and a weight column that indicated the frequency when a tweet was identified as matching both labels. Utilizing the network and igraph libraries in R, we plotted the node and edge relationships as the data visualization in Fig. 8. This figure is a network graph that has the most mentioned industry, healthcare, highlighted as a green network node. Then, red lines which indicate relationships, are drawn to each of the yellow trending terms given both labels co-exist in single tweet metadata that we created during our preprocessing. As the image indicates, all trend terms are found in the network of healthcare tweets. As Fig. 4 indicated, serverless was the least mentioned trending term, yet it too has an inner-tweet relationship to those tweets having reference to healthcare.
4.5 Sentiment analysis of commercial technology providers within the IoT tweets
There are many technology providers which have solutions, offer services, or offer platforms to solve IoT opportunities. We performed a content-based analysis of technology vendors within the IoT space. To determine the list of IoT vendors to analyze, we utilized two 2020 research reports by Gartner [38–39]. We utilized the sentimentr library to determine the sentiment scores of industry technology providers.
We plotted the technology provider names into a chart having four sections. The four sections of the chart have an x and y-axis, where the x-axis is the z-score of the tweet sentiments when the vendor is mentioned. The z-score is found by first determining the sentiment of all tweets that mention the commercial technologies, then calculating the average, and the standard deviation. Then, the z-score for a given technology vendor is calculated by dividing the commercial vendor’s mentioned tweet sentiment by the number of standard deviations away from the population’s average sentiment. The y-axis is measuring the number of times an IoT technology provider is mentioned in our corpus of tweets.
In general, if a vendor is placed on the upper right area of the chart, that implies that they are widely mentioned and the sentiment of the tweets that they are mentioned within is above average sentiment. If a vendor is found on the bottom left side of the chart, they would be both lower in popularity and lower in sentiment positivity within this collection of tweets. Any vendors having less than ten mentions within the tweets were removed from the plot. The dashed blue lines represent the average mentions and average sentiment scores. The average sentiment of all tweets mentioning these IoT solution vendors is slightly positive. Use caution when reviewing the chart as the y-axis is intentionally logarithmic. The logarithmic axis allows the data to pull slightly apart, as though zooming in, for the vendors who have lesser mentions. The vendor placement can be viewed in Fig. 9.
Amazon’s AWS has the most mentions and the most positive sentiment among the vendors being mentioned within the IoT tweets. The AWS IoT Core can connect IoT devices to AWS cloud services and AWS offers an IoT SDK for development in languages such as Java, JavaScript, or Python. AWS IoT Core product supports message brokering for these protocols [40]:
-
Message Queuing and Telemetry Transport (MQTT)
-
MQTT over Websockets Secure (WSS)
-
Hypertext Transfer Protocol -Secure (HTTPS)
-
Long Range Wide Area Network (LoRaWan)
Davra is within the bottom left area of the plot. They have fewer mentions in the analysis and the tweets that do mention them tend to have a lower sentiment than average across all of the analyzed technology vendors. Davra offers an IoT Platform that has features such as access control to both devices and services, service management features including edge, cloud, Kubernetes, or container deployments, as well as supporting many different IoT device protocols and data storage capabilities [41].
4.4 Predictive modeling based upon our IoT tweet metadata factors
Naïve Bayes has been utilized to accurately forecast crime activities including arson, burglary, and theft [42]. Biology researchers have successfully applied naïve Bayes modeling to determine the presence of links in protein interaction networks, although anomaly detection was utilized to increase the accuracy [43]. In our research, we utilize naïve Bayes models to understand relationships between the IoT trends, the sentiment of the content, industries, and IoT technology providers.
Using a naïve Bayes model with a dependent factor of trend type and an independent variable of sentiment, we found that given a tweet is labeled as towards the trending topic data science, there is a 66.7% probability that the sentiment of the tweet is positive. Tweets that were labeled as towards the IoT trend of natural language processing (NLP) scored the second-highest in positive sentiment probability at 57.1%. The table below notates the conditional probabilities as found by the model.
Table 1
Trending IoT tweet topics having the highest probability of positive sentiment are highlighted in this conditional probability table
Trends (below) | anger | anticipation | disgust | fear | joy | negative | positive | sadness | surprise | trust |
AI | 0.000 | 0.214 | 0.107 | 0.071 | 0.000 | 0.107 | 0.357 | 0.036 | 0.000 | 0.107 |
BigData | 0.149 | 0.064 | 0.000 | 0.064 | 0.128 | 0.106 | 0.234 | 0.000 | 0.064 | 0.191 |
DataScience | 0.000 | 0.222 | 0.000 | 0.000 | 0.000 | 0.000 | 0.667 | 0.000 | 0.111 | 0.000 |
DeepLearning | 0.100 | 0.100 | 0.000 | 0.000 | 0.100 | 0.200 | 0.300 | 0.000 | 0.100 | 0.100 |
JavaScript | 0.000 | 0.222 | 0.000 | 0.000 | 0.111 | 0.000 | 0.444 | 0.000 | 0.000 | 0.222 |
MachineLearning | 0.045 | 0.136 | 0.000 | 0.091 | 0.000 | 0.136 | 0.455 | 0.000 | 0.000 | 0.136 |
NLP | 0.000 | 0.143 | 0.143 | 0.000 | 0.000 | 0.143 | 0.571 | 0.000 | 0.000 | 0.000 |
Python | 0.000 | 0.000 | 0.000 | 0.500 | 0.000 | 0.500 | 0.000 | 0.000 | 0.000 | 0.000 |
A second naïve Bayes model was created to help with understanding which factors affect the prediction of tweets being retweeted. The industry and trend factors had little effect on a tweet being retweeted. However, using words that conveyed the sentiment of either fear or joy would improve the probability of retweet to 13.0% and 12.4% respectively. A third naïve Bayes model was used to predict which trending term an IoT tweet may be about. Using the factors of favorite, industry type, retweet, and IoT vendor name, we could predict the trend a tweet was referencing with an accuracy of 63.9%.