This research follows the SMI steps framework described by Choi et al. (2020) for social media-based BI research. The process consists of four phases: “Data collection”, “Data preprocessing”, “Data analysis”, and “Validation & Interpretation”.
According with this framework, the initial step was the extraction of Twitter data. As previously mentioned, a particular set of startups’ accounts was targeted: information technology (IT) startups founded by Portuguese or headquartered in Portugal, selling products or services based on machine learning (ML) approaches and presenting a B2B business model. Thus, our analysis centers in eight startups from the Sifted 2020 Portugal startups list: AttentiveMobile, Codacy, DefinedCrowd, Feedzai, Prodsmart, Talkdesk, Unbabel, and Virtuleap.
After the extraction, data was cleaned and the corpus was prepared (Data preprocessing), after which we could proceed with a topic modeling (TM) technique for the analysis (Data analysis). Finally, TM results are evaluated and interpreted (Validation & Interpretation). The latter step is where the topic modeling results are compared with the startups’ funding rounds, creating a diachronic profile for each startup. For that, the funding rounds of each startup have also been collected from Crunchbase5, and related with the startups’ lifecycle phase. Our approach is illustrated in Fig. 1.
The features that define a startup differ depending on where in the life cycle phase the company is: in the beginning, these are innovative companies with limited resources; while in the growth process, they perform an above-average rate increment in the number of customers and revenue; and finally, they have hyper-scalability and high company valuation, which characterizes a mature startup, demonstrating that startups change over their lifecycle and that the definition of a startup depends on particular phases of company’s evolution (Skala 2019). Thus, startups’ life cycle is a complex concept and, as stated by Paschen (2017), it shows two different but connected perspectives that are fundamental for the company’s success: its maturity, regarding the stage of development of a product or service, and the funding rounds, that is, the fundamental investment attraction capability.
Based on the related literature, we consider that the startups’ life cycle can be divided into two main perspectives. One that follows closely the concepts found in (Wang et al. 2016) and regards the creation of a mature product to solve a real problem: the maturity evolution. Another one concerning the startup funding rounds: the funding rounds. The funding rounds are where startups open or expose their shareholder structure to third parties, usually to business angels or venture capital firms, to secure investment to allow the startup to be able to grow (Paschen 2017). To illustrate a startup’s financing milestones and the startup’s evolution, we propose a life cycle model based on the previously introduced two dimensions: the funding rounds and the maturity evolution. We believe that the Funding and Product Evolution Model (FPEM) depicted in Fig. 2, illustrates the maturation process of a startup’s life regarding time and revenue in a typical success scenario.
For the model, the names of the funding rounds dimension are based on the Crunchbase Glossary6, and in the maturity evolution, the phases describe the startup’s product stages based on the work of Wang et al. (2016) and Paschen (2017). The proposed model, FPEM, encompasses four key phases, named after the funding round categories: the preseed phase, the seed phase, the early phase, and the late phase. For the creation of the model, we correlated the phases with the existing funding types since these are measurable, which is essential to be able to mark when a transition occurs. Then, we connected the product maturity evolution with each of the rounds. Therefore, a phase transition occurs with a funding round of a higher rank than the previous one, implying a scale-up for the company and a product maturity evolution. Typically, startups receive new funding when their product has evolved and created value for the company. However, every type of funding round can happen more than once throughout a company’s life. Notice that, for each phase, the association of concepts between maturity dimensions and funding rounds is relatively straightforward.
In the preseed phase, there is only the conceptualization of a potential and innovative solution for a concrete problem. Thus, funding is usually very limited (typically below $150K) because it finances only an idea. These funding laps are known as angel or preseed rounds and are generally used to jump-start the company, providing financial cash to build a prototype. According to Wang et al. (2016), in this phase the startup is in its learning stage. Next, in the seed phase, a prototype or, at least, a proof-of-concept, already exists, sustaining the seed funding which can scale up to $2M. This round is used to build a product as market ready, incorporating the novelty proposed by the startup in the previous phase. In the early phase, the company already has a functional product and is prepared for scaling in the market. In this phase, the startup evolves for the so called growing stage (Wang et al. 2016). The early funding rounds, also called Series A and Series B, can have values ranging between $1M and $30M. Lastly, in the late phase, a mature product is already established and the correspondent funding, also called Series C round, usually shows values that may start at $10M and with no upper limit.
The above-described relations between product maturity and funding rounds that represent the proposed life cycle model are validated by the topic model approach we have obtained, whose results are discussed in Section 4. The aforementioned relations enable to relate each of the four FPEM phases with the uncovered topics extracted from the tweets posted by the startups on social media during their existence.
3.1. Dataset
The dataset consists of 15 577 tweets extracted from the chosen Portuguese startups’ Twitter accounts. The date of extraction date January the 10th, 2021, and the data covers every tweet posted by each of the startup since its Twitter profile creation date. The Twitter API method was employed (“GET statuses/user_timeline”) to extract all the tweets posted by providing each company account’s username, through the library tweepy (Roesslein 2020). The analysis focuses on the last five years, where the higher quantity of posts is concentrated: from January 2015 to December 2020, or equivalently, during 72 months. To accurately examine the startups’ activity over time, Table 1 shows the startup’s Twitter accounts’ descriptions.
Table 1
Company Name | Founded Date | First Tweet Date | Followers | Number of Tweets | Tweets per month |
AttentiveMobile | 2016 | 08/02/2018 | 1 115 | 695 | 20.44 |
Codacy | 2012 | 02/10/2013 | 2 796 | 1 640 | 22.78 |
DefinedCrowd | 2015 | 04/02/2016 | 1 674 | 1 258 | 21.69 |
Feedzai | 2011 | 23/10/2015 | 2 630 | 3 177 | 51.24 |
Prodsmart | 2012 | 04/12/2012 | 897 | 211 | 2.93 |
Talkdesk | 2011 | 26/06/2019 | 6 586 | 3 211 | 178.39 |
Unbabel | 2013 | 17/11/2013 | 3 510 | 2 615 | 36.32 |
Virtuleap | 2018 | 29/08/2016 | 791 | 2 765 | 53.17 |
It presents the company’s first tweet available date, the number of followers, the number of tweets since January 2015, and the frequency of tweets per month. The last value regards the 72 months of analysis, or the number of months since the first tweet available date if it is more recent than January 2015. Additionally, the table shows the startup founding year, collected from Crunchbase.
Figure 3 shows each startup’s quantity of tweets distributed over our chosen time window. It is possible to see that some startups post regularly, while others present peaks with more activity. Within this context, regularly means the same temporal cadence, which is the case half of the companies in analysis, namely: AttentiveMobile, DefinedCrowd, Feedzai, Talkdesk, and Unbabel. Particularly, Talkdesk account presents a higher number of tweets per month.
However, not every startup presents tweet’s posts since the beginning of 2015. In the cases of AttentiveMobile, DefinedCrowd, Feedzai, and Talkdesk, the date for their first tweet available is more recent (Table 1 and Fig. 3). This may be due to the fact that the company’s foundation date is posterior or because more ancient tweets were voluntarily deleted. Namely, Feedzai, and Talkdesk are the ‘‘oldest’’ startups, dating from 2011, but the overall number of postings is not that high, which might suggest that they may have deleted some of their oldest tweets.
Codacy, Proadsmart, and Virtuleap do not post regularly, and Virtuleap is the only company whose activity does not cover the 72 months of analysis time window. Codacy and Virtuleap present a peak in 2016 and 2017, respectively. From then on, both post with regularity but using quite fewer tweets per month. Notably, Proadsmart shows a considerable lesser degree of Twitter posting activity and is the only company who does not show posts in every month.
3.2. Text preprocessing
To understand what the topics of the textual tweets might be, posted by the startups, we aggregated our dataset by month, resulting in a corpus (that is, the set of documents where each document has an id and the correspondent text) of 72 documents corresponding to each month in the time-scope of the analysis. Within each document, the id regards the month and year of the tweets. This corpus was then cleaned, retaining the vocabulary that accurately represents the startups’ content to be transformed into a document-term matrix for model training.
To assure the more adequate preprocessing of tweets, we first studied the techniques applied in literature’s similar studies, thus concluding that literature supports the need for a preprocessing phase enabling as a preparation phase for achieving coherent topics. Table 2 presents the techniques that have been applied in the existent literature.
Table 2
Literature preprocessing techniques usage
Preprocessing technique | H. J. Choi and Park (2019) | Alash and Al-sultany (2020) | Doogan et al. (2020) | Hidayatullah et al. (2018) | Yang and Zhang (2018) |
Lowercase transformation | X | | X | | X |
HTML tags elimination | X | X | | X | X |
URL elimination | X | X | X | X | X |
Hashtag treatment | X | X | | | |
Remove punctuation and digits | | | X | X | X |
Remove Stop Words | | X | X | X | X |
Lemmatization | | | X | | |
Stemming | | | | X | X |
N-Grams | | X | X | | |
TF-IDF | | | X | | |
Remove extra white spaces | X | X | X | X | X |
Remove terms with higher frequency | X | X | X | X | X |
Remove terms with less frequency | X | X | X | X | X |
The most used techniques are: URL elimination, extra white spaces elimination, and exclusion of the terms presenting higher or lesser frequency, HTML tags elimination and the usage of stop words are also commonly applied.
Since white spaces, URLs, and punctuation do not present information relevant towards topic’s identification, they were removed from the documents. Next, lowercase transformation and lemmatization were performed. Excluding a set of stopwords, in this case, stopwords from the Natural Language Toolkit (Bird et al. 2009), helps to focus the model on the relevant words that might define the text’s meaning. For this, we added the startups’ names and Twitter tags, like “RT” which means that it is a retweet, to the set of stopwords. The lemmatization goal is to convert every word to a common base form, providing coherence to the set of words and, consequently, to the topics. This was achieved via TextBlob library (Loria 2020). CountVectorizer from the Python library scikit-learn (Pedregosa et al. 2011) enables vectorizing the text and having some preprocessing customization like the use of n-grams and exclusion of terms. The n-grams used were in a range of 1 to 2, uni to bi-grams, to gather terms that may appear together, for example the bi-gram “Machine Learning”. Then, the terms that appear less than twice were excluded to prevent possible errors and misspells. Lastly, the exclusion of terms that appear in at least 80% of the tweets. Being highly frequent terms, suggests that they are meaningless in terms of topic characterization.
3.3. Topic modeling
Due to its success in Twitter topic analysis related literature, the topic modeling method here employed was LDA, Latent Dirichlet Allocation (Blei et al. 2002). The first step is to transform the corpus into a document-term matrix, where each term is either a word or a bigram. For that, we use the frequency of the occurrence of the term/bigram in the document’s text and apply the LDA algorithm on the resulting matrix, using the Python library gensim (Rehurek et al. 2011).
Since the number of topics must be given has input for the algorithm, we performed a coherence test for the advisable number of topics to be used in the modeling. Figure 4 suggests that five might be the more reliable number of topics, due to its higher coherence value. Note that the coherence measure used here was c_v, which is one of the options in gensim.
Thus, the topic model created has five topics, each one characterized by the relevant terms presented in Table 3, with all the terms showing a similar distribution within each topic.
Table 3
Topic | Terms |
Fintech and ML | future, talk, fintech, banking, reality, money2020, lisbon, project, hackathon, machinelearning |
Business Operations | business, cloud, opentalk2020, learn, covid19, service, solution, webinar, customer service, brand |
Bank and Funding | bank, webinar, cloud, leader, learn, read, account, report, meet, partner |
Product/Service RD | cloud, learn, product, read, industry, innovation, boost, service, webinar, lisbon |
IT | review, codereview, analysis, learning, websummit, machinelearning, machine learning, security, staticanalysis, lisbon |
The name chosen for the first topic is “Fintech and ML” because it encapsulates “fintech’’, “machine learning’’ and “banking,” as well as one event in this domain: “money2020”. The second topic is “Business Operations” since it presents terms correspondent concerns typical of the company’s operations, such as “customer service,” “brand,” “solution,” and “covid19”. Additionally, it also displays “opentalk2020”, a Talkdesk’s event regarding customer service subjects. “Bank and Funding” is the third topic, supported by the terms “bank,” “leader,” “report,” and “partner,” while the fourth is “Product/Service R&D” sustained by terms like “innovation,” “learning,” and “boost.” Lastly, “IT” (Information Technology) is the fifth topic associated with software, like code and security, and the more significant technological event, the Websummit.
[5] www.crunchbase.com
[6] https://support.crunchbase.com/hc/en-us/articles/115010458467-Glossary-of-Funding-Types