Thoughts on women entrepreneurship: an application of market basket analysis with google trends data

This paper focuses on the popularity and awareness of keywords in Google Trends data related to entrepreneurship of women in a global and cross-regional setting by using market basket analysis. Google Trends is one of the digital data platforms that provides a time series index of the volume of queries users enter into Google in a given geographic area. It is the most popular tool for gathering any information, and it has been used in several topics. Market basket analysis indicates items that appear/used together and the frequency of these appearances. Such technique is appropriate in finding hidden associations between items, which is also crucial in assessing individuals’ thoughts on a specific topic. This study contributes to the literature in terms of being the first study to use market basket analysis on Google trends data in the context of women’s entrepreneurship finding hidden associations between items, which is crucial in assessing individuals’ thoughts on a specific topic. The results of the analysis are interpreted through the lens of gender-responsive strategies, equality, efficiency and social justice in different country and region contexts.

ship 1985-2013 indicated that there is a survey method dominance as data collection method. Short et al. Short et al. (2010), in their interview-based research on editorial board members of Organizational Research Methods indicated difficulties related to measurement techniques, sampling and analysis techniques. According to editorial board members, unsophisticated measurement tools and data gathering by survey methods only permit traditional statistical techniques to be used. Kyrö and Kansikas Kyrö and Kansikas (2005) explained the dominance of relatively classical statistical techniques as means, standard deviation, t-tests and linear correlations. Two years later, Dean, Shook and Payne Dean et al. (2007) reported that the entrepreneurship field had a number of areas where new methodological practices were needed. Short et al. Short et al. (2010) have also revealed that even if there are some new approaches in quantitative data collection and analysis, there is still need to conduct more empirical research by using new analytical procedures in the entrepreneurship field. While the experimental designs Kraus et al. (2016), structural equation modeling, and multilevel modeling are the three most frequently mentioned research methods reflecting current entrepreneurship researchers' interests, they would also like to gain new insights on data mining Dean et al. (2007).
Data mining, as a new developing research technique for making data, is defined 'as the process of discovering "hidden message," patterns and knowledge within large amounts of data, ' Luan (2002) which called as big data. Although using big data and related analysis techniques are quite new in entrepreneurship field, there are some studies in management and economic fields that highlight the opportunities and challenges of big data. Gutmann, Merchant and Roberts Gutmann et al. (2018) examined what big data means to economic history and forthcoming sources of big data. Sheng and Wang Sheng and Wang (2017) focused on growing awareness of big data's business values and managerial changes led by data-driven approach. Mesquita, Antonialli and Carvalho Mesquita et al. (2018) intended to explain big data role in marketing research. Adomako and Amoah Amankwah-Amoah and Adomako (2019) shed light on decisions and processes in facilitating or hampering firms' ability to harness big data to mitigate the cause of business failures. Association rules involve the use of machine learning methods in order to analyze data for patterns, occurrences or records in the data. In addition, this method discovers the rules that determine how or why certain items or records are connected. Since association rules as one of the data mining methods are useful and easy to understand, there have been many successful business applications, including finance, telecommunication, marketing, retailing, and web analysis Xu and Zhang (2009); Insani and Soemitro (2016); Kronberger and Affenzeller (2011); Woo and Xu (2011).
Derived from all this literature review and methodological gaps in entrepreneurship, this study aims to bring the new methodological approach into entrepreneurship research, within the women entrepreneurship context.

The reason to choose women entrepreneurship as a context
Entrepreneurship has often been touted as a viable alternative for women in participation into the economic activities.
Having your own business, being your own boss, having autonomy in decision making and flexible working schedules are some features associated with entrepreneurship as a career option for women. Women entrepreneurship is a growing trend in both developing and developed countries. Although a lot of countries make every effort to increase the level of women entrepreneurship in their country, certain regions and countries are more successful in achieving this goal compared to others. A central debate in explanations of these differences between countries concerns on two main explanations: (I) environmental and (II) individualistic factors Boz Semerci and Cimen (2017). Environmental factors are composed of economic, governmental (political), educational and cultural indicators.
Many studies empirically validate the impact of environmental or institutional factors on entrepreneurship Lim et al. (2016);Nguyen (2020). In these studies, the scholars propose some formal and informal institutions in entrepreneurship environment North (1990). Accordingly, government policies and procedures, financial and non-financial assistance (educational or advisory) to businesses are related to formal institutions, whereas socio-cultural conditions concern informal institutions.
More specifically, entrepreneurial conditions such as the availability of financial resources, debt, equity, the encouragement of new and growing firms by governmental policies, the presence and quality of government programs, ease of access to the available physical resources and/or the availability of intellectual property that protect new and growing firms are formal institutions in the environmental aspect of entrepreneurship. They reveal the external factors imposed onto the community. However, cultural and social norms that encourage entrepreneurship, considering that becoming an entrepreneur is a desirable career choice and/or highly valued perceptions on innovation represent the informal institutions in the environmental aspect of entrepreneurship . Finally, individualistic factors consist of psychological profile of individuals. Motivation level, education and skills, personality, risk perception and/or social networks indicate the individual aspect of entrepreneurship Cuervo (2005); Canedo et al. (2014). Although previous studies on women entrepreneurship have explained the variety of these concepts and their effects on women's entrepreneurial decisions, most of them have approached it from the same methodological point of view. This methodological shortcoming can be explained in two main aspects based on sampling and analysis technique.
In sampling, most of the researches on intentions, preferences and opinions of individuals on women's entrepreneurship, use cross-sectional survey studies, structured questionnaires Ahl (2006) and in-depth interviews Jännäri and Kovalainen (2015). On the other hand, the emergence of big data provides new opportunities for monitoring and modeling attitudes toward social and economic issues. Specifically, employing trend analysis using online search data brings rewarding input to evaluate and assess changes in public opinion and perception, which can be thought as a proxy for the level of public knowledge and awareness of specific terms Lineman et al. (2015).
Technically, many studies in women entrepreneurship field are based on general linear model-based technique. This technique is bounded by assumptions (such as linearity, normality etc.) and provides results based on standard errors, correlation coefficients, and regression coefficients. More specifically, Ahl Ahl (2006), in her study related to women entrepreneurship, suggests new research directions that pro-duce more and richer aspects of women's entrepreneurship. Ahl Ahl (2006) has also stated the dominance of correlation tests, t-tests, multiple regression, manova, anova, factor analysis, clustering, discriminant analysis, and logit models as analysis techniques in women entrepreneurship field. These findings also show consistency with previous investigations on research methods in women entrepreneurship (e.g., Brush Brush (1992); Coviello and McAuley, Coviello and McAuley (1999)). More recently, Kuckertz and Prochotta Kuckertz and Prochotta (2018) have searched current topics and methods in entrepreneurship by collecting data from current researchers. Therefore, this study aims to examine a new methodological approach and explore its usage in women entrepreneurship context.

The reason to apply regional approach in women entrepreneurship context
Gender gap in entrepreneurial activities still remains as challenge for all countries consisting developed, developing and emerging ones. The past 30 years of research has shown that there are stylized facts related to women entrepreneurship. The most apparent one is that there are relatively less women entrepreneurs compared to men Minniti and Naudé (2010). The main reason for this is not higher rates of failure of women but rather that fewer women start businesses. For instance, in only 6 of 48 economies women's total earlystage entrepreneurial activity (TEA) rates among adults of age 18-64 are higher than or equal to that of men's according to Global Entrepreneurship Monitor (GEM) Adult Population Survey, 2018 Global Entrepreneurship Monitor (GEM) (2018). Why are these rates low? In the environmental point of view, economic, legal and socio-cultural characteristics of a society can be the reason behind the low rate of women entrepreneurship. GEM reports since 1999 have indicated that entrepreneurship does not exist uniformly in the world, but rather has "extraordinary variation." One reason of variation among regions is due to the differences in the level of development. In developing countries, the prevalence women entrepreneurship is higher than in developed countries Minniti and Naudé (2010). Europe and North America regions of the globe have experienced considerable industrial re-structuring in the last three decades, which also helps to develop new entrants and small firms Baptista et al. (2008). Women in such developed economies are more likely to start businesses, while those in less developed economies have lower rate of entrepreneurship activities. The reasoning of these varied women entrepreneurship rates can be meaningful when the regional evaluation is used. However, most of research on women entrepreneurship is very "Western-centric." Studies from Latin America, Asia, the Middle East and Eastern Europe are only now emerging Brush and Cooper (2012). Hence, a cross-regional analysis is well-suited to understand and analyze entrepreneurship dynamics of women, and this study at hand employs a regional approach.
In this vein, this study aims to examine new methodological approach and explore its usage in women entrepreneurship context by using market basket analysis on Google Trends data. It contributes to the literature in several ways. First, to the best of authors' knowledge, this study is the first one which employs market basket analysis to Google trends data in women entrepreneurship context. Second this research analyses the trend searched topics related to women entrepreneurship in a given region for the time period of January 2009-November 2019. Third, the findings of these research are important as they reveal the connectivity of keywords related to women entrepreneurship and their interassociations as being consecutives or descendants. Last but not the least, analyzing the popularity and awareness of keywords related to entrepreneurship of women in a global and cross-regional setting provide comprehensive empirical ground for practical suggestions. The following sections detail the Google Trend dataset and analytical procedure that are used in current study.

Data
Google Trends have become very popular and been widely used in assessing public opinions about several kinds of subjects. It is one of the digital data platforms that provides an index of the volume of queries users enter into Google Search Engine based on a given geographic area (country, state/province and city levels) and time. Since Google is the most widely used search engine in the world Report Nielsen (2008), it provides compilations of big data related to people's search terms. The search index is a compilation of all Internet queries ever submitted to Google's search engine since 2004. The index for each query term is calculated as the search volume for a specific query in a given geographical location divided by the total number of queries in that region at a given point in time 1 . Thus, the index is always a percentage from 0 to 100 Wu and Brynjolfsson (2009).
For this study, it is investigated whether the key words indicating the attitudes and opinions of women entrepreneurship in the world are popular between January 2009 and November 2019. In determination of keywords, the computer-based indexing system that depends on frequency analysis of the text was used. The widely used entrepreneurship scales related to 'entrepreneurial intention' (developed by Liñán  2007)), 'entrepreneurial attitudes' (developed by Kolvereid Kolvereid (1996)) and other scales related to women entrepreneurship (such as 'gender stereotypes' (developed by Van Egmond et al. Van Egmond et al. (2010)), 'work family conflict'(developed by Marshall and Barnett Marshall and Barnett (1993)), 'social support' (developed by Flood Flood (2005) that were used in order to reveal supporting and hindering factors in women entrepreneurship context were used. All these scales are combined and written as text. Moreover, the scale on the measurement of environmental incentives (such as governmental support, education and training, cultural and social norms used in GEM Report, Global Entrepreneurship Monitor (GEM) (2018)) was also included into the text. In the text mining process, stop words such as 'the', 'at', 'and', 'or' etc. were excluded from the keyword data. Twenty-four keywords were obtained, and the results were presented with 'word cloud' visualization ( Fig. 1).
The geographical regions were selected on the basis of the Global Entrepreneurship Monitor (GEM) categorization. GEM is the world's foremost research on entrepreneurship that assess national level of entrepreneurial activities annually and provides a large-scale database for international comparison Bosma (2013). The countries were grouped according to their geographical location and it generates four regions with forty-nine countries 2 given as: • Latin America and Caribbean • Middle East and Africa First, in the data extraction process, after determining the time range and keywords for each region according to their geographical location to be handled in Google trend data, some transformations were carried out. Separate data files were created for each of the four different regions. The columns of this new data file contain the keywords that were searched on Google, and the rows include the countries within each geographical region.
Then, the popularity terms for each country whether the terms were popular at least three times on the specified period or not were investigated. The popularity is defined as a binary variable (recoded as 0 or 1) which takes a value of 1 when the index is equal to or greater than 75. Final version of data was prepared to apply association rule mining. The R studio program (version 4.1.1), arulesViz, igraph and RgraphViz packages were used for statistical computing and graphical representation of the extracted rules.

Analytical procedure
Data mining methods that analyze the co-occurrence of events are called Association Rules. These methods present the rules of the events that occur together with the calculated probability values. Association rules are an approach that supports the analysis of past data and the determination of association behaviors within these data and conducting future studies. Association rules are first introduced by Agrawal Agrawal et al. (1993). Market basket analysis is an example of the application of association rules, which have been very popular since then in data mining applications Frawley et al. (1991). The purpose of the association rule used in market basket analysis is to find the relationship between the customers during the shopping process and to determine the buying habits of the customers. Thus, vendors have an effective and profitable marketing and sales opportunity thanks to the relationships and habits discovered.
Some of the algorithms used in the literature for Association Rules Analysis are: Apriori, Carma, Sequence, GRI, Eclat, FP-Growth algorithms. Since Apriori Algorithm is the most popular and widely used algorithm in data mining methods, we use this algorithm in our study. In order to implement the Apriori algorithm, the minimum support and minimum confidence values determined by the user must be determined. These minimum requirements are called strong rules Han and Kamber (2006). In some sources, these rules are also called interesting rules.
An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is an item found within the data. A consequent is an item found in combination with the antecedent. An example of an association rule for the products of A and B can be shown as " [A → B]." This means that in the case where A occurs, B will most likely occur. The left side of the expression (may be more than one) is called an antecedent and the right side is called a result (consecutive or descendent). If the expression on the left side is true, the expression on the right side is also true Hand et al. (2001). The importance of this rule can be measured by two different measures. These are support and confidence. The support level of this association rule is given below: In the formula, n(A ∪ B) represents the number of transactions where A and B are together, and N represents the total number of transactions. If this value is one, it means that A and B occur together in each process in the examined dataset, and that zero means that A and B do not occur together in any process in the dataset. On the other hand, the confidence value indicates that what percentage of transactions involving A includes B as well. The confidence level of this association rule is given below: The formula given above, unlike the support formula, contains the total number of observations of A in the denominator. An output of this value indicates that each process containing A also contains B.
There is a third metric called lift value which is also popular on discovering the interesting rules in association rule mining. The lift value is given in Eq. 3.
If the lift value is greater than 1, the items in the antecedent and the consequent have the feature of complementarity, i.e., there is a positive dependency structure between the values found in the antecedent and the consequent parts. If the lift value is between (0:1), the items in the antecedent and the consequent can be accepted as substitutes, i.e., there is a negative dependency structure between the values found in the antecedent and the consequent parts. If the lift value = 1, it can be said that there is no dependency structure belonging to the antecedent and consequent items in the association rule. The high lift value indicates that the items in the antecedent and consequent parts have a strong positive effect between each other.
It is possible to generate too many rules in a mediumsized data set. However, it is also a fact that the rules that do not cover the overall dataset are not worth examining. Therefore, it is necessary to list not only every rule but only those rules that meet certain requirements. Association rules attempt to reveal co-occurring records consisting of two or more elements. If rules are created by analyzing all possible sets of elements, there may be too many rules that they make little sense. Hence, the strength of a calculated association rule is measured by two main parameters: support and confidence values. It is possible to say that minimum confidence and minimum support values limit the number of rules. The highest level of support and confidence values is 100% (or 1), indicating that the relevant rule is included in the entire data set.
For the algorithm to use, the user needs to decide minimum support and minimum confidence values. Keeping these values high results in a low number of rules. Likewise, keeping these values too low will result in too many rules. In other words, it is possible to say that minimum support and minimum confidence values control the number of rules. Table 1 represents the number of rules achieved by taking different thresholds for support and confidence levels. In this study, we take confidence and support cut-off level as 85%.
As can be seen in Table 1, the number of rules calculated by the algorithm increases as the lower limit values for the considered support and confidence values decrease. In order to reduce this disadvantage, many different approaches have been studied in the literature. Tan et al. Tan et al. (2004) have a comprehensive study in order to determine the correct objective measure for association rule mining. Zang et al. Zhang et al. (2008) suggested to use fuzzy numbers and fuzzy threshold value for determining the rules on a different basis from the apriori algorithm. Selvi and Tamirashi Selvi and Tamilarasi (2009) introduced an automated Association Rule Mining Technique based on cumulative support thresholds. Bao et al. Bao et al. (2022) tried to overcome this disadvantage by introducing new interestingness measures to the field of association rule mining. In this field of study, debates are still ongoing and offer the potential for future studies.
While revealing the interesting rules in our study, different values for the minimum support and confidence level are calculated and provided in Table 1. According to Table  1 and accompanied by expert opinion, it is concluded that the most appropriate level should be 85%. Additionally, the application of the new approaches suggested in the literature presented above should be considered in future studies in order to determine other appropriate levels of support and confidence for the data used.
In this study, the important rules are obtained by Apriori algorithm. In order to do this, R studio, which is a popular programming language and software environment for statistical computing and graphical representations of the rules, is used.  The rules that are interpreted in the main manuscript are shown as bold

Results
The results are given in different forms with Table 2, Figs. 2 and 3. Market basket analysis results revealed 38 rules in total (Table 2). Support and confidence levels of each rule are given according to the four regions. These rules give us the probabilities of searched terms together and their order of occurrence. For Europe and North America region, 12 association rules are determined according to the market basket analysis of online search keywords within the context of women entrepreneurship. The results in Table 2 indicate that the most popular terms are [social media] (rule 1), [training] (rule 5) and [firm] (rule 9), respectively.
In Table 2, the search term [Social media] is the most common search term without any antecedent (this rule's antecedent is empty and given as {} in Table 2-rule 1).
The rules with an empty antecedent mean that no matter what other terms are involved the item in the consequent will appear with the probability given by the rule's confidence (which equals the support).
Hence, its support and confidence values are calculated as 1.000 means that the term [Social media] is a popular term above the support and confidence levels of %85 for this region. In other words, this means that for the Europe and North America region all the transactions contain the term [Social media] without any antecedent. However, this does not guarantee that this term will be observed in every association rule, since the support and confidence limits are taken into account for each transaction when calculating rules.
According to rule 2, the antecedent term is [Firm]  [Training] is another popular term with a 95% of probability   Table 2. Each region is represented in different panels. While left-hand side (LHS) means antecedent terms, right-hand side (RHS) represent the consequent term of the rule. Each of these circles represents the association rules for each region.
The colour of the circle shows the confidence value of the rule where the darker ones have higher confidence values compared to others. Also, the larger the size of the circle, the higher support value of the rule Hahsler and Chelluboina (2011).
For Europe and North America region four in 12 association rules have [social media] as a descendant, four have 3 Figure 2 also shows the association rules with confidence and support values less than but close to the cut-off level of 85%.
[training] as a descendant and four have [firm] as a descendant term. However, [social media] has higher support and confidence levels as a descendant (right hand side) term as suggested by larger sizes and darker colours of the circles in panel 'a' of Fig. 2. Higher confidence values (or darker colours) indicate that [social media] awareness is highly likely once awareness on [firm] and/or [training] is achieved. Figure 3 summarizes the rules of market basket analysis results with the graph-based visualization, in which each region is represented in different panels. In this graph, search terms and rules are connected to each other with directed arrows. This graph enables to examine how the rules are composed of individual terms and how rules are associated. The colour of the circle indicates confidence level and the size of the circle gives the support value of this rule, similar to Fig. 2.
For Europe and North America region, it is seen that social media, firm and training terms are likely to be seen as both an antecedent and a consequent term/s of the rules. Among two-term associations, the probability of co-occurrence of [ Table 2). All association rules have confidence and sup-port values of 1.000 indicating no hierarchy within the rules. The probability of co-occurrence of the terms or the support values of the rules is the same with the value of 90.9% in this region. For all the rules, except the first one, the size of the circles is identical (Figs. 2 and 3). It is also seen in Fig. 3

Discussion
This study was conducted to explore a new methodological approach in women entrepreneurship context. It examined women entrepreneurship concepts by using market basket analysis on regional Google Trends data. Market basket analysis is one of the used methodological approaches for working on big data. It indicates items that appear/used together and the probability of these appearances and relations. Such technique is appropriate in finding non-obvious and hidden associations between items, which is also crucial in assessing individuals' thoughts on a specific topic.
The varying results confirmed the need for a regional analysis to approach the popularity of terms associated with women entrepreneurship. Social media is the most popular trend term in Europe and North America. There are several studies that have examined the impact of the internet, specifically social media, on entrepreneurial practices. Kamberidou Kamberidou (2013) has mentioned that many European projects have focused on the benefits and opportunities of information and communication technologies in business activities of women. Duffy and Pruchniewska Duffy and Pruchniewska (2017) who interviewed 22 independently employed female professionals in America, suggested that there is a sharp rise in women entrepreneurship in digital context in USA. Moreno et al. (2015) also helps to explain social media and entrepreneurship match in this region.
In East and South Asia, the four popular pillars of women entrepreneurship appeared to be opportunity, net-work/networking, firm, and failure. Indeed, for the context of Europe and North America, the relationship between networks and creation of new firms Johannisson and Ramirez-Pasillas (2001) networks and firm survival (as antonym of failure) Reese and Aldrich (1995); Ingram and Baum (1997), and networks and opportunity, specifically opportunity perception or recognition of individuals, Arenius and De Clercq (2005) are well documented in the literature. Our study shows that these associations may be valid for East and South Asia as well. One unique feature of East and South Asia region is the high popularity of the keyword of [opportunity] and its associations with other women entrepreneurship concepts. Awareness on opportunity is a fact on its own, but it is also a consequent conditional on firm, network, and even failure. Awareness on opportunity is also an antecedent of awareness on firm, network and failure.
According to De Vita, Mari and Poggesi De Vita et al. (2014), access to networks influence female entrepreneurship in developing countries. This is also valid for entrepreneurship in general. Networks of entrepreneurs are effective in identifying business opportunities in over 70% of entrepreneurs consulted in the regions of Latin America and East Asia according to the study of Kantis, Ishida and Komori Kantis et al. (2002). However, there are regional differences. East Asian entrepreneurs have networks in the business world like colleagues and business contacts while entrepreneurs in Latin America have social networks like relatives, friends, and acquaintances. This may be the reason for the keyword network to be linked with the keyword failure in Latin America and the Caribbean region in our study.
In  [network/networking] in the Middle East and Africa region. This is an expected finding as the women entrepreneurs in the Middle East region are mostly in the informal sector and their firms are less likely to use Information and Communication Technology (ICT) tools Mathew (2010).
These findings provide an opportunity to researchers to note the popular search terms in each region. We suggest that further research on women entrepreneurship should focus on these varied trends and associations in each region. Moreover, these results, revealing the popular terms that individuals have searched on Google search engine, would be helpful in interpreting the other empirical studies' findings related to environmental or individual motives of women entrepreneurship.
One of the utilities of association rule mining, which is a machine learning method that analyzes the co-occurrence of events and establishes relationships between data, is that it is superior to many machine learning algorithms due to the fact that this algorithm can work very successfully with categorical data. Second, association rule provides simpler and more descriptive results that enable researchers to retrieve the information embedded in the data. Moreover, in contrast to black-box models machine learning algorithms (such as neural networks, support-vector machines or deep learning methods), the association rule mining generates more human-readable results and have explainable-nature. Last but not least, this algorithm provides different alternatives of cut-off values of support and confidence levels that can be applied according to researchers' expert.

Conclusion
In spite of the growing keen interest in women entrepreneurship research, there are several methodological limitations that obscure the focus of the research and obstruct the advancement of knowledge in the area. Accordingly, this study carries an importance in terms of its statistical technique, data and findings to help to reveal opinions toward women's entrepreneurship in the world. However, similar to all researches, this study should also be considered in light of its limitations. The first limitation is about search terms and time frame. Even the widely used scales were used in order to produce key words that will be searched within Google Trend data set, it may not capture all searched terms related to women entrepreneurship. Moreover, the time frame was limited as 10 years in this study. Further studies can be conducted with larger keywords pool and for longer period of time. This would be helpful in order to illustrate global context of women entrepreneurship more comprehensively. Second limitation concerns the variation of internet usage by countries. These variations can be seen from the Appendix-1 and further studies can consider these variations as weight of countries or regions.
Lastly, contrary to our expectations, fertility, work-family conflict, family support and mobbing terms, which are more related to women entrepreneurship were not found as trend topics within searched regions and the chosen time period. This may be related to gender differences on internet usage which is in favor of males. Abraham, Mörn and Vollman Abraham et al. (2010) reported that the rate of internet usage of women is still lower than that of men, particularly in developing countries. Men outnumber women among adult internet users in all regions (Asia Pacific, Europe and Latin America) except in North America.