Confined use of road sensors limits the effectiveness of traffic disturbing event detection. In this context, Twitter is becoming popular among the people to share the events that affect the daily life. In this paper, a novel dictionary formation and a new feature generation approach is proposed to build an integrated machine learning framework to detect the traffic events. The proposed novel combinatorial feature generation approach (CFGA) uncovers appropriate associations among the keywords of tweet and extracts the correlated keyword sets to the data collected. Such keyword sets are denoted as set phrases . The set phrases may comprise of single or multiple words of a tweet. These set phrases may be used as keywords for event-related data collection or further analysis. The frequently occurring set phrases are identified using the notion of support, which signifies the percentage of tweets containing relevant keywords. Since the nature of different events may also vary; therefore, a hardcoded value for support thresh- old will not be beneficial. Hyper-parameter designated as support (!) is tuned for finding threshold value that is used to obtain the set phrases. This process sets up a database of frequently occurring set phrases that can signalize the traffic-related events using ML classifier. The results of the proposed approach suggest that if suitable support is chosen, proposed CFGA increases the accuracy of supervised classification models for extracting traffic information from Twitter data. The classification results obtained by using the proposed approach outperform
their existing counterparts in terms of precision, recall, and F-measure.