The complaint monitoring system developed in this study consists of five main stages, namely data preprocessing, data annotation, location extraction, complaint type classification, and data visualization. Figure 1 is a flowchart of the proposed complaint monitoring system. Details of each stage are explained in the following subsection.
Data Preparation
The data used as initial input is from Twitter which is obtained using Tweepy, an open-source python library used to directly access the Twitter API using a personal access token for authentication purposes [5]. The data collection process took place from December 2020 to March 2021. The data was taken from Twitter, specifically for Surabaya City in Indonesia. Surabaya was chosen because it is one of the largest cities in Indonesia, with ever-increasing population growth. Thus, the city government makes serious efforts to accommodate and facilitate its citizens through the provision of public service facilities in order to create a sustainable urban environment [18]. The transformer-based complaint monitoring system is aimed at complaining tweets in Surabaya to assist the city government in managing public complaints.
A total of 8,500 tweets were collected over four months from several official accounts on Twitter. The credibility of a Twitter account lies in a large number of followers, especially if the tweet message comes from a popular user, such as the official Twitter account of a government-owned company [19]. Table 1 shows some of the official Twitter accounts used in this study. All keywords are shown in Table 2 to filter the complaining tweets.
Table 1 Official Twitter Account Details Used in This Study
Username
|
Description
|
Number of Tweets
|
Number of Followers
|
@e100ss
|
The official Twitter account of Radio Suara Surabaya. Radio that presents national news and news for Surabaya and surrounding areas.
|
535.600
|
975.700
|
@sits_dishubsby
|
The official Twitter account from the Surabaya City Transportation Agency for the traffic control system.
|
127.300
|
254.300
|
@SapawargaSby
|
Surabaya city government's official Twitter account. The account is managed by the Surabaya City Communication and Information Service.
|
29.600
|
105.000
|
@MNCPlayID
|
MNC Play's official Twitter account. MNC Play is one of Indonesia's internet and subscription cable TV providers.
|
127.600
|
51.900
|
@MNCPlaySBY
|
MNC Play's official Twitter account for the Surabaya and Sidoarjo areas.
|
2.772
|
1.229
|
@IndiHome
|
IndiHome's official Twitter account. IndiHome is a services provider of internet, home telephone, and interactive television services in Indonesia.
|
931.600
|
256.000
|
@pln_123
|
The official Twitter account of PT. PLN.
PT PLN is a government-owned company that handles all aspects and problems of electricity in Indonesia.
|
1.600.000
|
618.300
|
@PDAMSurabaya
|
The official Twitter account for an Indonesia’s government-owned company is engaged in distributing clean water, especially for Surabaya City.
|
18.600
|
31.700
|
Table 2 Explanation of Complaint Types, along with Keywords and Sample Tweets for Each Complaint Types
Complaint Type
|
Description
|
Twitter Account for Crawling Data
|
Search Keywords
|
Sample Tweets
|
Road Damage
|
Complaints about road or traffic damage conditions.
|
@e100ss, @sits_dishubsby, @SapawargaSby
|
“jalan rusak”, “jalan berlubang”
|
“Kondisi jalan rusak parah di kalianak surabaya, mobil2 pada patah as roda”
("The condition of the road is badly damaged in Kalianak, Surabaya, and the cars have broken wheels")
|
Internet
|
Complaints about internet or wifi problems.
|
@MNCPlayID, @MNCPlaySBY, @IndiHome
|
“internet error”, “wifi lemot”, “wifi tidak bisa, internet lambat”
|
“apakah daerah bandung sedang ganguan? soalnya ini internet ga nyala dari pagi”
("Is the Bandung area in trouble? The problem is the internet does not turn on since morning")
|
Water Quality
|
Complaints about the condition of water problems due to interference from the government-owned clean water company.
|
@PDAMSurabaya
|
“PDAM mati”, “PDAM”, “gangguan pdam, air tidak mengalir”, “pdam tidak keluar air”
|
“Mohon info ini air mati di daerah Wisma Lidah Kulon dan sekitarnya sampai kapan dan jam berapa? Terima kasih”
("Please give this information how long the dead water is in the Lidah Kulon Wisma area and its surroundings? Thank You")
|
Power Outages
|
Complaints about the condition of electricity not turning on or a building does not have electricity.
|
@pln_123
|
“mati listrik”, “listrik mati”, “mati lampu”, “pemadaman listrik”, “pengaduan pln”, “pln mati”, “pemadaman bergilir”
|
“Daerah ketintang permai listriknya padam, tolong segera ditangani”
("The electricity is out in the Ketintang area, and please take care of it immediately")
|
Non-complaint
|
It is not included in the complaint sentence that expresses consumer dissatisfaction with a service.
|
-
|
-
|
“min.. bs tolong di cek lampu padam di rumah kami.”
("min.. can you please check the lights in our house.")
|
Data Preprocessing
The data obtained from the crawling process will then be preprocessed. This research has an extensive collection of tweet text data, so it needs to be cleaned to avoid specific differences that can result in inconsistent data [5], [20]. The preprocessing stage is carried out by removing some parts that are not needed, such as URL links, mentions which are represented in the form “@”, and retweets that are usually written with "RT". Tokenization is done by dividing the tweet text sentence into smaller units called tokens [5]. Case folding is implemented by changing tweet sentences to lowercase [21].
Text normalization is done by changing non-standard or non-formal language into standard forms. The slang word is a term that refers to non-formal language commonly used in online conversations such as on social media Twitter. Slang words are formed from a term, an abbreviation, or a combination of both. Using short and easy non-formal language makes Twitter users use it in online communication. Slang words make it difficult for machines to analyze and understand their words meaning [22]. For example, in non-formal Indonesian, words are shortened by removing vowels, such as the word "tidak (no)" being shortened to "tdk" [23]. Therefore, slang words need to be changed into their standard forms. Tweet data through the preprocessing stage is stored in a log format.
Data Annotation
The data annotation process is carried out to obtain the labeled data needed to train the supervised model [15]. A total of 1,627 samples of tweet data were annotated. A single annotator, the fourth author of this study, performed annotation in three steps. First, the tweets are labeled with a location label using the BIO tag format to build a model for location extraction. Prefix B- (Begin) indicates the first word of an entity. Prefix I- (Inside) indicates the next word after the first word of an entity [24], [25]. The location labels used in this study are LOC, GPE, BLD, HWYMSE, MSE, NPL, TIME, DATE, and OBJ. The O (Other) label indicates that the word is not part of any entity. The definition of location labels is explained in Table 3.
Second, tweets are categorized into five categories to build a complaint type classification model. The five categories come from various types of industries and different areas of activity to represent common public complaints in Indonesia regarding public facility services. The definition of each complaint type label is explained in Table 2. Third, tweets are labeled as relation labels to build a Relation Extraction Model. Relation extraction aims to extract relations between entities successfully identified. Some relation labels in the relation extraction are Highway-Position, Street-Place, Starting Point-Destination, and Other. A detailed explanation of relation labels can be seen in Table 4..
Table 3 Entity Labels For Location Extraction in Complaint Monitoring System
Entity Label
|
Description
|
Word or Phrase Example
|
LOC (Location)
|
Non-GPE locations, such as street names
|
Kertajaya, Gubeng
|
GPE (Geographical Entity)
|
City name, country name
|
Surabaya, Malang, Blitar
|
BLD (Building)
|
Building name
|
Taman Pelangi, Taman Ekspresi
|
|
|
(Rainbow Park, Expression Park)
|
NPL (Natural Place)
|
Natural Place name
|
Sungai Brantas, Gunung Bromo
(Brantas River, Mount Bromo)
|
HWYMSE (Highway Measurement)
|
Unit kilometers on the highway
|
Km 20, KM 120
|
OBJ (Object)
|
Terms of things, not people
|
Truk, Mobil, Motor
(Trucks, Cars, Motorcycles)
|
MSE (Measurement)
|
The unit of measure for an object, for example, the strength of an earthquake
|
1 Km, 7.2 SR, 25 cm
|
TIME
|
Time is smaller than the day or date
|
15.35, 16:20
|
DATE
|
Absolute date or period
|
7-February-2023, 1/1/2021
|
O (Other)
|
Other entities besides location, date, and time
|
Saya, dimana, syukurlah
(I, Where, Thank Goodness)
|
Table 4 Types of Relations in The Manual Labeling Process for Relation Extraction
Relation Label
|
Description
|
Labeling Example
|
Highway Position
|
The relationship between LOC (the position of a place name on the highway) and MSE (kilometer unit on the highway).
|
Jalan berlubang tol Gempol pada km20 arah Surabaya
(A hollow road on the Gempol Highway (LOC) at km20 (MSE) towards Surabaya)
|
Street-Place
|
Relationship between LOC (street name) and LOC (place name)
|
Jalan darmo aja Surabaya jalan utama yaa rusak sebelum TL darmo
(Darmo road (LOC-Street name), Surabaya, the main road, was damaged before TL Darmo (LOC-Place name))
|
StartingPoint-Destination
|
Relationship between LOC (name of the place as starting point) and LOC (name of the place as destination)
|
Poris ke Green Lake jalannya sebagian ada yang rusak.
(Poris (LOC-Starting Point) to Green Lake (LOC-Destination), there are some roads damaged.)
|
Other
|
No relationship
|
Internet untuk daerah Keputih masih gangguan ya min?
(Internet for the Keputih area, is there still a problem?)
|
Location Extraction
Location entities on tweet complaints can be extracted using a Named Entity Recognition (NER) task [16], [26]. The NER model used is transformer-based because it is proven to have excellent performance due to a self-attention mechanism [12]. The NER models are BERT and XLNet, which will train tweet complaint data to study entities such as location, geographic entity, building, road measurement, natural place, time, date, object, measurement, and other entities.
a. BERT model
BERT is trained to learn all words based on their position from right to left or left to right so that they can understand the context of a text based on its entire environment (right and left of the text) [16]. BERT consists of an encoder, and each block is transformer based. The BERT input is a text string with a maximum length of 512 represented in a vector. For each input, there is a special symbol (CLS) that is added at the beginning of the sequence and a special token (SEP) that is useful for dividing the sequence into segments that determine whether the token comes from sentence A or sentence B. Position embedding is also added to each token so that the input representation on a token is the number of tokens, segments and pin positions. After being represented in a vector, it will proceed to the self-attention layer and the neural network for each block. The results of the final text representation will be stacked on top of the BERT to predict the possible location entity labels for each text [27].
b. XLNet model
Permutation Language Model (PLM) is used in XLNET to combine the advantages of Autoencoder and Autoregressive. BERT is the Autoencoder method, where certain words from the input sentence will be masked, and the data will be restored. GPT is the Autoregressive method, which uses the transformer's decoder to predict the output. PLM will randomly sort each word to generate phrases and cover the last few words. Autoregressive is used to predict the covered word by considering the previous words. XLNet also uses the recurrence mechanism and relative position encoding in TransformerXL, which can record hidden state memory sequences from each permutation and encode relative positions consistently between different permutations. Thus, XLNet can enrich information in the context of long sentences by representing each token according to the semantics of the sentence [14].
Complaint Type Classification
The Convolutional Neural Network (CNN) and Convolutional Long Short Term Memory (CLSTM) are the classification methods used to classify the complaints in complaint tweets. Classification is carried out after identifying the location entity because it refers to the intent of the event information, which must have at least one entity representing the location entity [10]. Details of the hyperparameter settings used in this study are shown in Table 5.
a. Convolutional Neural Network (CNN)
CNN consists of multi-layers in a neural network, each with many features. Furthermore, convolution is carried out on each filter with a particular kernel size in the convolutional layer. The function of the pooling layer is carried out simultaneously with the convolution process to get the maximum value from one kernel. Then, the dropout function is performed to eliminate unused features to prevent overfitting. The Cross-Entropy Loss Function is used because it has a clear decision boundary in a classification task which helps assess the predictions of a classification model.
b. Convolutional Long ShortTerm Memory (CLSTM)
The first stage of complaint type classification is N-gram feature extraction via one-dimensional convolution, involving a filter vector sliding over a sequence and detecting features at different positions. The N-Gram was obtained from Pre-trained Word2Vec Wikipedia Indonesian, which was then converted into a convolution feature and entered into the LSTM. The CLSTM model consists of one convolutional layer and one LSTM layer, which is changed via dropout by adding L2 Loss regulation as a weighting in softmax. The loss function used is cross-entropy, and the use of LSTM is adopted because it can capture long-term dependencies between words in a sentence [28].
The results of reporting the complaints from the CNN model or CLSTM model will be divided into five classes: power outages, damaged roads, water quality, internet and non-complaints. Tweets in the non-compliant label are only for performance evaluation without proceeding to the data visualization stage. Tweets included in complaint labels, such as labels for complaints of power outages, damaged roads, quality, and the internet, will then proceed to the data visualization and performance evaluation stages. Precision, recall, and f-measure values are used to evaluate classifier performance.
Table 5 Hyperparameter Settings for Complaint Monitoring System
Parameter
|
Description
|
Value
|
Epochs
|
The number of passes that must be completed by the algorithm in processing the training data [21].
|
40
|
Layers
|
The number of neurons in the output layer of a given input [29].
|
2
|
Learning Rate
|
Parameters used to minimize the loss function.
|
le-3
|
Embedding Size
|
The vector size is used to represent the embedding word.
|
300
|
Drop Out
|
Regularization technique of a neural network.
|
0.5
|
Loss Function
|
Function to assess the predictions of the classification model.
|
Cross-entropy
|
Data Visualization
Twitter data visualization is in the form of a website, which utilizes Laravel Framework 5.8.8, PHP 7.4.13, MySQL database, and Google Maps V3 API to display a map of the location of the complaint. The location of the complaint can be identified using the Named Entity Recognition method. Named entities that NER has recognized, then extraction of relations between entities is carried out. Additional gazetteer data was obtained from openstreetmap.org (OSM) and limited to Surabaya in Indonesia. The output from OSM is in the form of an XML file, which is then extracted to obtain the location id, city, location address, location name, latitude, and longitude, using the help of the xmltree python library.
The location data that has been obtained also produces information about the source (starting location), destination (final location), and way (central location), but not all data has a way. The concept of a graph is used to determine which nodes are the source, destination, and way. The source is analogous to the root of the graph, the destination is the end of the graph branch, and the way is the node that connects the root to the final node.
After the location data is converted into a graph, geocoding is then carried out by converting ambiguous addresses into numerical geographic coordinates (latitude and longitude), which can be used to place markers on a map or give the position of an address on a map [30]. Geocoding also marks with markers on the map. The color of the marker is differentiated based on the type of complaint.