This section will focus on the finer details of using Python for news scraping. We explored how to conduct new searches, retrieve article text, apply data filtering techniques, and analyze cultural perspectives surrounding topics like vaccine hesitancy, mandatory vaccines, and compulsory vaccination measures.
3.1 News Scraping with Python
This section will describe the approach taken to gather data, provide more information about our data sources, and give specific details about the datasets. Our methodical approach involves utilizing well-known Python libraries like BeautifulSoup and Scrapy to gather and analyze data from online sources. Our news scraping activity with Python revolves around these three core activities.
-
News research
-
News text body retrieval
-
Articles filtering and dataset creation
The first step was to retrieve a list of articles related to the research topic from the internet. For this reason, we decided to use the popular Google News package from the PyPI repository, which allows for quick queries on the Google News webpage by keywords, publication date, language, and region.
The query output is a JSON file containing information about matching results, such as the article title, the language, the publication date, and the web link. Our news search included the following terms.
{"title": "From cowpox to mumps: people have always had a problem with vaccination - The Conversation", "title_detail": {"type": "text/plain", "language": null, "base": "", "value": "From cowpox to mumps: people have always had a problem with vaccination - The Conversation"},
"published": "Wed, 19 Feb 2020 08:00:00 GMT", "published_parsed": [2020, 2, 19, 8, 0, 0, 2, 50, 0]
As we tested the package, we discovered that the number of search results was capped at up to 100 articles per query. For this reason, we decided to compute multiple queries by arranging different keywords, countries, and publishing dates. Although the query has a country parameter, we discovered that setting this to a specific country does not guarantee that search results are from that specific country; for this reason, we decided to add the country name in the research query directly. To streamline the analysis, we decided to stick with English and refrain from using local languages.
For the research query, we adopted the following skeleton:
{'vaccine covid {kword} {cname} after:{start_date} before:{end_date}'
where:
kword is a sentence between: [hesitancy, mandatory,compulsory]
cname is a country name among: [Austria, Belgium (BE), Bulgaria, Croatia, Republic of Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany (DE), Greece, Hungary, Ireland, Italy (IT), Latvia, Lithuania, Luxembourg, Malta, Netherlands(NL), Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden, UK, USA]
start_date, end_date are date ranges from November-2019 to June-2022 by month (32 months)
3.1.1 Result
To obtain all possible combinations, we have a total of 3x29x32 = 2784 queries.
Out of the 66462 articles we extracted, only 24001 (36%) were unique due to repetition across queries.
Retrieving the article text body can be a tricky task. Many approaches rely on web scraping techniques that access the web page source code for extracting text information. The main challenge with web scraping is that it requires a specific implementation per website, and sometimes, the desired text can be embedded in more complex data structures or objects such as JavaScript.
For this reason, we decided to rely on News-Please, another popular Python library, to handle the article's body retrieval.
The library successfully downloaded the text of 99% (22036 out of 24001) articles.
3.2 Data Filtering
We noticed that in some articles there were issues related to the downloaded text. During the data filtering process, we encountered the following notification :(not available anymore, news realise not able to automatically download it) Why did this happen? Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information, you can review our Terms of Service and Cookie Policy.
For this reason, we filtered out every article containing words like Java, JavaScript, cookie, and browser.
The filtering process resulted in 887 samples being removed, reducing the total number of samples to 21,149, which accounts for a 96% decrease.
Here is a typical example of our text-processing activity.
We removed exceptionally long articles, surpassing the 90th percentile in word count and character count. The frequency of characters and words per article is visualized in Figs. 1 and 2.
3.2.1 Results
By applying this filter, the remaining articles are now 18838 (78% less).Our search results found that some articles are unrelated to the coronavirus (e.g., articles on other topics containing ads or quick news about the pandemic). Filtering by article content: name and body text must contain at least one vaccine, COVID-19, and the coronavirus.
While the filter successfully eliminated outliers, it significantly affected the dataset size, leaving only 10,563 articles (48%).