Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using CRISP-DM

Document identifier: oai:DiVA.org:ltu-77090
Access full text here:10.3390/asi2040037
Keyword: Social Sciences, Media and Communications, Information Systems, Social aspects, Samhällsvetenskap, Medie- och kommunikationsvetenskap, Systemvetenskap, informationssystem och informatik med samhällsvetenskaplig inriktning, Web scraping, Web crawling, CRISP-DM, Text mining, Relevant literature search, Research methodology, Information systems, Informationssystem
Publication year: 2019
Relevant Sustainable Development Goals (SDGs):
SDG 12 Responsible consumption and production
The SDG label(s) above have been assigned by OSDG.ai


Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories, literature analysis for scholar writing becomes a challenge. Correspondingly, applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However, almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally, most repositories do not follow open access standards. Until the time this paper was published, the topic of the augmented, relevant literature search had seen a methodological progress only, with the inability to apply the underlying methods on a larger scale, given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology, a play role case study is conducted. This then works on alleviating these restrictions, as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.


Hossam Hassanien

Luleå tekniska universitet; Digitala tjänster och system
Other publications >>

Documents attached

Click on thumbnail to read

Record metadata

Click to view metadata