Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using CRISP-DM

Document identifier: oai:DiVA.org:ltu-77090
Access full text here:10.3390/asi2040037
Keyword: Social Sciences, Media and Communications, Information Systems, Social aspects, Samhällsvetenskap, Medie- och kommunikationsvetenskap, Systemvetenskap, informationssystem och informatik med samhällsvetenskaplig inriktning, Web scraping, Web crawling, CRISP-DM, Text mining, Relevant literature search, Research methodology, Information systems, Informationssystem
Publication year: 2019
Relevant Sustainable Development Goals (SDGs):
SDG 12 Responsible consumption and production
The SDG label(s) above have been assigned by OSDG.ai

Abstract:

Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories, literature analysis for scholar writing becomes a challenge. Correspondingly, applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However, almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally, most repositories do not follow open access standards. Until the time this paper was published, the topic of the augmented, relevant literature search had seen a methodological progress only, with the inability to apply the underlying methods on a larger scale, given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology, a play role case study is conducted. This then works on alleviating these restrictions, as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.

Authors

Hossam Hassanien

Luleå tekniska universitet; Digitala tjänster och system
Other publications >>

Documents attached


Click on thumbnail to read

Record metadata

Click to view metadata