Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using CRISP-DM
Document identifier: oai:DiVA.org:ltu-77090
Access full text here:
10.3390/asi2040037Keyword: Social Sciences,
Media and Communications,
Information Systems, Social aspects,
Samhällsvetenskap,
Medie- och kommunikationsvetenskap,
Systemvetenskap, informationssystem och informatik med samhällsvetenskaplig inriktning,
Web scraping,
Web crawling,
CRISP-DM,
Text mining,
Relevant literature search,
Research methodology,
Information systems,
InformationssystemPublication year: 2019Relevant Sustainable Development Goals (SDGs):
The SDG label(s) above have been assigned by OSDG.aiAbstract: Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories, literature analysis for scholar writing becomes a challenge. Correspondingly, applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However, almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally, most repositories do not follow open access standards. Until the time this paper was published, the topic of the augmented, relevant literature search had seen a methodological progress only, with the inability to apply the underlying methods on a larger scale, given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology, a play role case study is conducted. This then works on alleviating these restrictions, as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.
Authors
Hossam Hassanien
Luleå tekniska universitet; Digitala tjänster och system
Other publications
>>
Documents attached
|
Click on thumbnail to read
|
Record metadata
Click to view metadata
header:
identifier: oai:DiVA.org:ltu-77090
datestamp: 2021-04-19T12:56:04Z
setSpec: SwePub-ltu
metadata:
mods:
@attributes:
version: 3.7
recordInfo:
recordContentSource: ltu
recordCreationDate: 2019-12-06
identifier:
http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-77090
10.3390/asi2040037
2-s2.0-85094949071
titleInfo:
@attributes:
lang: eng
title: Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using CRISP-DM
abstract: Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories literature analysis for scholar writing becomes a challenge. Correspondingly applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally most repositories do not follow open access standards. Until the time this paper was published the topic of the augmented relevant literature search had seen a methodological progress only with the inability to apply the underlying methods on a larger scale given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology a play role case study is conducted. This then works on alleviating these restrictions as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.
subject:
@attributes:
lang: eng
authority: uka.se
topic:
Social Sciences
Media and Communications
Information Systems Social aspects
@attributes:
lang: swe
authority: uka.se
topic:
Samhällsvetenskap
Medie- och kommunikationsvetenskap
Systemvetenskap informationssystem och informatik med samhällsvetenskaplig inriktning
@attributes:
lang: eng
topic: web scraping
@attributes:
lang: eng
topic: web crawling
@attributes:
lang: eng
topic: CRISP-DM
@attributes:
lang: eng
topic: text mining
@attributes:
lang: eng
topic: relevant literature search
@attributes:
lang: eng
topic: research methodology
@attributes:
lang: eng
authority: ltu
topic: Information systems
genre: Research subject
@attributes:
lang: swe
authority: ltu
topic: Informationssystem
genre: Research subject
language:
languageTerm: eng
genre:
publication/journal-article
ref
note:
Published
1
name:
@attributes:
type: personal
authority: ltu
namePart:
Hassanien
Hossam
role:
roleTerm: aut
affiliation:
Luleå tekniska universitet
Digitala tjänster och system
nameIdentifier:
hooeld
0000-0002-1095-8437
originInfo:
dateIssued: 2019
publisher: MDPI
relatedItem:
@attributes:
type: host
titleInfo:
title: Applied System Innovation
identifier: 2571-5577
part:
detail:
@attributes:
type: volume
number: 2
@attributes:
type: issue
number: 4
@attributes:
type: artNo
number: 37
location:
url: http://ltu.diva-portal.org/smash/get/diva2:1376063/FULLTEXT01.pdf
accessCondition: gratis
physicalDescription:
form: electronic
typeOfResource: text