Options
Extraction Of Citation Data From Websites Based On Visual Cues
Date Issued
2016
Author(s)
Repke, Tim
Abstract
In this master’s thesis a system for extracting meta-information, specifically citation
data, from webpages is proposed. Machine Learning models like Artificial Neural
Networks and Random Forests are trained to classify elements on a given webpage
based on visual cues. Visual properties of elements are analysed in detail in order
to derive meaningful numerical features for classification. After applying sensible
post-processing filters, the system is able to recall up to 80% of the desired data at
a precision of up to 90%. Relying purely on visual cues however has it’s limitations
for robust extraction of some of the citation data. Possible approaches to facilitate
that are discussed at the end.
data, from webpages is proposed. Machine Learning models like Artificial Neural
Networks and Random Forests are trained to classify elements on a given webpage
based on visual cues. Visual properties of elements are analysed in detail in order
to derive meaningful numerical features for classification. After applying sensible
post-processing filters, the system is able to recall up to 80% of the desired data at
a precision of up to 90%. Relying purely on visual cues however has it’s limitations
for robust extraction of some of the citation data. Possible approaches to facilitate
that are discussed at the end.
File(s)
Loading...
Name
RepkeMScThesis.pdf
Size
421.28 KB
Format
Adobe PDF
Checksum
(MD5):a3ce7dc256e7a0cdc8a8aaac62f97dc6