Repke, TimTimRepke2024-10-142024-10-142016https://knowledge.hpi.de/handle/123456789/1948In this master’s thesis a system for extracting meta-information, specifically citation data, from webpages is proposed. Machine Learning models like Artificial Neural Networks and Random Forests are trained to classify elements on a given webpage based on visual cues. Visual properties of elements are analysed in detail in order to derive meaningful numerical features for classification. After applying sensible post-processing filters, the system is able to recall up to 80% of the desired data at a precision of up to 90%. Relying purely on visual cues however has it’s limitations for robust extraction of some of the citation data. Possible approaches to facilitate that are discussed at the end.Extraction Of Citation Data From Websites Based On Visual Cuesmastersthesis