Please use this identifier to cite or link to this item: https://research.matf.bg.ac.rs/handle/123456789/1666
DC FieldValueLanguage
dc.contributor.authorPajić, Vesnaen_US
dc.contributor.authorVujičić Stanković, Stašaen_US
dc.contributor.authorPajić, Milošen_US
dc.date.accessioned2025-03-14T13:31:13Z-
dc.date.available2025-03-14T13:31:13Z-
dc.identifier.urihttps://research.matf.bg.ac.rs/handle/123456789/1666-
dc.description.abstractThe use of PDF documents in Natural Language Processing (NLP) became an almost daily activity for researchers in the field of computer linguistics and alike. Extracting plain text from PDF documents, with existing software tools, leads to severe distortion of sentence and paragraph structures, which is a huge problem for linguistically oriented research. In this paper, we present a novel algorithm for recovering sentences and paragraphs from PDF documents, called Sentence Recovery Algorithm or SR algorithm. The algorithm takes plain text extracted from a PDF document as an input, and tends to recover sentences from it. It takes into account cases like misinterpreted end of line, interruption of a sentence by tables or figures, problems occurred by hyphenation and so on. Beside describing and evaluating the algorithm, we present a use case for processing scientific articles originally given in PDF format, implemented in Java programming language.en_US
dc.language.isoenen_US
dc.publisherBeograd : Filološki fakultet, Univerzitetska biblioteka "Svetozar Marković", Zajednica visokoškolskih biblioteka Srbijeen_US
dc.relation.ispartofInfotheca: Journal for Digital Humanitiesen_US
dc.titleAn Algorithm for Sentence Recovery from PDF Filesen_US
dc.typeArticleen_US
dc.contributor.affiliationInformatics and Computer Scienceen_US
dc.relation.issn2217-9461en_US
dc.relation.firstpage42en_US
dc.relation.lastpage55en_US
dc.relation.volume15en_US
dc.relation.issue2en_US
item.grantfulltextnone-
item.cerifentitytypePublications-
item.openairecristypehttp://purl.org/coar/resource_type/c_18cf-
item.openairetypeArticle-
item.fulltextNo Fulltext-
item.languageiso639-1en-
crisitem.author.deptInformatics and Computer Science-
crisitem.author.orcid0000-0002-7200-3724-
Appears in Collections:Research outputs
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.