Content-aware compression for big textual data analysis

dc.check.embargoformatNot applicableen
dc.check.infoNo embargo requireden
dc.check.opt-outNot applicableen
dc.check.reasonNo embargo requireden
dc.check.typeNo Embargo Required
dc.contributor.advisorHerbert, Johnen
dc.contributor.advisorSreenan, Cormac J.en
dc.contributor.authorDong, Dapeng
dc.contributor.funderHigher Education Authorityen
dc.contributor.funderEuropean Regional Development Funden
dc.date.accessioned2016-06-07T08:22:39Z
dc.date.available2016-06-07T08:22:39Z
dc.date.issued2016
dc.date.submitted2016
dc.description.abstractA substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.en
dc.description.sponsorshipHigher Education Authority Programme for Research in Third-Level Institutions Cycle 5 & European Regional Development Fund (Telecommunications Graduate Initiative program)en
dc.description.statusNot peer revieweden
dc.description.versionAccepted Version
dc.format.mimetypeapplication/pdfen
dc.identifier.citationDong, D. 2016. Content-aware compression for big textual data analysis. PhD Thesis, University College Cork.en
dc.identifier.endpage137en
dc.identifier.urihttps://hdl.handle.net/10468/2697
dc.language.isoenen
dc.publisherUniversity College Corken
dc.rights© 2016, Dapeng Dong.en
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/en
dc.subjectCompressionen
dc.subjectHadoopen
dc.subjectContent-awareen
dc.subjectMapReduceen
dc.subjectBig dataen
dc.subjectTextual dataen
dc.thesis.opt-outfalse
dc.titleContent-aware compression for big textual data analysisen
dc.typeDoctoral thesisen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD (Science)en
ucc.workflow.supervisorj.herbert@cs.ucc.ie
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Abstract.pdf
Size:
107.09 KB
Format:
Adobe Portable Document Format
Description:
Abstract
Loading...
Thumbnail Image
Name:
DongD_PhD2016.pdf
Size:
6.59 MB
Format:
Adobe Portable Document Format
Description:
Full Text E-thesis
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
5.62 KB
Format:
Item-specific license agreed upon to submission
Description: