Content-aware partial compression for textual big data analysis in Hadoop

dc.contributor.authorDong, Dapeng
dc.contributor.authorHerbert, John
dc.date.accessioned2018-02-13T12:29:30Z
dc.date.available2018-02-13T12:29:30Z
dc.date.issued2017-06-29
dc.date.updated2018-02-13T12:23:29Z
dc.description.abstractA substantial amount of information in companies and on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. Compression as an effective means to reduce data size has been employed by many emerging data analytic platforms, whom the main purpose of data compression is to save storage space and reduce data transmission cost over the network. Since general purpose compression methods endeavour to achieve higher compression ratios by leveraging data transformation techniques and contextual data, this context-dependency forces the access to the compressed data to be sequential. Processing such compressed data in parallel, such as desirable in a distributed environment, is extremely challenging. This work proposes techniques for more efficient textual big data analysis with an emphasis on content-aware compression schemes suitable for the Hadoop analytic platform. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of public and private real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.en
dc.description.statusPeer revieweden
dc.description.versionAccepted Versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.citationDong, D. and Herbert, J. (2017) 'Content-aware Partial Compression for Textual Big Data Analysis in Hadoop', IEEE Transactions on Big Data,4(4), pp.459-472 doi: 10.1109/TBDATA.2017.2721431en
dc.identifier.doi10.1109/TBDATA.2017.2721431
dc.identifier.endpage472en
dc.identifier.issn2332-7790
dc.identifier.issued4
dc.identifier.journaltitleIEEE Transactions on Big Dataen
dc.identifier.startpage459en
dc.identifier.urihttps://hdl.handle.net/10468/5452
dc.identifier.volume4
dc.language.isoenen
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.rights© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.en
dc.subjectAlgorithm design and analysisen
dc.subjectBig Dataen
dc.subjectData analysisen
dc.subjectData compressionen
dc.subjectDistributed databasesen
dc.subjectOrganizationsen
dc.subjectServersen
dc.subjectCompressionen
dc.subjectDistributed File Systemen
dc.subjectMapReduceen
dc.titleContent-aware partial compression for textual big data analysis in Hadoopen
dc.typeArticle (peer-reviewed)en
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3210.pdf
Size:
1.22 MB
Format:
Adobe Portable Document Format
Description:
Accepted version
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.71 KB
Format:
Item-specific license agreed upon to submission
Description: