Content-aware partial compression for textual big data analysis in Hadoop

Show simple item record

dc.contributor.author Dong, Dapeng
dc.contributor.author Herbert, John
dc.date.accessioned 2018-02-13T12:29:30Z
dc.date.available 2018-02-13T12:29:30Z
dc.date.issued 2017-06-29
dc.identifier.citation Dong, D. and Herbert, J. (2017) 'Content-aware Partial Compression for Textual Big Data Analysis in Hadoop', IEEE Transactions on Big Data, In Press, doi: 10.1109/TBDATA.2017.2721431 en
dc.identifier.startpage 1 en
dc.identifier.endpage 14 en
dc.identifier.issn 2332-7790
dc.identifier.uri http://hdl.handle.net/10468/5452
dc.identifier.doi 10.1109/TBDATA.2017.2721431
dc.description.abstract A substantial amount of information in companies and on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. Compression as an effective means to reduce data size has been employed by many emerging data analytic platforms, whom the main purpose of data compression is to save storage space and reduce data transmission cost over the network. Since general purpose compression methods endeavour to achieve higher compression ratios by leveraging data transformation techniques and contextual data, this context-dependency forces the access to the compressed data to be sequential. Processing such compressed data in parallel, such as desirable in a distributed environment, is extremely challenging. This work proposes techniques for more efficient textual big data analysis with an emphasis on content-aware compression schemes suitable for the Hadoop analytic platform. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of public and private real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements. en
dc.format.mimetype application/pdf en
dc.language.iso en en
dc.publisher Institute of Electrical and Electronics Engineers (IEEE) en
dc.rights © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. en
dc.subject Algorithm design and analysis en
dc.subject Big Data en
dc.subject Data analysis en
dc.subject Data compression en
dc.subject Distributed databases en
dc.subject Organizations en
dc.subject Servers en
dc.subject Compression en
dc.subject Distributed File System en
dc.subject MapReduce en
dc.title Content-aware partial compression for textual big data analysis in Hadoop en
dc.type Article (peer-reviewed) en
dc.internal.authorcontactother John Herbert, Computer Science, University College Cork, Cork, Ireland. +353-21-490-3000 Email: j.herbert@cs.ucc.ie en
dc.internal.availability Full text available en
dc.date.updated 2018-02-13T12:23:29Z
dc.description.version Accepted Version en
dc.internal.rssid 425602913
dc.description.status Peer reviewed en
dc.identifier.journaltitle IEEE Transactions on Big Data en
dc.internal.copyrightchecked No !!CORA!! en
dc.internal.licenseacceptance Yes en
dc.internal.IRISemailaddress j.herbert@cs.ucc.ie en
dc.internal.bibliocheck In Press, February 2018. Update citation details, page numbers, add volume, issue en


Files in this item

This item appears in the following Collection(s)

Show simple item record

This website uses cookies. By using this website, you consent to the use of cookies in accordance with the UCC Privacy and Cookies Statement. For more information about cookies and how you can disable them, visit our Privacy and Cookies statement