Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent MetadataShow others and affiliations
2017 (English)In: Proceedings - International Conference on Distributed Computing Systems, 2017, p. 2525-2528Conference paper, Published paper (Refereed)
Abstract [en]
Hadoop is a popular system for storing, managing,and processing large volumes of data, but it has bare-bonesinternal support for metadata, as metadata is a bottleneck andless means more scalability. The result is a scalable platform withrudimentary access control that is neither user-nor developer-friendly. Also, metadata services that are built on Hadoop, suchas SQL-on-Hadoop, access control, data provenance, and datagovernance are necessarily implemented as eventually consistentservices, resulting in increased development effort and morebrittle software. In this paper, we present a new project-based multi-tenancymodel for Hadoop, built on a new distribution of Hadoopthat provides a distributed database backend for the HadoopDistributed Filesystem's (HDFS) metadata layer. We extendHadoop's metadata model to introduce projects, datasets, andproject-users as new core concepts that enable a user-friendly, UI-driven Hadoop experience. As our metadata service is backed bya transactional database, developers can easily extend metadataby adding new tables and ensure the strong consistency ofextended metadata using both transactions and foreign keys.
Place, publisher, year, edition, pages
2017. p. 2525-2528
Keywords [en]
Data Management, Dynamic Roles, Hadoop, Mutli-tenancy, Access control, Data flow analysis, Information management, Metadata, Data provenance, Distributed database, Metadata services, Strong consistency, Transactional database, Distributed computer systems
National Category
Natural Sciences
Identifiers
URN: urn:nbn:se:ri:diva-30835DOI: 10.1109/ICDCS.2017.41Scopus ID: 2-s2.0-85027275789ISBN: 9781538617915 (print)OAI: oai:DiVA.org:ri-30835DiVA, id: diva2:1139370
Conference
37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017, 5 June 2017 through 8 June 2017
2017-09-072017-09-072023-05-22Bibliographically approved