Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Block-distributed gradient boosted trees
RISE - Research Institutes of Sweden, ICT, SICS.ORCID iD: 0000-0002-8180-7521
Amazon Web Services, US.
KTH Royal Institute of Technology, Sweden .
2019 (English)In: SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc , 2019, p. 1025-1028Conference paper, Published paper (Refereed)
Abstract [en]

The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high-dimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose block-distributed GBTs. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the block-distributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for high-dimensional data, and allow more cost-effective scale-out without the need for expensive network communication. 

Place, publisher, year, edition, pages
Association for Computing Machinery, Inc , 2019. p. 1025-1028
Keywords [en]
Communication Efficiency, Distributed Systems, Gradient Boosted Trees, Scalability, Clustering algorithms, Cost effectiveness, Cost reduction, Efficiency, Forestry, Information retrieval, Learning algorithms, Machine learning, Click-through rate, Communication cost, Feature dimensions, High dimensional data, Massive data sets, Network communications, Trees (mathematics)
National Category
Natural Sciences
Identifiers
URN: urn:nbn:se:ri:diva-40613DOI: 10.1145/3331184.3331331Scopus ID: 2-s2.0-85073774997ISBN: 9781450361729 (print)OAI: oai:DiVA.org:ri-40613DiVA, id: diva2:1373052
Conference
42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, 21 July 2019 through 25 July 2019
Available from: 2019-11-26 Created: 2019-11-26 Last updated: 2019-12-04Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records BETA

Vasiloudis, Theodore

Search in DiVA

By author/editor
Vasiloudis, Theodore
By organisation
SICS
Natural Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
v. 2.35.8