Change search
Link to record
Permanent link

Direct link
Publications (5 of 5) Show all publications
Varisteas, G. & Brorsson, M. (2016). Palirria: accurate on-line parallelism estimation for adaptive work-stealing. Concurrency and Computation, 28(2), 472-491
Open this publication in new window or tab >>Palirria: accurate on-line parallelism estimation for adaptive work-stealing
2016 (English)In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 28, no 2, p. 472-491Article in journal (Refereed) Published
Abstract [en]

Summary We present Palirria, a self-adapting work-stealing scheduling method for nested fork/join parallelism that can be used to estimate the number of utilizable workers and self-adapt accordingly. The estimation mechanism is optimized for accuracy, minimizing the requested resources without degrading performance. We implemented Palirria for both the Linux and Barrelfish operating systems and evaluated it on two platforms: a 48-core Non-Uniform Memory Access (NUMA) multiprocessor and a simulated 32-core system. Compared with state-of-the-art, we observed higher accuracy in estimating resource requirements. This leads to improved resource utilization and performance on par or better to executing with fixed resource allotments.

Place, publisher, year, edition, pages
John Wiley and Sons Ltd, 2016
Keywords
adaptive, load balancing, multicore, parallel, resource management, runtime, scheduler, task, work-stealing, workload, Computer operating systems, Resource allocation, Scheduling, Multi core, Runtimes, Network management
National Category
Natural Sciences
Identifiers
urn:nbn:se:ri:diva-41869 (URN)10.1002/cpe.3630 (DOI)2-s2.0-84956796768 (Scopus ID)
Available from: 2019-12-12 Created: 2019-12-12 Last updated: 2020-12-01Bibliographically approved
Muddukrishna, A., Jonsson, P. & Brorsson, M. (2015). Characterizing task-based OpenMP programs. PLOS ONE, 10(4), Article ID e0123545.
Open this publication in new window or tab >>Characterizing task-based OpenMP programs
2015 (English)In: PLOS ONE, E-ISSN 1932-6203, Vol. 10, no 4, article id e0123545Article in journal (Refereed) Published
Abstract [en]

Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance.

Place, publisher, year, edition, pages
Public Library of Science, 2015
Keywords
Article, automation, computer interface, computer program, controlled study, data analysis, data processing, information processing, information system, program cost effectiveness, human, software, task performance, Humans, Task Performance and Analysis
National Category
Natural Sciences
Identifiers
urn:nbn:se:ri:diva-41878 (URN)10.1371/journal.pone.0123545 (DOI)2-s2.0-84929498034 (Scopus ID)
Available from: 2019-12-12 Created: 2019-12-12 Last updated: 2021-06-14Bibliographically approved
Muddukrishna, A., Jonsson, P. & Brorsson, M. (2015). Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors. Scientific Programming, 2015, Article ID 981759.
Open this publication in new window or tab >>Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors
2015 (English)In: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, Vol. 2015, article id 981759Article in journal (Refereed) Published
Abstract [en]

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers

Place, publisher, year, edition, pages
IOS Press, 2015
Keywords
Application programming interfaces (API), Architectural design, Benchmarking, Computer architecture, Multiprocessing systems, Network management, Scheduling, Scheduling algorithms, Software architecture, Architectural knowledge, Data distribution, Improve performance, Many-core processors, Non uniform data, Performance degradation, Processor architectures, Scheduling techniques, Multitasking
National Category
Natural Sciences
Identifiers
urn:nbn:se:ri:diva-41881 (URN)10.1155/2015/981759 (DOI)2-s2.0-84947272497 (Scopus ID)
Available from: 2019-12-12 Created: 2019-12-12 Last updated: 2020-12-01Bibliographically approved
Bhatti, M. K., Oz, I., Popov, K., Muddukrishna, A. & Brorsson, M. (2014). Noodle: A heuristic algorithm for task scheduling in MPSoC architectures. In: Proceedings - 2014 17th Euromicro Conference on Digital System Design, DSD 2014: . Paper presented at 17th Euromicro Conference on Digital System Design, DSD 2014, 27 August 2014 through 29 August 2014 (pp. 667-670). Institute of Electrical and Electronics Engineers Inc., Article ID 6927309.
Open this publication in new window or tab >>Noodle: A heuristic algorithm for task scheduling in MPSoC architectures
Show others...
2014 (English)In: Proceedings - 2014 17th Euromicro Conference on Digital System Design, DSD 2014, Institute of Electrical and Electronics Engineers Inc. , 2014, p. 667-670, article id 6927309Conference paper, Published paper (Refereed)
Abstract [en]

Task scheduling is crucial for the performance of parallel applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper1 presents a novel heuristic algorithm, called the Noodle heuristic, which differs from the existing list scheduling techniques in the way it assigns task priorities. We conduct an extensive experimental to validate Noodle for task graphs taken from Standard Task Graph (STG). Results show that Noodle produces schedules that are within a maximum of 12% (in worst-case) of the optimal schedule for 2, 4, and 8 core systems. We also compare Noodle with existing scheduling heuristics and perform comparative analysis of its performance.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc., 2014
Keywords
Directed Acyclic Graph (DAG), List Scheduling, Multiprocessor System-on-Chip(MPSoC), Parallel Computing, Algorithms, Application specific integrated circuits, Computational complexity, Directed graphs, Heuristic algorithms, Microprocessor chips, Multiprocessing systems, Multitasking, Optimization, Parallel processing systems, Scheduling, System-on-chip, Comparative analysis, List-scheduling, MPSoC architectures, Multiprocessor system on chips, Optimal schedule, Parallel application, Scheduling heuristics, Scheduling algorithms
National Category
Engineering and Technology
Identifiers
urn:nbn:se:ri:diva-46489 (URN)10.1109/DSD.2014.71 (DOI)2-s2.0-84928812217 (Scopus ID)9781479957934 (ISBN)
Conference
17th Euromicro Conference on Digital System Design, DSD 2014, 27 August 2014 through 29 August 2014
Available from: 2020-08-24 Created: 2020-08-24 Last updated: 2023-05-25Bibliographically approved
Muddukrishna, A., Jonsson, P. A., Vlassov, V. & Brorsson, M. (2013). Locality-aware task scheduling and data distribution on NUMA systems. In: Lecture Notes in Computer Science: . Paper presented at 16 September 2013 through 18 September 2013, Canberra, ACT (pp. 156-170). , 8122
Open this publication in new window or tab >>Locality-aware task scheduling and data distribution on NUMA systems
2013 (English)In: Lecture Notes in Computer Science, 2013, Vol. 8122, p. 156-170Conference paper, Published paper (Refereed)
Abstract [en]

Modern parallel computer systems exhibit Non-Uniform Memory Access (NUMA) behavior. For best performance, any parallel program therefore has to match data allocation and scheduling of computations to the memory architecture of the machine. When done manually, this becomes a tedious process and since each individual system has its own peculiarities this also leads to programs that are not performance-portable. We propose the use of a data distribution scheme in which NUMA hardware peculiarities are abstracted away from the programmer and data distribution is delegated to a runtime system which is generated once for each machine. In addition we propose using task data dependence information now possible with the OpenMP 4.0RC2 proposal to guide the scheduling of OpenMP tasks to further reduce data stall times. We demonstrate the viability and performance of our proposals on a four socket AMD Opteron machine with eight NUMA nodes. We identify that both data distribution and locality-aware task scheduling improves performance compared to default policies while still providing an architecture-oblivious approach for the programmer.

Keywords
Data allocation, Data dependence, Data distribution, Data distribution schemes, Individual systems, Non uniform memory access, Parallel computer systems, Parallel program, Electric equipment, Memory architecture, Multitasking, Scheduling, Scheduling algorithms, Application programming interfaces (API)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:ri:diva-48578 (URN)10.1007/978-3-642-40698-0_12 (DOI)2-s2.0-84883296523 (Scopus ID)9783642406973 (ISBN)
Conference
16 September 2013 through 18 September 2013, Canberra, ACT
Available from: 2020-10-08 Created: 2020-10-08 Last updated: 2020-12-01Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9637-2065

Search in DiVA

Show all publications