Autonomic managers are the main architectural building blocks for constructing self-management capabilities of computing systems and applications. One of the major challenges in developing self-managing applications is robustness of management elements which form autonomic managers. We believe that transparent handling of the effects of resource churn (joins/leaves/failures) on management should be an essential feature of a platform for self-managing large-scale dynamic distributed applications, because it facilitates the development of robust autonomic managers and hence improves robustness of self-managing applications. This feature can be achieved by providing a robust management element abstraction that hides churn from the programmer. In this paper, we present a generic approach to achieve robust services that is based on finite state machine replication with dynamic reconfiguration of replica sets. We contribute a decentralized algorithm that maintains the set of nodes hosting service replicas in the presence of churn. We use this approach to implement robust management elements as robust services that can operate despite of churn. Our proposed decentralized algorithm uses peer-to-peer replica placement schemes to automate replicated state machine migration in order to tolerate churn. Our algorithm exploits lookup and failure detection facilities of a structured overlay network for managing the set of active replicas. Using the proposed approach, we can achieve a long running and highly available service, without human intervention, in the presence of resource churn. In order to validate and evaluate our approach, we have implemented a prototype that includes the proposed algorithm.
Parallel computing systems promise higher performance for computationally intensive applications. Since programmes for parallel systems consist of tasks that can be executed simultaneously, task scheduling becomes crucial for the performance of these applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper presents a novel list scheduling heuristic, called the Noodle heuristic. Noodle is a simple yet effective scheduling heuristic that differs from the existing list scheduling techniques in the way it assigns task priorities. The priority mechanism of Noodle maintains a proportionate fairness among all ready tasks belonging to all paths within a task graph. We conduct an extensive experimental evaluation of Noodle heuristic with task graphs taken from Standard Task Graph. Our experimental study includes results for task graphs comprising of 50, 100, and 300 tasks per graph and execution scenarios with 2-, 4-, 8-, and 16-core systems. We report results for average Schedule Length Ratio (SLR) obtained by producing variations in Communication to Computation cost Ratio. We also analyse results for different degree of parallelism and number of edges in the task graphs. Our results demonstrate that Noodle produces schedules that are within a maximum of 12 % (in worst-case) of the optimal schedule for 2-, 4-, and 8-core systems. We also compare Noodle with existing scheduling heuristics and perform comparative analysis of its performance. Noodle outperforms existing heuristics for average SLR values.
In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.
Task scheduling is crucial for the performance of parallel applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper1 presents a novel heuristic algorithm, called the Noodle heuristic, which differs from the existing list scheduling techniques in the way it assigns task priorities. We conduct an extensive experimental to validate Noodle for task graphs taken from Standard Task Graph (STG). Results show that Noodle produces schedules that are within a maximum of 12% (in worst-case) of the optimal schedule for 2, 4, and 8 core systems. We also compare Noodle with existing scheduling heuristics and perform comparative analysis of its performance.
With the proliferation of multicore processors, there is an urgent need for tools and methodologies supporting parallelization of existing applications. In this paper, we present a novel tool for aiding programmers in parallelizing programs. The tool, Embla, is based on the Valgrind framework, and allows the user to discover the data dependences in a sequential program, thereby exposing opportunities for parallelization. Embla performs an off-line dynamic analysis, and records dependences as they arise during program execution. It reports an optimistic view of parallelizable sequences, and ignores dependences that do not arise during execution. Moreover, since the tool instruments the machine code of the program, it is largely language independent. Since Embla finds the dependencies that occur for particular executions, the confidence one would assign to its results depend on whether different executions yield different (bad) or largely the same (good) dependencies. We present a preliminary investigation into this issue using 84 different inputs to the SPEC CPU 2006 benchmark 403.gcc. The results indicate that there is a strong correlation between coverage and finding dependencies; executing the entire program is likely to reveal all dependencies.
As multicore systems evolve by increasing the number of parallel execution units, parallel programming models have been released to exploit parallelism in the applications. Task-based programming model uses task abstractions to specify parallel tasks and schedules tasks onto processors at runtime. In order to increase the efficiency and get the highest performance, it is required to identify which runtime configuration is needed and how processor cores must be shared among tasks. Exploring design space for all possible scheduling and runtime options, especially for large input data, becomes infeasible and requires statistical modeling. Regression-based modeling determines the effects of multiple factors on a response variable, and makes predictions based on statistical analysis. In this work, we propose a regression-based modeling approach to predict the task-based program performance for different scheduling parameters with variable data size. We execute a set of task-based programs by varying the runtime parameters, and conduct a systematic measurement for influencing factors on execution time. Our approach uses executions with different configurations for a set of input data, and derives different regression models to predict execution time for larger input data. Our results show that regression models provide accurate predictions for validation inputs with mean error rate as low as 6.3%, and 14% on average among four task-based programs.
An efficient (un)marshaling framework is presented. It is designed for distributed applications implemented in languages such as C++. A marshaler/unmarshaler pair converts arbitrary \emph{structured} data between its host and network representations. This technology can also be used for persistent storage. Our framework simplifies the design of efficient and flexible marshalers. The network latency is reduced by concurrent execution of (un)marshaling and network operations. The framework is actually used in Mozart, a distributed programming system that implements Oz, a multi-paradigm concurrent language. Mozart, including the implementation of the framework, is available at http://www.mozart-oz.org.
We discuss a parallel implementation of an agent-based simulation. Our approach allows to adapt a sequential simulator for large-scale simulation on a cluster of workstations. We target discrete-time simulation models that capture the behavior of Web users and Web sites. Web users are connected with each other in a graph resembling the social network. Web sites are also connected in a similar graph. Users are stateful entities. At each time step, they exhibit certain behaviour such as visiting bookmarked sites, exchanging information about Web sites in the "word-of-mouth" style, and updating bookmarks. The real-world phenomena of emerged aggregated behavior of the Internet population is studied. The system distributes data among workstations, which allows large-scale simulations infeasible on a stand-alone computer. The model properties cause traffic between workstations proportional to partition sizes. Network latency is hidden by concurrent simulation of multiple users. The system is implemented in Mozart that provides multithreading, dataflow variables, component-based software development, and network-transparency. Currently we can simulate up to 1 million Web users on 100 million Web sites using a cluster of 16 computers, which takes few seconds per simulation step, and for a problem of the same size, parallel simulation offers speedups between 11 and 14.
We discuss a parallel implementation of an agent-based simulation. Our approach allows to adapt a sequential simulator for large-scale simulation on a cluster of workstations. We target discrete-time simulation models that capture the behavior of WWW. The real-world phenomena of emerged aggregated behavior of the Internet population is studied. The system distributes data among workstations, which allows large-scale simulations infeasible on a stand-alone computer. The model properties cause traffic between workstations proportional to partition sizes. Network latency is hidden by concurrent simulation of multiple users. The system is implemented in Mozart that provides multithreading, dataflow variables, component-based software development, and network-transparency. Currently we can simulate up to 106 Web users on 104 Web sites using a cluster of 16 computers, which takes few seconds per simulation step, and for a problem of the same size, parallel simulation offers speedups between 11 and 14.
The research goal is to develop a large-scale agent-based simulation environment to support implementations of Internet simulation applications.The Small Worlds (SW) graphs are used to model Web sites and social networks of Internet users. Each vertex represents the identity of a simple agent. In order to cope with scalability issues, we have to consider distributed parallel processing. The focus of this paper is to present two parallel-distributed algorithms for the construction of a particular type of SW graph called Beta-model. The first algorithm serializes the graph construction, while the second constructs the graph in parallel.