This chapter also incidentally provides the reader exposure to a range of effective tools and performance evaluation techniques that will inform his own analyses. As an example, consider these passages from the analysis of the performance characteristics of GTC, a 3D particle-in-cell code for studying magnetic confinement plasmas. After describing how the code decomposes the computational domain, the authors go on to make explicit the hardware implications of the decomposition -- a step that is probably unnecessary for the professional computationalist, but which will nevertheless prove very valuable for the many other HPC professionals thinking about the next generation of hardware and software and planning the investments that will get us there: Figure 1.
This particle-in-cell calculation uses a one-dimensional domain decomposition across the toroidal computational grid, causing each processor to exchange data with its two neighbors as particles cross the left and right boundaries Therefore, on average each MPI task communicates with 4 neighbors, so much simpler interconnect networks will be adequate for its requirements.
After looking fairly closely at the characteristics of each of the five applications, the authors bring the whole set of results together and draw broad conclusions about these codes and their potential for successful petascale deployment based on the evidence gathered. It's a great chapter to open the book with. An early example of a chapter that makes a deep dive into a specific application domain is chapter six on the numerical prediction of "high-impact" local weather.
This chapter exemplifies one of the real strengths of the book: each domain-specific chapter provides enough context and detail to bring along non-experts in the domain, while still managing to cover enough detail that the reader achieves a working knowledge of the high-level drivers in scientific applications. As a result readers exit these chapters with knowledge of why an application does what it does sufficient to think critically about the impact that specific petascale hardware and software features will have on performance.
In chapter six, for example, the authors briefly introduce operational weather forecasting and describe the current state of the practice, weather forecasting at a resolution of 25 km. As a motivation for increasing this resolution, the authors describe what will be needed to both expand the geographic area covered by predictions and enable prediction in sufficient detail to predict items of high local interest, such as thunderstorms and tornadoes: Even at km horizontal grid spacing, important weather systems that are directly responsible for meteorological hazards including thunderstorms, heavy precipitation and tornadoes cannot be directly resolved because of their small sizes.
For individual storm cells, horizontal grid resolutions of at least 1-km grid are generally believed to be necessary, which even high resolutions are needed to resolve less organized storms and the internal circulation within the storm cells. The primary difference lies with the memory and memory access speeds. And this analysis continues for data distribution, load balancing issues, scalability, and so on. There is also discussion of specific architectures and approaches for reaching the petascale, with emphasis on how these will influence application design and optimization.
For example chapter 8 considers reaching beyond the petascale by functionally dividing amenable computations onto separate supercomputers, an approach the authors call "distributed petascale computing. Later chapters, for example chapter 21, talk about a specific annotation-based approach for performance portability, an important topic to address if indeed petascale architectures end up being as diverse as many expect.
In chapter 13 the authors frame the discussions about massive concurrency and enormous computer systems in a different way: distributing applications over tens or hundreds of thousands of processors is going to create a driver for applications to recover from hardware faults. This chapter discusses FT-MPI, a fault-tolerant MPI implementation, along with a diskless checkpointing approach that, combined, can provide the application developer a good starting point for developing more robust applications.
In any discussion of these technologies, the question of performance overhead will naturally and properly arise, and the authors present results in the context of a PCG algorithm. In fact, throughout this book the focus stays on performance -- not just the numbers, but what drives the numbers to fall out the way they do, and, perhaps just as important, why we need the performance in the first place.
Go to Amazon. Back to top. High-performance computing will enable breakthrough science and engi- neering in the 21st century. May this book inspire you to solve computational grand challenges that will help our society, protect our environment, and im- prove our understanding in fundamental ways, all through the ecient use of petascale computing. David A. He received his Ph. Bader has been a pioneer in the eld of high-performance computing for problems in bioinformatics and computational genomics. He has co-authored over 80 arti- cles in peer-reviewed journals and conferences, and his main areas of research are in parallel algorithms, combinatorial optimization, and computational bi- ology and genomics.
Michael Aftosmis David A. Pratul K. Sadaf R.
Petascale Computing: Algorithms and Applications / Edition 1
Alam Jonathan W. Steven F. James Bordner Joseph R. Diaconescu Knoxville, Tennessee Yahoo! Burbank, California. Greg L. Kelvin K. Stephane Ethier William D. Al Geist Thomas C. Michael Gerndt Bruce A. Tom Goodale Alfons G. Philip W. John M. Celso L. Dieter an Mey Christian D. Tetsu Narumi James C. Michael L. Peter M. Daniel Weber Hans P. Ashby and John M. May 3. Hen- derson 4. Droegemeier and Daniel Weber 6. Drake, Philip W. Jones, Mariana Vertenstein, James B.
Worley 7. Sloot 8. Phillips, Laxmikant V. Kale, and Abhinav Bhatele. Agarwal, Sadaf R. Berry, Joseph R. Crobak, and Bruce A. Zima The table compares two coarsening operators C-old and C- new for each of two data querying techniques global and as- sumed partition. Phoenix data were collected on the Cray X1 platform. Phoenix X1E data were collected with an X1 binary.
Cactus Phoenix results are on the X1 system. The left patch set is invalid because it contains a mixed boundary. Patches that do not contain ags are removed from the computational mesh. The left patch set is using the tiled regridder while the right is using the coarsened Berger- Rigoutsos regridder.
Up to seven levels of adaptive mesh renement resolve the distribution of baryons within and between galaxy clusters, for an eective resolution of 65, Shown is a volume rendering of baryon density. A mosaic of hundreds of Earth- based telescope pointings were needed to make this image. The dotted curve gives the total mass in giant molecular clouds, the thick dashed curve in star clusters and the solid curve in eld stars, which come from dissolved star clusters.
The solid, thick short dashed and dotted curves are as in Fig- ure 8. New in this gure are the two-dotted and dash-dotted lines near the bottom, which represent the CPU time needed for evolving the eld star population lower dotted curve and dark matter bottom curve. For astrophysics, there are two product lines. The upper three lines are the results using only the host machine with var- ious cuto lengths. Component v,4 , the root of this hierarchy, represents the entire graph. The generated machine accepts the remaining subset input 2 of input data to produce output.
The overall system may be viewed as a single system that accepts the full set of input data. Abstract After a decade where HEC high-end computing capability was dominated by the rapid pace of improvements to CPU clock frequency, the performance of next-generation supercomputers is increasingly dierentiated by varying interconnect designs and levels of integration. Understanding the trade-os of these system designs, in the context of high-end numerical simu- lations, is a key step towards making eective petascale computing a reality.
A novel aspect of our study is the emphasis on full applications, with real input data at the scale desired by computational scientists in their unique domain. We examine ve candidate ultra-scale applications, representing a broad range of algorithms and compu- tational structures. Our work includes the highest concurrency experiments to date on ve of our six applications, including 32K processor scalability for two of our codes and describes several successful optimization strategies.
Overall results indicate that our evaluated codes have the potential to eectively utilize petascale re- sources; however, several applications will require reengineering to incorporate the additional levels of parallelism necessary to utilize the vast concurrency of upcoming ultra-scale systems. However, harnessing such extreme computing power will require an unprecedented degree of parallelism both within the scientic applications and at all levels of the underlying architectural platforms.
Un- like a decade ago when the trend of HEC high-end computing systems was clearly towards building clusters of commodity components today one sees a much more diverse set of HEC models. Increasing concerns over power eciency are likely to further accelerate recent trends towards architectural diversity through new interest in customization and tighter system integra- tion. Power dissipation concerns are also driving high-performance comput- ing HPC system architectures from the historical trend of geometrically in- creasing clock rates towards geometrically increasing core counts multicore , leading to daunting levels of concurrency for future petascale systems.
Understanding the trade-os of these computing paradigms, in the context of high-end numer- ical simulations, is a key step towards making eective petascale computing a reality. The main contribution of this work is to quantify these trade-os by examining the eectiveness of various architectural models for HEC with respect to absolute performance and scalability across a broad range of key scientic domains. A novel aspect of our eort is the emphasis on full applications, with real input data at the scale desired by computational scientists in their unique do- main, which builds on our previous eorts [20, 21, 5, 19] and complements a number of other related studies [6, 18, 31, 11].
This work represents one of the most comprehensive performance evalu- ation studies to date on modern HEC platforms. Additionally, we implement several optimizations for the HyperCLaw AMR calculation, and show signif- icantly improved performance and scalability on the X1E vector platform, compared with previously published studies. Overall, we believe that these comprehensive evaluation eorts lead to more ecient use of community re- sources in both current installations and in future designs. Ta- ble 1. Minimum latency for the XT3 torus. There is a nominal additional latency of 50ns per hop through the torus.
There is an additional latency of up to 69ns per. TABLE 1. The 1. The experiments in this work were conducted under AIX 5. Several experiments were also conducted on the 1, node 12, processor Power5-based Purple system at Lawrence Livermore National Laboratory. Jacquard contains dual-processor nodes and runs Linux 2. Each node contains two 2. The Opteron uses a single-instruction multiple-data SIMD oating- point unit accessed via the SSE2 instruction set extensions, and can execute two double-precision oating-point operations per cycle.
The processor pos- sesses an on-chip DDR memory controller as well as tightly-integrated switch interconnection via HyperTransport. Each node of the Cray XT3 contains a single, dual-core 2. However, the second FPU is not an independent unit, and can only be used with special SIMD instructions thus, making it di- cult for the core to perform at close to peak except for specially hand-tuned code portions.
Our experiments primarily examine performance in coproces- sor mode where one core is used for computation and the second is dedicated to communication. Additionally, several experiments were conducted using 32K processors of BGW in virtual node mode where both cores are used for both computation and communication. Finally, the global interrupt network provides fast bar- riers and interrupts with a system-wide constant latency of 1.
Vector processors expedite uniform oper- ations on independent data sets by exploiting regularities in the computational structure. The X1E computational core, called the single-streaming processor SSP , contains two stage vector pipes running at 1. Each SSP contains 32 vector registers holding 64 double-precision words, and operates at 4.
The X1E interconnect is hierarchical, with subsets of 16 SMP nodes each con- taining 8 MSPs connected via a crossbar, these subsets are connected in a 4D-hypercube topology. We examine: GTC, a magnetic fusion application that uses the particle- in-cell approach to solve nonlinear gyrophase-averaged Vlasov-Poisson equa- tions; Cactus, an astrophysics framework for high-performance computing that evolves Einsteins equations from the Theory of General Relativity; ELBM3D, a Lattice-Boltzmann code to study turbulent uid ow; PARATEC, a rst principles materials science code that solves the Kohn-Sham equations of den- sity functional theory to obtain electronic wave functions; HyperCLaw, an adaptive mesh renement AMR framework for solving the hyperbolic con- servation laws of gas dynamics via a higher-order Godunov method.
Table 1. These codes are candidate ultra-scale applications with the potential to fully utilize leadership-class computing systems, and represent a broad range of algorithms and computational structures. Examining these varied computational methodologies across a set of modern supercomputing platforms allows us to study the performance trade-os of dierent architectural balances and topological interconnect approaches.
To study the topological connectivity of communication for each applica- tion, we utilize the IPM  tool, recently developed at LBNL. IPM is an application proling layer that allows us to noninvasively gather the commun- ication characteristics of these codes as they are run in a production environ- ment. By recording statistics on these message exchanges we can form an undirected graph which describes the topological connectivity required by the application. We use the IPM data to create representations of the topolog- ical connectivity of communication for each code as a matrix where each point in the graph indicates message exchange and color coded intensity be- tween two given processors highlighting the vast range of communication requirements within our application suite.
IPM is also used to collect statis- tics on MPI utilization and the sizes of point-to-point messages, allowing us to quantify the fraction of messages that are bandwidth- or latency-bound. The dividing line between these two regimes is an aggressively designed bandwidth- delay product of 2Kb. See [14, 25] for a more detailed explanation of the application communication characteristics, data collection methodology, and bandwidth-delay product thresholding. Experimental results show either strong scaling where the problem size remains xed regardless of concurrency , or weak scaling where the problem size grows with concurrency such that the per-processor computational re- quirement remains xed whichever is appropriate for a given applications large-scale simulation.
Note that these applications have been designed and highly optimized on superscalar platforms; thus, we describe newly devised optimizations for the vector platforms where appropriate. The GFLOPS value is computed by dividing a valid baseline op-count by the measured wall-clock time of each platform thus the ratio between the computational rates is the same as the ratio of runtimes across the evaluated systems.
Landau, Ed. Kale, D. Padua and D. Shu and L. Kale and V. Ramkumar and L. A Board Jr. Skeel, and T. Nelson, W. Humprey, R. Kuffrin, A. Gursoy, A. Dalke, L. Humprey, A. Schlick, R. Skeel, A. Brunger, L. Kale, J. Board, J.
- Pharmacogenomic Testing in Current Clinical Practice: Implementation in the Clinical Laboratory (Molecular and Translational Medicine)!
- Software Abstractions and Methodologies for HPC Simulation Codes on Future Architectures?
- Time Management Proven Techniques for Making Every Minute Count.
- Rugby Union and Globalization: An Odd-Shaped World (Global Culture and Sport);
- chapman & hall/crc textbooks in computing.
- Travels with Frances Densmore: Her Life, Work, and Legacy in Native American Studies?
Hermans, and K. Skeel, M. Gursoy, N. Phillips, A. Shinozaki, K. Varadarajan, and K. Lawlor and L. Vadali, Y. Shi, S. Kumar, L. Tuckerman, G. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. Skeel, L. Kale and K. Zheng, C. Huang and L. Chakravorty, C. Mendes, L. Kale, T. Jones, A. Tauferner, T. Inglett and J. Phillips, Hao Yu, Laxmikant V. Kale, Mark E. Tuckerman, Sameer Kumar, John A.
Gunnels, Glenn J. Mangala, T. Wilmarth, S. Chakravorty, N. Choudhury, L. Kale, P. Abhinav Bhatele, Laxmikant V. PP, Issue: 99, July 17, PP, Issue: 99, July 23, Warren, M. Ahamad, S. Debray and L. Kale and D. Charles, Illinois, August , pp. Gooley, L. Padua, D.
Sehr, B. Ramkumar, U. Reddy, W. Shu and B. Ramkumar and W. Ramkumar, W. Pethe, C. Rippey and L. Kale and W. Saletore and L. II, pp. Kale, and B. Sehr and L. Fenton, B. Saletore, A. Sinha, L.
IPM -- Publications
Sehr, L. Gursoy and L. Kale, and S. Sinha and L. Kale and A. Kale and Sanjeev Krishnan. Kale, and D. Banerjee, D. Gelernter, A.
Nicolau, D. Padua, eds, Springer-Verlag, Series , , pp. Kornkven and L. Krishnan and L. III Ramkumar, C. Forbes and L. II Yelon and L. Richards, and T. Ito, R. Halstead, C. Queinnec, eds. Jagathesan, S. Krishnan and J. Bhandarkar and L. Gursoy and L V. Yelon and T. Sehr, U. Padua, Eds. Kale and M. Bhandarkar and T. Brunner L. Bhandarkar and J. Bhandarkar and R. Brunner, L. Kale, and J. Brunner and L. Brunner, J. Phillips, and K.
- Color in food. Improving quality.
- The Blessings?
- Stanford Libraries?
- The Invention of Memory: An Irish Family Scrapbook 1560-1934!
- Genie In A Bottle;
- About the Series.
Bhandarkar, G. Budescu, W. Humphrey, J. Izaguirre, S. Izrailev, L. Kosztin, F. Molnar, J. Bruzzone, A. Uchrmacher, and Ernest H. Page, eds. Ramachandran and L. Radhakrishnan, R. Hoeflinger, L. Kale, and M. Ottawa, Canada, October Bangalore, India, December , pp. Phillips, L. Gordon Bell Award finalist.
Steels: Metallurgy and Applications, Third Edition
Saboo, A. Singla, J. Unger, and L. Bhandarkar, L. Kale, E. Zheng, A. Kale, S. Kumar, and J. Phillips, G. Zheng, S. Received Gordon Bell Award. Kumar, and K. Kumar, G. Zheng, and C. Lecture Notes in Computer Science , P. Sloot, D. Abramson, A. Bogdanov, J. Dongarra, A. Zomaya, Y. Gorbachev eds. DeSouza and L. Huang, O. Lawlor, and L. Jyothi, O. Chakravorty and L. Zheng, T. Wilmarth, O. Lawlor, L. Adve, D. Padua, P. Zheng, G. Kakulapati, and L. Kumar and L. Wilmarth and L. Kumar, J. DeSouza, M.
Potnuru, and S. Available on CD; therefore, no page numbers are available. Zheng, L. Shi, and L. Hills, and C.