WorldCat Identities

Gupta, Anoop

Overview
Works: 53 works in 107 publications in 1 language and 840 library holdings
Roles: Author
Classifications: QA76.58, 004.35
Publication Timeline
.
Most widely held works by Anoop Gupta
Parallel computer architecture : a hardware/software approach by David E Culler( Book )
17 editions published between 1998 and 2006 in English and held by 460 WorldCat member libraries worldwide
The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software te
Parallelism in production systems by Anoop Gupta( Book )
9 editions published between 1984 and 1987 in English and Undetermined and held by 215 WorldCat member libraries worldwide
The performance impact of data reuse in parallel dense Cholesky factorization by Edward Rothberg( Book )
4 editions published in 1992 in English and held by 14 WorldCat member libraries worldwide
Abstract: "This paper explores performance issues for several prominent approaches to parallel dense Cholesky factorization. The primary focus is on issues that arise when blocking techniques are integrated into parallel factorization approaches to improve data reuse in the memory hierarchy. We first consider panel-oriented approaches, where sets of contiguous columns are manipulated as single units. These methods represent natural extensions of the column-oriented methods that have been widely used previously. On machines with memory hierarchies, panel- oriented methods significantly increase the achieved performance over column-oriented methods
Parallel ICCG on a hierarchical memory multiprocessor : addressing the triangular solve bottleneck by Edward Rothberg( Book )
4 editions published in 1990 in English and held by 12 WorldCat member libraries worldwide
Abstract: "The incomplete Cholesky conjugate gradient (ICCG) algorithm is a commonly used iterative method for solving large sparse systems of equations. In this paper, we study the parallel solution of sparse triangular systems of equations, the most difficult aspect of implementing the ICCG method on a multiprocessor. We focus on shared- memory multiprocessor architectures with deep memory hierarchies. On such architectures we find that previously proposed parallelization approaches result in little or no speedup. The reason is that these approaches cause significant increases in the amount of memory system traffic as compared to a sequential approach. Increases of as much as a factor of 10 on four processors were observed
An evaluation of left-looking, right-looking and multifrontal approaches to sparse Cholesky factorization on hierarchical-memory machines by Edward Rothberg( Book )
4 editions published in 1991 in English and held by 12 WorldCat member libraries worldwide
We also find that the overall approach (left-looking, right- looking, or multifrontal) is less important for performance than the particular set of primitives used to implement the approach."
Fast sparse matrix factorization on modern workstations by Edward Rothberg( Book )
4 editions published in 1989 in English and held by 12 WorldCat member libraries worldwide
The performance of workstation-class machines has experienced a dramatic increase in the recent past. Relatively inexpensive machines which offer 14 MIPS and 2 MFLOPS performance are now available, and machines with even higher performance are not far off. One important characteristic of these machines is that they rely on a small amount of high-speed cache memory for their high performance. In this paper, we consider the problem of cholesky factorization of a large sparse positive definite system of equations on a high performance workstation. We find that the major factor limiting performance is the cost of moving data between memory and the processor. We use two techniques to address this limitation; we decrease the number of memory references and we improve cache behavior to decrease the cost of each reference. When run on benchmarks from the Harwell-Boeing Sparse Matrix Collection, the resulting factorization code is almost three times as fast as SPARSPAK on a DECStation 3100. We believe that the issues brought up in this paper will play an important role in the effective use of high performance workstations on large numerical problems
A comparative evaluation of nodal and supernodal parallel sparse matrix factorization : detailed simulation results by Edward Rothberg( Book )
2 editions published in 1990 in English and held by 10 WorldCat member libraries worldwide
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations by Edward Rothberg( Book )
3 editions published in 1990 in English and held by 10 WorldCat member libraries worldwide
The result is greatly increased factorization performance. We present experimental results from executions of our codes on the Silicon Graphics 4D/380 multiprocessor. Using eight processors, we find that the supernodal parallel code achieves a computation rate of approximately 40 MFLOPS when factoring a range of benchmark matrices. This is more than twice as fast as the parallel nodal code developed at the Oak Ridge National Laboratory running on the SGI 4D/380."
Parallel execution of OPS5 in QLISP by H. G Okuna( Book )
3 editions published in 1987 in English and held by 9 WorldCat member libraries worldwide
Temporal, processor, and spatial locality in multiprocessor memory references by Anant Agarwal( Book )
4 editions published between 1988 and 1989 in English and held by 8 WorldCat member libraries worldwide
The performance of cache-coherent multiprocessors is strongly influenced by locality in the memory reference behavior of parallel applications. While the notions of temporal and spatial locality in uniprocessor memory references are well understood, the corresponding notions of locality in multiprocessors and their impact on multiprocessor cache behavior are not clear. A locality model suitable for multiprocessor cache evaluation is derived by viewing memory references as streams of processor identifiers directed at specific cache/memory blocks. This viewpoint differs from the traditional uniprocessor approach that uses streams of addresses to different blocks emanating from specific processors. Our view is based on te intuition that cache coherence traffic in multiprocessors is largely determined by the number of processors accessing a location, the frequency with which they access the location, and the sequence in which their accesses occur. The specific locations accessed by each processor, the time order of access to different locations, and the size of the working set play a smaller role in determining the cache coherence traffic, although they still influence intrinsic cache performance. Looking at traces from the viewpoint of a memory block leads to a new notion of reference locality for multiprocessors, called processor locality. Reprints. (RH)
An efficient block-oriented approach to parallel sparse Cholesky factorization by Edward Rothberg( Book )
1 edition published in 1992 in English and held by 8 WorldCat member libraries worldwide
Abstract: "This paper explores the use of a sub-block decomposition strategy for parallel sparse Cholesky factorization, in which the sparse matrix is decomposed into rectangular blocks. Such a strategy has enormous theoretical scalability advantages over a more traditional column-oriented decomposition for large parallel machines. However, little progress has been made in producing a practical sub-block method. This paper describes and evaluates an approach that is both simple and efficient."
Implementation of production systems on message passing computers : techniques, simulation results and analysis by Milind Tambe( Book )
2 editions published in 1989 in English and held by 7 WorldCat member libraries worldwide
Measurements on production systems by Anoop Gupta( Book )
2 editions published in 1983 in English and held by 6 WorldCat member libraries worldwide
HEXT, a hierarchical circuit extractor by Anoop Gupta( Book )
1 edition published in 1982 in English and held by 4 WorldCat member libraries worldwide
Implementing OPS5 production systems on DADO by Anoop Gupta( Book )
1 edition published in 1984 in English and held by 4 WorldCat member libraries worldwide
Two papers on circuit extraction by Anoop Gupta( Book )
3 editions published in 1982 in English and held by 4 WorldCat member libraries worldwide
The first paper describes the design, implementation and performance of a flat edge-based circuit extractor for NMOS circuits. The extractor is able to work on large and complex designs, it can handle arbitrary geometry, and outputs a comprehensive wirelist. Measurements show that run time of the edge-based algorithm used in linear in size of the circuit, with low implementation overheads. The extractor is capable of analyzing a circuit with 20,000 transistors in less than 30 minutes of CPU time on a VAX 11/780. The high performance of the extractor has changed the role that a circuit extractor played in the design process, as it is now possible to extract a chip a number of times during the same session. The second paper describes the algorithms, implementation, and performance of a hierarchical circuit extractor for NMOS designs. The input to the circuit extractor is a description of the layout of the chip, and its output is a hierarchical wirelist describing the circuit. The extractor is divided into two parts, a front-end and a back-end. The front-end analyzes the CIF description of a layout and partitions it into a set of non-overlapping rectangular regions called windows; redundant windows are recognized and are extracted only once. The back-end analyzes each unique window found by the front-end. The back-end determines the electrical circuit represented by the window, and computes an interface that is later used to combine the window with others that are adjacent. The paper also presents a simple analysis of the expected performance of the algorithm, and the results of running the extractor on some real chip designs
Parallel Computer Architecture ( )
2 editions published in 1998 in English and held by 3 WorldCat member libraries worldwide
ACE, a circuit extractor by Anoop Gupta( Book )
1 edition published in 1982 in English and held by 3 WorldCat member libraries worldwide
SPLASH : Stanford Parallel Applications for Shared-Memory* by Jaswinder Pal Singh( Book )
2 editions published between 1991 and 1992 in English and held by 2 WorldCat member libraries worldwide
This report was replaced and updated in CSL-TR-92-526
Revision to "Memory consistency and event ordering in scalable shared-memory multiprocessors" by Kourosh Gharachorloo( Book )
1 edition published in 1993 in English and held by 2 WorldCat member libraries worldwide
In addition, our previous work on the implementation and performance of various memory models is unaffected by this change."
 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  Kids General Special  
Audience level: 0.76 (from 0.00 for Parallel C ... to 0.94 for Parallel e ...)
Alternative Names
Anoop Gupta
Languages
English (69)
Covers