Gupta, Anoop
Overview
Works:  18 works in 91 publications in 1 language and 841 library holdings 

Genres:  Conference papers and proceedings 
Roles:  Author 
Classifications:  QA76.58, 004.35 
Publication Timeline
.
Most widely held works by
Anoop Gupta
Parallel computer architecture : a hardware/software approach by
David E Culler(
Book
)
22 editions published between 1998 and 2007 in English and held by 452 WorldCat member libraries worldwide
The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of sharedmemory, messagepassing, data parallel, and datadriven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software te
22 editions published between 1998 and 2007 in English and held by 452 WorldCat member libraries worldwide
The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of sharedmemory, messagepassing, data parallel, and datadriven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software te
Parallelism in production systems by
Anoop Gupta(
Book
)
12 editions published between 1984 and 1987 in English and Undetermined and held by 214 WorldCat member libraries worldwide
12 editions published between 1984 and 1987 in English and Undetermined and held by 214 WorldCat member libraries worldwide
Fast sparse matrix factorization on modern workstations by
Edward Rothberg(
Book
)
7 editions published in 1989 in English and held by 15 WorldCat member libraries worldwide
The performance of workstationclass machines has experienced a dramatic increase in the recent past. Relatively inexpensive machines which offer 14 MIPS and 2 MFLOPS performance are now available, and machines with even higher performance are not far off. One important characteristic of these machines is that they rely on a small amount of highspeed cache memory for their high performance. In this paper, we consider the problem of Cholesky factorization of a large sparse positive definite system of equations on a high performance workstation. We find that the major factor limiting performance is the cost of moving data between memory and the processor. We use two techniques to address this limitation; we decrease the number of memory references and we improve cache behavior to decrease the cost of each reference. When run on benchmarks from the HarwellBoeing Sparse Matrix Collection, the resulting factorization code is almost three times as fast as SPARSPAK on a DECStation 3100. We believe that the issues brought up in this paper will play an important role in the effective use of high performance workstations on large numerical problems
7 editions published in 1989 in English and held by 15 WorldCat member libraries worldwide
The performance of workstationclass machines has experienced a dramatic increase in the recent past. Relatively inexpensive machines which offer 14 MIPS and 2 MFLOPS performance are now available, and machines with even higher performance are not far off. One important characteristic of these machines is that they rely on a small amount of highspeed cache memory for their high performance. In this paper, we consider the problem of Cholesky factorization of a large sparse positive definite system of equations on a high performance workstation. We find that the major factor limiting performance is the cost of moving data between memory and the processor. We use two techniques to address this limitation; we decrease the number of memory references and we improve cache behavior to decrease the cost of each reference. When run on benchmarks from the HarwellBoeing Sparse Matrix Collection, the resulting factorization code is almost three times as fast as SPARSPAK on a DECStation 3100. We believe that the issues brought up in this paper will play an important role in the effective use of high performance workstations on large numerical problems
The performance impact of data reuse in parallel dense Cholesky factorization by
Edward Rothberg(
Book
)
5 editions published in 1992 in English and held by 14 WorldCat member libraries worldwide
Abstract: "This paper explores performance issues for several prominent approaches to parallel dense Cholesky factorization. The primary focus is on issues that arise when blocking techniques are integrated into parallel factorization approaches to improve data reuse in the memory hierarchy. We first consider paneloriented approaches, where sets of contiguous columns are manipulated as single units. These methods represent natural extensions of the columnoriented methods that have been widely used previously. On machines with memory hierarchies, panel oriented methods significantly increase the achieved performance over columnoriented methods
5 editions published in 1992 in English and held by 14 WorldCat member libraries worldwide
Abstract: "This paper explores performance issues for several prominent approaches to parallel dense Cholesky factorization. The primary focus is on issues that arise when blocking techniques are integrated into parallel factorization approaches to improve data reuse in the memory hierarchy. We first consider paneloriented approaches, where sets of contiguous columns are manipulated as single units. These methods represent natural extensions of the columnoriented methods that have been widely used previously. On machines with memory hierarchies, panel oriented methods significantly increase the achieved performance over columnoriented methods
Parallel ICCG on a hierarchical memory multiprocessor : addressing the triangular solve bottleneck by
Edward Rothberg(
Book
)
6 editions published in 1990 in English and held by 13 WorldCat member libraries worldwide
Abstract: "The incomplete Cholesky conjugate gradient (ICCG) algorithm is a commonly used iterative method for solving large sparse systems of equations. In this paper, we study the parallel solution of sparse triangular systems of equations, the most difficult aspect of implementing the ICCG method on a multiprocessor. We focus on shared memory multiprocessor architectures with deep memory hierarchies. On such architectures we find that previously proposed parallelization approaches result in little or no speedup. The reason is that these approaches cause significant increases in the amount of memory system traffic as compared to a sequential approach. Increases of as much as a factor of 10 on four processors were observed."
6 editions published in 1990 in English and held by 13 WorldCat member libraries worldwide
Abstract: "The incomplete Cholesky conjugate gradient (ICCG) algorithm is a commonly used iterative method for solving large sparse systems of equations. In this paper, we study the parallel solution of sparse triangular systems of equations, the most difficult aspect of implementing the ICCG method on a multiprocessor. We focus on shared memory multiprocessor architectures with deep memory hierarchies. On such architectures we find that previously proposed parallelization approaches result in little or no speedup. The reason is that these approaches cause significant increases in the amount of memory system traffic as compared to a sequential approach. Increases of as much as a factor of 10 on four processors were observed."
Parallel execution of OPS5 in QLISP by
Hiroshi G Okuno(
Book
)
5 editions published in 1987 in English and held by 12 WorldCat member libraries worldwide
5 editions published in 1987 in English and held by 12 WorldCat member libraries worldwide
An evaluation of leftlooking, rightlooking and multifrontal approaches to sparse Cholesky factorization on hierarchicalmemory
machines by
Edward Rothberg(
Book
)
4 editions published in 1991 in English and held by 11 WorldCat member libraries worldwide
We also find that the overall approach (leftlooking, right looking, or multifrontal) is less important for performance than the particular set of primitives used to implement the approach."
4 editions published in 1991 in English and held by 11 WorldCat member libraries worldwide
We also find that the overall approach (leftlooking, right looking, or multifrontal) is less important for performance than the particular set of primitives used to implement the approach."
A comparative evaluation of nodal and supernodal parallel sparse matrix factorization : detailed simulation results by
Edward Rothberg(
Book
)
2 editions published in 1990 in English and held by 10 WorldCat member libraries worldwide
In this paper we consider the problem of factoring a large sparse system of equations on a modestly parallel sharedmemory multiprocessor with a nontrivial memory hierarchy. Using detailed multiprocessor simulation, we study the behavior of the parallel sparse factorization scheme developed at the Oak Ridge National Laboratory. We then extend the Oak Ridge scheme to incorporate the notion of supernodal elimination. We present detailed analyses of the sources of performance degradation for each of these schemes. We measure the impact of interprocessor communication costs, processor load imbalance, overheads introduced in order to distribute work, and cache behavior on overall parallel performance. For the three benchmark matrices which we study, we find that the supernodal scheme gives a factor of 1.7 to 2.7 performance advantage for 8 processors and a factor of 0.9 to 1.6 for 32 processors. The supemodal scheme exhibits higher performance due mainly to the fact that it executes many fewer memory operations and produces fewer cache misses. However, the natural task grain size for the supernodal scheme is much larger than that of the Oak Ridge scheme, making effective distnbution of work more difficult, especially when the number of processors is large
2 editions published in 1990 in English and held by 10 WorldCat member libraries worldwide
In this paper we consider the problem of factoring a large sparse system of equations on a modestly parallel sharedmemory multiprocessor with a nontrivial memory hierarchy. Using detailed multiprocessor simulation, we study the behavior of the parallel sparse factorization scheme developed at the Oak Ridge National Laboratory. We then extend the Oak Ridge scheme to incorporate the notion of supernodal elimination. We present detailed analyses of the sources of performance degradation for each of these schemes. We measure the impact of interprocessor communication costs, processor load imbalance, overheads introduced in order to distribute work, and cache behavior on overall parallel performance. For the three benchmark matrices which we study, we find that the supernodal scheme gives a factor of 1.7 to 2.7 performance advantage for 8 processors and a factor of 0.9 to 1.6 for 32 processors. The supemodal scheme exhibits higher performance due mainly to the fact that it executes many fewer memory operations and produces fewer cache misses. However, the natural task grain size for the supernodal scheme is much larger than that of the Oak Ridge scheme, making effective distnbution of work more difficult, especially when the number of processors is large
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations by
Edward Rothberg(
Book
)
3 editions published in 1990 in English and held by 9 WorldCat member libraries worldwide
The result is greatly increased factorization performance. We present experimental results from executions of our codes on the Silicon Graphics 4D/380 multiprocessor. Using eight processors, we find that the supernodal parallel code achieves a computation rate of approximately 40 MFLOPS when factoring a range of benchmark matrices. This is more than twice as fast as the parallel nodal code developed at the Oak Ridge National Laboratory running on the SGI 4D/380."
3 editions published in 1990 in English and held by 9 WorldCat member libraries worldwide
The result is greatly increased factorization performance. We present experimental results from executions of our codes on the Silicon Graphics 4D/380 multiprocessor. Using eight processors, we find that the supernodal parallel code achieves a computation rate of approximately 40 MFLOPS when factoring a range of benchmark matrices. This is more than twice as fast as the parallel nodal code developed at the Oak Ridge National Laboratory running on the SGI 4D/380."
Temporal, processor, and spatial locality in multiprocessor memory references by
Anant Agarwal(
Book
)
7 editions published between 1988 and 1989 in English and Undetermined and held by 9 WorldCat member libraries worldwide
The performance of cachecoherent multiprocessors is strongly influenced by locality in the memory reference behavior of parallel applications. While the notions of temporal and spatial locality in uniprocessor memory references are well understood, the corresponding notions of locality in multiprocessors and their impact on multiprocessor cache behavior are not clear. A locality model suitable for multiprocessor cache evaluation is derived by viewing memory references as streams of processor identifiers directed at specific cache/memory blocks. This viewpoint differs from the traditional uniprocessor approach that uses streams of addresses to different blocks emanating form specific processors. Our view is based on the intuition that cache coherence traffic in multiprocessor is largely determined by the number of processors accessing a location, the frequency with which they access the location, and the sequence in which their accesses occur. The specific locations accessed by each processor, the time order of access to different locations, and the size of the working set play a smaller role in determining the cache coherence traffic, although they still influence intrinsic cache performance. Looking at traces from the viewpoint of a memory block leads to a new notion of reference locality for multiprocessors, called processor locality. In this paper, we study the temporal, spatial, and processor locality in the memory reference patterns of three parallel applications. Based on the observed locality, we then reflect on the expected cache behavior of the three applications. (kr)
7 editions published between 1988 and 1989 in English and Undetermined and held by 9 WorldCat member libraries worldwide
The performance of cachecoherent multiprocessors is strongly influenced by locality in the memory reference behavior of parallel applications. While the notions of temporal and spatial locality in uniprocessor memory references are well understood, the corresponding notions of locality in multiprocessors and their impact on multiprocessor cache behavior are not clear. A locality model suitable for multiprocessor cache evaluation is derived by viewing memory references as streams of processor identifiers directed at specific cache/memory blocks. This viewpoint differs from the traditional uniprocessor approach that uses streams of addresses to different blocks emanating form specific processors. Our view is based on the intuition that cache coherence traffic in multiprocessor is largely determined by the number of processors accessing a location, the frequency with which they access the location, and the sequence in which their accesses occur. The specific locations accessed by each processor, the time order of access to different locations, and the size of the working set play a smaller role in determining the cache coherence traffic, although they still influence intrinsic cache performance. Looking at traces from the viewpoint of a memory block leads to a new notion of reference locality for multiprocessors, called processor locality. In this paper, we study the temporal, spatial, and processor locality in the memory reference patterns of three parallel applications. Based on the observed locality, we then reflect on the expected cache behavior of the three applications. (kr)
An efficient blockoriented approach to parallel sparse Cholesky factorization by
Edward Rothberg(
Book
)
2 editions published in 1992 in English and held by 9 WorldCat member libraries worldwide
Abstract: "This paper explores the use of a subblock decomposition strategy for parallel sparse Cholesky factorization, in which the sparse matrix is decomposed into rectangular blocks. Such a strategy has enormous theoretical scalability advantages over a more traditional columnoriented decomposition for large parallel machines. However, little progress has been made in producing a practical subblock method. This paper describes and evaluates an approach that is both simple and efficient."
2 editions published in 1992 in English and held by 9 WorldCat member libraries worldwide
Abstract: "This paper explores the use of a subblock decomposition strategy for parallel sparse Cholesky factorization, in which the sparse matrix is decomposed into rectangular blocks. Such a strategy has enormous theoretical scalability advantages over a more traditional columnoriented decomposition for large parallel machines. However, little progress has been made in producing a practical subblock method. This paper describes and evaluates an approach that is both simple and efficient."
Implementation of production systems on message passing computers : techniques, simulation results and analysis by
Milind Tambe(
Book
)
3 editions published in 1989 in English and held by 8 WorldCat member libraries worldwide
3 editions published in 1989 in English and held by 8 WorldCat member libraries worldwide
Measurements on production systems by
Anoop Gupta(
Book
)
3 editions published in 1983 in English and held by 7 WorldCat member libraries worldwide
3 editions published in 1983 in English and held by 7 WorldCat member libraries worldwide
Implementing OPS5 production systems on DADO by
Anoop Gupta(
Book
)
2 editions published in 1984 in English and held by 5 WorldCat member libraries worldwide
2 editions published in 1984 in English and held by 5 WorldCat member libraries worldwide
HEXT, a hierarchical circuit extractor by
Anoop Gupta(
Book
)
1 edition published in 1982 in English and held by 4 WorldCat member libraries worldwide
1 edition published in 1982 in English and held by 4 WorldCat member libraries worldwide
Two papers on circuit extraction by
Anoop Gupta(
Book
)
4 editions published in 1982 in English and held by 4 WorldCat member libraries worldwide
The first paper describes the design, implementation and performance of a flat edgebased circuit extractor for NMOS circuits. The extractor is able to work on large and complex designs, it can handle arbitrary geometry, and outputs a comprehensive wirelist. Measurements show that run time of the edgebased algorithm used in linear in size of the circuit, with low implementation overheads. The extractor is capable of analyzing a circuit with 20,000 transistors in less than 30 minutes of CPU time on a VAX 11/780. The high performance of the extractor has changed the role that a circuit extractor played in the design process, as it is now possible to extract a chip a number of times during the same session. The second paper describes the algorithms, implementation, and performance of a hierarchical circuit extractor for NMOS designs. The input to the circuit extractor is a description of the layout of the chip, and its output is a hierarchical wirelist describing the circuit. The extractor is divided into two parts, a frontend and a backend. The frontend analyzes the CIF description of a layout and partitions it into a set of nonoverlapping rectangular regions called windows; redundant windows are recognized and are extracted only once. The backend analyzes each unique window found by the frontend. The backend determines the electrical circuit represented by the window, and computes an interface that is later used to combine the window with others that are adjacent. The paper also presents a simple analysis of the expected performance of the algorithm, and the results of running the extractor on some real chip designs
4 editions published in 1982 in English and held by 4 WorldCat member libraries worldwide
The first paper describes the design, implementation and performance of a flat edgebased circuit extractor for NMOS circuits. The extractor is able to work on large and complex designs, it can handle arbitrary geometry, and outputs a comprehensive wirelist. Measurements show that run time of the edgebased algorithm used in linear in size of the circuit, with low implementation overheads. The extractor is capable of analyzing a circuit with 20,000 transistors in less than 30 minutes of CPU time on a VAX 11/780. The high performance of the extractor has changed the role that a circuit extractor played in the design process, as it is now possible to extract a chip a number of times during the same session. The second paper describes the algorithms, implementation, and performance of a hierarchical circuit extractor for NMOS designs. The input to the circuit extractor is a description of the layout of the chip, and its output is a hierarchical wirelist describing the circuit. The extractor is divided into two parts, a frontend and a backend. The frontend analyzes the CIF description of a layout and partitions it into a set of nonoverlapping rectangular regions called windows; redundant windows are recognized and are extracted only once. The backend analyzes each unique window found by the frontend. The backend determines the electrical circuit represented by the window, and computes an interface that is later used to combine the window with others that are adjacent. The paper also presents a simple analysis of the expected performance of the algorithm, and the results of running the extractor on some real chip designs
ACE, a circuit extractor by
Anoop Gupta(
Book
)
1 edition published in 1982 in English and held by 3 WorldCat member libraries worldwide
1 edition published in 1982 in English and held by 3 WorldCat member libraries worldwide
RealTime KnowledgeBased Systems by Thomas Laffey(
Book
)
2 editions published in 1989 in English and held by 2 WorldCat member libraries worldwide
2 editions published in 1989 in English and held by 2 WorldCat member libraries worldwide
more
fewer
Audience Level
0 

1  
Kids  General  Special 
Related Identities
Associated Subjects
Artificial intelligence Cache memory Computer architecture Computer storage devices Conjugate gradient methodsData processing Data structures (Computer science) Electric circuit analysis Equations, SimultaneousNumerical solutionsData processing Expert systems (Computer science) Factorization (Mathematics) Factorization (Mathematics)Data processing Integrated circuits Integrated circuitsTesting LISP (Computer program language) MathematicsData processing Matrices Microcomputer workstations Multiprocessors OPS5 (Computer program language) Parallel computers Parallel processing (Electronic computers) Realtime data processing Sparse matrices Sparse matricesData processing