WorldCat Identities

Gupta, Anoop

Overview
Works: 20 works in 72 publications in 2 languages and 818 library holdings
Genres: Conference papers and proceedings 
Roles: Author
Classifications: QA76.58, 004.35
Publication Timeline
.
Most widely held works by Anoop Gupta
Parallel computer architecture : a hardware/software approach by David E Culler( Book )

18 editions published between 1998 and 2006 in English and held by 449 WorldCat member libraries worldwide

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software te
Parallelism in production systems by Anoop Gupta( Book )

8 editions published between 1986 and 1987 in English and Undetermined and held by 208 WorldCat member libraries worldwide

The performance impact of data reuse in parallel dense Cholesky factorization by Edward Rothberg( Book )

4 editions published in 1992 in English and held by 13 WorldCat member libraries worldwide

Abstract: "This paper explores performance issues for several prominent approaches to parallel dense Cholesky factorization. The primary focus is on issues that arise when blocking techniques are integrated into parallel factorization approaches to improve data reuse in the memory hierarchy. We first consider panel-oriented approaches, where sets of contiguous columns are manipulated as single units. These methods represent natural extensions of the column-oriented methods that have been widely used previously. On machines with memory hierarchies, panel- oriented methods significantly increase the achieved performance over column-oriented methods
Fast sparse matrix factorization on modern workstations by Edward Rothberg( Book )

5 editions published in 1989 in English and held by 12 WorldCat member libraries worldwide

The performance of workstation-class machines has experienced a dramatic increase in the recent past. Relatively inexpensive machines which offer 14 MIPS and 2 MFLOPS performance are now available, and machines with even higher performance are not far off. One important characteristic of these machines is that they rely on a small amount of high-speed cache memory for their high performance. In this paper, we consider the problem of Cholesky factorization of a large sparse positive definite system of equations on a high performance workstation. We find that the major factor limiting performance is the cost of moving data between memory and the processor. We use two techniques to address this limitation; we decrease the number of memory references and we improve cache behavior to decrease the cost of each reference. When run on benchmarks from the Harwell-Boeing Sparse Matrix Collection, the resulting factorization code is almost three times as fast as SPARSPAK on a DECStation 3100. We believe that the issues brought up in this paper will play an important role in the effective use of high performance workstations on large numerical problems
Parallel ICCG on a hierarchical memory multiprocessor : addressing the triangular solve bottleneck by Edward Rothberg( Book )

4 editions published in 1990 in English and held by 11 WorldCat member libraries worldwide

Abstract: "The incomplete Cholesky conjugate gradient (ICCG) algorithm is a commonly used iterative method for solving large sparse systems of equations. In this paper, we study the parallel solution of sparse triangular systems of equations, the most difficult aspect of implementing the ICCG method on a multiprocessor. We focus on shared- memory multiprocessor architectures with deep memory hierarchies. On such architectures we find that previously proposed parallelization approaches result in little or no speedup. The reason is that these approaches cause significant increases in the amount of memory system traffic as compared to a sequential approach. Increases of as much as a factor of 10 on four processors were observed
An evaluation of left-looking, right-looking and multifrontal approaches to sparse Cholesky factorization on hierarchical-memory machines by Edward Rothberg( Book )

4 editions published in 1991 in English and held by 11 WorldCat member libraries worldwide

We also find that the overall approach (left-looking, right- looking, or multifrontal) is less important for performance than the particular set of primitives used to implement the approach."
A comparative evaluation of nodal and supernodal parallel sparse matrix factorization : detailed simulation results by Edward Rothberg( Book )

2 editions published in 1990 in English and held by 10 WorldCat member libraries worldwide

Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations by Edward Rothberg( Book )

3 editions published in 1990 in English and held by 9 WorldCat member libraries worldwide

The result is greatly increased factorization performance. We present experimental results from executions of our codes on the Silicon Graphics 4D/380 multiprocessor. Using eight processors, we find that the supernodal parallel code achieves a computation rate of approximately 40 MFLOPS when factoring a range of benchmark matrices. This is more than twice as fast as the parallel nodal code developed at the Oak Ridge National Laboratory running on the SGI 4D/380."
Parallel execution of OPS5 in QLISP by Hiroshi G Okuno( Book )

3 editions published in 1987 in English and held by 9 WorldCat member libraries worldwide

An efficient block-oriented approach to parallel sparse Cholesky factorization by Edward Rothberg( Book )

1 edition published in 1992 in English and held by 8 WorldCat member libraries worldwide

Abstract: "This paper explores the use of a sub-block decomposition strategy for parallel sparse Cholesky factorization, in which the sparse matrix is decomposed into rectangular blocks. Such a strategy has enormous theoretical scalability advantages over a more traditional column-oriented decomposition for large parallel machines. However, little progress has been made in producing a practical sub-block method. This paper describes and evaluates an approach that is both simple and efficient."
Implementation of production systems on message passing computers : techniques, simulation results and analysis by Milind Tambe( Book )

2 editions published in 1989 in English and held by 7 WorldCat member libraries worldwide

Temporal, processor, and spatial locality in multiprocessor memory references by Aditya Agarwal( Book )

4 editions published between 1988 and 1989 in English and held by 7 WorldCat member libraries worldwide

The performance of cache-coherent multiprocessors is strongly influenced by locality in the memory reference behavior of parallel applications. While the notions of temporal and spatial locality in uniprocessor memory references are well understood, the corresponding notions of locality in multiprocessors and their impact on multiprocessor cache behavior are not clear. A locality model suitable for multiprocessor cache evaluation is derived by viewing memory references as streams of processor identifiers directed at specific cache/memory blocks. This viewpoint differs from the traditional uniprocessor approach that uses streams of addresses to different blocks emanating from specific processors. Our view is based on te intuition that cache coherence traffic in multiprocessors is largely determined by the number of processors accessing a location, the frequency with which they access the location, and the sequence in which their accesses occur. The specific locations accessed by each processor, the time order of access to different locations, and the size of the working set play a smaller role in determining the cache coherence traffic, although they still influence intrinsic cache performance. Looking at traces from the viewpoint of a memory block leads to a new notion of reference locality for multiprocessors, called processor locality. Reprints. (RH)
Measurements on production systems by Anoop Gupta( Book )

2 editions published in 1983 in English and held by 6 WorldCat member libraries worldwide

Implementing OPS5 production systems on DADO by Anoop Gupta( Book )

1 edition published in 1984 in English and held by 4 WorldCat member libraries worldwide

Parallelism in production systems : the sources and the expected speed-up by Anoop Gupta( Book )

1 edition published in 1984 in English and held by 4 WorldCat member libraries worldwide

HEXT, a hierarchical circuit extractor by Anoop Gupta( Book )

1 edition published in 1982 in English and held by 4 WorldCat member libraries worldwide

Two papers on circuit extraction by Anoop Gupta( Book )

4 editions published in 1982 in English and held by 4 WorldCat member libraries worldwide

The first paper describes the design, implementation and performance of a flat edge-based circuit extractor for NMOS circuits. The extractor is able to work on large and complex designs, it can handle arbitrary geometry, and outputs a comprehensive wirelist. Measurements show that run time of the edge-based algorithm used in linear in size of the circuit, with low implementation overheads. The extractor is capable of analyzing a circuit with 20,000 transistors in less than 30 minutes of CPU time on a VAX 11/780. The high performance of the extractor has changed the role that a circuit extractor played in the design process, as it is now possible to extract a chip a number of times during the same session. The second paper describes the algorithms, implementation, and performance of a hierarchical circuit extractor for NMOS designs. The input to the circuit extractor is a description of the layout of the chip, and its output is a hierarchical wirelist describing the circuit. The extractor is divided into two parts, a front-end and a back-end. The front-end analyzes the CIF description of a layout and partitions it into a set of non-overlapping rectangular regions called windows; redundant windows are recognized and are extracted only once. The back-end analyzes each unique window found by the front-end. The back-end determines the electrical circuit represented by the window, and computes an interface that is later used to combine the window with others that are adjacent. The paper also presents a simple analysis of the expected performance of the algorithm, and the results of running the extractor on some real chip designs
ACE, a circuit extractor by Anoop Gupta( Book )

1 edition published in 1982 in English and held by 3 WorldCat member libraries worldwide

Real-Time Knowledge-Based Systems by Thomas Laffey( Book )

2 editions published in 1989 in English and held by 2 WorldCat member libraries worldwide

Common link by Anoop Gupta( Book )

2 editions published in 2012 in German and English and held by 1 WorldCat member library worldwide

 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  Kids General Special  
Audience level: 0.65 (from 0.61 for Parallel c ... to 0.84 for Temporal, ...)

Parallel computer architecture : a hardware/software approach
Alternative Names
Anoop Gupta

Languages
English (70)

German (1)

Covers
Parallelism in production systems