WorldCat Identities

Gupta, Anoop

Overview
Works: 20 works in 72 publications in 2 languages and 812 library holdings
Genres: Conference papers and proceedings 
Roles: Author
Classifications: QA76.58, 004.35
Publication Timeline
.
Most widely held works by Anoop Gupta
Parallel computer architecture : a hardware/software approach by David E Culler( Book )

19 editions published between 1998 and 2006 in English and held by 448 WorldCat member libraries worldwide

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software te
Parallelism in production systems by Anoop Gupta( Book )

9 editions published between 1986 and 1987 in English and Undetermined and held by 209 WorldCat member libraries worldwide

The performance impact of data reuse in parallel dense Cholesky factorization by Edward Rothberg( Book )

4 editions published in 1992 in English and held by 13 WorldCat member libraries worldwide

Abstract: "This paper explores performance issues for several prominent approaches to parallel dense Cholesky factorization. The primary focus is on issues that arise when blocking techniques are integrated into parallel factorization approaches to improve data reuse in the memory hierarchy. We first consider panel-oriented approaches, where sets of contiguous columns are manipulated as single units. These methods represent natural extensions of the column-oriented methods that have been widely used previously. On machines with memory hierarchies, panel- oriented methods significantly increase the achieved performance over column-oriented methods
Fast sparse matrix factorization on modern workstations by Edward Rothberg( Book )

4 editions published in 1989 in English and held by 11 WorldCat member libraries worldwide

The performance of workstation-class machines has experienced a dramatic increase in the recent past. Relatively inexpensive machines which offer 14 MIPS and 2 MFLOPS performance are now available, and machines with even higher performance are not far off. One important characteristic of these machines is that they rely on a small amount of high-speed cache memory for their high performance. In this paper, we consider the problem of Cholesky factorization of a large sparse positive definite system of equations on a high performance workstation. We find that the major factor limiting performance is the cost of moving data between memory and the processor. We use two techniques to address this limitation; we decrease the number of memory references and we improve cache behavior to decrease the cost of each reference. When run on benchmarks from the Harwell-Boeing Sparse Matrix Collection, the resulting factorization code is almost three times as fast as SPARSPAK on a DECStation 3100. We believe that the issues brought up in this paper will play an important role in the effective use of high performance workstations on large numerical problems
An evaluation of left-looking, right-looking and multifrontal approaches to sparse Cholesky factorization on hierarchical-memory machines by Edward Rothberg( Book )

4 editions published in 1991 in English and held by 11 WorldCat member libraries worldwide

Abstract: "In this paper we present a comprehensive analysis of the performance of a variety of sparse Cholesky factorization methods on hierarchical-memory machines. We investigate methods that vary along two different axes. Along the first axis, we consider three different high- level approaches to sparse factorization: left-looking, right-looking, and multifrontal. Along the second axis, we consider the implementation of each of these high-level approaches using different sets of primitives. The primitives vary based on the structures they manipulate. One important structure in sparse Cholesky factorization is a single column of the matrix. We first consider primitives that manipulate single columns
Parallel ICCG on a hierarchical memory multiprocessor : addressing the triangular solve bottleneck by Edward Rothberg( Book )

4 editions published in 1990 in English and held by 11 WorldCat member libraries worldwide

Abstract: "The incomplete Cholesky conjugate gradient (ICCG) algorithm is a commonly used iterative method for solving large sparse systems of equations. In this paper, we study the parallel solution of sparse triangular systems of equations, the most difficult aspect of implementing the ICCG method on a multiprocessor. We focus on shared- memory multiprocessor architectures with deep memory hierarchies. On such architectures we find that previously proposed parallelization approaches result in little or no speedup. The reason is that these approaches cause significant increases in the amount of memory system traffic as compared to a sequential approach. Increases of as much as a factor of 10 on four processors were observed."
A comparative evaluation of nodal and supernodal parallel sparse matrix factorization : detailed simulation results by Edward Rothberg( Book )

2 editions published in 1990 in English and held by 10 WorldCat member libraries worldwide

Parallel execution of OPS5 in QLISP by Hiroshi G Okuno( Book )

3 editions published in 1987 in English and held by 9 WorldCat member libraries worldwide

Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations by Edward Rothberg( Book )

3 editions published in 1990 in English and held by 9 WorldCat member libraries worldwide

Abstract: "In this paper we look at the problem of factoring large sparse systems of equations on high-performance multiprocessor workstations. While these multiprocessor workstations are capable of very high peak floating point computation rates, most existing sparse factorization codes achieve only a small fraction of this potential. A major limiting factor is the cost of memory accesses performed during the factorization. In this paper, we describe a parallel factorization code which utilizes the supernodal structure of the matrix to reduce the number of memory references. We also propose enhancements that significantly reduce the overall cache miss rate
Temporal, processor, and spatial locality in multiprocessor memory references by Anant Agarwal( Book )

4 editions published between 1988 and 1989 in English and held by 8 WorldCat member libraries worldwide

The performance of cache-coherent multiprocessors is strongly influenced by locality in the memory reference behavior of parallel applications. While the notions of temporal and spatial locality in uniprocessor memory references are well understood, the corresponding notions of locality in multiprocessors and their impact on multiprocessor cache behavior are not clear. A locality model suitable for multiprocessor cache evaluation is derived by viewing memory references as streams of processor identifiers directed at specific cache/memory blocks. This viewpoint differs from the traditional uniprocessor approach that uses streams of addresses to different blocks emanating form specific processors. Our view is based on the intuition that cache coherence traffic in multiprocessor is largely determined by the number of processors accessing a location, the frequency with which they access the location, and the sequence in which their accesses occur. The specific locations accessed by each processor, the time order of access to different locations, and the size of the working set play a smaller role in determining the cache coherence traffic, although they still influence intrinsic cache performance. Looking at traces from the viewpoint of a memory block leads to a new notion of reference locality for multiprocessors, called processor locality. In this paper, we study the temporal, spatial, and processor locality in the memory reference patterns of three parallel applications. Based on the observed locality, we then reflect on the expected cache behavior of the three applications. (kr)
An efficient block-oriented approach to parallel sparse Cholesky factorization by Edward Rothberg( Book )

1 edition published in 1992 in English and held by 8 WorldCat member libraries worldwide

Abstract: "This paper explores the use of a sub-block decomposition strategy for parallel sparse Cholesky factorization, in which the sparse matrix is decomposed into rectangular blocks. Such a strategy has enormous theoretical scalability advantages over a more traditional column-oriented decomposition for large parallel machines. However, little progress has been made in producing a practical sub-block method. This paper describes and evaluates an approach that is both simple and efficient."
Implementation of production systems on message passing computers : techniques, simulation results and analysis by Milind Tambe( Book )

2 editions published in 1989 in English and held by 7 WorldCat member libraries worldwide

Measurements on production systems by Anoop Gupta( Book )

2 editions published in 1983 in English and held by 6 WorldCat member libraries worldwide

HEXT, a hierarchical circuit extractor by Anoop Gupta( Book )

1 edition published in 1982 in English and held by 4 WorldCat member libraries worldwide

Two papers on circuit extraction by Anoop Gupta( Book )

4 editions published in 1982 in English and held by 4 WorldCat member libraries worldwide

The first paper describes the design, implementation and performance of a flat edge-based circuit extractor for NMOS circuits. The extractor is able to work on large and complex designs, it can handle arbitrary geometry, and outputs a comprehensive wirelist. Measurements show that run time of the edge-based algorithm used in linear in size of the circuit, with low implementation overheads. The extractor is capable of analyzing a circuit with 20,000 transistors in less than 30 minutes of CPU time on a VAX 11/780. The high performance of the extractor has changed the role that a circuit extractor played in the design process, as it is now possible to extract a chip a number of times during the same session. The second paper describes the algorithms, implementation, and performance of a hierarchical circuit extractor for NMOS designs. The input to the circuit extractor is a description of the layout of the chip, and its output is a hierarchical wirelist describing the circuit. The extractor is divided into two parts, a front-end and a back-end. The front-end analyzes the CIF description of a layout and partitions it into a set of non-overlapping rectangular regions called windows; redundant windows are recognized and are extracted only once. The back-end analyzes each unique window found by the front-end. The back-end determines the electrical circuit represented by the window, and computes an interface that is later used to combine the window with others that are adjacent. The paper also presents a simple analysis of the expected performance of the algorithm, and the results of running the extractor on some real chip designs
Implementing OPS5 production systems on DADO by Anoop Gupta( Book )

1 edition published in 1984 in English and held by 4 WorldCat member libraries worldwide

Parallelism in production systems : the sources and the expected speed-up by Anoop Gupta( Book )

1 edition published in 1984 in English and held by 4 WorldCat member libraries worldwide

ACE, a circuit extractor by Anoop Gupta( Book )

1 edition published in 1982 in English and held by 3 WorldCat member libraries worldwide

Real-Time Knowledge-Based Systems by Thomas Laffey( Book )

2 editions published in 1989 in English and held by 2 WorldCat member libraries worldwide

The Common Link Meaning-Making in Algebra and the Visual Arts by Anoop Gupta( )

1 edition published in 2012 in German and held by 0 WorldCat member libraries worldwide

 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  Kids General Special  
Audience level: 0.64 (from 0.61 for Parallel c ... to 0.87 for The Common ...)

Parallel computer architecture : a hardware/software approach
Alternative Names
Anoop Gupta

Languages
English (70)

German (1)

Covers
Parallelism in production systems