Kung, H. T.
Overview
Works:  107 works in 299 publications in 2 languages and 2,827 library holdings 

Genres:  Conference papers and proceedings 
Roles:  Author, Editor, Author of introduction 
Publication Timeline
.
Most widely held works by
H. T Kung
Traffic management for highspeed networks by
H. T Kung(
)
10 editions published in 1997 in English and held by 2,035 WorldCat member libraries worldwide
10 editions published in 1997 in English and held by 2,035 WorldCat member libraries worldwide
VLSI systems and computations by
H. T Kung(
Book
)
20 editions published in 1981 in English and Italian and held by 376 WorldCat member libraries worldwide
The papers in this book were presented at the CMU Conference on VLSI Systems and Computations, held October 1921, 1981 in Pittsburgh, Pennsylvania. The conference was organized by the Computer Science Department, CarnegieMellon University and was partially supported by the National Science Foundation and the Office of Naval Research. These proceedings focus on the theory and design of computational systems using VLSI. Until very recently, integratedcircuit research and development were concentrated in the device physics and fabrication design disciplines and in the integratedcircuit industry itself. Within the last few years, a community of researchers is growing to address issues closer to computer science: the relationship between computing structures and the physical structures that implement them; the specification and verification of computational procosses implemented in VLSI; the use of massively parallel computing made possible by VLSI; the design of special purpose computing architectures; and the changes in generalpurpose computer architecture that VLSI makes possible. It is likely that the future exploitation of VLSI technology depends as much on structural and design innovations as on advances in fabrication technology. The book is divided into nine sections:  Invited Papers. Six distinguished researchers from industry and academia presented invited papers.  Models of Computation. The papers in this section deal with abstracting the properties of VLSI circuits into models that can be used to analyze the chip area, time or energy required for a particular computation
20 editions published in 1981 in English and Italian and held by 376 WorldCat member libraries worldwide
The papers in this book were presented at the CMU Conference on VLSI Systems and Computations, held October 1921, 1981 in Pittsburgh, Pennsylvania. The conference was organized by the Computer Science Department, CarnegieMellon University and was partially supported by the National Science Foundation and the Office of Naval Research. These proceedings focus on the theory and design of computational systems using VLSI. Until very recently, integratedcircuit research and development were concentrated in the device physics and fabrication design disciplines and in the integratedcircuit industry itself. Within the last few years, a community of researchers is growing to address issues closer to computer science: the relationship between computing structures and the physical structures that implement them; the specification and verification of computational procosses implemented in VLSI; the use of massively parallel computing made possible by VLSI; the design of special purpose computing architectures; and the changes in generalpurpose computer architecture that VLSI makes possible. It is likely that the future exploitation of VLSI technology depends as much on structural and design innovations as on advances in fabrication technology. The book is divided into nine sections:  Invited Papers. Six distinguished researchers from industry and academia presented invited papers.  Models of Computation. The papers in this section deal with abstracting the properties of VLSI circuits into models that can be used to analyze the chip area, time or energy required for a particular computation
An optimality theory of concurrency control for databases by
H. T Kung(
Book
)
11 editions published between 1979 and 1980 in English and held by 19 WorldCat member libraries worldwide
A concurrency control mechanism (or a scheduler) is the component of a database system that safeguards the consistency of the database in the presence of interleaved accesses and update requests. We formally show that the performance of a scheduler, i.e., the amount of parallelism that it supports, depends explicitly upon the amount of information that is available to the scheduler. We point out that most previous work on concurrency control is simply concerned with specific points of the base tradeoff between performance and information. In fact, several of these approaches are shown to be optimal for the amount of information that they use. (Author)
11 editions published between 1979 and 1980 in English and held by 19 WorldCat member libraries worldwide
A concurrency control mechanism (or a scheduler) is the component of a database system that safeguards the consistency of the database in the presence of interleaved accesses and update requests. We formally show that the performance of a scheduler, i.e., the amount of parallelism that it supports, depends explicitly upon the amount of information that is available to the scheduler. We point out that most previous work on concurrency control is simply concerned with specific points of the base tradeoff between performance and information. In fact, several of these approaches are shown to be optimal for the amount of information that they use. (Author)
Why systolic architectures? by
H. T Kung(
Book
)
4 editions published between 1981 and 1982 in English and Undetermined and held by 10 WorldCat member libraries worldwide
4 editions published between 1981 and 1982 in English and Undetermined and held by 10 WorldCat member libraries worldwide
A systolic algorithm for integer GCD computation by
R. P Brent(
Book
)
8 editions published between 1982 and 1984 in English and held by 10 WorldCat member libraries worldwide
Abstract: "We show that the greatest common divisor of two nbit integers (given in the usual binary representation) can be computed in time O(n) on a linear array of O(n) identical systolic cells, each of which is a finitestate machine with connections to its nearest neighbours."
8 editions published between 1982 and 1984 in English and held by 10 WorldCat member libraries worldwide
Abstract: "We show that the greatest common divisor of two nbit integers (given in the usual binary representation) can be computed in time O(n) on a linear array of O(n) identical systolic cells, each of which is a finitestate machine with connections to its nearest neighbours."
The areatime complexity of Binary multiplication by
R. P Brent(
Book
)
5 editions published in 1979 in English and held by 9 WorldCat member libraries worldwide
We consider the problem of performing multiplication of nbit binary numbers on a chip. Let A denote the chip area, and T the time required to perform multiplication. Using a model of computation which is a realistic approximation to current and anticipated VLSI technology, we show that (A/A sub 0) (T/T sub 0) to the 2 alpha power> or = n to the (1 + alpha) power for all alpha is an element (0, 1), where A sub 0 and T sub 0 are positive constants which depend on the technology but are independent of n. The exponent 1 + alpha is the best possible. A consequence is that binary multiplication is 'harder' than binary addition if AT to the 2 alpha power is used as a complexity measure for any alpha> or = 0. (Author)
5 editions published in 1979 in English and held by 9 WorldCat member libraries worldwide
We consider the problem of performing multiplication of nbit binary numbers on a chip. Let A denote the chip area, and T the time required to perform multiplication. Using a model of computation which is a realistic approximation to current and anticipated VLSI technology, we show that (A/A sub 0) (T/T sub 0) to the 2 alpha power> or = n to the (1 + alpha) power for all alpha is an element (0, 1), where A sub 0 and T sub 0 are positive constants which depend on the technology but are independent of n. The exponent 1 + alpha is the best possible. A consequence is that binary multiplication is 'harder' than binary addition if AT to the 2 alpha power is used as a complexity measure for any alpha> or = 0. (Author)
An efficient parallel garbage collection system and its correctness proof by
H. T Kung(
Book
)
3 editions published in 1977 in English and held by 9 WorldCat member libraries worldwide
An efficient system to perform garbage collection in parallel with list operations is proposed and its correctness is proven. The system consists of two independent processes sharing a common memory. One process is performed by the list processor (LP) for list processing and the other by the garbage collector (GC) for marking active nodes and collecting garbage nodes. The system is derived by using both the correctness and efficiency arguments. Assuming that memory references are indivisible the system satisfies the following properties: No critical sections are needed in the entire system. The time to perform the marking phase by the GC is independent of the size of memory, but depends only on the number of active nodes. Nodes on the free list need not be marked during the marking phase by the GC. Minimum overheads are introduced to the LP. Only two extra bits for encoding four colors are needed for each node. Efficiency results show that the parallel system is usually significantly more efficient in terms of storage and time than the sequential stack algorithm. (Author)
3 editions published in 1977 in English and held by 9 WorldCat member libraries worldwide
An efficient system to perform garbage collection in parallel with list operations is proposed and its correctness is proven. The system consists of two independent processes sharing a common memory. One process is performed by the list processor (LP) for list processing and the other by the garbage collector (GC) for marking active nodes and collecting garbage nodes. The system is derived by using both the correctness and efficiency arguments. Assuming that memory references are indivisible the system satisfies the following properties: No critical sections are needed in the entire system. The time to perform the marking phase by the GC is independent of the size of memory, but depends only on the number of active nodes. Nodes on the free list need not be marked during the marking phase by the GC. Minimum overheads are introduced to the LP. Only two extra bits for encoding four colors are needed for each node. Efficiency results show that the parallel system is usually significantly more efficient in terms of storage and time than the sequential stack algorithm. (Author)
All algebraic functions can be computed fast by
H. T Kung(
Book
)
5 editions published in 1976 in English and Undetermined and held by 9 WorldCat member libraries worldwide
The expansions of algebraic functions can be computed 'fast' using the Newton Polygon Process and any 'normal' iteration. Let M(j) be the number of operations sufficient to multiply two jth degree polynomials. It is shown that the first N terms of an expansion of any algebraic function defined by an nth degree polynomial can be computed in O(n(M(N)) operations, while the classical method needs O(N sup n) operations. Among the numerous applications of algebraic functions are symbolic mathematics and combinatorial analysis. Reversion, reciprocation, and nth root of a polynomial are all special cases of algebraic functions
5 editions published in 1976 in English and Undetermined and held by 9 WorldCat member libraries worldwide
The expansions of algebraic functions can be computed 'fast' using the Newton Polygon Process and any 'normal' iteration. Let M(j) be the number of operations sufficient to multiply two jth degree polynomials. It is shown that the first N terms of an expansion of any algebraic function defined by an nth degree polynomial can be computed in O(n(M(N)) operations, while the classical method needs O(N sup n) operations. Among the numerous applications of algebraic functions are symbolic mathematics and combinatorial analysis. Reversion, reciprocation, and nth root of a polynomial are all special cases of algebraic functions
Comprehensive evaluation of a twodimensional configurable array by
Universidade de São Paulo(
Book
)
4 editions published between 1989 and 1990 in English and Undetermined and held by 9 WorldCat member libraries worldwide
Abstract: "This paper presents the evaluation of a highly configurable architecture for twodimensional (2D) arrays of powerful processors. The evaluation is based on an array of Warp cells, a powerful processor developed at Carnegie Mellon and manufactured by General Electric, and uses real application programs. The evaluation covers the areas of configurability, array survivability, and performance degradation. The software and algorithms developed for the evaluation are also discussed. The results based on simulations of small and medium size arrays (up to 16x16), show that a high degree of configurability, and array survivability can be achieved with little impact on program performance."
4 editions published between 1989 and 1990 in English and Undetermined and held by 9 WorldCat member libraries worldwide
Abstract: "This paper presents the evaluation of a highly configurable architecture for twodimensional (2D) arrays of powerful processors. The evaluation is based on an array of Warp cells, a powerful processor developed at Carnegie Mellon and manufactured by General Electric, and uses real application programs. The evaluation covers the areas of configurability, array survivability, and performance degradation. The software and algorithms developed for the evaluation are also discussed. The results based on simulations of small and medium size arrays (up to 16x16), show that a high degree of configurability, and array survivability can be achieved with little impact on program performance."
Let's design algorithms for VLSI systems by
H. T Kung(
Book
)
3 editions published in 1979 in English and held by 9 WorldCat member libraries worldwide
3 editions published in 1979 in English and held by 9 WorldCat member libraries worldwide
I/O complexity : the redblue pebble game by
Jiawei Hong(
Book
)
3 editions published in 1981 in English and held by 8 WorldCat member libraries worldwide
In this paper, the redblue pebble game is proposed to model the inputoutput complexity of algorithms. Using the pebble game formulation, a number of lower bound results for the I/O (Input/Output) requirement are proven. For example, it is shown that to perform the npoint FFT (or the ordinary nxn matrix multiplication algorithm) with a device of O(S) memory at least Omega(n log n/log S) (or Omega(ncubed/square root of S), respectively) time is needed for the I/O. Similar results are obtained for algorithms for several other problems. All of the lower bounds presented are the best possible in the sense that they are achievable by certain decomposition schemes. The results in this paper provide insight into the difficult task of balancing I/O and computation in specialpurpose system design. For example, for the npoint FFT, the I/O lower bound implies that an Spoint device achieving a speedup ratio O(log S) over the conventional O(n log n) implementation is all that one can hope for
3 editions published in 1981 in English and held by 8 WorldCat member libraries worldwide
In this paper, the redblue pebble game is proposed to model the inputoutput complexity of algorithms. Using the pebble game formulation, a number of lower bound results for the I/O (Input/Output) requirement are proven. For example, it is shown that to perform the npoint FFT (or the ordinary nxn matrix multiplication algorithm) with a device of O(S) memory at least Omega(n log n/log S) (or Omega(ncubed/square root of S), respectively) time is needed for the I/O. Similar results are obtained for algorithms for several other problems. All of the lower bounds presented are the best possible in the sense that they are achievable by certain decomposition schemes. The results in this paper provide insight into the difficult task of balancing I/O and computation in specialpurpose system design. For example, for the npoint FFT, the I/O lower bound implies that an Spoint device achieving a speedup ratio O(log S) over the conventional O(n log n) implementation is all that one can hope for
Parallel algorithms for solving triangular linear systems with small parallelism by L Hyafil(
Book
)
6 editions published between 1974 and 1975 in English and held by 8 WorldCat member libraries worldwide
The problem of solving triangular linear systems of size n on a parallel computer with small parallelism is considered. Assume that the time is measured by the number of parallel steps of any arithmetic operation. It is shown that the problem can be done in time n squared/k + O(n) with k processors for k <or = O(n), O(n sup (2r) log n) with O(n sup r) processors for 1 <r <3/2, and O(n sup (1r/3) (log sup 2)n) with O(n sup r) processors for 3/2 <or = r <or = 3. The results are obtained by using two principles of reducing parallel algorithms with large parallelism to parallel algorithms with small parallelism. The two principles for the reduction are expected to be useful for other problems
6 editions published between 1974 and 1975 in English and held by 8 WorldCat member libraries worldwide
The problem of solving triangular linear systems of size n on a parallel computer with small parallelism is considered. Assume that the time is measured by the number of parallel steps of any arithmetic operation. It is shown that the problem can be done in time n squared/k + O(n) with k processors for k <or = O(n), O(n sup (2r) log n) with O(n sup r) processors for 1 <r <3/2, and O(n sup (1r/3) (log sup 2)n) with O(n sup r) processors for 3/2 <or = r <or = 3. The results are obtained by using two principles of reducing parallel algorithms with large parallelism to parallel algorithms with small parallelism. The two principles for the reduction are expected to be useful for other problems
Systolic algorithms for the CMU Warp processor by
H. T Kung(
Book
)
4 editions published in 1984 in English and held by 8 WorldCat member libraries worldwide
The prototype has 10 cells, each of which is capable of performing 10 million floatingpoint operations per second (10 MFLOPS) and is build on a single board using only offtheshelf components. This 10cell processor for example can process 1024point complex FFTs at a rate of one FFT every 600 [mu]s. Under program control, the same processor can perform many other primitive computations in signal, image and vision processing, including twodimensional convolution and complex matrix multiplication, at a rate of 100 MFLOPS. Together with another processor capable of performing divisions and square roots, the processor can also efficiently carry out a number of difficult matrix operations such as solving covariant linear systems, a crucial computation in realtime adaptive signal processing. This paper outlines the architecture of the Warp processor and describes how the signal processing tasks are implemented on the processor."
4 editions published in 1984 in English and held by 8 WorldCat member libraries worldwide
The prototype has 10 cells, each of which is capable of performing 10 million floatingpoint operations per second (10 MFLOPS) and is build on a single board using only offtheshelf components. This 10cell processor for example can process 1024point complex FFTs at a rate of one FFT every 600 [mu]s. Under program control, the same processor can perform many other primitive computations in signal, image and vision processing, including twodimensional convolution and complex matrix multiplication, at a rate of 100 MFLOPS. Together with another processor capable of performing divisions and square roots, the processor can also efficiently carry out a number of difficult matrix operations such as solving covariant linear systems, a crucial computation in realtime adaptive signal processing. This paper outlines the architecture of the Warp processor and describes how the signal processing tasks are implemented on the processor."
Faulttolerance and twolevel pipelining in VLSI systolic arrays by
H. T Kung(
Book
)
4 editions published in 1983 in English and held by 8 WorldCat member libraries worldwide
This paper addresses two important issues in systolic array designs: faulttolerance and twolevel pipelining. The proposed 'systolic' faulttolerant scheme maintains the original data flow pattern by bypassing defective cells with a few registers. As a result, many of the desirable properties of systolic arrays (such as local and regular communication between cells) are preserved. Twolevel pipelining refers to the use of pipelined functional units in the implementation of systolic cells. This paper addresses the problem of efficiently utilizing pipelined units to increase the overall system throughput. We show that both of these problems can be reduced to the same mathematical problem of incorporating extra delays on certain data paths in originally correct systolic designs. We introduce the mathematical notion of a cut which enables us to handle this problem effectively. The results obtained by applying the techniques described in this paper are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput. In response to this, we have derived a new class of systolic algorithms in which the data cycle around a ring of processing cells
4 editions published in 1983 in English and held by 8 WorldCat member libraries worldwide
This paper addresses two important issues in systolic array designs: faulttolerance and twolevel pipelining. The proposed 'systolic' faulttolerant scheme maintains the original data flow pattern by bypassing defective cells with a few registers. As a result, many of the desirable properties of systolic arrays (such as local and regular communication between cells) are preserved. Twolevel pipelining refers to the use of pipelined functional units in the implementation of systolic cells. This paper addresses the problem of efficiently utilizing pipelined units to increase the overall system throughput. We show that both of these problems can be reduced to the same mathematical problem of incorporating extra delays on certain data paths in originally correct systolic designs. We introduce the mathematical notion of a cut which enables us to handle this problem effectively. The results obtained by applying the techniques described in this paper are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput. In response to this, we have derived a new class of systolic algorithms in which the data cycle around a ring of processing cells
Synchronized and asynchronous parallel algorithms for multiprocessors by
H. T Kung(
Book
)
3 editions published in 1976 in English and held by 8 WorldCat member libraries worldwide
Parallel algorithms for multiprocessors are classified into synchronized and asynchronous algorithms. Important characteristics with respect to the design and analysis of the two types of algorithms are identified and discussed. Several examples of the two types of algorithms are considered in depth
3 editions published in 1976 in English and held by 8 WorldCat member libraries worldwide
Parallel algorithms for multiprocessors are classified into synchronized and asynchronous algorithms. Important characteristics with respect to the design and analysis of the two types of algorithms are identified and discussed. Several examples of the two types of algorithms are considered in depth
The complexity of parallel evaluation of linear recurrences by L Hyafil(
Book
)
6 editions published between 1974 and 1975 in English and held by 7 WorldCat member libraries worldwide
The concept of computers such as C.mmp and ILLIAC 4 is to achieve computational speedup by performing several operations simultaneously with parallel processors. This type of computer organization is referred to as a parallel computer. In this paper, the authors prove upper bounds on speedups achievable by parallel computers for a particular problem, the solution of first order linear recurrences. The authors consider this problem because it is important in practice and also because it is simply stated so that one might obtain some insight into the nature of parallel computation by studying it
6 editions published between 1974 and 1975 in English and held by 7 WorldCat member libraries worldwide
The concept of computers such as C.mmp and ILLIAC 4 is to achieve computational speedup by performing several operations simultaneously with parallel processors. This type of computer organization is referred to as a parallel computer. In this paper, the authors prove upper bounds on speedups achievable by parallel computers for a particular problem, the solution of first order linear recurrences. The authors consider this problem because it is important in practice and also because it is simply stated so that one might obtain some insight into the nature of parallel computation by studying it
Specialpurpose devices for signal and image processing : an opportunity in VLSI by
H. T Kung(
Book
)
3 editions published in 1980 in English and held by 7 WorldCat member libraries worldwide
Based on the systolic array approach, new designs of specialpurpose devices for filtering, correlation, convolution, and discrete Fourier transform are proposed and discussed. It is argued that because of high degrees of simplicity, regularity and concurrency inherent to these designs, their VLSI implementation will be cost effective. (Author)
3 editions published in 1980 in English and held by 7 WorldCat member libraries worldwide
Based on the systolic array approach, new designs of specialpurpose devices for filtering, correlation, convolution, and discrete Fourier transform are proposed and discussed. It is argued that because of high degrees of simplicity, regularity and concurrency inherent to these designs, their VLSI implementation will be cost effective. (Author)
MISE, Machine for InSystem Evaluation of custom VLSI chips by
Roberto Bisiani(
Book
)
4 editions published in 1982 in English and held by 7 WorldCat member libraries worldwide
This paper identifies some of the key research problems that one encounters in specifying, designing, testing and demonstrating a custom chip in relation to the application system in which it will be used, and proposes a system called MISE(Machine For InSystem Evaluation) to be a solution to the issues raised
4 editions published in 1982 in English and held by 7 WorldCat member libraries worldwide
This paper identifies some of the key research problems that one encounters in specifying, designing, testing and demonstrating a custom chip in relation to the application system in which it will be used, and proposes a system called MISE(Machine For InSystem Evaluation) to be a solution to the issues raised
Systolic (VLSI) arrays for relational database operations by
H. T Kung(
Book
)
2 editions published in 1980 in English and held by 7 WorldCat member libraries worldwide
2 editions published in 1980 in English and held by 7 WorldCat member libraries worldwide
Numerically stable solution of dense systems of linear equations using meshconnected processors by
A Bojanczyk(
Book
)
2 editions published in 1981 in English and held by 7 WorldCat member libraries worldwide
2 editions published in 1981 in English and held by 7 WorldCat member libraries worldwide
more
fewer
Audience Level
0 

1  
Kids  General  Special 
Related Identities
 United States Office of Naval Research
 National Research Council (U.S.)
 United States Air Force Office of Scientific Research
 Sproull, Robert F.
 Steele, Guy L. 1954
 McKay, Gordon
 National Academy of Sciences
 Harvard University
 Commission on Physical Sciences, Mathematics, and Applications
 Division on Engineering and Physical Sciences
Useful Links
Associated Subjects
Algebraic functions Algorithms Array processors Binary system (Mathematics) Computational complexity Computer architecture Computer programming Computer programmingManagement ComputersAccess control ComputersCircuits Database management Data transmission systems Debugging in computer science Digital electronics Electronics Engineering Faulttolerant computing Graph theory Image processing Integrated circuits Integrated circuitsLarge scale integration Integrated circuitsVery large scale integration Linear programming Microelectronics Microprocessors Multiprocessors Parallel processing (Electronic computers) Polynomials Signal processing Systolic array circuits Telecommunication TelecommunicationTrafficManagement