In this paper, we present our work on evaluating knowledge base systems with respect to use in large OWL applications. To this end, we have developed the Lehigh University Benchmark (LUBM). The benchmark is intended to evaluate knowledge base systems with respect to extensional queries over a large dataset that commits to a single realistic ontology. LUBM features an OWL ontology modeling university domain, synthetic OWL data generation that can scale to an arbitrary size, fourteen test queries representing a variety of properties, and a set of performance metrics. We describe the components of the benchmark and some rationale for its design. Based on the benchmark, we have conducted an evaluation of four knowledge base systems (KBS). To our knowledge, no experiment has been done with the scale of data used here. The smallest dataset used consists of 15 OWL files totaling 8MB, while the largest dataset consists of 999 files totaling 583MB. We evaluated two memory-based systems (OWLJessKB and memory-based Sesame) and two systems with persistent storage (database-based Sesame and DLDB-OWL). We show the results of the experiment and discuss the performance of each system. In particular, we have concluded that existing systems need to place a greater emphasis on scalability.