From 1990 to 1993, a series of tests on numerical reliability ofdata analysis systems has been carried out. The tests are based onL.Wilkinson's "Statistics Quiz". Systems under test included BMDP, Data Desk,Excel, GLIM, ISP, SAS, SPSS, S-PLUS,STATGRAPHICS. The results showconsiderable problems even in basic features of well-known systems. For allour test exercises, the computational solutions are well known. The omissionsand failures observed here give some suspicions of what happens in lesswell-understood problem areas of computational statistics. We cannot takeresults of data analysis systems at face value, but have to submit them to alarge amount of informed inspection. Quality awareness still needs improvement.
NetWork is an experiment in distributed computing. The idea isto make use of idle time on personal workstations while retaining theiradvantages of immediate and guarantied availability. NetWork wants tomake use of otherwise idle resources only. The performance criterion ofNetWork is the net work done per unit time - not computing time or othermeasures of resource utilization. The NetWork model provides correspondingprogramming primitives for distributed computing. An implementation of adistributed asynchronous neural net serves as test application.
Recent changes in software technology have opened new possibilitiesfor statistical computing. Conditions for creating efficient and reliableextensible systems have been largely improved by programming languages andsystems which provide dynamic loading and type-safety across module boundaries,even at run time. We introduce Voyager, an extensible data analysis systembased on Oberon, which tries to exploit some of these possibilities.
How do we draw a distribution on the line? We give a survey of somewell known and some recent proposals to present such a distribution, based onsample data. We claim: a diagnostic plot is only as good as the hard statisticaltheory that is supporting it. To make this precise, one has to ask for theunderlying functionals, study their stochastic behaviour and ask for the naturalmetrics associated to a plot. We try to illustrate this point of view for someexamples.
If we want software that can be adapted to our needs on the long run, extensibility is a main requirement. For a long time, extensibility has been in conflict with stability and/or efficiency. This situation has changed with recent software technologies. Thetools provided by software technology however must be complementedby a design which exploits their facilities for extensibility. We illustrate this using Voyager, a portable data analysis system basedon Oberon.
The excess mass approach is a general approach to statistical analysis. It can be used to formulate a probabilistic model for clustering and can be applied to the analysis of multi-modality. Intuitively, a mode is present where an excess of probability mass is concentrated. This intuitive idea can be formalized directly by means of the excess mass functional. There is no need for intervening steps like initial density estimation. The excess mass measures the local difference of a given distribution to a reference model, usually the uniform distribution. The excess mass defines a functional which can be estimated efficiently from the data and can be used to test for multi-modality.
We identify some of the requirements for document integration of software components in statistical computing, and try to give a general idea how to cope with them in an implementation.
Bertin's permutation matrices give simple and effective tools for the graphical analysis of data matrices or tables. We discuss some abstractions which help understanding Bertin's strategies and can be used in an interactive system.
An overview of the production process of cDNA microarrays on glass chips for gene expression analysis.
Demonstration of NetWork, a system for asynchronous distributed computing in a non-guarantied environment.
An introduction to integrated documents in statistics. Integrated documents allow a seamless integration of interactive statistics and data analysis components in 'life' documents while keeping the full computational power needed for simulation or resampling.