gsbin-@... wrote:
original article:http://www.egroups.com/group/recursive-partitioning/?s
tart=64
> We are trying to locate papers that describe the methods used in
Rpart for
> n-fold cross-validation for regression trees. Any suggestions?
> Sincerely,
> Greg Binns
Have you looked at the 2 tech. reports that come with the RPART
distribution? The N-fold cross-validation for regression trees is the
same as that for classification trees (or Poisson regression/survival
trees).
We are trying to locate papers that describe the methods used in Rpart for
n-fold cross-validation for regression trees. Any suggestions?
Sincerely,
Greg Binns
I've been approved as an editor for the Machine Learning category of
the Open
Directory Project
http://dmoz.org/Computers/Artificial_Intelligence/Machine_Learning
Major portals and search engines that use the directories include:
* Netscape
* AltaVista
* AOL Search
* Direct Hit
* Dogpile
* EuroSeek
* HotBot
* Lycos
Please email your suggestion and site submission to me. You can also
contribute
by becoming an editor. Thank you for your attention.
--
Tjen-Sien Lim
editor@...
www.Recursive-Partitioning.com
______________________________________________________________________
Get paid to write a review! http://recursive-partitioning.epinions.com
In response to Patane's query: In developing auto-clustering algorithms
I found it was difficult to compare my results with previous results
because
the labeling was always different, even when there was general agreement
in the way clustering proceeded.
I have found that most effective test in the early development of these
algorithms is to construct 2-dimensional test patterns with various
cluster shapes and degrees of overlapping. Then you can easily validate
the algorithm with a 2-D plotting program. Once you have verified the
functionality this way you can go on to higher dimensions.
John Day
PhD Candidate
Florida Tech
I am a Ph.D student at the University of Catania (Italy) and I am
developing an automatic tool for clustering. It is based on Vector
Quantizations and, given the learning patterns and the desired error (for
example,in terms of Mean Squared Error), it automatically calculates the
codebook that allows that error with the least number of codewords. I need
some previous work to make some comparisons. Can anybody help me ? Thanks
Giuseppe Patane'
The answer may be NO! I'm going to discuss the case where all attributes
are categorical. Exhaustive search algorithm (described in Breiman,
Friedman, Olshen & Stone, 1984) tends to select categorical attribute with
many levels as the split variable. For a categorical attribute with c
levels, you need to evaluate up to 2^{c-1} - 1 possible splits. So, the
more levels the attribute has, the more likely the attribute is selected as
the split variable just by chance.
On the other hand, CHAID and its derivatives (Kass, 1980; Hawkins & Kass,
1982; Biggs, de Ville & Suen, 1991) tend to select categorical attribute
with few levels. The algorithm penalizes categorical attributes with many
levels too severely. The adjustment proposed by Biggs, et al. (1991) seems
to be the least conservative, however.
QUEST, CRUISE, and PLUS also tend to select categorical attribute with few
levels when all categorical attributes are "equally informative" with
respect to the dependent variable. This is an artifact of the Pearson's
chi-square test for independence in a 2-way contingency table.
Hence, users of classification tree methods should exercise caution in
interpreting the resulting tree diagram when the categorical attributes
have varying levels. The selection bias won't occur when all categorical
attributes have the same number of levels. There won't be any serious bias
when all attributes are numerical and they have roughly comparable numbers
of distinct values.
The case of mixed attributes (numerical and categorical) is more
complicated and I haven't studied it deeply. My preliminary simulation
results (not for citation yet) can be downloaded from
http://www.recursive-partitioning.com/plus/split.pdf
Thank your for your attention. I'd welcome any discussion/comment.
--
Tjen-Sien Lim
tslim@...
www.Recursive-Partitioning.com
______________________________________________________________________
Get paid to write a review! http://recursive-partitioning.epinions.com
I've discovered that the splits produced by AnswerTree CHAID and
KnowledgeSEEKER Cluster are diffefernt on 6 out of 7 data sets I've tried.
On one data set (US voting records data), I've confirmed that AnswerTree
CHAID reports the wrong split (I'm coding my own CHAID clone). I'm
wondering if anyone on the list has compared the two software packages. Thanks.
--
Tjen-Sien Lim
tslim@...
www.Recursive-Partitioning.com
____________________________________________________________________
Get your free Web-based email! http://recursive-partitioning.zzn.com
Dear,
Could anyone in the list point me any work which relates the
number of
instances in a dataset and the size (depth, number of folds, etc.) of
the tree induced from this dataset?
Thanks for all.
********************************************************************
Iñaki Inza
Computer Sciences and Artificial Intelligence Department
University of the Basque Country
P.O. Box 649
E-20080 Donostia - San Sebastian
Basque Country
Spain
Telephone number: (+34) 943018000 (ext. 5106)
FAX number: (+34) 943219306
e-mail: ccbincai@...http://www.sc.ehu.es/ccwbayes/inaki.htm
********************************************************************
I'm wondering if anyone on the list is aware of papers describing
application of trees methods to pharmacogenomics. Thanks.
--
Tjen-Sien Lim
tslim@...
www.Recursive-Partitioning.com
____________________________________________________________________
Get your free Web-based email! http://recursive-partitioning.zzn.com
I'm wondering if anyone of the list has an implementation of the CHAID
algorithm in C/C++/Fortran and is willing to share it. The treedisc.sas
SAS macro I'm using is just too slow for simulation. If there's no free
source code available, I'm going to try to code the algorithm from
scratch myself. Thanks.
--
Tjen-Sien Lim (608) 262-8181 (Voice)
Dept. of Statistics (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison limt@...
1210 West Dayton Street http://www.stat.wisc.edu/~limt
Madison, WI 53706
Data Mining Berater für Banken in Frankfurt/ München zu finden unter:
(1) www.siemens.de
(2) Jobs & Karriere
(3) Jobbörse
(4) Suche Text: "data mining"
Hans-Peter Neeb
---------------------------------------------------------
Siemens Business Services
SBS FS D CRM Data Warehouse/ Data Mining
Lyoner Strasse 27 Postfach 71 07 61
D-60528 Frankfurt D-60497 Frankfurt
Phone +49 69 6682 - 1444
Fax +49 69 6682 - 1829
Mobile 0172 - 524 99 44
E-Mail hans-peter.neeb@...http://www.sbs.dehttp://www.siemens.com/sbs/en/offerings/financial/Offerings/crm/inde
x.html
I've run simulations comparing 4 classification tree methods in terms
of selection probability of split variable at the root node. I
consider a very simple situation with only 2 categorical covariates
and 2 classes. For each tree method, the tree is grown and then pruned
back. The variable that is selected as the split variable at the root
node is recorded for each Monte Carlo iteration, provided the tree is
not pruned all the way back to the root node.
To avoid problems with selecting the best tree by cross-validation, I
prune the trees using an independent pruning data set.
I'd be interested in hearing your opinions/comments.
Thank you for your attention.
--
Tjen-Sien Lim (608) 262-8181 (Voice)
Dept. of Statistics (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison limt@...
1210 West Dayton Street http://www.stat.wisc.edu/~limt
Madison, WI 53706
Design:
======
Size of training data file = 500
Size of pruning data file = 250
Monte Carlo iterations = 2000
2 classes
2 categorical covariates: X1 (2 levels), X2 (10 levels)
Pr(Class = 1) = Pr(Class = 2) = 0.5
Pr(Class=1 | X1=1) = c1
Pr(Class=1 | X1=2) = 1 - c1
Pr(Class=1 | X2=1) = ... = Pr(Class=1 | X2=5) = c2
Pr(Class=1 | X2=6) = ... = Pr(Class=1 | X2=10) = 1 - c2
Results:
=======
Selection Probability
===============================================
Exhaustive CRUISE PLUS
c1 c2 Covariate Search QUEST (interaction) (Option 4)
======================================================================
0.5 0.5 X1 0.0185 0.2710 0.2730 0.2805
X2 0.5990 0.3175 0.3210 0.3335
Root node 0.3825 0.4115 0.4060 0.3860
0.6 0.6 X1 0.4410 0.7810 0.7595 0.7815
X2 0.5550 0.2155 0.2360 0.2160
Root node 0.0040 0.0035 0.0045 0.0025
0.7 0.7 X1 0.4870 0.8005 0.5510 0.6140
X2 0.5130 0.1995 0.4480 0.3815
Root node 0.0000 0.0000 0.0010 0.0045
0.8 0.8 X1 0.5040 0.8270 0.1430 0.4335
X2 0.4960 0.1730 0.8570 0.5665
Root node 0.0000 0.0000 0.0000 0.0000
0.9 0.9 X1 0.5010 0.8690 0.0010 0.4625
X2 0.4990 0.1310 0.9990 0.5375
Root node 0.0000 0.0000 0.0000 0.0000
Conclusions:
===========
1. All methods are equally bad in failing to prune the tree all the
way back to the root node when the covariates are just noise.
2. Exhaustive search (e.g., CART(r)) favors categorical covariates
with many levels when all covariates are just noise. When the
covariates are "equally informative", exhaustive search selects
them with roughly equal probabilities.
3. QUEST favors categorical covariates with fewer levels when the
covariates are "equally informative".
4. PLUS (Option 4) also favors categorical covariates with fewer
levels but not as severe as QUEST. When the covariates have a very
strong association with the class variable, PLUS selects them with
almost equal probabilities.
5. The behavior of CRUISE (interaction detection option) is
puzzling. CRUISE favors categorical covariates with fewer levels
when the association is weak. However, it favors categorical
covariates with many leves when the association is strong.
I'd like to announce the availability of my Polytomous Logistic
Regression Trees software. The program is named PLUS (Polytomous
Logistic regression trees with Unbiased Splits). PLUS is freeware. It
is implemented in a set of Fortran 90 routines. The current version is
1.0 (Beta).
The program accepts numerical/continuous as well as categorical
variables. Missing covariate value is allowed. If a test data set is
available, an estimate of the misclassification error rate will be
provided.
The executables are available on the following platforms:
- Digital Alpha (Digital UNIX 4.0)
- Sun SPARCstation/Ultra (Sun Solaris 2.6)
- Pentium (Linux)
- Pentium (Windows 95/98/NT) (coming soon)
For download and further information, please visit the following site:
http://www.recursive-partitioning.com/plus/software.html
Note that the software is still only a beta version and so expect some
bugs. Comments, suggestions, and especially bugs reports are most
appreciated.
Thank you for your attention.
--
Tjen-Sien Lim (608) 262-8181 (Voice)
Dept. of Statistics (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison limt@...
1210 West Dayton Street http://www.stat.wisc.edu/~limt
Madison, WI 53706
Hi
Please submit any news item you would like considered for inclusion in
the October 1999 Classification Society of North America
(http://www.pitt.edu/~csna/) newsletter.
Thanks
Bill
--
William D. Shannon, Ph.D.
Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences
Assistant Professor of Biostatistics
Division of Biostatistics
Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO 63110
Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon
Washington University School of Medicine in St. Louis
Postdoctoral Position -- Second Announcement
A postdoctoral position in biostatistics is immediately available in the
Division of General Medical Sciences, Department of Medicine, Washington
University School of Medicine in St. Louis (http://medschool.wustl.edu/). This
position involves working collaboratively with Dr. William Shannon
(http://osler.wustl.edu/~shannon) on statistical clustering and classification
analysis (50%), and providing biostatistical consulting support (50%) to
academic biomedical researchers. Research possibilities include developing new
methods for improving classification and regression tree models, application of
tree-based models to genetic epidemiology, and developing wavelet-based and
other strategies for cluster analysis of gene chip data.
The successful candidate will have a recent PhD in biostatistics (or statistics
with a focus on applications), and have some background in statistical cluster
and classification analysis. Strong computing skills and proficiency in UNIX
are necessary. This position is for 2-3 years, and salary is based on NIH
specified postdoctoral salaries.
I
If interested please send me a cover letter, cv, reprints, and list of
references by mail, email or fax. (Please do not have letters of reference
sent.)
Bill Shannon
--
William D. Shannon, Ph.D.
Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences
Assistant Professor of Biostatistics
Division of Biostatistics
Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO 63110
Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon
I've found a weird phenomenon with Shelby Haberman's breast cancer
survival data (can be downloaded from the UCI Machine Learning
Repository). The QUEST algorithm yields only the root node (number of
terminal nodes = 1) when I set the minimum sample size at 1. However,
when I change the minimum sample size to 5, I get a tree with 18
terminal nodes. Both trees are obtained with N-fold cross-validation
(or jackknife), 0-SE rules, proportional priors, and equal costs.
CART(r), on the other hand, yields the same tree with minimum sample
size 1 or 5. C5.0/See5 gives the same tree as CART(r) with 3 terminal
nodes. My experimental classification tree (based on chi-square tests)
also gives exactly the same tree.
Has anyone observed the same problem with other tree variants? If
minimum sample size has a big impact, then interpretation of the tree
diagram would be much more difficult.
Thanks.
--
Tjen-Sien Lim (608) 262-8181 (Voice)
Dept. of Statistics (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison limt@...
1210 West Dayton Street http://www.stat.wisc.edu/~limt
Madison, WI 53706
Here are two job announcements I am posting for a colleague. If interested
please contact Dan Weaver at weaver@...
Applied Mathematician
Genomica Corporation seeks an individual with a strong background in
mathematical statistics and data analysis. This individual must have a
desire to interact with a multidisciplinary team involved in the
development of new computational and mathematical methods relating to the
analysis of gene and protein expression experiments. Other areas of
collaboration include the analysis of complex genetic disorders. The
successful candidate will be a member of the research department and will
be encouraged to publish and present original scientific results. A Ph.D.
in mathematical statistics or a related field is required.
Quantitative/Statistical Geneticist
Genomica Corporation seeks a quantitative/statistical geneticist with
significant experience in the analysis of genetic human genetic disorders.
Must be well versed in both the theory and practice of genetic analysis.
New challenges will arise as a consequence of the availability of large
numbers of single nucleotide polymorphisms in both public and private
databases and their application to the analysis of complex genetic
disorders. The successful candidate will be a part of a multidisciplinary
team that is responsible for the development of new methods and software
for genetic analysis and for determining the applicability of third-party
algorithms to this area. The successful candidate will also be responsible
for collaborations with academic and industrial partners in the analysis of
genetic disorders. The successful candidate will be a member of the
research department and will be encouraged to publish and present original
scientific results. A Ph.D. in statistical genetics or a related field is
required.
--
William D. Shannon, Ph.D.
Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences
Assistant Professor of Biostatistics
Division of Biostatistics
Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO 63110
Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon
Machine Learning
Volume 36, Issue 3, September 1999
An Efficient Extension to Mixture Techniques for Prediction and
Decision Trees
Fernando C. Pereira, Yoram Singer
pp. 183-199
General and Efficient Multisplitting of Numerical Attributes
Tapio Elomaa, Juho Rousu
pp. 201-244
Opitz, D. and Maclin, R. (1999) "Popular Ensemble Methods: An Empirical
Study", Volume 11, pages 169-198.
Available in HTML, PDF, PostScript and compressed PostScript.
For quick access go to <http://www.jair.org/abstracts/opitz99a.html>
Abstract: An ensemble consists of a set of individually trained
classifiers (such as neural networks or decision trees) whose
predictions are combined when classifying novel instances. Previous
research has shown that an ensemble is often more accurate than any of
the single classifiers in the ensemble. Bagging (Breiman, 1996c) and
Boosting (Freund & Shapire, 1996; Shapire, 1990) are two relatively
new but popular methods for producing ensembles. In this paper we
evaluate these methods on 23 data sets using both neural networks and
decision trees as our classification algorithm. Our results clearly
indicate a number of conclusions. First, while Bagging is almost
always more accurate than a single classifier, it is sometimes much
less accurate than Boosting. On the other hand, Boosting can create
ensembles that are less accurate than a single classifier --
especially when using neural networks. Analysis indicates that the
performance of the Boosting methods is dependent on the
characteristics of the data set being examined. In fact, further
results show that Boosting ensembles may overfit noisy data sets, thus
decreasing its performance. Finally, consistent with previous
studies, our work suggests that most of the gain in an ensemble's
performance comes in the first few classifiers combined; however,
relatively large gains can be seen up to 25 classifiers when Boosting
decision trees.
For those of you interested, the Census Bureau has many many
Mathematical Statistician and Survey Statistician positions at
various grades that we are trying to fill. You can go to:
http://www.census.gov/hrd/www/index.html for more general info.
HOWEVER, I HAVE SPECIFIC POSITIONS I'M SEEKING TO FILL. See below.
I'm building a team in the Administrative Records Evaluation and
Linkage
Group, which I head up. Thus, I'm HIRING! I have six positions as
Mathematical or Survey Statistician at the Bureau in Suitland, MD,
pay ranges from $30,000 to $75,000 per year. We have several
nationwide
projects underway, and I'm looking for persons skilled in Data
Warehousing, Data Mining, Record Linkage, Administrative records,
sampling, and/or survey coverage. I am building the staff for cutting
edge research in administrative records as only the Census Bureau can
do!
For more info, I can be reached at:
Dean H. Judson, Ph.D., Mathematical Statistician and Group Leader,
Administrative Records Evaluation and Linkage Group
Planning, Research and Evaluation Division
U.S. Bureau of the Census
Washington, DC 20233
My email is: Dean.H.Judson@...
My work phone is: 301-457-4222
Please feel free to pass this email along to anyone who might
find it appealing. THANKS!
MARS AND CART OCTOBER SEMINARS PRESENTED BY SALFORD SYSTEMS
An Introduction to Next Generation Regression Modeling with MARS™
(October 13 in San Francisco and October 18 in New York)
An Introduction to Decision-Tree Modeling with CART®
(October 19 in New York)
Advanced CART® and Tree-Hybrid Modeling Techniques
(October 20 in New York)
COURSE DESCRIPTIONS
An Introduction to Next Generation Regression Modeling with MARS™
October 13 in San Francisco and October 18 in New York
Step into the next generation of regression modeling with MARS
(Multivariate Adaptive Regression Splines)! Be the first on your block to
get answers about a new high-speed regression tool that provides superior
predictive - what is MARS? How does it work? What applications is it
best suited for? How can it help you develop more accurate models for
predicting continuous and binary outcomes?
Learn how to:
· plan and execute MARS analyses
· refine your models using key control parameters
· use MARS to improve your existing models
· hybridize MARS and CART to gain an even better fit
An Introduction to Decision-Tree Modeling with CART®
October 19 in New York
Discover the power of tree-structured modeling during this popular one-day
seminar by Dan Steinberg, one of the leading experts in decision-tree
technology and applications. This seminar is geared towards data analysts
and modelers who are interested in understanding both the conceptual and
practical basis of decision-tree methodology -- what it is, why it works,
how it has been used, and how it can help you better understand your data.
Explore the practical use and application of decision trees in solving real
world, complex modeling challenges and learn about:
· decision-tree fundamentals
· decision-tree applications
· how to build and interpret CART trees
· how to use advanced options (splitting rules, priors, costs, bagging,
ARCing)
Advanced CART® and Tree-Hybrid Modeling Techniques
October 20 in New York
Sharpen your decision-tree expertise during this one-day advanced course
for analysts with prior knowledge of tree algorithms. Using real-world
examples, learn how to:
· hybridize decision trees with logistic regression and neural nets
· combine multiple trees via bagging, boosting, and varying priors
· explore alternative splitting rules and their strengths and weaknesses
· grow and prune with misclassification costs
· emerging topics in decision trees
MORE INFO
For more information on our training courses or to download a free demo of
MARS or CART visit:
http://www.salford-systems.com
I'm wondering if someone has the electronic copy (in PostScript or
PDF) of the following paper. Our local library doesn't have the 1995
proceeding and it'll take about 2 weeks to order a copy through
interlibrary loan. Thanks.
@InProceedings{fks-eafmsdt-95,
author = "Truxton Fulton and Simon Kasif and Steven Salzberg",
title = "Efficient algorithms for finding multi-way splits
for
decision trees",
booktitle = "Proc. 12th International Conference on Machine
Learning",
publisher = "Morgan Kaufmann",
year = "1995",
pages = "244--251",
}
Machine Learning Methods for Ecological Applications
Edited by
Alan H. Fielding
Dept. of Biological Sciences, The Manchester Metropolitan, UK
The last 25 years have seen a tremendous growth in the application of
statistical and modelling techniques to ecological problems. This
expansion has been accelerated by the increasing availability of
software, books and computing power. However, the suitability of some
of these approaches to data analysis, in a relatively knowledge-poor
discipline such as ecology, can be questioned on grounds of
appropriateness and robustness. One reason for these concerns is that
many ecological problems are at best poorly defined and most lack
algorithmic solutions. Machine learning methods offer the potential
for a different approach to these difficult problems.
One definition of machine learning is that it is concerned with
inducing knowledge from data, where the data could be patterns in a
game of chess or patterns in the species composition of natural
communities. Unfortunately ecologists have little experience of these
relatively recent and novel approaches to understanding data. This is
a problem that is made more complex because there is no simple
taxonomy of machine learning methods and there are relatively few
examples in the mainstream ecological literature to encourage
exploration.
This is the first text aimed at introducing machine learning methods
to a readership of professional ecologists. All but one of the
chapters have been written by ecologists and biologists who highlight
the application of a particular method to a particular class of
problem. Examples include the identification of species, optimal mate
choice, predicting species distributions and modelling landscape
features. A group of experienced machine learning workers, who have
become interested in environmental problems, have written a chapter
that demonstrates how machine learning methods can be used to discover
equations that describe the dynamic behaviour of ecological
systems. The final chapter reviews `real learning', offering the
potential for greater dialogue between the biological and machine
learning communities.
Contents and Contributors
Kluwer Academic Publishers, Boston
Hardbound, ISBN 0-412-84190-8
August 1999, 280 pp.
NLG 255.00 / USD 125.00 / GBP 81.25
Contents and Contributors
Contributors. Preface. Acknowledgements. 1. An introduction to machine
learning methods; A. Fielding. 2. Artificial neural networks for
pattern recognition; L. Boddy, C.W. Morris. 3. Tree-based methods;
J.F. Bell. 4. Genetic Algorithms I; J.N.R. Jeffers. 5. Genetic
Algorithms II; D.R.B. Stockwell. 6. Cellular automata;
D. Dunkerley. 7. Equation discovery with ecological applications;
S. Szeroski, et al. 8. How should accuracy be measured?
A. Fielding. 9. Real learning; B. Stevens-Wood. Author Index.
Subject Index.
The following new article in Machine Learning could be of interest.
===
UI - 215PC-0003
AU - Merz CJ
TI - Using correspondence analysis to combine classifiers
SO - Machine Learning 1999 Jul;36(1-2):33-58
MH - Classification
MH - Correspondence analysis
MH - Multiple models
MH - Combining estimates
MH - Algorithm
AB - Several effective methods have been developed recently for improving
predictive performance by generating and combining multiple learned
models. The general approach is to create a set of learned models
either by applying an algorithm repeatedly to different versions of
the training data, or by applying different learning algorithms to
the same data. The predictions of the models are then combined
according to a voting scheme. This paper focuses on the task of
combining the predictions of a set of learned models. The method
described uses the strategies of stacking and Correspondence
Analysis to model the relationship between the learning examples and
their classification by a collection of learned models. A nearest
neighbor method is then applied within the resulting representation
to classify previously unseen examples. The new algorithm does not
perform worse than, and frequently performs significantly better
than other combining techniques on a suite of data sets.
[References: 37]
PT - Article
I'd like to hear your opinions/comments/discussions on the following
issue. One selling point of tree-structured methods is that they can
"provide insight and understanding into the predictive structure of
the data" (quote from Breiman, Friedman, Olshen & Stone, 1984). Tree
methods enable us to understand our data better by looking at the tree
diagram and studying the split variables and split points.
Skeptics want to know what the guarantee is that the
explanation/theory/story provided by the tree diagram is not wrong nor
misleading. What can we tell skeptics to assure them that the tree
diagram is useful?
One way is to simulate a tree model and generate some simulated data
from the true tree model. If the tree method can estimate the true
model reasonably well, then perhaps we can convince
skeptics. (Professor Douglas Hawkins has done this kind of
simulation.) What else can we do?
Thanks.
--
Tjen-Sien Lim (608) 262-8181 (Voice)
Dept. of Statistics (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison limt@...
1210 West Dayton Street http://www.stat.wisc.edu/~limt
Madison, WI 53706
I'd like to solicit information about the cause missing values based
on your experiences analyzing data. Some examples that I've
encountered or can think up are:
1. In a survey, missing values can mean that the respondents:
- refuse to answer
- don't know
- don't remember
2. Missing values can mean "Not Applicable" (or skipped pattern). For
example:
- in medical sciences, a particular blood test is not ordered by
the physician and hence the value for the test is "missing"
- only men can experience prostate problem
- only women can get pregnant
3. In agricultural sciences, the plant/crops or animals could die
during the course of the experiment and hence the "yield" variable
would be "missing".
4. In an ecological study, the cows chew the plastic bags used to trap
insects on the field and hence no insect data could be obtained
from those bags.
Other examples? Please respond to me directly and I'll summarize to
the list. Thank you.
--
Tjen-Sien Lim (608) 262-8181 (Voice)
Dept. of Statistics (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison limt@...
1210 West Dayton Street http://www.stat.wisc.edu/~limt
Madison, WI 53706
http://osler.wustl.edu/~shannon/csna/news.latest.html
Hi,
The above link is to the most recent issue of the Classification Society of
North America's (CSNA) Newsletter which may be of interest to some readers. If
you are unfamiliar with the CSNA I encourage you to take a look at their
homepage: http://www.pitt.edu/~csna/
Bill Shannon
--
William D. Shannon, Ph.D.
Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences
Assistant Professor of Biostatistics
Division of Biostatistics
Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO 63110
Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon