Search the web
Sign In
New User? Sign Up
recursive-partitioning · Recursive Partitioning
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 36 - 65 of 95   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#65 From: "Tjen-Sien Lim" <tslim@...>
Date: Sat Feb 5, 2000 3:35 am
Subject: [RP] Re: Rpart questions
tslim@...
Send Email Send Email
 
gsbin-@... wrote:
original article:http://www.egroups.com/group/recursive-partitioning/?s
tart=64
> We are trying to locate papers that describe the methods used in
Rpart for
> n-fold cross-validation for regression trees.  Any suggestions?
> Sincerely,
> Greg Binns

Have you looked at the 2 tech. reports that come with the RPART
distribution? The N-fold cross-validation for regression trees is the
same as that for classification trees (or Poisson regression/survival
trees).

#64 From: GSBInc@...
Date: Wed Feb 2, 2000 12:13 pm
Subject: [RP] Rpart questions
GSBInc@...
Send Email Send Email
 
We are trying to locate papers that describe the methods used in Rpart for
n-fold cross-validation for regression trees.  Any suggestions?
Sincerely,
Greg Binns

#63 From: "Tjen-Sien Lim" <tslim@...>
Date: Wed Feb 2, 2000 2:03 am
Subject: [RP] FYI: dmoz.org Open Directory Project -- Machine Learning
tslim@...
Send Email Send Email
 
I've been approved as an editor for the Machine Learning category of
the Open
Directory Project

      http://dmoz.org/Computers/Artificial_Intelligence/Machine_Learning

Major portals and search engines that use the directories include:

* Netscape
* AltaVista
* AOL Search
* Direct Hit
* Dogpile
* EuroSeek
* HotBot
* Lycos

Please email your suggestion and site submission to me. You can also
contribute
by becoming an editor. Thank you for your attention.

--
Tjen-Sien Lim
editor@...
www.Recursive-Partitioning.com
______________________________________________________________________
Get paid to write a review! http://recursive-partitioning.epinions.com

#62 From: "John Day" <jday@...>
Date: Wed Jan 26, 2000 3:25 pm
Subject: [RP] Autoclustering
jday@...
Send Email Send Email
 
In response to Patane's query: In developing auto-clustering algorithms
I found it was difficult to compare my results with previous results
because
the labeling was always different, even when there was general agreement
in the way clustering proceeded.

I have found that most effective test in the early development of these
algorithms is to construct 2-dimensional test patterns with various
cluster shapes and degrees of overlapping. Then you can easily validate
the algorithm with a 2-D plotting program. Once you have verified the
functionality this way you can go on to higher dimensions.

John Day
PhD Candidate
Florida Tech

#61 From: "Giuseppe Patane'" <gpatane@...>
Date: Tue Jan 25, 2000 7:14 pm
Subject: [RP] Automatic clustering
gpatane@...
Send Email Send Email
 
I am a Ph.D student at the University of Catania (Italy) and I am
developing an automatic tool for clustering. It is based on Vector
Quantizations and, given the learning patterns and the desired error (for
example,in terms of Mean Squared Error), it automatically calculates the
codebook that allows that error with the least number of codewords. I need
some previous work to make some comparisons. Can anybody help me ? Thanks

Giuseppe Patane'

#60 From: Tjen-Sien Lim <tslim@...>
Date: Mon Jan 10, 2000 5:18 am
Subject: [RP] Can you trust the splits produced by classification trees?
tslim@...
Send Email Send Email
 
The answer may be NO! I'm going to discuss the case where all attributes
are categorical. Exhaustive search algorithm (described in Breiman,
Friedman, Olshen & Stone, 1984) tends to select categorical attribute with
many levels as the split variable. For a categorical attribute with c
levels, you need to evaluate up to 2^{c-1} - 1 possible splits. So, the
more levels the attribute has, the more likely the attribute is selected as
the split variable just by chance.

On the other hand, CHAID and its derivatives (Kass, 1980; Hawkins & Kass,
1982; Biggs, de Ville & Suen, 1991) tend to select categorical attribute
with few levels. The algorithm penalizes categorical attributes with many
levels too severely. The adjustment proposed by Biggs, et al. (1991) seems
to be the least conservative, however.

QUEST, CRUISE, and PLUS also tend to select categorical attribute with few
levels when all categorical attributes are "equally informative" with
respect to the dependent variable. This is an artifact of the Pearson's
chi-square test for independence in a 2-way contingency table.

Hence, users of classification tree methods should exercise caution in
interpreting the resulting tree diagram when the categorical attributes
have varying levels. The selection bias won't occur when all categorical
attributes have the same number of levels. There won't be any serious bias
when all attributes are numerical and they have roughly comparable numbers
of distinct values.

The case of mixed attributes (numerical and categorical) is more
complicated and I haven't studied it deeply. My preliminary simulation
results (not for citation yet) can be downloaded from

    http://www.recursive-partitioning.com/plus/split.pdf

Thank your for your attention. I'd welcome any discussion/comment.

--
Tjen-Sien Lim
tslim@...
www.Recursive-Partitioning.com
______________________________________________________________________
Get paid to write a review! http://recursive-partitioning.epinions.com

#59 From: Tjen-Sien Lim <tslim@...>
Date: Mon Dec 13, 1999 6:05 am
Subject: [RP] SPSS AnswerTree CHAID vs. KnowledgeSEEKER Cluster
tslim@...
Send Email Send Email
 
I've discovered that the splits produced by AnswerTree CHAID and
KnowledgeSEEKER Cluster are diffefernt on 6 out of 7 data sets I've tried.
On one data set (US voting records data), I've confirmed that AnswerTree
CHAID reports the wrong split (I'm coding my own CHAID clone). I'm
wondering if anyone on the list has compared the two software packages. Thanks.

--
Tjen-Sien Lim
tslim@...
www.Recursive-Partitioning.com
____________________________________________________________________
Get your free Web-based email! http://recursive-partitioning.zzn.com

#58 From: Innaki Inza Cano <ccbincai@...>
Date: Thu Dec 9, 1999 10:39 am
Subject: [RP] Number of instances and tree size
ccbincai@...
Send Email Send Email
 
Dear,

	 Could anyone in the list point me any work which relates the
number of
instances in a dataset and the size (depth, number of folds, etc.) of
the tree induced from this dataset?

	 Thanks for all.

********************************************************************
Iñaki Inza
Computer Sciences and Artificial Intelligence Department
University of the Basque Country
P.O. Box 649
E-20080 Donostia - San Sebastian
Basque Country
Spain

Telephone number: (+34) 943018000 (ext. 5106)
FAX number: (+34) 943219306
e-mail: ccbincai@...
http://www.sc.ehu.es/ccwbayes/inaki.htm
********************************************************************

#57 From: Tjen-Sien Lim <tslim@...>
Date: Fri Dec 11, 1998 12:54 am
Subject: [RP] Pharmacogenomics?
tslim@...
Send Email Send Email
 
I'm wondering if anyone on the list is aware of papers describing
application of trees methods to pharmacogenomics. Thanks.

--
Tjen-Sien Lim
tslim@...
www.Recursive-Partitioning.com
____________________________________________________________________
Get your free Web-based email! http://recursive-partitioning.zzn.com

#56 From: "Tjen-Sien Lim" <limt@...>
Date: Mon Nov 29, 1999 10:25 pm
Subject: [RP] CHAID code?
limt@...
Send Email Send Email
 
I'm wondering if anyone of the list has an implementation of the CHAID
algorithm in C/C++/Fortran and is willing to share it. The treedisc.sas
SAS macro I'm using is just too slow for simulation. If there's no free
source code available, I'm going to try to code the algorithm from
scratch myself. Thanks.

--
Tjen-Sien Lim                (608) 262-8181 (Voice)
Dept. of Statistics          (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison   limt@...
1210 West Dayton Street      http://www.stat.wisc.edu/~limt
Madison, WI 53706

#55 From: hans-peter.neeb@...
Date: Fri Nov 19, 1999 11:57 am
Subject: [RP] JOB: Data Mining bei SIEMENS
hans-peter.neeb@...
Send Email Send Email
 
Data Mining Berater für Banken in Frankfurt/ München zu finden unter:

    (1) www.siemens.de
    (2) Jobs & Karriere
    (3) Jobbörse
    (4) Suche Text: "data mining"

    Hans-Peter Neeb

    ---------------------------------------------------------

    Siemens Business Services
    SBS FS D CRM Data Warehouse/ Data Mining
    Lyoner Strasse 27 Postfach 71 07 61
    D-60528 Frankfurt D-60497 Frankfurt

    Phone +49 69 6682 - 1444
    Fax +49 69 6682 - 1829
    Mobile 0172 - 524 99 44

    E-Mail hans-peter.neeb@...

    http://www.sbs.de
    http://www.siemens.com/sbs/en/offerings/financial/Offerings/crm/inde
x.html

#54 From: Tjen-Sien Lim <limt@...>
Date: Wed Nov 17, 1999 2:17 am
Subject: [RP] Bias in split variable selection
limt@...
Send Email Send Email
 
I've run simulations comparing 4 classification tree methods in terms
of selection probability of split variable at the root node. I
consider a very simple situation with only 2 categorical covariates
and 2 classes. For each tree method, the tree is grown and then pruned
back. The variable that is selected as the split variable at the root
node is recorded for each Monte Carlo iteration, provided the tree is
not pruned all the way back to the root node.

To avoid problems with selecting the best tree by cross-validation, I
prune the trees using an independent pruning data set.

I'd be interested in hearing your opinions/comments.

Thank you for your attention.

--
Tjen-Sien Lim                (608) 262-8181 (Voice)
Dept. of Statistics          (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison   limt@...
1210 West Dayton Street      http://www.stat.wisc.edu/~limt
Madison, WI 53706



Design:
======

Size of training data file = 500
Size of pruning data file = 250
Monte Carlo iterations = 2000

2 classes
2 categorical covariates: X1 (2 levels), X2 (10 levels)

Pr(Class = 1) = Pr(Class = 2) = 0.5

Pr(Class=1 | X1=1) = c1
Pr(Class=1 | X1=2) = 1 - c1

Pr(Class=1 | X2=1) = ... = Pr(Class=1 | X2=5) = c2
Pr(Class=1 | X2=6) = ... = Pr(Class=1 | X2=10) = 1 - c2


Results:
=======

                                     Selection Probability
                        ===============================================
                        Exhaustive              CRUISE          PLUS
  c1   c2   Covariate     Search     QUEST   (interaction)   (Option 4)
======================================================================

0.5   0.5  X1            0.0185     0.2710     0.2730         0.2805
            X2 	 0.5990     0.3175     0.3210       0.3335
            Root node  0.3825     0.4115     0.4060       0.3860

0.6   0.6  X1            0.4410     0.7810     0.7595         0.7815
            X2 	 0.5550     0.2155     0.2360       0.2160
            Root node  0.0040     0.0035     0.0045       0.0025

0.7   0.7  X1            0.4870     0.8005     0.5510         0.6140
            X2 	 0.5130     0.1995     0.4480       0.3815
            Root node  0.0000     0.0000     0.0010       0.0045

0.8   0.8  X1            0.5040     0.8270     0.1430         0.4335
            X2 	 0.4960     0.1730     0.8570       0.5665
            Root node  0.0000     0.0000     0.0000       0.0000

0.9   0.9  X1            0.5010     0.8690     0.0010         0.4625
            X2 	 0.4990     0.1310     0.9990       0.5375
            Root node  0.0000     0.0000     0.0000       0.0000



Conclusions:
===========

1. All methods are equally bad in failing to prune the tree all the
    way back to the root node when the covariates are just noise.

2. Exhaustive search (e.g., CART(r)) favors categorical covariates
    with many levels when all covariates are just noise. When the
    covariates are "equally informative", exhaustive search selects
    them with roughly equal probabilities.

3. QUEST favors categorical covariates with fewer levels when the
    covariates are "equally informative".

4. PLUS (Option 4) also favors categorical covariates with fewer
    levels but not as severe as QUEST. When the covariates have a very
    strong association with the class variable, PLUS selects them with
    almost equal probabilities.

5. The behavior of CRUISE (interaction detection option) is
    puzzling. CRUISE favors categorical covariates with fewer levels
    when the association is weak. However, it favors categorical
    covariates with many leves when the association is strong.

#53 From: Tjen-Sien Lim <limt@...>
Date: Sat Oct 30, 1999 10:02 pm
Subject: [RP] ANN: Polytomous Logistic Regression Trees Version 1.0 (Beta)
limt@...
Send Email Send Email
 
I'd like to announce the availability of my Polytomous Logistic
Regression Trees software. The program is named PLUS (Polytomous
Logistic regression trees with Unbiased Splits). PLUS is freeware. It
is implemented in a set of Fortran 90 routines. The current version is
1.0 (Beta).

The program accepts numerical/continuous as well as categorical
variables. Missing covariate value is allowed. If a test data set is
available, an estimate of the misclassification error rate will be
provided.

The executables are available on the following platforms:

    - Digital Alpha (Digital UNIX 4.0)
    - Sun SPARCstation/Ultra (Sun Solaris 2.6)
    - Pentium (Linux)
    - Pentium (Windows 95/98/NT) (coming soon)

For download and further information, please visit the following site:

    http://www.recursive-partitioning.com/plus/software.html

Note that the software is still only a beta version and so expect some
bugs. Comments, suggestions, and especially bugs reports are most
appreciated.

Thank you for your attention.

--
Tjen-Sien Lim                (608) 262-8181 (Voice)
Dept. of Statistics          (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison   limt@...
1210 West Dayton Street      http://www.stat.wisc.edu/~limt
Madison, WI 53706

#52 From: William Shannon <shannon@...>
Date: Fri Oct 15, 1999 4:22 pm
Subject: [RP] CSNA Newsletter
shannon@...
Send Email Send Email
 
Hi

Please submit any news item you would like considered for inclusion in
the October 1999 Classification Society of North America
(http://www.pitt.edu/~csna/) newsletter.

Thanks
Bill

--
William D. Shannon, Ph.D.

Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences

Assistant Professor of Biostatistics
Division of Biostatistics

Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO   63110

Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon

#51 From: William Shannon <shannon@...>
Date: Fri Oct 15, 1999 4:46 pm
Subject: [RP] Postdoc Position
shannon@...
Send Email Send Email
 
Washington University School of Medicine in St.  Louis

Postdoctoral Position -- Second Announcement


A postdoctoral position in biostatistics is immediately available in the
Division of General Medical Sciences, Department of Medicine, Washington
University School of Medicine in St.  Louis (http://medschool.wustl.edu/).  This
position involves working collaboratively with Dr.  William Shannon
(http://osler.wustl.edu/~shannon) on statistical clustering and classification
analysis (50%), and providing biostatistical consulting support (50%) to
academic biomedical researchers.  Research possibilities include developing new
methods for improving classification and regression tree models, application of
tree-based models to genetic epidemiology, and developing wavelet-based and
other strategies for cluster analysis of gene chip data.

  The successful candidate will have a recent PhD in biostatistics (or statistics
with a focus on applications), and have some background in statistical cluster
and classification analysis.  Strong computing skills and proficiency in UNIX
are necessary.  This position is for 2-3 years, and salary is based on NIH
specified postdoctoral salaries.

									        I
If interested please send me a cover letter, cv, reprints, and list of
references by mail, email or fax.  (Please do not have letters of reference
sent.)


Bill Shannon

--
William D. Shannon, Ph.D.

Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences

Assistant Professor of Biostatistics
Division of Biostatistics

Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO   63110

Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon

#50 From: Tjen-Sien Lim <limt@...>
Date: Mon Oct 11, 1999 12:58 am
Subject: [RP] effect of minimal sample size in a node
limt@...
Send Email Send Email
 
I've found a weird phenomenon with Shelby Haberman's breast cancer
survival data (can be downloaded from the UCI Machine Learning
Repository). The QUEST algorithm yields only the root node (number of
terminal nodes = 1) when I set the minimum sample size at 1. However,
when I change the minimum sample size to 5, I get a tree with 18
terminal nodes. Both trees are obtained with N-fold cross-validation
(or jackknife), 0-SE rules, proportional priors, and equal costs.

CART(r), on the other hand, yields the same tree with minimum sample
size 1 or 5. C5.0/See5 gives the same tree as CART(r) with 3 terminal
nodes. My experimental classification tree (based on chi-square tests)
also gives exactly the same tree.

Has anyone observed the same problem with other tree variants? If
minimum sample size has a big impact, then interpretation of the tree
diagram would be much more difficult.

Thanks.

--
Tjen-Sien Lim                (608) 262-8181 (Voice)
Dept. of Statistics          (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison   limt@...
1210 West Dayton Street      http://www.stat.wisc.edu/~limt
Madison, WI 53706

#49 From: William Shannon <shannon@...>
Date: Tue Oct 5, 1999 2:56 pm
Subject: [RP] jobs
shannon@...
Send Email Send Email
 
Here are two job announcements I am posting for a colleague. If interested
please contact Dan Weaver at weaver@...


Applied Mathematician

Genomica Corporation seeks an individual with a strong background in
mathematical statistics and data analysis. This individual must have a
desire to interact with a multidisciplinary team involved in the
development of new computational and mathematical methods relating to the
analysis of gene and protein expression experiments. Other areas of
collaboration include the analysis of complex genetic disorders. The
successful candidate will be a member of the research department and will
be encouraged to publish and present original scientific results. A Ph.D.
in mathematical statistics or a related field is required.


Quantitative/Statistical Geneticist

Genomica Corporation seeks a quantitative/statistical geneticist with
significant experience in the analysis of genetic human genetic disorders.
Must be well versed in both the theory and practice of genetic analysis.
New challenges will arise as a consequence of the availability of large
numbers of single nucleotide polymorphisms in both public and private
databases and their application to the analysis of complex genetic
disorders. The successful candidate will be a part of a multidisciplinary
team that is responsible for the development of new methods and software
for genetic analysis and for determining the applicability of third-party
algorithms to this area. The successful candidate will also be responsible
for collaborations with academic and industrial partners in the analysis of
genetic disorders. The successful candidate will be a member of the
research department and will be encouraged to publish and present original
scientific results. A Ph.D. in statistical genetics or a related field is
required.


--
William D. Shannon, Ph.D.

Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences

Assistant Professor of Biostatistics
Division of Biostatistics

Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO   63110

Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon

#48 From: Tjen-Sien Lim <limt@...>
Date: Mon Oct 4, 1999 3:37 am
Subject: [RP] more on boosting/bagging (classification trees)
limt@...
Send Email Send Email
 
I've computed the improvement (%) over the error rate of CART(r),
C5.0, and RIPPER by boosting, bagging, or arcing. The data sets used
in the experiment and other results are available at

    http://www.recursive-partitioning.com/mv.html

Negative numbers mean that boosting/bagging/arcing yields a less
accurate classifier.

It appears that the proprietary boosting algorithm is C5.0 is more
"robust" than the bagging or arcing algorithm implemented in CART(r).


            ARCed      Bagged       Boosted        Boosted     Boosted
Data     CART(r)     CART(r)     C5.0 Tree     C5.0 Rules      RIPPER
=====================================================================

adt       -9.15        -6.34         -0.67          -2.05        4.24
att       -1.27        -0.51         -0.25           0.25        2.99
ban       43.25        33.44         28.57          29.84       18.30
bcw       41.30        21.67         42.08          39.01       24.89
bio       15.24        15.24         -3.60          10.07       -2.74
bld       16.62        15.80         16.33          14.93        4.36
bos        6.97         8.61         18.11          18.78       14.86
bpr       19.81        30.35         31.01          30.68       37.78
cmc       -8.55        -6.80          2.77           4.37       10.66
crx        5.37         7.38         10.34          16.44        4.79
der       23.04        30.65         67.65          63.65       50.00
dna       18.81        20.00         36.62          40.36       30.96
ech        2.24         1.68          8.18           4.30        7.29
edu        0.93         2.55         -0.22           0.69        0.00
hab      -27.17       -12.45          0.00          -1.42       -1.20
hco      -22.49         5.33         -2.50           1.23       -5.61
hea        5.88         4.52         30.63          25.21       23.81
hep       22.75        24.89         22.67           9.95       -0.55
hin       -7.53        -3.94         11.26           8.90        0.00
hur       -1.32        -3.95         -0.58           9.94        7.69
hyp      -82.94       -34.80        -20.82         -12.38      -30.01
imp       41.20        21.03         26.03          31.60       52.27
inf      -34.55       -26.82          2.19          14.97      -13.33
lbw      -15.85       -10.98          4.58          14.76       12.74
led       -2.66         1.33          4.87           7.08        0.00
pid       -1.63         2.86          8.24           2.32       -4.38
sat       35.86        30.92         44.59          32.22       29.25
seg       42.83        41.15         45.19          57.92       55.46
smo      -26.23       -21.97        -11.80          -8.20        0.00
tae      -19.09       -13.39         -7.84          12.81       29.44
usn       14.70        13.26         15.73          13.84       -0.36
veh       23.13        21.17         16.90          23.75       15.48
vot       -7.94        13.61         50.54          28.60      -50.12
wav       39.31        36.90         40.53          37.23       34.08


Mean       4.44         7.72         15.80          17.11       10.68
Median     3.81         6.35         10.80          14.30        6.04
Min      -82.94       -34.80        -20.82         -12.38      -50.12
Max       43.25        41.15         67.65          63.65       55.46

#47 From: Tjen-Sien Lim <limt@...>
Date: Fri Sep 17, 1999 3:31 am
Subject: [RP] FYI: 2 new trees papers
limt@...
Send Email Send Email
 
Machine Learning
Volume 36, Issue 3, September 1999


An Efficient Extension to Mixture Techniques for Prediction and
Decision Trees
Fernando C. Pereira, Yoram Singer
pp. 183-199

General and Efficient Multisplitting of Numerical Attributes
Tapio Elomaa, Juho Rousu
pp. 201-244

#46 From: Tjen-Sien Lim <limt@...>
Date: Thu Sep 9, 1999 7:35 pm
Subject: [RP] FYI: Popular Ensemble Methods: An Empirical Study (JAIR)
limt@...
Send Email Send Email
 
Opitz, D. and Maclin, R. (1999)  "Popular Ensemble Methods: An Empirical
    Study", Volume 11, pages 169-198.
    Available in HTML, PDF, PostScript and compressed PostScript.

    For quick access go to <http://www.jair.org/abstracts/opitz99a.html>

    Abstract: An ensemble consists of a set of individually trained
    classifiers (such as neural networks or decision trees) whose
    predictions are combined when classifying novel instances.  Previous
    research has shown that an ensemble is often more accurate than any of
    the single classifiers in the ensemble.  Bagging (Breiman, 1996c) and
    Boosting (Freund & Shapire, 1996; Shapire, 1990) are two relatively
    new but popular methods for producing ensembles.  In this paper we
    evaluate these methods on 23 data sets using both neural networks and
    decision trees as our classification algorithm.  Our results clearly
    indicate a number of conclusions.  First, while Bagging is almost
    always more accurate than a single classifier, it is sometimes much
    less accurate than Boosting.  On the other hand, Boosting can create
    ensembles that are less accurate than a single classifier --
    especially when using neural networks.  Analysis indicates that the
    performance of the Boosting methods is dependent on the
    characteristics of the data set being examined.  In fact, further
    results show that Boosting ensembles may overfit noisy data sets, thus
    decreasing its performance.  Finally, consistent with previous
    studies, our work suggests that most of the gain in an ensemble's
    performance comes in the first few classifiers combined; however,
    relatively large gains can be seen up to 25 classifiers when Boosting
    decision trees.

#45 From: dean_h_judson@...
Date: Wed Sep 8, 1999 6:49 pm
Subject: [RP] Job opp for math. and comp. statisticians
dean_h_judson@...
Send Email Send Email
 
For those of you interested, the Census Bureau has many many
      Mathematical Statistician and Survey Statistician positions at
      various grades that we are trying to fill.  You can go to:

      http://www.census.gov/hrd/www/index.html for more general info.
      HOWEVER, I HAVE SPECIFIC POSITIONS I'M SEEKING TO FILL. See below.

      I'm building a team in the Administrative Records Evaluation and
      Linkage
      Group, which I head up.  Thus, I'm HIRING! I have six positions as
      Mathematical or Survey Statistician at the Bureau in Suitland, MD,
      pay ranges from $30,000 to $75,000 per year.  We have several
      nationwide
      projects underway, and I'm looking for persons skilled in Data
      Warehousing, Data Mining, Record Linkage, Administrative records,
      sampling, and/or survey coverage.  I am building the staff for cutting
      edge research in administrative records as only the Census Bureau can
      do!

      For more info, I can be reached at:
      Dean H. Judson, Ph.D.,  Mathematical Statistician and Group Leader,
      Administrative Records Evaluation and Linkage Group
      Planning, Research and Evaluation Division
      U.S. Bureau of the Census
      Washington, DC 20233

      My email is: Dean.H.Judson@...
      My work phone is: 301-457-4222

      Please feel free to pass this email along to anyone who might
      find it appealing.  THANKS!

#44 From: Kerry Martin <kerry@...>
Date: Tue Sep 7, 1999 11:09 pm
Subject: [RP] Upcoming MARS and CART One-Day Seminars by Salford Systems
kerry@...
Send Email Send Email
 
MARS AND CART OCTOBER SEMINARS PRESENTED BY SALFORD SYSTEMS

An Introduction to Next Generation Regression Modeling with MARS™
(October 13 in San Francisco and October 18 in New York)

An Introduction to Decision-Tree Modeling with CART®
(October 19 in New York)

Advanced CART® and Tree-Hybrid Modeling Techniques
(October 20 in New York)


COURSE DESCRIPTIONS

An Introduction to Next Generation Regression Modeling with MARS™
October 13 in San Francisco and October 18 in New York

Step into the next generation of regression modeling with MARS
(Multivariate Adaptive Regression Splines)!  Be the first on your block to
get answers about a new high-speed regression tool that provides superior
predictive - what is MARS?  How does it work?   What applications is it
best suited for?  How can it help you develop more accurate models for
predicting continuous and binary outcomes?

Learn how to:
· plan and execute MARS analyses
· refine your models using key control parameters
· use MARS to improve your existing models
· hybridize MARS and CART to gain an even better fit

An Introduction to Decision-Tree Modeling with CART®
October 19 in New York

Discover the power of tree-structured modeling during this popular one-day
seminar by Dan Steinberg, one of the leading experts in decision-tree
technology and applications.  This seminar is geared towards data analysts
and modelers who are interested in understanding both the conceptual and
practical basis of decision-tree methodology -- what it is, why it works,
how it has been used, and how it can help you better understand your data.

Explore the practical use and application of decision trees in solving real
world, complex modeling challenges and learn about:
· decision-tree fundamentals
· decision-tree applications
· how to build and interpret CART trees
· how to use advanced options (splitting rules, priors, costs, bagging,
ARCing)

Advanced CART® and Tree-Hybrid Modeling Techniques
October 20 in New York

Sharpen your decision-tree expertise during this one-day advanced course
for analysts with prior knowledge of tree algorithms.   Using real-world
examples, learn how to:
· hybridize decision trees with logistic regression and neural nets
· combine multiple trees via bagging, boosting, and varying priors
· explore alternative splitting rules and their strengths and weaknesses
· grow and prune with misclassification costs
· emerging topics in decision trees

MORE INFO

For more information on our training courses or to download a free demo of
MARS or CART visit:

http://www.salford-systems.com

#43 From: Tjen-Sien Lim <limt@...>
Date: Tue Sep 7, 1999 2:04 am
Subject: [RP] looking for a paper
limt@...
Send Email Send Email
 
I'm wondering if someone has the electronic copy (in PostScript or
PDF) of the following paper. Our local library doesn't have the 1995
proceeding and it'll take about 2 weeks to order a copy through
interlibrary loan. Thanks.


@InProceedings{fks-eafmsdt-95,
   author =       "Truxton Fulton and Simon Kasif and Steven Salzberg",
   title =        "Efficient algorithms for finding multi-way splits
   for
                  decision trees",
   booktitle =    "Proc. 12th International Conference on Machine
                  Learning",
   publisher =    "Morgan Kaufmann",
   year =         "1995",
   pages =        "244--251",
}

#42 From: Tjen-Sien Lim <limt@...>
Date: Thu Aug 26, 1999 6:29 pm
Subject: [RP] New book: Machine Learning Methods for Ecological Applications
limt@...
Send Email Send Email
 
Machine Learning Methods for Ecological Applications


Edited by
Alan H. Fielding
Dept. of Biological Sciences, The Manchester Metropolitan, UK


The last 25 years have seen a tremendous growth in the application of
statistical and modelling techniques to ecological problems. This
expansion has been accelerated by the increasing availability of
software, books and computing power. However, the suitability of some
of these approaches to data analysis, in a relatively knowledge-poor
discipline such as ecology, can be questioned on grounds of
appropriateness and robustness. One reason for these concerns is that
many ecological problems are at best poorly defined and most lack
algorithmic solutions. Machine learning methods offer the potential
for a different approach to these difficult problems.

One definition of machine learning is that it is concerned with
inducing knowledge from data, where the data could be patterns in a
game of chess or patterns in the species composition of natural
communities. Unfortunately ecologists have little experience of these
relatively recent and novel approaches to understanding data. This is
a problem that is made more complex because there is no simple
taxonomy of machine learning methods and there are relatively few
examples in the mainstream ecological literature to encourage
exploration.

This is the first text aimed at introducing machine learning methods
to a readership of professional ecologists. All but one of the
chapters have been written by ecologists and biologists who highlight
the application of a particular method to a particular class of
problem.  Examples include the identification of species, optimal mate
choice, predicting species distributions and modelling landscape
features. A group of experienced machine learning workers, who have
become interested in environmental problems, have written a chapter
that demonstrates how machine learning methods can be used to discover
equations that describe the dynamic behaviour of ecological
systems. The final chapter reviews `real learning', offering the
potential for greater dialogue between the biological and machine
learning communities.


Contents and Contributors


Kluwer Academic Publishers, Boston

      Hardbound, ISBN 0-412-84190-8
      August 1999, 280 pp.
      NLG 255.00 / USD 125.00 / GBP 81.25


Contents and Contributors
Contributors. Preface. Acknowledgements. 1. An introduction to machine
learning methods; A. Fielding. 2. Artificial neural networks for
pattern recognition; L. Boddy, C.W. Morris. 3. Tree-based methods;
J.F. Bell. 4.  Genetic Algorithms I; J.N.R. Jeffers. 5. Genetic
Algorithms II; D.R.B.  Stockwell. 6. Cellular automata;
D. Dunkerley. 7.  Equation discovery with ecological applications;
S. Szeroski, et al. 8. How should accuracy be measured?
A. Fielding. 9. Real learning; B. Stevens-Wood.  Author Index.
Subject Index.

#41 From: Tjen-Sien Lim <limt@...>
Date: Mon Aug 23, 1999 10:23 pm
Subject: [RP] FYI: Using correspondence analysis to combine classifiers
limt@...
Send Email Send Email
 
The following new article in Machine Learning could be of interest.

===

UI  - 215PC-0003
AU  - Merz CJ
TI  - Using correspondence analysis to combine classifiers
SO  - Machine Learning 1999 Jul;36(1-2):33-58
MH  - Classification
MH  - Correspondence analysis
MH  - Multiple models
MH  - Combining estimates
MH  - Algorithm
AB  - Several effective methods have been developed recently for improving
       predictive performance by generating and combining multiple learned
       models. The general approach is to create a set of learned models
       either by applying an algorithm repeatedly to different versions of
       the training data, or by applying different learning algorithms to
       the same data. The predictions of the models are then combined
       according to a voting scheme. This paper focuses on the task of
       combining the predictions of a set of learned models. The method
       described uses the strategies of stacking and Correspondence
       Analysis to model the relationship between the learning examples and
       their classification by a collection of learned models. A nearest
       neighbor method is then applied within the resulting representation
       to classify previously unseen examples. The new algorithm does not
       perform worse than, and frequently performs significantly better
       than other combining techniques on a suite of data sets.
       [References: 37]
PT  - Article

#40 From: Tjen-Sien Lim <limt@...>
Date: Sun Aug 15, 1999 3:00 am
Subject: [RP] usefulness of the tree diagram
limt@...
Send Email Send Email
 
I'd like to hear your opinions/comments/discussions on the following
issue. One selling point of tree-structured methods is that they can
"provide insight and understanding into the predictive structure of
the data" (quote from Breiman, Friedman, Olshen & Stone, 1984). Tree
methods enable us to understand our data better by looking at the tree
diagram and studying the split variables and split points.

Skeptics want to know what the guarantee is that the
explanation/theory/story provided by the tree diagram is not wrong nor
misleading. What can we tell skeptics to assure them that the tree
diagram is useful?

One way is to simulate a tree model and generate some simulated data
from the true tree model. If the tree method can estimate the true
model reasonably well, then perhaps we can convince
skeptics. (Professor Douglas Hawkins has done this kind of
simulation.) What else can we do?

Thanks.

--
Tjen-Sien Lim                (608) 262-8181 (Voice)
Dept. of Statistics          (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison   limt@...
1210 West Dayton Street      http://www.stat.wisc.edu/~limt
Madison, WI 53706

#39 From: Tjen-Sien Lim <limt@...>
Date: Sun Aug 8, 1999 12:41 am
Subject: [RP] Q: cause of missing values
limt@...
Send Email Send Email
 
I'd like to solicit information about the cause missing values based
on your experiences analyzing data. Some examples that I've
encountered or can think up are:

1. In a survey, missing values can mean that the respondents:
       - refuse to answer
       - don't know
       - don't remember

2. Missing values can mean "Not Applicable" (or skipped pattern). For
    example:
       - in medical sciences, a particular blood test is not ordered by
         the physician and hence the value for the test is "missing"
       - only men can experience prostate problem
       - only women can get pregnant

3. In agricultural sciences, the plant/crops or animals could die
    during the course of the experiment and hence the "yield" variable
    would be "missing".

4. In an ecological study, the cows chew the plastic bags used to trap
    insects on the field and hence no insect data could be obtained
    from those bags.

Other examples? Please respond to me directly and I'll summarize to
the list. Thank you.

--
Tjen-Sien Lim                (608) 262-8181 (Voice)
Dept. of Statistics          (209) 882-7914 (Fax)
Univ. of Wisconsin-Madison   limt@...
1210 West Dayton Street      http://www.stat.wisc.edu/~limt
Madison, WI 53706

#38 From: limt@...
Date: Fri Aug 6, 1999 5:59 pm
Subject: [RP] Re: C5.0 and QSAR data sets
limt@...
Send Email Send Email
 
<7obka8$oac-@egroups.com> wrote:
original article:http://www.egroups.com/group/recursive-partitioning/?s
tart=37
> I'm looking for people having experience in C5.0 for QSAR
applications.
> Thanks
>
> Abdel Laoui


I don't know about C5.0 for QSAR (Quantitative Structure Activity
Relationships), but there's a classification tree program designed
specifically for QSAR

http://www.msi.com/solutions/products/cerius2/modules/c2csar.html

#37 From: abdelazize.laoui@...
Date: Thu Aug 5, 1999 9:08 am
Subject: [RP] C5.0 and QSAR data sets
abdelazize.laoui@...
Send Email Send Email
 
I'm looking for people having experience in C5.0 for QSAR applications.
Thanks

Abdel Laoui

#36 From: William Shannon <shannon@...>
Date: Fri Jul 30, 1999 7:27 pm
Subject: [RP] CSNA Newsletter
shannon@...
Send Email Send Email
 
http://osler.wustl.edu/~shannon/csna/news.latest.html

Hi,

The above link is to the most recent issue of the Classification Society of
North America's (CSNA) Newsletter which may be of interest to some readers. If
you are unfamiliar with the CSNA I encourage you to take a look at their
homepage: http://www.pitt.edu/~csna/

Bill Shannon

--
William D. Shannon, Ph.D.

Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences

Assistant Professor of Biostatistics
Division of Biostatistics

Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO   63110

Phone: 314-454-8356
Fax: 314-454-5113
e-mail: shannon@...
web page: http://osler.wustl.edu/~shannon

Messages 36 - 65 of 95   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help