P.S. When you say "but I see it as managing errors rather than reducing them"
- I see it as the opposite.
An SLM's main aim is to reduce the absolute error count as low as possible - i.e. Reduce them.
If the best way to do that is to completely make one word/phrase not work at all, then the SLM may result in that.
There are certainly many observable cases of, say one strange word occurring just 1 or 2 times in 100000 cases - And the SLM may not work AT ALL on that word.
- Is that really so bad?
If you want to 'manage' errors - Saying "this error is 3 times as bad as this other error", or whatever - Then that is managing errors, and uses information the SLM can't really know.
- That's where your design and tweaking comes in...
But the SLM is all about trying to minimise the absolute error count. Anything more 'wise' than that is up to you.
From: vuids@yahoogroups.com [mailto:vuids@yahoogroups.com] On Behalf Of Peter Nann
Sent: Friday, 3 July 2009 9:51 AM
To: vuids@yahoogroups.com
Subject: RE: [vuids] Re: Recognition-Problem
Everything you say is true.
Like I said, you have to consider what you are trying to do, and it is critical to consider how important it is to you to be able to get the 'rare' ones right.
You might sacrifice 3 "NewYorks" to make 1 "New Yerk" work - That is your call.
You might sacrifice 10 "Sales" calls for 1 "Sal Lees" call.
But you are probably pushing things if you sacrifice 100 "New Yorks" for 1 "New Yerk".
(Weighting 2 similar sounding things like this is basically an exercise in sacrificing one for the other)
And disambiguation strategies? - Sure, that affects what your performance targets should be in the steps prior to the disambiguation.
But we weren't talking about that. That could change everything.
Also like I said - Your examples are why I stated that starting with weights based on population is really only a very rough start.
The risk with doing this blindly is exactly the sort of thing you are worried about - Low runners might be completely clobbered and you might not know about it unless you test it properly.
But the harsh reality is:
Do you want to put out a system that gets 10% of requests wrong, or 5% of requests wrong?
Probably the latter, and that means sacrificing (performance on) the minority for the good of the majority.
And as usual, there is no single answer. You have to consider exactly what you are trying to do and design accordingly.
From: vuids@yahoogroups.com [mailto:vuids@yahoogroups.com] On Behalf Of Bruce Papazian
Sent: Friday, 3 July 2009 3:21 AM
To: vuids@yahoogroups.com; vuids@yahoogroups.com
Subject: RE: [vuids] Re: Recognition-Problem
>> Interesting paper. I've always wondered if such SLMs truely improver accuracy
- Yes they do.
Hi Peter,
This may be true, I'll have to take your word for it, but I would certainly agree that these techniques can lead to a greater percentage of calls being processed, but I see it as managing errors rather than reducing them (which is improved accuracy), and I think this can be more or less appropriate, depending on the application. I see the process as trading off the probability of recognizing some items correctly over others to increase automation rates, but the result may not be acceptable from an individual users perspective if he/she can't get the app to work for things that are not common but still valuable.
Let me give offer an example from a call director application perspective to make my point. Suppose an app is set up to transfer calls to people and departments within a company. All departments and employee names are in the grammar, and the prompt is something like "say the name of the person or department you want to reach." I would say it would be reasonable for the caller to expect that he/she should be able to reach all people and departments.
Now suppose "Sales" is a department and the CEO's name is "Sal Less" and you want to make it work better so you decide to weight the grammar via some real data. In the real data there are many more calls to Sales than Sal Less, so through this process, a new grammar is built where "Sales" is now weighted heavier than "Sal Less" and is deployed. Now Sales is recognized correctly more often, and the automation rate is up, but less calls get through to the CEO as they used to. I can see (and have seen) where some constituents would not see this change as an improvement, even though the automation rate could have gone up.
Taken to the extreme, lets repeat the process a few more times. As things that were infrequently requested to start with get requested less often because people learn that they don't work very well, they get weighted lower and lower, which, in the limit, is effectively taking them out of the grammar. You may now have a great automation rate, but you've changed the problem you are solving, and the job you are doing for the users, and the app is no longer meeting expectations.
Sorry to be belaboring this point, but I think it worth noting when these techniques are being considered. A long time ago I was involved with a project where city names had to recognized. Performance data showed there were confusable pairs like Austin and Boston we had to deal with. Rather then tuning via a weighting strategy we chose to disambiguate via a dialog change where we would ask for the state name whenever one of the confusable pairs was recognized, and we put the city with the state in the grammar so that it would recognize properly if people decided to say the city and state after hearing the disambiguation prompt.
The dialog went something like this:
>What city?
Boston
>Was that Boston Massachusetts, or Austin Texas?
Boston Massachusetts
This made things work better without biasing towards the more frequent request, and it worked for confusable pairs that were equivalent in frequency of request.
Interesting discussion.
>> or just bias the app to the more frequently requested items.
- Yes it certainly does bias, that is exactly how the higher accuracy is achieved.
>> If a test set is biased the same way as the models, then I can see why system results will look better,
- Yep. That's why real data is your friend, fake data can be your enemy. There's not point optimising a system toward fake data, but real data is another story...
>> but if the test set represents all possible items equally does it get better?
- That is a 'fake' testset. Performance would be gauranteed WORSE with the weighting, on such a fake testset. See the above point. Optimising on fake data is folly...
>> And if you look at individual items, does the performance on the low frequency items go down using such methods?
- Yes it would go down. Possibly a lot for 'rare items' that are similar to 'hugely common' items.
That's why weighting based on things like population is a start, but you really want to then know how the whole solution performs _in_the_real_world_
Is it acceptable that some minor cities might be really hard to recognise?
If it's really important to you that the town of "New Yerk" with population 50 is recognised at all, then you had better carefully consider the weight of it and other similar sounding cities, and test with real data.
You might test it like this:
A) Get 100 people saying "New York"
B) Get 100 people saying "New Yerk"
- Adjust the weightings until they were both, say, 95% right.
If you did this, you _WOULD_BE_CRAZY_.
It is far, far, FAR more important that "New York" performs better.
So you might adjust weights such that "New York" was 99.5% right (if you are lucky), and "New Yerk" might be only 70% right,
- But the OVERALL performance on REAL DATA, would almost certainly be better like this.
It's a numbers game. And it's a harsh reality.
But you can't really argue against the numbers...
Now if you were REALLY SMART, the grammars would not just be weighted on 'population', but preferably by some other factors...
For example, biased more toward cities near your current location, because fairly short trips (?) are probably much more likely than long trips... Depending on the app.
- That's the sort of thing Google love to use to imrove their apps...
From: vuids@yahoogroups.com [ mailto:vuids@yahoogroups.com] On Behalf Of Bruce Papazian
Sent: Thursday, 2 July 2009 2:30 AM
To: vuids@yahoogroups.com
Subject: Re: [vuids] Re: Recognition-Problem
Interesting paper. I've always wondered if such SLMs truely improver accuracy or just bias the app to the more frequently requested items. If a test set is biased the same way as the models, then I can see why system results will look better, but if the test set represents all possible items equally does it get better? And if you look at individual items, does the performance on the low frequency items go down using such methods?
At 01:55 PM 6/30/2009, you wrote:
Here is a pointer to a paper that describes how to weight the grammar by population:BPIdesign
http://phil.shinn.googlepages.com/DesigningLanguageModelsforVoicePorta.pdf
--- In vuids@yahoogroups.com, "vuiwoz" <vuiwoz@...> wrote:
>
> Hi,
> I have an issue with recognition of cities (without state):
> There are about 50.000 cities to recognize (including synonyms) - and according to my experiences recognition should be at least about 70-75%, without any fine-tuning (except lexicon) - but by now its far less. The lexical transcriptions we use are hand-crafted, so this shouldnt be the problem.
>
> Does anyone have experiences with that and knows, which parameters can be set to enhance recognition? (Maybe preprocessing - Sample-Frequency, Volume)
>
> Thanks in advance for any hints...
>
6 Stonecutters Path
Harvard, MA 01451
brucepapazian@...
978-835-3124
6 Stonecutters Path
Harvard, MA 01451
brucepapazian@...
978-835-3124
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________