Jump to content

Elo Based On Win/loss (Or Anything Based On Win/loss) Is Silly


167 replies to this topic

#121 Grits N Gravy

    Member

  • PipPipPipPipPipPip
  • 287 posts

Posted 11 November 2013 - 08:14 AM

The problem isn't inherently with Elo systems, it's the implementation. Specifically the lack of segregation of players based on Elo scores. In all likelihood they are putting people on teams together who's Elo scores are within 1 standard deviation of each other. If your Elo score is the mode, the most commonly occurring score, then you can expect to matched against 60% of the player base.

You can run into a lot of problems with this approach. First and foremost is, you get inaccurate forecasting from Elo scores. At it's heart, an Elo score is a precise way of saying how likely one person is to beat another. IE a person with 200 more Elo points than his opponent should win 75% of the time. When you allow teams to be composed of people with largely disparate scores; You end up with rapid swings in Elo scores, due to players being easily over and under ranked and then having the score over or under corrected.

This basically destroys the accuracy of predictive nature Elo. As the 200 point difference in scores doesn't accurately predict that the higher raked team will win 75% of the time. Not to mention that a individual's Elo scores is adjusted based on his team's average Elo Vs the opponents average Elo. So your score is inherently inaccurate as the adjustments to your score are somewhat arbitrary. You maybe either over or under rewarded for your performance in any given match based on your team's average Elo score.

This can lead to developers to make false assessments about the performance of the matchmaking system, depending on what metrics they are looking at to judge success. Looking at something like the distribution of Elo scores might give you the false sense that the matchmaker is functioning well. If you have a normal distribution of Elo scores you might assume all is well because the scores match the typical distribution of skill. What really could be happening is that the inherent inaccuracy of the system creates a feedback loop for the majority of players thereby sticking them within 1.5 standard deviations of the mode score. So you end up with bad matchmaking that looks good on paper.

This may also lead a developer to believe that he needs a larger tolerance in matchmaking spread, which compounds the issue. An inaccurate Elo system leads to a greater range of scores as a function of the fluctuation of players scores. Which means the standard deviation is much larger. Thus any matchmaking criteria is much wide than it needs to be.

An accurate Elo system will tend to concentrate scores along the mode score. Thus we end up with a smaller standard deviation. Which means inherently tighter matchmaking. That functions just as fast as the large standard deviation because the populations of the scores around the mode are greater.

From the published distributions of the Elo scores you can see the range of scores and standard deviation opens way up after 50 matches. That's mostly a function of the inaccuracy of the matchmaking system. So you can see the feedback loop develop and it tells us we need wide matchmaking spreads, which makes it self perpetuate.
Posted Image

What you really should keep you eye on is the frequency and variance in score changes, more is bad. And, the how accurately Elo is predicting outcomes. IF both are off, and your committed to Elo, you tune your system by lowering the maximum amount of points won in a match, and tightening the match and team building criteria.

Edited by Grits N Gravy, 11 November 2013 - 08:21 AM.


#122 Kunae

    Member

  • PipPipPipPipPipPipPipPipPip
  • 4,303 posts

Posted 11 November 2013 - 08:34 AM

View PostJoseph Mallan, on 11 November 2013 - 07:12 AM, said:

This too! Though it could be argued that is part of Assisting. :D

Don't have to actually shoot, or hit anyone if you do, to maneuver the enemy where you want them, many times. The game has no way of tracking this behavior, and it's not really possible for it to.

#123 Joseph Mallan

    ForumWarrior

  • PipPipPipPipPipPipPipPipPipPipPipPipPipPipPip
  • FP Veteran - Beta 1
  • FP Veteran - Beta 1
  • 35,216 posts
  • Google+: Link
  • Facebook: Link
  • LocationMallanhold, Furillo

Posted 11 November 2013 - 08:40 AM

View PostKunae, on 11 November 2013 - 08:34 AM, said:

Don't have to actually shoot, or hit anyone if you do, to maneuver the enemy where you want them, many times. The game has no way of tracking this behavior, and it's not really possible for it to.

I would say at least the Mech in question would have to use the 'R' o indicate he is drawing the enemy into friendly fire much as we do with Missiles. I can't tell you how many times I get Money and XP for having Target lock and some missiles damage my target!

Edited by Joseph Mallan, 11 November 2013 - 08:40 AM.


#124 MischiefSC

    Member

  • PipPipPipPipPipPipPipPipPipPipPipPip
  • The Benefactor
  • The Benefactor
  • 16,697 posts

Posted 11 November 2013 - 01:26 PM

View PostVictor Morson, on 11 November 2013 - 12:04 AM, said:


Buzzz... and that is where you're entirely wrong. The true flaw behind everything you keep saying.

It's not a reward, no, but it is an attempt to measure skill for the intent of, gasp, placing those players in matches with players of a similar skill. This is the entire purpose of ELO, even if the execution is entirely lacking.

You keep talking about how you odds even out over time and that is precisely the problem. The small % bump your horrendous lack of skill / amazing displays of skill will get lost in a sea of drops that you won or lost based on a bum MM roll. Every bum roll, that number keeps on averaging out until it means absolutely nothing.

If you're trying to put pilots with others of equal skill you need to track the pilots and not the entire team. Any ELO system using the core logic you are putting forth is less than worthless for the goal of balancing matches.



Yes and no. They can make the biggest difference, yes, but ultimately one Highlander will not save a team of Locust pilots calling people tryhards for telling them not to charge the hill.

Again I do remind you the majority of your major ELO streaks have been in 4 mans, which would equally steamroll you if you happened to roll in alone on the other side against your teammates.


I really have to ask Victor, do you understand that for your point to be valid statistics and probability theory would have to be wrong. It's a bold assertion and one I'd be willing to entertain - you've got a pretty tough sell though. Admittedly there are millions of people who think Twilight was a great love story so I'm not entirely against challenging logical principles but you're going to need something a little stronger than your personal opinions. Do you have any mathematical basis for saying that once you get into 12 or more people it's impossible to ever untangle one persons contributions or performance?

The entire point of statistics and probability theory is that not only can your personal win/loss rate in a team of 12 be used to identify your precise performance but I could do it with you in a team of 1200 - I'd just need a lot more matches to sort your performance out. 12, 12,000, 12 million players I could still sort you out with enough matches and telemetry and win/loss would be the most accurate way to do it. I'd just need more matches the larger the pool size.

Let's take this further. If I had complete telemetry for you over hundreds or even thousands of matches I could not only identify your impact on win/loss for a team of random or semi-random players but I could identify details of your personality - what sort of shopper you are, your age and financial situation, do you have a family or not. I could even give a decent prediction of how you'd vote in elections, all within a decent margin of error. The more telemetry I have, the more games you play the more I can sort you out of the mix and the more precision with which I can do so. Do you focus first on helping teammates or targeting enemies. Do you prefer targets who've fired on you or ones who've attacked your allies. Do you focus on strongest or weakest first. How often do you change mechs, loadouts and weapons? What sort of weapons do you prefer, what sort of mechs? How often do you buy new things, how often do you spend cbills or XP? Telemetry - give me enough of it and I'll show you a clearer picture of exactly what sort of person you are than your mom could.

What do you think Google does? Facebook? The NSA? It's the exact same principle.

You and your performance are the only stable thing between all the matches you play. The wide variety of variables in pug matches in MWO is why it takes hundreds of matches to accurately seat someone into a pretty close Elo rating. If you were in a stable 12man, the same 12 people every game, I could do so in a couple dozen matches but it'd be strongly skewed by the consistent data points of your teammates. Your performance with random teammates would vary by quite a bit - hence why pug matches are *more* accurate for identifying your performance without a stable team. You are the only consistent variable.

Win/loss is the only reliable factor from which to draw data. Everything else is easily skewed. I've gone over why prior.

This isn't my opinion, it's a mathematical process. I've shown you the math in prior posts. Show me the math that says your contribution in pugs is impossible to identify. I'd love to see it - it flies in the face of the whole field of statistical analysis. Again, not trying to be sarcastic here but this isn't a debate of opinions. This is you saying that an entire field of mathematical analysis and human behavior is invalid. That's a pretty big assertion. Elo for win/loss works in teams of 12 for the exact same reason any statistical analysis of a random sampling to track patterns of a single consistent data point works. That consistent data point can be plotted and its impact on the samples refined over time to accurately track impact. More samples, longer time, better accuracy.

View PostJoseph Mallan, on 11 November 2013 - 06:08 AM, said:

Your contibutions were assisting kills and spotting, AN assist should count towards Elo at 10-up to 75% of a
kill you get. Tracking assists is a valuable bit of data to your performance.

View PostKunae, on 11 November 2013 - 07:03 AM, said:

Actually, his contributions were making the enemy move/maneuver in a way to put them out of position.

View PostJoseph Mallan, on 11 November 2013 - 07:12 AM, said:

This too! Though it could be argued that is part of Assisting. :D


You guys are talking about double-dipping. That would skew results. The reward for helping your team win is improving the odds of a win - if you get an Elo bump modifier for the acts that contribute to the win in addition to the win itself you're going to throw the results way out of whack. Now you've got all sorts of bonuses without an equivalent negative. You've also go to identify every single behavior that in some way contributes to a win and weight it relative to the others. To then make it balanced so that everyone doesn't just perpetually climb in Elo you'd have to identify every single action and behavior that reduces the odds of a win and weight them as well.

Why not just.... base it on the result? The win/loss? It's a 1/0 result and thus incredibly reliable and extremely difficult to cheat on.

#125 Victor Morson

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • Elite Founder
  • 6,370 posts
  • LocationAnder's Moon

Posted 11 November 2013 - 01:41 PM

View PostHauser, on 11 November 2013 - 04:07 AM, said:

The whole point of Mischiefs argument is that you are the only constant in your team. The effects of your team mates and your opponents cancel each other out in the long run. The matchmaker can botch the match either way. So the only thing that remains, that isn't averaged out, is your own influence.


Which would be fine, if there weren't so many chaotic variables that can't be accounted for.

To be fair, this is a snowball effect. If ELO was working properly to begin with, you would always be a constant on a team of people with similar track records, and thus, you'd be the constant in a somewhat stable environment.

The fact is every time you launch into Matchmaker, you are at the mercy of dozens of random factors. The biggest and most notable of these factors is wild weight mismatches followed immediately by randomly rolling great players, horrible players, and premade lances.

In the end, that data is far, far too muddy to mine. A lone pilot simply cannot control those tides. Any battles they win/lose because of their abilities will get lost in a sea of those they lost to dice rolls, and ultimately not really push their ratio more than a few percent in any given direction. However remember ELO calculates in 4-mans too so you can also artificially raise your ELO by running with a good team.

-

I think two things pretty much prove my point on this, from a practical standpoint:

1 - Even at prime time when ELO splitting should be easy, trial 'mechs run by cadets are still constantly thrown against people from the 10 best units in the game. Routinely.

2 - A bad player could literally drive an unarmed mech in 100 games, and still manage to only impact their winning % by the tiniest of margins, if at all. Frankly half the terrible Frankenbuilds we see in serious games, that's just about happening now.

3 - Almost all exclusive solo dropping players end up in the 45-55% bracket. How much of that is the constant variable of the pilot, and how much of that is pure luck to team rolls? It's impossible to determine this.

-

Frankly half the people here talking about how one person can consistently make enough difference to impact their ELO in a 12-man game are ludicrous. Those wins/losses you cause will merely get thrown in that muddy data puddle with all the other out of control factors. Ultimately a guy who impacts matches 5-10% of the time is going to end up evenly ranked with a guy who's lucked out more often than not and rolled good teams.

The bottom line is, why track a stat that is so difficult to get an accurate reading from? Average damage done per match, for example, is a highly trackable stat that helps ID that pilot's capabilities far, far more and that's just one stat of many. How about accuracy? Average kills per map? Average points captured per map? How many assists they have?

Those things are within the pilot's control. Those things are an effective way at ranking pilots and matching them against each other and none of it relies on entirely random elements.

Now Elo can handle that just fine. It only changes your score when proven wrong and it does it proportionally to expected outcome. So even if fair matches only show up 1% of time you will get to the right Elo score.

Now you're better of making an argument against the matchmaker. But that should be fixed roundabout UI 2.0. It will remove the tonnage matching and replace it with a weight limit. So the match maker will only have to look at Elo which should be easier.

View PostHauser, on 11 November 2013 - 04:07 AM, said:

Grouping up is an effective way to win, as such it should be reflected in your Elo rating. It doesn't require much piloting skill but we're measuring ones ability to win games.


Which continues to be the wrong thing to measure.

Bottom line is the purpose of ELO is to match pilots with pilots of a similar skill level, and win/loss is the worst stat to do that by. We can all see that ELO, as it works right now... doesn't. It's a proven failure of a system through and through.

EDIT: I won't ask anyone to run an unarmed mech or suicide to prove this theory but if anyone owns a locust, why not put some worthless guns on it and run it for 20 drops and see what your W/L ratio is. In a 'mech that can't really dent the outcome of anything, I bet you still see within 45-55, depending on how much you beat the odds one way or the other.

In fact anyone with a positive w/l ratio that drives the Locust in general proves my point.

#126 MischiefSC

    Member

  • PipPipPipPipPipPipPipPipPipPipPipPip
  • The Benefactor
  • The Benefactor
  • 16,697 posts

Posted 11 November 2013 - 01:41 PM

View PostGrits N Gravy, on 11 November 2013 - 08:14 AM, said:

The problem isn't inherently with Elo systems, it's the implementation. Specifically the lack of segregation of players based on Elo scores. In all likelihood they are putting people on teams together who's Elo scores are within 1 standard deviation of each other. If your Elo score is the mode, the most commonly occurring score, then you can expect to matched against 60% of the player base.

You can run into a lot of problems with this approach. First and foremost is, you get inaccurate forecasting from Elo scores. At it's heart, an Elo score is a precise way of saying how likely one person is to beat another. IE a person with 200 more Elo points than his opponent should win 75% of the time. When you allow teams to be composed of people with largely disparate scores; You end up with rapid swings in Elo scores, due to players being easily over and under ranked and then having the score over or under corrected.

This basically destroys the accuracy of predictive nature Elo. As the 200 point difference in scores doesn't accurately predict that the higher raked team will win 75% of the time. Not to mention that a individual's Elo scores is adjusted based on his team's average Elo Vs the opponents average Elo. So your score is inherently inaccurate as the adjustments to your score are somewhat arbitrary. You maybe either over or under rewarded for your performance in any given match based on your team's average Elo score.

This can lead to developers to make false assessments about the performance of the matchmaking system, depending on what metrics they are looking at to judge success. Looking at something like the distribution of Elo scores might give you the false sense that the matchmaker is functioning well. If you have a normal distribution of Elo scores you might assume all is well because the scores match the typical distribution of skill. What really could be happening is that the inherent inaccuracy of the system creates a feedback loop for the majority of players thereby sticking them within 1.5 standard deviations of the mode score. So you end up with bad matchmaking that looks good on paper.

This may also lead a developer to believe that he needs a larger tolerance in matchmaking spread, which compounds the issue. An inaccurate Elo system leads to a greater range of scores as a function of the fluctuation of players scores. Which means the standard deviation is much larger. Thus any matchmaking criteria is much wide than it needs to be.

An accurate Elo system will tend to concentrate scores along the mode score. Thus we end up with a smaller standard deviation. Which means inherently tighter matchmaking. That functions just as fast as the large standard deviation because the populations of the scores around the mode are greater.

From the published distributions of the Elo scores you can see the range of scores and standard deviation opens way up after 50 matches. That's mostly a function of the inaccuracy of the matchmaking system. So you can see the feedback loop develop and it tells us we need wide matchmaking spreads, which makes it self perpetuate.
Posted Image

What you really should keep you eye on is the frequency and variance in score changes, more is bad. And, the how accurately Elo is predicting outcomes. IF both are off, and your committed to Elo, you tune your system by lowering the maximum amount of points won in a match, and tightening the match and team building criteria.


You, sir, get it.

Here's the issue you're fighting with though - if you reduce the point impact per match you extend the total matches required to accurately predict a result on a given player. The biggest problem right now is that you'd need something in the neighborhood of 400 to 500 matches to accurately seat someone at an 1800 Elo because the narrowing band of available opponents of comparable skill gives you diminishing returns on your wins and increased decline on your losses.

Second is the matchmaker does weight balancing as well so it exacerbates the narrow range of options at particularly high and low Elo as it narrows the people available at any given time in any band.

This is why you actually DO want faster, even inappropriate jumps up in Elo in the early part of a players Elo jump - statistically it's actually better for players to be ranked higher than they deserve as it fattens the higher Elo bands, even if just artificially, and allows higher Elo players to more quickly settle into an appropriate rank. Less skilled players incorrectly rated as high Elo will inevitably settle into an appropriate rank and an extra 10 or 20 matches for them to do so because they are inaccurately reflected higher isn't a big deal. Having a high skilled player incorrectly reflected as a lower Elo and thus negatively skewing results of numerous other players is a bigger problem.

Does that make sense? I agree completely that Matchmaking would be better served by tightening the band it pulls from for matches but do to so effectively and still have good matchmaking you've got to fatten the higher and lower bands enough to populate matches. Otherwise you're inevitably going to have high and low ranked outliers used to offset imbalances in otherwise standard matchs. This makes progression in higher bands incredibly slow, almost impossibly so. At the same time it pushes low Elo players into middle Elo rankings far longer than they should be. It squeezes everyone to the middle.

The mixing of 4man and pug Elo exacerbates this, especially when someones Elo gets carried into high Elo bands by a skilled team and then pugs, throwing weighting for that whole match off for everyone. He takes a disproportionate hit when he can't pull the weight the MM incorrectly believes he can and the weighting for the match being off throws off the results for everyone. If it were isolated it wouldn't be a big deal but given the preponderance of 4mans who sometimes pug this is going to be a constant disruption in accurate prediction. If I were trying to use the resulting data of that to sell a product marketing projection I wouldn't show it to anyone who could understand the figures. They're still going to be accurate-ish but even predicting the margin of error would be tough.

#127 Victor Morson

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • Elite Founder
  • 6,370 posts
  • LocationAnder's Moon

Posted 11 November 2013 - 01:43 PM

View PostMischiefSC, on 11 November 2013 - 01:41 PM, said:

You, sir, get it.


What I don't get is your insistence on pulling data from something so incredibly random. Individual pilot skill is at the very bottom of things that will impact a match in the first place: I'd say a bad weight mismatch or pilot grouping changes the outcome immeasurably more, to the point you can never be sure. You're pulling bad data for the goal of determining individual pilot matchmaking.

EDIT: Long story short is when you are talking statistical analysis, my point is your Margin of error is huge. That's the long and short of it. Other data would provide a far, far smaller range for mistakes.

Edited by Victor Morson, 11 November 2013 - 01:45 PM.


#128 MischiefSC

    Member

  • PipPipPipPipPipPipPipPipPipPipPipPip
  • The Benefactor
  • The Benefactor
  • 16,697 posts

Posted 11 November 2013 - 02:01 PM

View PostVictor Morson, on 11 November 2013 - 01:41 PM, said:


Which would be fine, if there weren't so many chaotic variables that can't be accounted for.

To be fair, this is a snowball effect. If ELO was working properly to begin with, you would always be a constant on a team of people with similar track records, and thus, you'd be the constant in a somewhat stable environment.

The fact is every time you launch into Matchmaker, you are at the mercy of dozens of random factors. The biggest and most notable of these factors is wild weight mismatches followed immediately by randomly rolling great players, horrible players, and premade lances.

In the end, that data is far, far too muddy to mine. A lone pilot simply cannot control those tides. Any battles they win/lose because of their abilities will get lost in a sea of those they lost to dice rolls, and ultimately not really push their ratio more than a few percent in any given direction. However remember ELO calculates in 4-mans too so you can also artificially raise your ELO by running with a good team.

-

I think two things pretty much prove my point on this, from a practical standpoint:

1 - Even at prime time when ELO splitting should be easy, trial 'mechs run by cadets are still constantly thrown against people from the 10 best units in the game. Routinely.

2 - A bad player could literally drive an unarmed mech in 100 games, and still manage to only impact their winning % by the tiniest of margins, if at all. Frankly half the terrible Frankenbuilds we see in serious games, that's just about happening now.

3 - Almost all exclusive solo dropping players end up in the 45-55% bracket. How much of that is the constant variable of the pilot, and how much of that is pure luck to team rolls? It's impossible to determine this.

-

Frankly half the people here talking about how one person can consistently make enough difference to impact their ELO in a 12-man game are ludicrous. Those wins/losses you cause will merely get thrown in that muddy data puddle with all the other out of control factors. Ultimately a guy who impacts matches 5-10% of the time is going to end up evenly ranked with a guy who's lucked out more often than not and rolled good teams.

The bottom line is, why track a stat that is so difficult to get an accurate reading from? Average damage done per match, for example, is a highly trackable stat that helps ID that pilot's capabilities far, far more and that's just one stat of many. How about accuracy? Average kills per map? Average points captured per map? How many assists they have?

Those things are within the pilot's control. Those things are an effective way at ranking pilots and matching them against each other and none of it relies on entirely random elements.

Now Elo can handle that just fine. It only changes your score when proven wrong and it does it proportionally to expected outcome. So even if fair matches only show up 1% of time you will get to the right Elo score.

Now you're better of making an argument against the matchmaker. But that should be fixed roundabout UI 2.0. It will remove the tonnage matching and replace it with a weight limit. So the match maker will only have to look at Elo which should be easier.



Which continues to be the wrong thing to measure.

Bottom line is the purpose of ELO is to match pilots with pilots of a similar skill level, and win/loss is the worst stat to do that by. We can all see that ELO, as it works right now... doesn't. It's a proven failure of a system through and through.

EDIT: I won't ask anyone to run an unarmed mech or suicide to prove this theory but if anyone owns a locust, why not put some worthless guns on it and run it for 20 drops and see what your W/L ratio is. In a 'mech that can't really dent the outcome of anything, I bet you still see within 45-55, depending on how much you beat the odds one way or the other.

In fact anyone with a positive w/l ratio that drives the Locust in general proves my point.


All you've done is make assumptions.

Show me the math. Not 'it's obvious' or 'anyone can see' but the math. Show me the statistical analysis that shows that if I put 1 person in a semi-random group of 12 people and track 100 of games that player was in I can't identify his performance. Find me the statistical representation of that which shows it's impossible. I've given you countless examples, shown you the math in several posts. 8.33% of your teams performance is *plenty* of impact for me to chart. Any mechanical engineer would laugh in your face if you told him 8.33% isn't enough of an impact to quantify in 300, 400, 500 tests.

Victor, you're wrong. I get that you don't *feel* wrong but your feelings are not trustworthy. Nobodies are. That's why math doesn't count on feelings and why it's used for things like this.

Damage, assists, kills, all can be gamed. Easily. They are not trustworthy metrics for how a player impacts the probability that his team will win vs another team. That's why in chess Elo is based off win/loss, not pieces taken or how many moves or even point value of pieces taken vs left. It's win/loss because that's the only accurate indicator of the critical end result -

How likely is your performance going to influence the odds of a win or a loss.

Matches in MW:O are not incredibly random. I work in a field that involves pulling insanely more random stats than the factors that go into a player helping his team win a match.

I get that it feels random to you. It's not. There are a lot of factors but they're not random. No dice rolls. It's all behaviorally driven and fall within incredibly narrow parameters. Given general human parameters most of your fellow players fall within the same narrow range of skill. It's the extremes on both ends that are more and more statistically uncommon.

The issue here isn't the matchmaker. It's negativity bias and confirmation bias and the very human behaviors that they drive. You feel like the consequences of a match are beyond your ability to influence. They are not. You feel like the system is punitive and you are paying for the mistakes of others. You are not.

Go back over your posts - how often do you reference wins you got because other players made up for your mistakes? You won't see that reference from you or anyone else. What you will see is references from many people to how they got done wrong. Again this isn't a criticism but an identification of normal human behavior. Situations where you feel like you were wronged or forced to pay for the mistakes of another or a reward you feel you were due was withheld have a lot more weight in the human psyche and memory.

Your memory and perception are not trustworthy or reliable. Math is. Statistical analysis is, within a certain margin of error. Win/loss is the most reliable 1/0 factor that can be tracked between games to a single player. It doesn't *feel* fair. Fair things rarely do. When we as people ask for 'fairness' or 'mercy' we actually mean weighted in our favor. That's alright and it's normal and natural. We don't weight times we won by luck nearly as heavily as we weight times we lost because of someone elses mistake.

At this point though you need to show the math proving that 8.333% is not sufficient to differentiate a data point over hundreds of samples, especially if that data point is the only universal factor between the samples. You can't, because even 1% influence is significant over hundreds of samples.

Not trying to be harsh here Victor but there is no way to win this argument. You can say that it's not enjoyable - you can say that you don't feel adequately rewarded for losing matches when you did well. You can bring a lot of good points out of what you're saying and end of the day if a game isn't fun the math isn't very relevant.

Win/loss via Elo though is the only trustworthy metric for weighting players for ranking. That's not an opinion, just a mathematical observation. You can debate how that metric is weighted in the matchmaker, how the +/- of points and k-values in the Elo need to change to give better results, how they should be valued vs tonnage, a lot of things to argue. The value of win/loss isn't one of them.

#129 Victor Morson

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • Elite Founder
  • 6,370 posts
  • LocationAnder's Moon

Posted 11 November 2013 - 03:40 PM

View PostMischiefSC, on 11 November 2013 - 02:01 PM, said:

Show me the math. Not 'it's obvious' or 'anyone can see' but the math. Show me the statistical analysis that shows that if I put 1 person in a semi-random group of 12 people and track 100 of games that player was in I can't identify his performance. Find me the statistical representation of that which shows it's impossible. I've given you countless examples, shown you the math in several posts. 8.33% of your teams performance is *plenty* of impact for me to chart. Any mechanical engineer would laugh in your face if you told him 8.33% isn't enough of an impact to quantify in 300, 400, 500 tests.


Any mechanical engineer would laugh in your face for suggesting using something with such muddy numbers as the basis of your system.

Effectively all you are doing is saying that over 200 dice throws, things will even out except the person throwing them and some minor impact that might have on the proceedings. You are tracking data that is entirely out of people's control, unless they "stack the deck" and refuse to run solo, and trying to use that to justify performance.

You aren't even considering other factors. What if a light 'mech pilot is doing these drops? Are they supposed to have any major impact in the game? For every cap win or save that a good light pilot can make, they will run into 20 instances out of their control. While you could argue that this averages out over time, again, this is extremely muddy.

I am saying that pilot skill is one of the smallest factors buried under layers of factors. When all is said and done they will cancel out the information gleaned.

The only time people pull, again, from such muddy sources in the real world is when they are trying to skew the numbers. Numbers don't lie but they can be twisted through intent or incompetence very easily. I could go out and find you, say, tons of numbers that show Global Warming doesn't exist, but all of those numbers were pulled from {Scrap} data sources in the first place to get the desired result.

Tracking win/loss in a game where pilots are unlikely to have a definitive say in that 99.999% of times is garbage.

In fact, you keep going on the assumption that it would happen enough to add an 8% dip, because you and a few others have this mentality that every pilot makes a change to the outcome. It almost never, ever happens despite the rambo-esque stories being shared here, all of which have been debunked.

Long story short, that +4% win/loss stat could be anything from piloting to having an abnormally high win/loss count. Over a reasonable sample period (Say, 500 games) there will be pilots on both side of that fence through no action of their own. By your own admission this would take nearly 2,000 games to even START to show anything usable.

You know what? If you track things ACTUALLY under the pilots control, you could probably get them into a properly matched game in under 100.

So again, even if you were theoretically correct (and I firmly believe you are not because, again, the constant you are counting on is an almost non-existent factor) it's still the worst possible way to go about it.

EDIT: I will bet you ten bucks that nearly every pilot that has played 500+ games without playing 4 mans has a nearly 50/50 ratio and that their involvement probably counts for less than one half of one percent of that number. While that is a value you could use, the margin of error is still wider than the grand canyon.

EDIT 2: Also I would like to point out scenarios in which one pilot CAN make a difference, themselves, are entirely random and just as likely to be seized upon by another pilot of equal skill level. This even further randomizes the elements surrounding your constant.

Long story short, tracking win/loss to sort pilots into a game is dumb and it doesn't work. If you believe it does work, maybe look around at all the ELO threads or go play some games and note the experience of the pilots in it. The way ELO is handled and the way you endorse is, quite frankly, impractical at the very best.

PS: If someone did spend time gaming assists/kills they would... uh.. get into a higher ELO and face harder enemies? Like you said this isn't a reward system, it's a way to sort pilots effectively. These scenarios are a non-issue. In general, these factors are far, far more telling about a pilot's typical potential.

Edited by Victor Morson, 11 November 2013 - 03:48 PM.


#130 Ghogiel

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • CS 2021 Gold Champ
  • CS 2021 Gold Champ
  • 6,852 posts

Posted 11 November 2013 - 03:58 PM

Actually Victor does seem to get it.

All other factors besides own players ability to win matches are random, and thus they average out, leaving only the thing highlighted, players own ability to win matches.

#131 Grits N Gravy

    Member

  • PipPipPipPipPipPip
  • 287 posts

Posted 11 November 2013 - 04:01 PM

View PostMischiefSC, on 11 November 2013 - 01:41 PM, said:


You, sir, get it.

Here's the issue you're fighting with though - if you reduce the point impact per match you extend the total matches required to accurately predict a result on a given player. The biggest problem right now is that you'd need something in the neighborhood of 400 to 500 matches to accurately seat someone at an 1800 Elo because the narrowing band of available opponents of comparable skill gives you diminishing returns on your wins and increased decline on your losses.

Second is the matchmaker does weight balancing as well so it exacerbates the narrow range of options at particularly high and low Elo as it narrows the people available at any given time in any band.

This is why you actually DO want faster, even inappropriate jumps up in Elo in the early part of a players Elo jump - statistically it's actually better for players to be ranked higher than they deserve as it fattens the higher Elo bands, even if just artificially, and allows higher Elo players to more quickly settle into an appropriate rank. Less skilled players incorrectly rated as high Elo will inevitably settle into an appropriate rank and an extra 10 or 20 matches for them to do so because they are inaccurately reflected higher isn't a big deal. Having a high skilled player incorrectly reflected as a lower Elo and thus negatively skewing results of numerous other players is a bigger problem.

Does that make sense? I agree completely that Matchmaking would be better served by tightening the band it pulls from for matches but do to so effectively and still have good matchmaking you've got to fatten the higher and lower bands enough to populate matches. Otherwise you're inevitably going to have high and low ranked outliers used to offset imbalances in otherwise standard matchs. This makes progression in higher bands incredibly slow, almost impossibly so. At the same time it pushes low Elo players into middle Elo rankings far longer than they should be. It squeezes everyone to the middle.

The mixing of 4man and pug Elo exacerbates this, especially when someones Elo gets carried into high Elo bands by a skilled team and then pugs, throwing weighting for that whole match off for everyone. He takes a disproportionate hit when he can't pull the weight the MM incorrectly believes he can and the weighting for the match being off throws off the results for everyone. If it were isolated it wouldn't be a big deal but given the preponderance of 4mans who sometimes pug this is going to be a constant disruption in accurate prediction. If I were trying to use the resulting data of that to sell a product marketing projection I wouldn't show it to anyone who could understand the figures. They're still going to be accurate-ish but even predicting the margin of error would be tough.


The issue isn't the convergence time; it's that there is no convergence and the system basically functions as if there were no matchmaking system. Why this occurs is because the K factor remains way to high for far to long and the spread of the matches too high. So you end up with a large portion of your games at the maximum rate of change possible. From what I've read it's 50 points per match max. Thus in 6 matches it's possible to get to 2σ. Which means you have the majority of players with Elo scores that bounce from Posted Image to +-1.5σ. Scores will not stabilize in that environment.

Which is why using the aggregate Posted Image of wins/total games played can lead to a false belief that matchmaking is even functional. As a binomial system with matchmaking and without should both have Posted Image win rate of 50%.


It's a false dichotomy to say that tighter criteria lead to longer convergence and queue times.
Large K factors vasty reduced the population density of all the scores. They flatten out the distribution. Which means σ tends to be artificially large. Thus if you want to match people in a timely manner there has to be great variation in their scores. Thus you end up in a feedback loop where you need wider scores to keep queue times reasonable, which is what we have now.

With smaller K factors then population density of scores around Posted Image increases to a point that σ becomes much smaller. The match and team making system self corrects better with a smaller σ and K factor. The system doesn't end up in a wild state of constant flux. Which is better for players as it provides more matches and teams of players much closer to their Elo scores.

Right now there are to many players of wildly differing skill levels on the same teams. That largely cancels out the effect of any type of matchmaking. It gets to a point where if Elo scores are so varied on a team or team to team, that you are better off with a more stringent weight balance. As this will produce a better quality match.

In the end Elo systems really aren't that great of for getting players quality matches. In some of the best case scenarios they are accurate about 30% of the time. When it comes down to it, the best system for getting players quality matches is to use a weight system that discounts tonnage base on the number of 3 and 4 man groups.

Edited by Grits N Gravy, 11 November 2013 - 04:03 PM.


#132 Ghogiel

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • CS 2021 Gold Champ
  • CS 2021 Gold Champ
  • 6,852 posts

Posted 11 November 2013 - 04:12 PM

View PostGrits N Gravy, on 11 November 2013 - 04:01 PM, said:


It's a false dichotomy to say that tighter criteria lead to longer convergence and queue times.

When they tightened the Elo ranges the match maker could use to place players in matches the wait times did increase dramatically for higher level players. To the point even I couldn't get in a game during certain hours of the day.

#133 Grits N Gravy

    Member

  • PipPipPipPipPipPip
  • 287 posts

Posted 11 November 2013 - 04:32 PM

View PostGhogiel, on 11 November 2013 - 04:12 PM, said:

When they tightened the Elo ranges the match maker could use to place players in matches the wait times did increase dramatically for higher level players. To the point even I couldn't get in a game during certain hours of the day.

It's because the system is inherently broken as implemented. Adjusting spreads is a band aid solution. Ever wonder about the math that derived your score that was so high it can't find a match of 23 other players? Unless your in the top 100 players there really shouldn't be a problem finding matches. The only reason there is the scores get so divergent as function of poor implementation. The solution is counter intuitive, Ie tighter matchmaking close to the mean and wider criteria out side of 1σ of the mean score.

#134 MischiefSC

    Member

  • PipPipPipPipPipPipPipPipPipPipPipPip
  • The Benefactor
  • The Benefactor
  • 16,697 posts

Posted 11 November 2013 - 06:10 PM

View PostVictor Morson, on 11 November 2013 - 03:40 PM, said:


Any mechanical engineer would laugh in your face for suggesting using something with such muddy numbers as the basis of your system.

Effectively all you are doing is saying that over 200 dice throws, things will even out except the person throwing them and some minor impact that might have on the proceedings. You are tracking data that is entirely out of people's control, unless they "stack the deck" and refuse to run solo, and trying to use that to justify performance.

You aren't even considering other factors. What if a light 'mech pilot is doing these drops? Are they supposed to have any major impact in the game? For every cap win or save that a good light pilot can make, they will run into 20 instances out of their control. While you could argue that this averages out over time, again, this is extremely muddy.

I am saying that pilot skill is one of the smallest factors buried under layers of factors. When all is said and done they will cancel out the information gleaned.

The only time people pull, again, from such muddy sources in the real world is when they are trying to skew the numbers. Numbers don't lie but they can be twisted through intent or incompetence very easily. I could go out and find you, say, tons of numbers that show Global Warming doesn't exist, but all of those numbers were pulled from {Scrap} data sources in the first place to get the desired result.

Tracking win/loss in a game where pilots are unlikely to have a definitive say in that 99.999% of times is garbage.

In fact, you keep going on the assumption that it would happen enough to add an 8% dip, because you and a few others have this mentality that every pilot makes a change to the outcome. It almost never, ever happens despite the rambo-esque stories being shared here, all of which have been debunked.

Long story short, that +4% win/loss stat could be anything from piloting to having an abnormally high win/loss count. Over a reasonable sample period (Say, 500 games) there will be pilots on both side of that fence through no action of their own. By your own admission this would take nearly 2,000 games to even START to show anything usable.

You know what? If you track things ACTUALLY under the pilots control, you could probably get them into a properly matched game in under 100.

So again, even if you were theoretically correct (and I firmly believe you are not because, again, the constant you are counting on is an almost non-existent factor) it's still the worst possible way to go about it.

EDIT: I will bet you ten bucks that nearly every pilot that has played 500+ games without playing 4 mans has a nearly 50/50 ratio and that their involvement probably counts for less than one half of one percent of that number. While that is a value you could use, the margin of error is still wider than the grand canyon.

EDIT 2: Also I would like to point out scenarios in which one pilot CAN make a difference, themselves, are entirely random and just as likely to be seized upon by another pilot of equal skill level. This even further randomizes the elements surrounding your constant.

Long story short, tracking win/loss to sort pilots into a game is dumb and it doesn't work. If you believe it does work, maybe look around at all the ELO threads or go play some games and note the experience of the pilots in it. The way ELO is handled and the way you endorse is, quite frankly, impractical at the very best.

PS: If someone did spend time gaming assists/kills they would... uh.. get into a higher ELO and face harder enemies? Like you said this isn't a reward system, it's a way to sort pilots effectively. These scenarios are a non-issue. In general, these factors are far, far more telling about a pilot's typical potential.


So I notice no math in your response. Just assumptions. No engineers laugh in my face and given that I deal with this for a living it would have come up.

Show me the math Victor. Show me the math that says 8.333% is not statistically relevant, especially when drawn out over 300, 400, 500 samples.

You can't. All you've done is list a bunch of speculated opinions based on the idea that, well, it's just too tough to calculate. You've got absolutely no data to back you up.

As to the dice throwing comment I'm saying that if you throw 11 six sided dice and once dice that's always a 5, or even just 1 dice that's 8 sided after 200 throws you will consistently and clearly end up with a higher result than if you rolled 12 six sided dice 200 times. That you don't believe this is why I listed the psychological behaviors that drive your decision to ignore this fact. Now we're into persistence of belief. Think about it. This is also the part where you accuse me of the same thing even though I've been listing mathematical evidence for my position and yours is based entirely on an opinion which can not by supported.

Your behavior influences the win/loss of your team by ~8.333%. Not 0.0001%. This includes the mech you drive and how you drive it. Everything. Everything together that you do fits into that 8.333%.

At this point I'm going to have to go with 'show me the math'. I've reiterated and re-confirmed all the proofs I've put forward. You have a baseless belief that the numbers are 'too muddy'. What does that look like, statistically? The inability to pick a single detail out?

Back to the dice. You're the 8-sided dice in with 11 six sided dice. Your results influence the results over time and the more times you roll it the clearer those results show. If you're the caliber of an 8-sided dice then after 100 rolls you'll put the total at 3700 not 3600. If you can see each dices total each throw but can't see how many sides are on it then you can go back over those 100 rolls and see which of those 12 dice consistently threw the higher number.

In fact, the dice that you represent could be any size from 1d4 to 1d12 - after 100 rolls I could tell you exactly what size the dice was by the number it consistently threw.

Show me the math, Victor. Show me statistical data that mathematically shows that 8.333% is not enough to plot performance after several hundred samples. The other 11 factors are all like you in their way - they equate to 8.333% of the teams performance but just like you will vary by a couple percent each match depending on skill, mech and other factors. You, however, are the consistent factor in all of your games and thus while the other factors balance out due to probability theory over time your performance can be safely charted.

#135 Roadbeer

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • Elite Founder
  • 8,160 posts
  • LocationWazan, Zion Cluster

Posted 11 November 2013 - 06:16 PM



#136 MischiefSC

    Member

  • PipPipPipPipPipPipPipPipPipPipPipPip
  • The Benefactor
  • The Benefactor
  • 16,697 posts

Posted 11 November 2013 - 06:33 PM

View PostGrits N Gravy, on 11 November 2013 - 04:01 PM, said:


The issue isn't the convergence time; it's that there is no convergence and the system basically functions as if there were no matchmaking system. Why this occurs is because the K factor remains way to high for far to long and the spread of the matches too high. So you end up with a large portion of your games at the maximum rate of change possible. From what I've read it's 50 points per match max. Thus in 6 matches it's possible to get to 2σ. Which means you have the majority of players with Elo scores that bounce from Posted Image to +-1.5σ. Scores will not stabilize in that environment.

Which is why using the aggregate Posted Image of wins/total games played can lead to a false belief that matchmaking is even functional. As a binomial system with matchmaking and without should both have Posted Image win rate of 50%.


It's a false dichotomy to say that tighter criteria lead to longer convergence and queue times.
Large K factors vasty reduced the population density of all the scores. They flatten out the distribution. Which means σ tends to be artificially large. Thus if you want to match people in a timely manner there has to be great variation in their scores. Thus you end up in a feedback loop where you need wider scores to keep queue times reasonable, which is what we have now.

With smaller K factors then population density of scores around Posted Image increases to a point that σ becomes much smaller. The match and team making system self corrects better with a smaller σ and K factor. The system doesn't end up in a wild state of constant flux. Which is better for players as it provides more matches and teams of players much closer to their Elo scores.

Right now there are to many players of wildly differing skill levels on the same teams. That largely cancels out the effect of any type of matchmaking. It gets to a point where if Elo scores are so varied on a team or team to team, that you are better off with a more stringent weight balance. As this will produce a better quality match.

In the end Elo systems really aren't that great of for getting players quality matches. In some of the best case scenarios they are accurate about 30% of the time. When it comes down to it, the best system for getting players quality matches is to use a weight system that discounts tonnage base on the number of 3 and 4 man groups.


The K-value is ~5 points per match though. Also we used to play with very tight weight balancing and no Elo - it was a disaster for anyone who wasn't a good player or in a 4man. I know that when I ran 4mans in the old system my win rate was a little north of 80%. That wasn't uncommon.

Tighter criteria leading to longer queue times isn't a false dichotomy - more restrictions on finding a match the longer it takes unless the pool is big enough. Right now the pool isn't big enough.

I absolutely agree that in the long term you want to scale k-value back but for the moment there are not enough players and not enough spread in Elo bands to facilitate that without making progression in Elo too difficult for people who only play 5 or 10 matches a day.

The perception of wide shifts in player ability is also an incorrect perception - most people fall, skill-wise, into the same middle band. Slicing that more finely would help but for a lot of players mech build is more important than trigger skill. Most people aren't that good. This you've got a fat middle band and a sharply tapered top and bottom, which is pretty consistent in any competitive sport or skill. The key is having enough people in the tapered ends to fill out their matches without having to pull people from the middle.

A better goal is getting that middle band cut more precisely to get more homogeneous matches for the middle 80% of players who probably fill a swath closer to 20% of the total possible Elo bands - the middle. When you have people in the lower-middle getting pulled into matches with people who are in the top 10% of players you skew their Elo score. 1 match in 50 isn't significant but 1 in 5 certainly is. This is part of the problem with a combined premade + pug Elo. Premade teams tend to significantly out-perform pugs, meaning a player in a premade team will have an Elo score significantly out of the band of a pure pug. While he may represent it when dropping in his premade he does NOT represent it when pugging. The result is that his presence while pugging in a match skews the matchmaking and how Elo awards points for win/loss based on an incorrect assumption of his skills.

Make sense? ~5 points/match is a good choice and while I'd argue that higher would be good for anyone with less than 300 or 400 drops in a weight class (remember each weight class has a separate Elo score for you) it's good once you're pretty close to your goal. Weight is reasonably important but not critical - not as much as being in a premade or pugging is. Right now what's going to impact matchmaking the most IMO is people with a high Elo from dropping often in premades who then pug. They're weighted higher than their solo skill would reasonably represent.

In the 2k matches I had prior to the introduction of Elo my first 400, 500 matches had me with a win-rate of about 35%. Then I started dropping with friends and my win rate skyrocketed to an average of over 60% by the time Elo came out, even factoring in my prior pugging losses. Most times I dropped with friends we won four out of five matches. Now, since Elo has come out, my average is at about 55%. My matches are far more challenging now and the caliber of people I play with and against far, far better.

It's still got a ways to go though. I'd say split premade/pug Elo and tighten matching to a hard limit on tonnage, say 150 tons max difference with a target goal of 66% of matches within 50 tons. Second criteria being Elo matching first in band and widening instead of shooting for an overall total score via high and low mixing. While it'd give longer queue times after a couple hundred matches for everyone it should put the bulk of players in a better matching environment - the big issue being the top and bottom 10% who likely can't fill their own queue.

Hmmm.... I wonder. Could you set it so once past a certain point you widen tonnage requirements for high or low Elo scores? The idea being that if you're good enough tonnage is less significant. I'd so much rather see statistical outliers kept in their own ranges rather than skewing results in the middle.

#137 Roadbeer

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • Elite Founder
  • 8,160 posts
  • LocationWazan, Zion Cluster

Posted 11 November 2013 - 08:06 PM

http://mwomercs.com/...ost__p__2912595

Do with that data what you will.

#138 Victor Morson

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • Elite Founder
  • 6,370 posts
  • LocationAnder's Moon

Posted 11 November 2013 - 10:29 PM

View PostMischiefSC, on 11 November 2013 - 06:10 PM, said:

Your behavior influences the win/loss of your team by ~8.333%. Not 0.0001%. This includes the mech you drive and how you drive it. Everything. Everything together that you do fits into that 8.333%.


Your math is already off, again, on random functions. Because you are 8.333% of your team does not mean you influence 8.333% of the outcome. Your base logic is wrong and it's clouding everything you are doing after that. Not even accounting for skill variance, let's look at tonnage - are you saying a Highlander impacts the round as much as a Locust? Are you saying a Frakenmech impacts the round as much as an optimized one?

There are many rounds you have 0% impact. For example, a disconnected mech can still be on the winning team, as one that is not pulling it's weight. I'd say that since the cap speed got raised, instances of 100% factors are very rare. That means that 8.333% number is, effectively, complete garbage.

Your core numbers are broken before they begin man. That is why everything you say to prove your point after doesn't work. It's built on a broken foundation.

Edited by Victor Morson, 11 November 2013 - 10:31 PM.


#139 Ghogiel

    Member

  • PipPipPipPipPipPipPipPipPipPip
  • CS 2021 Gold Champ
  • CS 2021 Gold Champ
  • 6,852 posts

Posted 11 November 2013 - 10:37 PM

View PostGrits N Gravy, on 11 November 2013 - 04:32 PM, said:

Ever wonder about the math that derived your score that was so high it can't find a match of 23 other players? Unless your in the top 100 players there really shouldn't be a problem finding matches.

Problem is, no one has any idea what my Elo is, how many players are online at off peak times with similar Elo ratings and don't even know the maximum threshold at which the game can pick players from.

#140 Grits N Gravy

    Member

  • PipPipPipPipPipPip
  • 287 posts

Posted 11 November 2013 - 11:02 PM

View PostMischiefSC, on 11 November 2013 - 06:33 PM, said:


The K-value is ~5 points per match though. Also we used to play with very tight weight balancing and no Elo - it was a disaster for anyone who wasn't a good player or in a 4man. I know that when I ran 4mans in the old system my win rate was a little north of 80%. That wasn't uncommon.

Tighter criteria leading to longer queue times isn't a false dichotomy - more restrictions on finding a match the longer it takes unless the pool is big enough. Right now the pool isn't big enough.

The perception of wide shifts in player ability is also an incorrect perception.The key is having enough people in the tapered ends to fill out their matches without having to pull people from the middle.

I've seen quoted at 50 here, where did you get 5 from?
http://mwomercs.com/...ost__p__1626065

Wait times have little to do with the total size of the player pool. Wait times as function of matchmaking criteria are largely a function of the distribution of scores. With a wider distribution of scores you end up with either lots of variance in your matchmaking criteria and short queues or tighter criteria with longer queues. The driving force is the distribution of scores not total pool size.

It's not that player's ability is even distributed; It's that their scores are.There are a lot of players both over and under ranked as a function of bad Elo math. This is evident in the graph I posted earlier. Look at how the density of the distribution changes past 50 games. The trend is away from convergence at the mean. The ranking system is thinning the pools of players at various scores. Under and over ranking robs you of the ability to set tight matches with small queue times.

If the system worked well we would expect a much denser cluster of players around the mean and in turn, a smaller standard deviation. This is also solves issues for bad and good players, Those with score +- 1.5 standard deviations of the mean. As these groups become more densely populated they are easier to match. Right now there are too many people with scores +- 1.5 standard deviations of the mean. In standard normal distribution 1.5σ would represent something like 10% of the total population of scores in MWO Elo it is much higher.

The key isn't population size it's the proper distribution of scores. Which you'll never accomplish with wide variance in a Elo system. As more players get added scores drift further away from the mean. Elo drift has been an issue in Chess where the system is arguably much better suited.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users