My understating of data overfitting is that:  You have a training set, and you come out with a model, but that model is tuned too much that it only works on a specific dataset like the training set. If you apply the model to other dataset (scenarios), the results are bad.

Data are not perfect. In most cases, the training set contains noise, which needs to be filtered out instead of taken into account in the model. 

I have also written this post:

The Machine Learning Case Study – How to Predict Weight over Height/Gender using Linear Regression? 

Base on the many samples of Weight/Height relations:

Male Weight = -101.24 + 1.061 * Height

Female Weight = -110.20 + 1.062 * Height

I am 174cm, the weight should be 83.2kg, but I am in fact 80.0kg, so according to this model, I am fit, which is soooo much better than the  BMI.

 大数据这年头很火. 有着大数据 甚至不需要做什么就能发财. 一般来说, 你有了数据 然后就可以通过一些算法进行学习 得到一些模型. 通过这些模型来进行预测. 

 但是很有可能你的数据 (Training Set – 训练集) 是含有一些特殊例子, 或者称为噪声, 我们需要过滤掉这些数据 或者在学习的过程中不考虑它们. 否则得到的模型就会是一个过分拟合的现象. 过拟表现就是对于当前训练集, 你的模型十分的拟合, 但是这个模型却不适合于其它的场景. 

 过分拟合 // 图片来自于网络  // Image Credit: Here

 推荐数据学习的英文: The Machine Learning Case Study – How to Predict Weight over Height/Gender using Linear Regression? 

 这个文章学习了大量的 男性/女性 体重对于身高的关系, 得出了两组模型:  

男性体重 = -101.24 + 1.061 * 身高

女性体重 = -110.20 + 1.062 * 身高

 我身高174cm, 所以体重应该是 83.2kg, 我实际体重是 80.0kg, 所以是不胖滴… 这比 BMI 靠谱多了 .  😂 

