博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
integer encoding vs 1-hot (py)
阅读量:5075 次
发布时间:2019-06-12

本文共 1543 字,大约阅读时间需要 5 分钟。

https://github.com/szilard/benchm-ml/issues/1

 

 commented 

Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.

I would expect faster, lower memory but decrease in AUC (or same in some cases).

When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced.

 

Thanks . I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?

 

 commented 

Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear.

See section 3.6.3.2 of my thesis if you dont have the CART book.

 

One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).

While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..

 
 
 

转载于:https://www.cnblogs.com/xinping-study/p/7085221.html

你可能感兴趣的文章
Jmeter入门实例
查看>>
亲近用户—回归本质
查看>>
中文脏话识别的解决方案
查看>>
CSS之不常用但重要的样式总结
查看>>
Python编译错误总结
查看>>
URL编码与解码
查看>>
日常开发时遇到的一些坑(三)
查看>>
Eclipse 安装SVN插件
查看>>
深度学习
查看>>
TCP粘包问题及解决方案
查看>>
构建之法阅读笔记02
查看>>
添加按钮
查看>>
移动端页面开发适配 rem布局原理
查看>>
Ajax中文乱码问题解决方法(服务器端用servlet)
查看>>
会计电算化常考题目一
查看>>
阿里云服务器CentOS6.9安装Mysql
查看>>
剑指offer系列6:数值的整数次方
查看>>
js 过滤敏感词
查看>>
poj2752 Seek the Name, Seek the Fame
查看>>
软件开发和软件测试,我该如何选择?(蜗牛学院)
查看>>