apachecn / interview Goto Github PK

View Code? Open in Web Editor NEW

8.5K 8.5K 2.2K 54.16 MB

Interview = 简历指南 + 算法题 + 八股文 + 源码分析

Home Page: https://interview.apachecn.org

License: Other

Python 31.57% Jupyter Notebook 58.18% Shell 0.30% Java 0.64% HTML 0.55% Dockerfile 0.01% CSS 4.25% JavaScript 4.50%

interview kaggle leetcode machine-learning python

interview's Introduction

Interview——IT 行业应试学知识库

程序员的双手是魔术师的双手，他们把枯燥无味的代码变成了丰富多彩的软件。——《疯狂的程序员》

在线阅读

网址: https://interview.apachecn.org

协议

CC BY-NC-SA 4.0

赞助我们

interview's People

Contributors

Stargazers

Watchers

Forkers

oucliut ytzhao chenyyx keyman9848 breeze1988 swan815 zhuyansen hyserendipity02 maxiaomu ruijuntianxia huangbocom sundaygeek fengfengtzp jiaqiangbandongg jumphunt wangyangting xvjie asurada2015 bringtree jonsenhb snaildm vivianzhang1990 aimicm wuzeen jrxgugi moriatyyuan chengpiaopiao yidan3166 gaolinjie servant007 cheneyshark yongqiangning caesernieh githubclj 1mrliu richard7412 marsjhao youwei2567 cucucool danielack apache-cn jetou alexchen-melbourne wudahht yishengxiaoyao jiazhengpei kylewang1982 596350754 jysgithup deelmind zhiguotop budaicidewei jxlxt iclouding yxh11028 meizh ll625257054 a2398480 nemonameless breezehavana younglou1995 hadoopandspark bigwode zyf7630 hadxu spr1nt0a0 lai-bluejay shubhampachori12110095 hongrushang sinianyutian carollei926 jameskry dyb10101 jlwang233 jackyangzg mynick777 wwg377655460 chaoshengt kailiangky gangandong kalengit johnkeating24 lknba fei090620 zpf1452233 panligit helensgit fadeaway81 cainsmile cuihuanhuan0 lejiajia jiguang123 gtbailly moolighty gtnull xiashuijun winfys rosefun codereason jaguar1995

interview's Issues

国内gitee网站404了，突出一个抽象，真的摇摇领先了

https://apachecn.gitee.io/interview国内gitee网站404了，突出一个抽象，真的摇摇领先了

【收集】面试专业名词

业务建模：
数据建模：使用数据建模技术来分析数据对象，以此洞悉数据的内在涵义。

统计分析：
对比分析：
聚类分析：它是将相似的对象聚合在一起，每类相似的对象组合成一个聚类(也叫作簇)的过程。这种分析方法的目的在于分析数据间的差异和相似性。
回归分析：
判别分析：
相关性分析：
相关系数：是研究变量之间线性相关程度的量（较为常用的是皮尔森相关系数）

异常检测：在数据集中搜索与预期模式或行为不匹配的数据项。

数据采样：
数据增强：

特征选择：

特征工程：
数据清洗：对数据进行重新审查和校验的过程，目的在于删除重复信息、纠正存在的错误，并提供数据一致性。
降维：

自荐一下，https://github.com/aQuaYi/LeetCode-in-Go#leetcode-%E7%9A%84-go-%E8%A7%A3%E7%AD%94

我个人的 LeetCode 题解汇总。
坚持更新，全 Github 最全的啦。

希望可以加入 “推荐 LeetCode 网站” 列表中。

天猫国际是**消费升级的第一跨境平台，是阿里经济体5年2000亿美金进口承诺的主力军。2019年天猫国际技术部和考拉合并成立了阿里巴巴大进口技术部，是阿里巴巴国际化战略的核心技术部门。致力于进口业务的技术突破和创新，助力**的消费者实现“买遍全球”的需求，跨入未来的万亿级市场。想了解更多的话，直接联系我吧，我帮你组内直推，大量HC，走过路过不可错过。
邮箱：[email protected]
微信：isHunterZhang

【职场晋升】管理书籍推荐

经济类型的书目：

金字塔原理（麦肯锡40年经典培训教材）
市场研究实务（历次研究公司入职考核书目）
品牌知行（品牌三部曲）
决战大数据（阿里大数据之父）

管理思维：

The Five Dysfunctions of a Team by Patrick Lencioni (团队协作的五大障碍)
Four Disciplines of Execution, by Stephen Covey （高效能人士的执行4原则）
Non-Violent Communication, by Rosenberg (非暴力沟通)
The Checklist Manifesto, by Atul Gawande (清单革命)
The Ideal Teamplayer, by Patrick Lencioni (理想的团队成员）
Learned Optimism by Martin Seligman （活出最乐观的自己）
Built to Last by Jim Collins (基业长青)
The Fifth Discipline（第五项修炼）

adaboost：不同分类算法的有效利用

http://blog.csdn.net/guyuealian/article/details/70995333
感觉可以用在我们基本算法结束之后，对于不同算法的整合上，比简单的选择某个样本在这些算法中出现最多的最为最终结果要好。

svm-python3.6.py saveResult函数输出结果多了一行空格

open的时候加newline=''

def saveResult(result, csvName):
with open(csvName, 'w',newline='') as myFile:
myWriter = csv.writer(myFile)
myWriter.writerow(["ImageId", "Label"])
index = 0
for r in result:
index += 1
myWriter.writerow([index, int(r)])
print('Saved successfully...') # 保存预测结果

大数据类面经总结

https://github.com/WadeStack/BigDataIE

比赛活动 & 负责人征集

如果你有想法，有热情参与某个比赛（或者复现某个现有比赛），但苦于没人一起组队的话，加入我们，成为比赛活动负责人吧！发起你的活动，招募队友，互相学习，争取更大的胜利！

请在这个 ISSUE 中留言，“昵称 + QQ + 比赛名称”，示例：“飞龙+562826179+kaggle Leaf Classification”。

负责人	QQ	比赛名称	备注
张一极	2533524298	手写数字百分百准确率模型探究
呆呆	728634974		ds、kdd相关皆可，个人水平kaggle top20，cv的话小数据集可以，我这里算力有限
Roman	570515024		大数据
1266	1097828409	桑坦德客户交易预测
Datawhale	-	搭建文本情感分类模型

比赛平台

house price 模型教程问题

kaggle/competitions/getting-started/house-price/里面readme.md中的rmsle_cv函数是什么？我按照代码敲进去发现报错，而且这个里面好多函数的都没有import，例如：

from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb

Columns and DataType Not Explicitly Set on line 345 of titanic-python3.6.py

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem	If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios.
Solution	It is recommended to set the columns and DataType explicitly in data processing.
Impact	Readability

Example:

### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]

### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})

You can find the code related to this smell in this link:

Interview/src/py3.x/kaggle/getting-started/titanic/titanic-python3.6.py

Lines 335 to 355 in 4b25be7

    
           pca_tr_data = do_FeatureEngineering(train_data) 
        
           pca_te_data = do_FeatureEngineering(test_data) 
        
           # 3. 模型训练/模型融合（分类问题： lr、rf、adboost、xgboost、lightgbm） 
        
           model = trainModel(pca_tr_data, train_label) 
        
           model.fit(pca_tr_data, train_label) 
        
           labels = model.predict(pca_te_data) 
        
           # 4. 数据导出 
        
           print(type(pids), type(labels.tolist())) 
        
           result = pd.DataFrame({ 
        
               'PassengerId': pids,  
        
               'Survived': [int(i) for i in labels.tolist()] 
        
           }) 
        
           result.to_csv('Result_titanic.csv', index=False) 
        
           # 结束时间 
        
           end_time = datetime.datetime.now() 
        
           times = (end_time - sta_time).seconds 
        
           print("\n运行时间: %ss == %sm == %sh\n\n" % (times, times/60, times/60/60))

I also found instances of this smell in other files, such as:

File: https://github.com/apachecn/Interview/blob/master/src/py3.x/kaggle/getting-started/digit-recognizer/cnn_pytorch-python3.6.py#L24-L34 Line: 29
File: https://github.com/apachecn/Interview/blob/master/src/py3.x/kaggle/getting-started/digit-recognizer/cnn_pytorch-python3.6.py#L34-L44 Line: 39
File: https://github.com/apachecn/Interview/blob/master/src/py3.x/kaggle/getting-started/digit-recognizer/knn-python3.6.py#L19-L29 Line: 24
File: https://github.com/apachecn/Interview/blob/master/src/py3.x/kaggle/getting-started/digit-recognizer/knn-python3.6.py#L20-L30 Line: 25
File: https://github.com/apachecn/Interview/blob/master/src/py3.x/kaggle/getting-started/digit-recognizer/rf-python3.6.py#L23-L33 Line: 28
.

I hope this information is helpful!

决策树有可能用于数字图片分类吗？

没什么把握啊，感觉好像不行，但没准有奇效？

Adaboost原算法提高分类性能

我现在还没怎么看懂，有人能浅显的讲一下吗？

【收集】数据分析模型

数据分析理论

统计分析方法论有：

描述统计、假设检验、相关分析、方差分析、回归分析、聚类分析、判别分析、主成分与因子分析、时间序列分析、决策树

10种经典统计方法总结

数据分析模型

腾讯满意度评估模型

数据分析项目

数据分析项目实战:用户消费行为分析

智能风控筑基手册：全面了解风控决策引擎

项目布局规划

大厂面试： 1面知识面，2面技术深度，3面项目经验，4面职业规划
小厂面试： 1面技能和项目， 2面职业规划

知识面：基础的算法面试题，技能的知识面
技术深度：基于技能的知识面，会针对具体的几个聊聊深入的底层原理和优化
项目经验：只要是看项目经验中，是否能胜任目前公司招聘人员的要求
职业规划：扯淡为主，毕竟大家都是相互套路一下

后期需要完善

如何写好一份简历
如何去投递简历
如何刷题
如何刷项目
面试心得

欢迎补充 ...

多标签（multi_label）分类

时间序列问题（单变量+多变量）

如何开发 LSTM 模型进行时间序列预测

【收集】实时流

flume+kafka+spark streaming+hdfs 整合项目

【收集】数据采集

数据增强：以防止过拟合，并提高模型的泛化能力
- https://zhuanlan.zhihu.com/p/63182132
- https://zhuanlan.zhihu.com/p/102640267

【收集】特征工程

数据处理

在每个样本上减去数据的统计平均值可以移除共同的部分，凸显个体差异。

使用sklearn做单机特征工程

https://www.cnblogs.com/jasonfreak/p/5448385.html

特征工程系列

特征工程系列：特征筛选的原理与实现（上）
https://www.cnblogs.com/purple5252/p/11205500.html
特征工程系列：特征筛选的原理与实现（下）
https://www.cnblogs.com/purple5252/p/11211083.html

混淆矩阵及confusion_matrix函数的使用

https://blog.csdn.net/u011734144/article/details/80277225

sklearn 网格搜索 - 得到最优参数

https://github.com/apachecn/ml-mastery-zh/blob/master/docs/xgboost/tune-number-size-decision-trees-xgboost-python.md

算法刷题，推荐《LeetCode Cookbook》

去年刷了一年的题，把每道题的题解都整理在这里了：https://github.com/halfrost/LeetCode-Go/
解题汇集成了这本《LeetCode Cookbook》。

作者如果觉得质量还可以，可以把我这里链接放在算法刷题的专栏下面。

https://github.com/apachecn/gainlo-interview-guide-zh

（2）一亩三分地有个系统设计版，很多人在里面贴英文的资源，可以翻译

https://www.1point3acres.com/bbs/forum-323-1.html

（3）HighScalability.com 是个权威的站点，但是我不知道从哪里下手。

转到新的连接: #273

BAT - 技术面试题目汇总

在 Interview 项目中添加系统设计面试解答

我之前翻译过一个小册子，可以合并进来：
- https://github.com/apachecn/gainlo-interview-guide-zh
一亩三分地有个系统设计版，很多人在里面贴英文的资源，可以翻译
- https://www.1point3acres.com/bbs/forum-323-1.html
HighScalability.com 是个权威的站点，但是我不知道从哪里下手。

推荐链接：

kaggle-api

感觉可以讲一些kaggle的api，很多在网页上的操作可以转换成命令。
https://github.com/Kaggle/kaggle-api

	pca_tr_data = do_FeatureEngineering(train_data)
	pca_te_data = do_FeatureEngineering(test_data)

	# 3. 模型训练/模型融合（分类问题： lr、rf、adboost、xgboost、lightgbm）
	model = trainModel(pca_tr_data, train_label)
	model.fit(pca_tr_data, train_label)
	labels = model.predict(pca_te_data)

	# 4. 数据导出
	print(type(pids), type(labels.tolist()))
	result = pd.DataFrame({
	'PassengerId': pids,
	'Survived': [int(i) for i in labels.tolist()]
	})
	result.to_csv('Result_titanic.csv', index=False)

	# 结束时间
	end_time = datetime.datetime.now()
	times = (end_time - sta_time).seconds
	print("\n运行时间: %ss == %sm == %sh\n\n" % (times, times/60, times/60/60))