Special Edition Data Science and Machine Learning Interview Questions Solved in Python and Spark with Deep Learning and Reinforcement Learning Bonus Questions
Introduction 17
1. What are the most important machine learning techniques? 17
Solution 17
2. Why is it important to have a robust set of metrics for machine learning? 18
Solution 18
Code 19
Feature Engineering and the ETL Process (Extraction, Transformation, Loading) 20
3. Why are Features extraction and engineering so important in machine learning? 20
Solution 20
4. Can you provide an example of features extraction? 22
Solution 22
Code 22
5. What is the mean, the variance, and the covariance? 23
Solution 23
Code 23
6. What are percentiles and quartiles? 24
Solution 24
Code 24
7. Why are vectors and norms used in machine learning? 24
Solution 24
Code 25
8. What is a convenient tool for performing data statistics? 25
Solution 25
Code 25
9. How is it convenient to visualize data statistics 26
Solution 26
Code 26
10. How to compute covariance and correlation matrices with pandas 27
Solution 27
Code 27
11. What is a TFxIDF? 28
Solution 28
Code 29
12. What is "features hashing"? And why is it useful for BigData? 29
Solution 29
13. What is "continuous features binning"? 30
Solution 30
14. What is an LP normalization? 30
Solution 30
Code 30
15. What is a Chi Square Selection? 31
Solution 31
16. What is mutual information and how can it be used for features selection? 31
Solution 31
17. How to deal with categorical features? And what is one-hot-encoding? 32
Solution 32
Code 32
18. Can you transform an XML file into Python Pandas? 33
Solution 33
Code 34
19. Can you read HTML into Python Pandas? 35
Solution 35
Code 35
20. Can you read JSON into Python Pandas? 35
Solution 35
Code 35
21. Can you draw a function from Python? 36
Solution 36
Code 36
22. What is a Gaussian? 37
Solution 37
Code 37
23. What is a Standard Scaling? 38
Solution 38
Code 38
24. Why are statistical distributions important? 39
Solution 39
Code 41
25. Can you compare your data with some distribution? What is a qq-plot? 41
Solution 41
Code 41
26. Can you provide an example of connection to the Twitter API? 42
Solution 42
Code 42
27. Can you provide an example of connection to the LinkedIn API? 44
Solution 44
Code 44
28. Can you provide an example of connection to the Facebook API? 44
Solution 44
Code 45
29. What is Parquet? 45
Solution 45
Code 45
Machine learning basics 46
30. What is a Bias - Variance tradeoff? 46
Solution 46
31. What is a training set, a validation set, a test set and a gold set in supervised and unsupervised learning? 47
Solution 47
32. What is a cross-validation and what is an overfitting? 48
Solution 48
Code 49
Code 49
33. Why is Grid Search important? 50
Solution 50
Code 50
Spark and python 52
34. What is an Ipython notebook? 52
Solution 52
Code 52
35. What are Numpy, Scipy and Spark essential datatypes? 53
Solution 53
Code 54
36. Can you provide an example for Map and Reduce in Spark? (Let's compute the Mean Square Error) 54
Solution 54
Code 55
37. Can you provide examples for other computations in Spark? 56
Solution 56
Code 60
38. How does Python interact with Spark 60
Solution 60
39. What is Spark support for Machine Learning? 61
Solution 61
40. How does Spark work in a parallel environment 61
Solution 61
Code 61
41. What are the new Spark DataFrame and the Spark Pipeline? And how we can use the new ML library for Grid Search 62
Solution 62
Code 63
Linear Models and Regression 66
42. What is a loss function, what are linear models, and what do we mean by regularization parameters in machine learning? 66
Solution 66
43. What is an odd ratio? 68
Solution 68
44. What is a sigmoid function and what is a logistic function? 69
Solution 69
Code 70
45. What is a Linear Least Square Regression? 70
Solution 70
Code 71
46. What are Lasso, Ridge, and ElasticNet regularizations? 72
Solution 72
47. What is a Logistic Regression? 72
Solution 72
Code 73
48. What is a stepwise regression? 74
Solution 74
49. What is an isotonic regression? 75
Solution 75
Code 75
50. How to include nonlinear information into linear models 76
Solution 76
51. What are generalized linear models and what is an R Formula? 77
Solution 77
Code 77
52. What is LARS? 78
Solution 78
53. What is GMLNET? 79
Solution 79
Optimization techniques 81
54. What is a gradient descent? 81
Solution 81
55. What is a stochastic gradient descent? 82
Solution 82
Code 83
56. What is momentum? 83
Solution 83
57. What is Conjugate Gradient? 84
Solution 84
58. What are Adagrad, RSMProp, Adam, and L-BFGS? 85
Classification 86
59. What is a Naïve Bayes classifier? 86
Solution 86
60. What is a Bernoulli and a Multivariate Naïve Bayes? 88
Solution 88
Code 89
61. What is a Gaussian Naïve Bayes? 90
Solution 90
62. What is another way to use Naïve Bayes with continuous data? 90
Solution 90
63. What is the Nearest Neighbor classification? 90
Solution 90
Code 92
64. What are Support Vector Machines (SVM)? 92
Solution 92
Code 94
65. What are SVM Kernel tricks? 95
Solution 95
67. What is SVM with soft margins? 96
Solution 96
66. Can you provide an example for Text Classification with Spark? 96
Solution 96
Code 97
Clustering 98
67. What is K-Means Clustering? 98
Solution 98
Code 99
68. What is the DBSCAN clustering algorithm? 99
Solution 99
Code 100
69. What is a Streaming K-Means? 101
Solution 101
Code 101
70. What is Canopi Clusterting? 102
Solution 102
71. What is Bisecting K-Means? 103
Solution 103
72. What is the Expectation Maximization Clustering algorithm? 103
Solution 103
73. What is a Gaussian Mixture? 105
Solution 105
Code 105
Boosting and Ensembles 107
74. What are the Ensembles? 107
Solution 107
75. What is an AdaBoost classification algorithm? 107
Solution 107
Decision Trees, Gradient Boosted Trees and Random Forests 109
76. What are the Decision Trees? 109
Solution 109
Code 111
77. What is a Gradient Boosted Tree? 112
Solution 112
78. What is a Gradient Boosted Trees Regressor? 112
Solution 112
Code 112
79. Gradient Boosted Trees Classification 114
Solution 114
Code 114
80. What is a Random Forest? 115
Solution 115
Code 115
Recommendations 117
81. What is a recommender system? 117
Solution 117
82. What is a collaborative filtering ALS algorithm? 118
Solution 118
Code 119
Dimensional Reduction 121
83. What is the PCA Dimensional reduction technique? 121
Solution 121
Code 122
84. What is the SVD Dimensional reduction technique? 123
Solution 123
Code 123
85. What is a Latent Semantic Analysis (LSA)? 124
Solution 124
86. What is the Latent Dirichlet Allocation topic model? 124
Solution 124
Code 125
Associative Rules 127
87. What is the Associative Rule Learning? 127
Solution 127
88. What is FP-growth? 129
Solution 129
Code 129
Graph Mining 130
89. Can you represent a graph in Python? 130
Solution 130
Code 130
90. How to use the GraphX Library 130
Solution 130
91. What is PageRank? And how to compute it with GraphX 131
Solution 131
Code 132
Code 132
92. What is a Power Iteration Clustering? 134
Solution 134
Code 134
Neural Networks 135
93. What is a Perceptron? 135
Solution 135
94. What is an ANN (Artificial Neural Network)? 136
Solution 136
95. What are the activation functions? 137
Solution 137
96. How many types of Neural Networks are known? 138
97. How to train a Neural Network 139
Solution 139
98. Which are the possible ANNs applications? 139
Solution