diff options
| author | leshe4ka46 <alex9102naid1@ya.ru> | 2025-12-09 10:10:48 +0300 |
|---|---|---|
| committer | leshe4ka46 <alex9102naid1@ya.ru> | 2025-12-09 10:10:48 +0300 |
| commit | f60863aecfdfb2a7a35c9d2d4233142ca17c9152 (patch) | |
| tree | 6d9746dc7b99f40ffa9a35592f8638ee5b8a4ee9 | |
| parent | d8df84af00cfe3038fc696417fd97b33eca0146e (diff) | |
rapids ok
| -rw-r--r-- | Fundamentals_of_Accelerated_Data_Science/25.11.25.md | 33 | ||||
| -rw-r--r-- | Fundamentals_of_Accelerated_Data_Science/3-05_knn.md | 19 | ||||
| -rw-r--r-- | Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.ipynb | 7 | ||||
| -rw-r--r-- | Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.md | 4 | ||||
| -rw-r--r-- | Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.png | bin | 0 -> 101671 bytes |
5 files changed, 63 insertions, 0 deletions
diff --git a/Fundamentals_of_Accelerated_Data_Science/25.11.25.md b/Fundamentals_of_Accelerated_Data_Science/25.11.25.md new file mode 100644 index 0000000..8de5afc --- /dev/null +++ b/Fundamentals_of_Accelerated_Data_Science/25.11.25.md @@ -0,0 +1,33 @@ +3.04 как корректируются коэффициенты, функция ошибок + +3.06 энтропия, джини (splitting criteria) +subsample 0.8 для чего помогает +f score для переменной +ROC curve + + +https://mlu-explain.github.io/logistic-regression/ +A common way to estimate coefficients is to use gradient descent. In gradient descent, the goal is to minimize the Log-Loss cost function over all samples. This method involves selecting initial parameter values, and then updating them incrementally by moving them in the direction that decreases the loss. At each iteration, the parameter value is updated by the gradient, scaled by the step size (otherwise known as the learning rate). The gradient is the vector encompassing the direction and rate of the fastest increase of a function, which can be calculated using partial derivatives. The parameters are updated in the opposite direction of the gradient by the step size in an attempt to find the parameter values that minimize the Log-Loss. + +Because the gradient calculates where the function is increasing, going in the opposite direction leads us to the minimum of our function. In this manner, we can repeatedly update our model's coefficients such that we eventually reach the minimum of our error function and obtain a sigmoid curve that fits our data well. + +https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf + + +https://medium.com/data-science/optimization-loss-function-under-the-hood-part-ii-d20a239cde11 + + + + + + +https://xgboost.readthedocs.io/en/stable/parameter.html +Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. + + +ROC (Receiver Operating Characteristic) +https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc + + + +https://www.learndatasci.com/glossary/gini-impurity/
\ No newline at end of file diff --git a/Fundamentals_of_Accelerated_Data_Science/3-05_knn.md b/Fundamentals_of_Accelerated_Data_Science/3-05_knn.md new file mode 100644 index 0000000..5fdf591 --- /dev/null +++ b/Fundamentals_of_Accelerated_Data_Science/3-05_knn.md @@ -0,0 +1,19 @@ +lazy learner +деревья для поиска ближайших значений (KD, Q, R, Ball) + +- k-dimensional tree +Recursively splits data along coordinate axes. +at each node pick a dimension, split at median, each node becomes a hyperrectangle in d-dimensional space. + +- Quad-Tree +Each node is a square region. If too many points fall into a region subdivide into 4 equal quadrants + +- rectangle tree +Each node stores several children, each with a Minimum Bounding Rectangle covering many points or objects. Subtrees group spatially close objects. Insertions try to minimize rectangle enlargement. + +- Ball-Tree +Each node contains: + - a center point + - a radius enclosing all points in that node +Two children represent two balls that split the data + diff --git a/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.ipynb b/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.ipynb index d3548f2..604e881 100644 --- a/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.ipynb +++ b/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.ipynb @@ -746,6 +746,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "https://ru.wikipedia.org/wiki/%D0%94%D0%B2%D0%BE%D0%B8%D1%87%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BB%D0%B0%D1%81%D1%81%D0%B8%D1%84%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F#/media/%D0%A4%D0%B0%D0%B9%D0%BB:Binary-classification-labeled.svg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## (Optional) Comparison: CPU-Only XGBoost ##\n", "Below we provide code for training and inferring from a CPU-only XGBoost using the same model parameters other than switching the histogram tree method from GPU to CPU." ] diff --git a/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.md b/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.md new file mode 100644 index 0000000..d5694e3 --- /dev/null +++ b/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.md @@ -0,0 +1,4 @@ +https://neerc.ifmo.ru/wiki/index.php?title=XGBoost + + +энтропия джини
\ No newline at end of file diff --git a/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.png b/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.png Binary files differnew file mode 100644 index 0000000..fdbebde --- /dev/null +++ b/Fundamentals_of_Accelerated_Data_Science/3-06_xgboost.png |
