{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Biclustering — KDDCup99 (Real World Dataset)\n", "\n", "Был выбран пример **Spectral Biclustering** из раздела Biclustering: \n", "https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html\n", "\n", "Датасет: **KDDCup99** — сетевые соединения для обнаружения вторжений. \n", "https://scikit-learn.org/stable/datasets/real_world.html\n", "\n", "**Цель задачи:** \n", "Применить алгоритм `SpectralBiclustering` для одновременной кластеризации строк (соединений) и столбцов (признаков) матрицы данных.\n", "\n", "**Что такое бикластеризация?** \n", "Обычная кластеризация группирует только строки. Бикластеризация ищет подматрицы, где **определённые строки** ведут себя похоже по **определённым столбцам**." ], "id": "23c3dc812931bdfa" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1. Импорт библиотек" ], "id": "a20f25a25aea2e12" }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2026-05-07T19:02:58.780202200Z", "start_time": "2026-05-07T19:02:57.144112300Z" } }, "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import fetch_kddcup99\n", "from sklearn.cluster import SpectralBiclustering\n", "from sklearn.preprocessing import StandardScaler" ], "id": "a1f10f59879ab7f1", "outputs": [], "execution_count": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2. Получение данных" ], "id": "1ba57c35a85d1e0c" }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2026-05-07T19:03:01.976299100Z", "start_time": "2026-05-07T19:02:58.815383Z" } }, "source": [ "kdd = fetch_kddcup99(percent10=True, as_frame=True)\n", "df = kdd.frame\n", "\n", "print(\"Размер датасета:\", df.shape)\n", "df.head(3)" ], "id": "bb6b7489f6f6a02", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Размер датасета: (494021, 42)\n" ] }, { "data": { "text/plain": [ " duration protocol_type service flag src_bytes dst_bytes land \\\n", "0 0 b'tcp' b'http' b'SF' 181 5450 0 \n", "1 0 b'tcp' b'http' b'SF' 239 486 0 \n", "2 0 b'tcp' b'http' b'SF' 235 1337 0 \n", "\n", " wrong_fragment urgent hot ... dst_host_srv_count dst_host_same_srv_rate \\\n", "0 0 0 0 ... 9 1.0 \n", "1 0 0 0 ... 19 1.0 \n", "2 0 0 0 ... 29 1.0 \n", "\n", " dst_host_diff_srv_rate dst_host_same_src_port_rate \\\n", "0 0.0 0.11 \n", "1 0.0 0.05 \n", "2 0.0 0.03 \n", "\n", " dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "\n", " dst_host_rerror_rate dst_host_srv_rerror_rate labels \n", "0 0.0 0.0 b'normal.' \n", "1 0.0 0.0 b'normal.' \n", "2 0.0 0.0 b'normal.' \n", "\n", "[3 rows x 42 columns]" ], "text/html": [ "
| \n", " | duration | \n", "protocol_type | \n", "service | \n", "flag | \n", "src_bytes | \n", "dst_bytes | \n", "land | \n", "wrong_fragment | \n", "urgent | \n", "hot | \n", "... | \n", "dst_host_srv_count | \n", "dst_host_same_srv_rate | \n", "dst_host_diff_srv_rate | \n", "dst_host_same_src_port_rate | \n", "dst_host_srv_diff_host_rate | \n", "dst_host_serror_rate | \n", "dst_host_srv_serror_rate | \n", "dst_host_rerror_rate | \n", "dst_host_srv_rerror_rate | \n", "labels | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "b'tcp' | \n", "b'http' | \n", "b'SF' | \n", "181 | \n", "5450 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "9 | \n", "1.0 | \n", "0.0 | \n", "0.11 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "b'normal.' | \n", "
| 1 | \n", "0 | \n", "b'tcp' | \n", "b'http' | \n", "b'SF' | \n", "239 | \n", "486 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "19 | \n", "1.0 | \n", "0.0 | \n", "0.05 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "b'normal.' | \n", "
| 2 | \n", "0 | \n", "b'tcp' | \n", "b'http' | \n", "b'SF' | \n", "235 | \n", "1337 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "29 | \n", "1.0 | \n", "0.0 | \n", "0.03 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "b'normal.' | \n", "
3 rows × 42 columns
\n", "