{ "cells": [ { "cell_type": "markdown", "id": "4cad8cf9-52c5-4a78-ac16-39833303e1dc", "metadata": {}, "source": [ "# **Искусственные нейронные сети: первые шаги**" ] }, { "cell_type": "markdown", "id": "977abacb-0595-48ac-b874-ff6bf995fe06", "metadata": {}, "source": [ "# Базовая нейросеть" ] }, { "cell_type": "code", "execution_count": 1, "id": "afa2b869-a183-4d03-8907-b9cbf8531379", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 11\n", " 1 1.00 0.75 0.86 8\n", " 2 0.85 1.00 0.92 11\n", "\n", " accuracy 0.93 30\n", " macro avg 0.95 0.92 0.92 30\n", "weighted avg 0.94 0.93 0.93 30\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "D:\\Практика. 2 курс\\Task 4\\venv\\Lib\\site-packages\\sklearn\\neural_network\\_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (500) reached and the optimization hasn't converged yet.\n", " warnings.warn(\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.metrics import classification_report\n", "\n", "# Загрузка и разбиение данных\n", "X, y = load_iris(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "# Модель MLP — многослойный перцептрон\n", "clf = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=500)\n", "clf.fit(X_train, y_train)\n", "\n", "# Отчёт о точности\n", "print(classification_report(y_test, clf.predict(X_test)))" ] }, { "cell_type": "code", "execution_count": 4, "id": "b683837e-c854-4b6a-b48b-3db80b514ecf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 8\n", " 1 0.00 0.00 0.00 12\n", " 2 0.43 0.90 0.58 10\n", "\n", " accuracy 0.57 30\n", " macro avg 0.48 0.63 0.53 30\n", "weighted avg 0.41 0.57 0.46 30\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "D:\\Практика. 2 курс\\Task 4\\venv\\Lib\\site-packages\\sklearn\\neural_network\\_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (100) reached and the optimization hasn't converged yet.\n", " warnings.warn(\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.metrics import classification_report\n", "\n", "# Загрузка и разбиение данных\n", "X, y = load_iris(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "# Модель MLP — многослойный перцептрон\n", "clf = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=100)\n", "clf.fit(X_train, y_train)\n", "\n", "# Отчёт о точности\n", "print(classification_report(y_test, clf.predict(X_test)))" ] }, { "cell_type": "code", "execution_count": 5, "id": "6a3624c4-87ce-4c7e-b0a5-9b3fae3230aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 13\n", " 1 1.00 1.00 1.00 11\n", " 2 1.00 1.00 1.00 6\n", "\n", " accuracy 1.00 30\n", " macro avg 1.00 1.00 1.00 30\n", "weighted avg 1.00 1.00 1.00 30\n", "\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.metrics import classification_report\n", "\n", "# Загрузка и разбиение данных\n", "X, y = load_iris(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "# Модель MLP — многослойный перцептрон\n", "clf = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=2500)\n", "clf.fit(X_train, y_train)\n", "\n", "# Отчёт о точности\n", "print(classification_report(y_test, clf.predict(X_test)))" ] }, { "cell_type": "markdown", "id": "707bf955-5dbd-49e5-8650-4aeedf59531a", "metadata": {}, "source": [ "# Самостоятельное задание" ] }, { "cell_type": "markdown", "id": "3df34b09-a006-4b05-9d6d-9cae917c8535", "metadata": {}, "source": [ "# Biclustering документов с алгоритмом Spectral Co-clustering\n", "\n", "## Цель задачи:\n", "Целью данной работы является демонстрация применения алгоритма Spectral Co-clustering для совместной кластеризации документов и слов (бикластеризации) на наборе данных 20 newsgroups. \n", "\n", "Бикластеризация позволяет находить подмножества слов, которые часто встречаются вместе в подмножествах документов, что полезно для тематического моделирования и анализа текстов.\n", "\n", "Алгоритм сравнивается с MiniBatchKMeans по метрике V-measure." ] }, { "cell_type": "markdown", "id": "eddfbff7-533d-4a11-89e8-60e4f520dcd4", "metadata": {}, "source": [ "## Импорт необходимых библиотек" ] }, { "cell_type": "code", "execution_count": 1, "id": "0c5dd022-f4d1-4668-87ee-bcc4d4a6b3ae", "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from time import time\n", "import numpy as np\n", "from sklearn.cluster import MiniBatchKMeans, SpectralCoclustering\n", "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.metrics.cluster import v_measure_score\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from sklearn.datasets import make_biclusters\n", "from sklearn.metrics import consensus_score" ] }, { "cell_type": "markdown", "id": "751a8517-5965-4995-bb3c-762c50a89bb2", "metadata": {}, "source": [ "## 1. Работа с встроенным датасетом (20 newsgroups)" ] }, { "cell_type": "markdown", "id": "f853c375-b8da-48ef-a046-908b014c2446", "metadata": {}, "source": [ "### 1.1 Загрузка и подготовка данных\n", "Используем встроенный датасет 20 newsgroups, исключая категорию 'comp.os.ms-windows.misc', так как она содержит много постов только с данными." ] }, { "cell_type": "code", "execution_count": 19, "id": "4bdd4f2d-5e13-40cd-a52e-4b39b468cec8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vectorizing...\n", "Coclustering...\n", "Done in 1.53s. V-measure: 0.4415\n", "MiniBatchKMeans...\n", "Done in 2.06s. V-measure: 0.3015\n", "\n", "Best biclusters:\n", "----------------\n", "bicluster 0 : 8 documents, 6 words\n", "categories : 100% talk.politics.mideast\n", "words : cosmo, angmar, alfalfa, alphalpha, proline, benson\n", "\n", "bicluster 1 : 1948 documents, 4325 words\n", "categories : 23% talk.politics.guns, 18% talk.politics.misc, 17% sci.med\n", "words : gun, guns, geb, banks, gordon, clinton, pitt, cdt, surrender, veal\n", "\n", "bicluster 2 : 1259 documents, 3534 words\n", "categories : 27% soc.religion.christian, 25% talk.politics.mideast, 25% alt.atheism\n", "words : god, jesus, christians, kent, sin, objective, belief, christ, faith, moral\n", "\n", "bicluster 3 : 775 documents, 1623 words\n", "categories : 30% comp.windows.x, 25% comp.sys.ibm.pc.hardware, 20% comp.graphics\n", "words : scsi, nada, ide, vga, esdi, isa, kth, s3, vlb, bmug\n", "\n", "bicluster 4 : 2180 documents, 2802 words\n", "categories : 18% comp.sys.mac.hardware, 16% sci.electronics, 16% comp.sys.ibm.pc.hardware\n", "words : voltage, shipping, circuit, receiver, processing, scope, mpce, analog, kolstad, umass\n", "\n" ] } ], "source": [ "# Authors: The scikit-learn developers\n", "# SPDX-License-Identifier: BSD-3-Clause\n", "from collections import Counter\n", "from time import time\n", "\n", "import numpy as np\n", "\n", "from sklearn.cluster import MiniBatchKMeans, SpectralCoclustering\n", "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.metrics.cluster import v_measure_score\n", "\n", "\n", "def number_normalizer(tokens):\n", " \"\"\"Map all numeric tokens to a placeholder.\n", "\n", " For many applications, tokens that begin with a number are not directly\n", " useful, but the fact that such a token exists can be relevant. By applying\n", " this form of dimensionality reduction, some methods may perform better.\n", " \"\"\"\n", " return (\"#NUMBER\" if token[0].isdigit() else token for token in tokens)\n", "\n", "\n", "class NumberNormalizingVectorizer(TfidfVectorizer):\n", " def build_tokenizer(self):\n", " tokenize = super().build_tokenizer()\n", " return lambda doc: list(number_normalizer(tokenize(doc)))\n", "\n", "\n", "# exclude 'comp.os.ms-windows.misc'\n", "categories = [\n", " \"alt.atheism\",\n", " \"comp.graphics\",\n", " \"comp.sys.ibm.pc.hardware\",\n", " \"comp.sys.mac.hardware\",\n", " \"comp.windows.x\",\n", " \"misc.forsale\",\n", " \"rec.autos\",\n", " \"rec.motorcycles\",\n", " \"rec.sport.baseball\",\n", " \"rec.sport.hockey\",\n", " \"sci.crypt\",\n", " \"sci.electronics\",\n", " \"sci.med\",\n", " \"sci.space\",\n", " \"soc.religion.christian\",\n", " \"talk.politics.guns\",\n", " \"talk.politics.mideast\",\n", " \"talk.politics.misc\",\n", " \"talk.religion.misc\",\n", "]\n", "newsgroups = fetch_20newsgroups(categories=categories)\n", "y_true = newsgroups.target\n", "\n", "vectorizer = NumberNormalizingVectorizer(stop_words=\"english\", min_df=5)\n", "cocluster = SpectralCoclustering(\n", " n_clusters=len(categories), svd_method=\"arpack\", random_state=0\n", ")\n", "kmeans = MiniBatchKMeans(\n", " n_clusters=len(categories), batch_size=20000, random_state=0, n_init=3\n", ")\n", "\n", "print(\"Vectorizing...\")\n", "X = vectorizer.fit_transform(newsgroups.data)\n", "\n", "print(\"Coclustering...\")\n", "start_time = time()\n", "cocluster.fit(X)\n", "y_cocluster = cocluster.row_labels_\n", "print(\n", " f\"Done in {time() - start_time:.2f}s. V-measure: \\\n", "{v_measure_score(y_cocluster, y_true):.4f}\"\n", ")\n", "\n", "\n", "print(\"MiniBatchKMeans...\")\n", "start_time = time()\n", "y_kmeans = kmeans.fit_predict(X)\n", "print(\n", " f\"Done in {time() - start_time:.2f}s. V-measure: \\\n", "{v_measure_score(y_kmeans, y_true):.4f}\"\n", ")\n", "\n", "\n", "feature_names = vectorizer.get_feature_names_out()\n", "document_names = list(newsgroups.target_names[i] for i in newsgroups.target)\n", "\n", "\n", "def bicluster_ncut(i):\n", " rows, cols = cocluster.get_indices(i)\n", " if not (np.any(rows) and np.any(cols)):\n", " import sys\n", "\n", " return sys.float_info.max\n", " row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0]\n", " col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0]\n", " # Note: the following is identical to X[rows[:, np.newaxis],\n", " # cols].sum() but much faster in scipy <= 0.16\n", " weight = X[rows][:, cols].sum()\n", " cut = X[row_complement][:, cols].sum() + X[rows][:, col_complement].sum()\n", " return cut / weight\n", "\n", "\n", "bicluster_ncuts = list(bicluster_ncut(i) for i in range(len(newsgroups.target_names)))\n", "best_idx = np.argsort(bicluster_ncuts)[:5]\n", "\n", "print()\n", "print(\"Best biclusters:\")\n", "print(\"----------------\")\n", "for idx, cluster in enumerate(best_idx):\n", " n_rows, n_cols = cocluster.get_shape(cluster)\n", " cluster_docs, cluster_words = cocluster.get_indices(cluster)\n", " if not len(cluster_docs) or not len(cluster_words):\n", " continue\n", "\n", " # categories\n", " counter = Counter(document_names[doc] for doc in cluster_docs)\n", "\n", " cat_string = \", \".join(\n", " f\"{(c / n_rows * 100):.0f}% {name}\" for name, c in counter.most_common(3)\n", " )\n", "\n", " # words\n", " out_of_cluster_docs = cocluster.row_labels_ != cluster\n", " out_of_cluster_docs = np.where(out_of_cluster_docs)[0]\n", " word_col = X[:, cluster_words]\n", " word_scores = np.array(\n", " word_col[cluster_docs, :].sum(axis=0)\n", " - word_col[out_of_cluster_docs, :].sum(axis=0)\n", " )\n", " word_scores = word_scores.ravel()\n", " important_words = list(\n", " feature_names[cluster_words[i]] for i in word_scores.argsort()[:-11:-1]\n", " )\n", "\n", " print(f\"bicluster {idx} : {n_rows} documents, {n_cols} words\")\n", " print(f\"categories : {cat_string}\")\n", " print(f\"words : {', '.join(important_words)}\\n\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "927d506e-dd51-4616-888f-4c114a44b106", "metadata": {}, "outputs": [], "source": [ "# Исключаем 'comp.os.ms-windows.misc'\n", "categories = [\n", " \"alt.atheism\",\n", " \"comp.graphics\",\n", " \"comp.sys.ibm.pc.hardware\",\n", " \"comp.sys.mac.hardware\",\n", " \"comp.windows.x\",\n", " \"misc.forsale\",\n", " \"rec.autos\",\n", " \"rec.motorcycles\",\n", " \"rec.sport.baseball\",\n", " \"rec.sport.hockey\",\n", " \"sci.crypt\",\n", " \"sci.electronics\",\n", " \"sci.med\",\n", " \"sci.space\",\n", " \"soc.religion.christian\",\n", " \"talk.politics.guns\",\n", " \"talk.politics.mideast\",\n", " \"talk.politics.misc\",\n", " \"talk.religion.misc\",\n", "]\n", "\n", "# Загрузка данных\n", "newsgroups = fetch_20newsgroups(categories=categories)\n", "y_true = newsgroups.target" ] }, { "cell_type": "markdown", "id": "11c70dd8-8a44-48e8-89c6-c13f2bef8987", "metadata": {}, "source": [ "### 1.2 Препроцессинг данных\n", "Создаем кастомный векторизатор, который нормализует числа в тексте (заменяет их на #NUMBER)." ] }, { "cell_type": "code", "execution_count": 3, "id": "157bc5ce-66e6-4c66-ad2f-8b25a40af6b5", "metadata": {}, "outputs": [], "source": [ "def number_normalizer(tokens):\n", " \"\"\"Заменяет все числовые токены на placeholder #NUMBER\"\"\"\n", " return (\"#NUMBER\" if token[0].isdigit() else token for token in tokens)\n", "\n", "class NumberNormalizingVectorizer(TfidfVectorizer):\n", " def build_tokenizer(self):\n", " tokenize = super().build_tokenizer()\n", " return lambda doc: list(number_normalizer(tokenize(doc)))\n", "\n", "# Создаем векторизатор с удалением стоп-слов и минимальной частотой слова 5\n", "vectorizer = NumberNormalizingVectorizer(stop_words=\"english\", min_df=5)\n", "X = vectorizer.fit_transform(newsgroups.data)" ] }, { "cell_type": "markdown", "id": "067ce0e9-cc59-427d-9d59-079092991301", "metadata": {}, "source": [ "### 1.3 Обучение моделей" ] }, { "cell_type": "code", "execution_count": 4, "id": "1eef0f0c-8956-4b2e-b1c7-4191e78bfed7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Coclustering...\n", "Done in 5.62s. V-measure: 0.4415\n", "MiniBatchKMeans...\n", "Done in 1.40s. V-measure: 0.3015\n" ] } ], "source": [ "# Инициализация моделей\n", "cocluster = SpectralCoclustering(\n", " n_clusters=len(categories), svd_method=\"arpack\", random_state=0\n", ")\n", "kmeans = MiniBatchKMeans(\n", " n_clusters=len(categories), batch_size=20000, random_state=0, n_init=3\n", ")\n", "\n", "# Обучение Spectral Co-clustering\n", "print(\"Coclustering...\")\n", "start_time = time()\n", "cocluster.fit(X)\n", "y_cocluster = cocluster.row_labels_\n", "print(f\"Done in {time() - start_time:.2f}s. V-measure: {v_measure_score(y_cocluster, y_true):.4f}\")\n", "\n", "# Обучение MiniBatchKMeans\n", "print(\"MiniBatchKMeans...\")\n", "start_time = time()\n", "y_kmeans = kmeans.fit_predict(X)\n", "print(f\"Done in {time() - start_time:.2f}s. V-measure: {v_measure_score(y_kmeans, y_true):.4f}\")" ] }, { "cell_type": "markdown", "id": "ffe70635-8ac9-4038-a7b7-6fba38141441", "metadata": {}, "source": [ "### 1.4 Анализ результатов\n", "Выводим информацию о лучших бикластерах." ] }, { "cell_type": "code", "execution_count": 11, "id": "8c166b37-814c-492b-a1e0-a4a367c88a7b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Best biclusters:\n", "----------------\n", "bicluster 0 : 8 documents, 6 words\n", "categories : 100% talk.politics.mideast\n", "words : cosmo, angmar, alfalfa, alphalpha, proline, benson\n", "\n", "bicluster 1 : 1948 documents, 4325 words\n", "categories : 23% talk.politics.guns, 18% talk.politics.misc, 17% sci.med\n", "words : gun, guns, geb, banks, gordon, clinton, pitt, cdt, surrender, veal\n", "\n", "bicluster 2 : 1259 documents, 3534 words\n", "categories : 27% soc.religion.christian, 25% talk.politics.mideast, 25% alt.atheism\n", "words : god, jesus, christians, kent, sin, objective, belief, christ, faith, moral\n", "\n", "bicluster 3 : 775 documents, 1623 words\n", "categories : 30% comp.windows.x, 25% comp.sys.ibm.pc.hardware, 20% comp.graphics\n", "words : scsi, nada, ide, vga, esdi, isa, kth, s3, vlb, bmug\n", "\n", "bicluster 4 : 2180 documents, 2802 words\n", "categories : 18% comp.sys.mac.hardware, 16% sci.electronics, 16% comp.sys.ibm.pc.hardware\n", "words : voltage, shipping, circuit, receiver, processing, scope, mpce, analog, kolstad, umass\n", "\n" ] } ], "source": [ "feature_names = vectorizer.get_feature_names_out()\n", "document_names = list(newsgroups.target_names[i] for i in newsgroups.target)\n", "\n", "def bicluster_ncut(i):\n", " \"\"\"Вычисляет normalized cut для бикластера\"\"\"\n", " rows, cols = cocluster.get_indices(i)\n", " if not (np.any(rows) and np.any(cols)):\n", " import sys\n", " return sys.float_info.max\n", " row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0]\n", " col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0]\n", " weight = X[rows][:, cols].sum()\n", " cut = X[row_complement][:, cols].sum() + X[rows][:, col_complement].sum()\n", " return cut / weight\n", "\n", "# Находим 5 лучших бикластеров\n", "bicluster_ncuts = [bicluster_ncut(i) for i in range(len(newsgroups.target_names))]\n", "best_idx = np.argsort(bicluster_ncuts)[:5]\n", "\n", "print(\"\\nBest biclusters:\")\n", "print(\"----------------\")\n", "for idx, cluster in enumerate(best_idx):\n", " n_rows, n_cols = cocluster.get_shape(cluster)\n", " cluster_docs, cluster_words = cocluster.get_indices(cluster)\n", " if not len(cluster_docs) or not len(cluster_words):\n", " continue\n", "\n", " # Анализ категорий\n", " counter = Counter(document_names[doc] for doc in cluster_docs)\n", " cat_string = \", \".join(\n", " f\"{(c / n_rows * 100):.0f}% {name}\" for name, c in counter.most_common(3)\n", " )\n", "\n", " # Анализ слов\n", " out_of_cluster_docs = cocluster.row_labels_ != cluster\n", " out_of_cluster_docs = np.where(out_of_cluster_docs)[0]\n", " word_col = X[:, cluster_words]\n", " word_scores = np.array(\n", " word_col[cluster_docs, :].sum(axis=0) - word_col[out_of_cluster_docs, :].sum(axis=0)\n", " )\n", " word_scores = word_scores.ravel()\n", " important_words = list(\n", " feature_names[cluster_words[i]] for i in word_scores.argsort()[:-11:-1]\n", " )\n", "\n", " print(f\"bicluster {idx} : {n_rows} documents, {n_cols} words\")\n", " print(f\"categories : {cat_string}\")\n", " print(f\"words : {', '.join(important_words)}\\n\")\n" ] }, { "cell_type": "markdown", "id": "60c0f67c-d371-4adc-9c0d-d1c5adc4b75f", "metadata": {}, "source": [ "### 1.5 Визуализация результатов\n", "Сравниваем результаты кластеризации по метрике V-measure." ] }, { "cell_type": "code", "execution_count": 5, "id": "3d19b351-a57b-4de0-ae2a-b71cfcf4cf15", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Сравнение V-measure\n", "plt.figure(figsize=(8, 5))\n", "plt.bar(['Spectral Co-clustering', 'MiniBatchKMeans'], \n", " [v_measure_score(y_cocluster, y_true), v_measure_score(y_kmeans, y_true)])\n", "plt.title('Сравнение методов кластеризации по V-measure')\n", "plt.ylabel('V-measure')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "b764e10c-184e-4fe2-be3b-41b684e93c4c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ddb23f5d-791f-49cb-a044-e0a68153a0d5", "metadata": {}, "source": [ "## 2. Работа с внешним датасетом \n", "\n", "### 2.1 Загрузка и подготовка данных\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "d3aca8eb-11d9-4236-9726-1be6ba5ae44d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Patient_IDAgeGenderSmoking_StatusYears_SmokingCigarettes_Per_DaySecondhand_Smoke_ExposureOccupation_ExposureAir_Pollution_LevelFamily_History...Diet_QualityRegionIncome_LevelEducation_LevelAccess_to_HealthcareScreening_FrequencyChronic_Lung_DiseaseLung_Cancer_StageDiagnosis_YearSurvival_Status
0P10000076FemaleNever637LowDiesel FumesLowYes...PoorWestMiddleTertiaryGoodRegularlyNoStage II2008Alive
1P10000139MaleNever3039LowSilicaLowYes...AverageNorthMiddlePrimaryPoorOccasionallyYesNaN2002Alive
2P10000285MaleFormer4714HighAsbestosLowYes...GoodSouthHighTertiaryAverageRegularlyNoStage II2007Deceased
3P10000345FemaleCurrent4532MediumSilicaHighNo...GoodWestLowSecondaryAverageNeverYesNaN2011Alive
4P10000448FemaleNever4626MediumSilicaLowNo...GoodNorthLowTertiaryAverageRegularlyNoNaN2016Alive
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " Patient_ID Age Gender Smoking_Status Years_Smoking Cigarettes_Per_Day \\\n", "0 P100000 76 Female Never 6 37 \n", "1 P100001 39 Male Never 30 39 \n", "2 P100002 85 Male Former 47 14 \n", "3 P100003 45 Female Current 45 32 \n", "4 P100004 48 Female Never 46 26 \n", "\n", " Secondhand_Smoke_Exposure Occupation_Exposure Air_Pollution_Level \\\n", "0 Low Diesel Fumes Low \n", "1 Low Silica Low \n", "2 High Asbestos Low \n", "3 Medium Silica High \n", "4 Medium Silica Low \n", "\n", " Family_History ... Diet_Quality Region Income_Level Education_Level \\\n", "0 Yes ... Poor West Middle Tertiary \n", "1 Yes ... Average North Middle Primary \n", "2 Yes ... Good South High Tertiary \n", "3 No ... Good West Low Secondary \n", "4 No ... Good North Low Tertiary \n", "\n", " Access_to_Healthcare Screening_Frequency Chronic_Lung_Disease \\\n", "0 Good Regularly No \n", "1 Poor Occasionally Yes \n", "2 Average Regularly No \n", "3 Average Never Yes \n", "4 Average Regularly No \n", "\n", " Lung_Cancer_Stage Diagnosis_Year Survival_Status \n", "0 Stage II 2008 Alive \n", "1 NaN 2002 Alive \n", "2 Stage II 2007 Deceased \n", "3 NaN 2011 Alive \n", "4 NaN 2016 Alive \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", "from sklearn.cluster import SpectralCoclustering\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "df = pd.read_csv(\"Lung_Cancer_Trends_Realistic.csv\")\n", "\n", "# Посмотрим на структуру данных\n", "df.head()" ] }, { "cell_type": "markdown", "id": "fbaa5791-49c3-483b-8a05-fd9e9355137c", "metadata": {}, "source": [ "### 2.2 Очистка данных" ] }, { "cell_type": "code", "execution_count": 24, "id": "bf3c8140-269c-4a4d-8de6-96e954e3e9c5", "metadata": {}, "outputs": [], "source": [ "# Удалим строки с пропущенными значениями\n", "df_clean = df.dropna()\n", "\n", "# Отделим категориальные и числовые признаки\n", "categorical_cols = df_clean.select_dtypes(include=[\"object\", \"category\"]).columns\n", "numeric_cols = df_clean.select_dtypes(include=[\"int64\", \"float64\"]).columns\n", "\n", "# One-Hot Encoding для категориальных признаков\n", "encoder = OneHotEncoder(sparse_output=False)\n", "encoded_cat = encoder.fit_transform(df_clean[categorical_cols])\n", "\n", "# Масштабирование числовых признаков\n", "scaler = StandardScaler()\n", "scaled_num = scaler.fit_transform(df_clean[numeric_cols])\n", "\n", "# Объединяем\n", "X = np.hstack([scaled_num, encoded_cat])" ] }, { "cell_type": "markdown", "id": "29b63bd0-181b-4c1f-a4c6-0b107eab164d", "metadata": {}, "source": [ "### 2.3 Обучение модели" ] }, { "cell_type": "code", "execution_count": 18, "id": "1de28a36-b569-4b7e-9d7e-216f58936961", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SpectralCoclustering(n_clusters=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "SpectralCoclustering(n_clusters=5, random_state=42)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_clusters = 5 # количество кластеров, можно изменить\n", "model = SpectralCoclustering(n_clusters=n_clusters, random_state=42)\n", "model.fit(X)" ] }, { "cell_type": "markdown", "id": "0dcc3335-3a64-448c-8c7a-7144974f52ca", "metadata": {}, "source": [ "### 2.4 Присвоение кластеров" ] }, { "cell_type": "code", "execution_count": 25, "id": "60a41ec0-ddfb-4a3f-818e-06f7a5dcda0f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\user\\AppData\\Local\\Temp\\ipykernel_8300\\3849030686.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_clean[\"Cluster\"] = model.row_labels_\n" ] } ], "source": [ "df_clean[\"Cluster\"] = model.row_labels_" ] }, { "cell_type": "markdown", "id": "72e8393e-f745-4b5e-9b6d-a5a31b317e1e", "metadata": {}, "source": [ "### 2.5 Визуализация результатов" ] }, { "cell_type": "code", "execution_count": 26, "id": "696afb4c-e7a9-4182-8fef-4d97f03a49f6", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fit_data = X[np.argsort(model.row_labels_)]\n", "plt.figure(figsize=(12, 6))\n", "sns.heatmap(fit_data, cmap=\"viridis\", cbar=True)\n", "plt.title(\"Спектральное ко-кластеризованное представление данных\")\n", "plt.xlabel(\"Признаки\")\n", "plt.ylabel(\"Объекты (переставлены по кластеру)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "01e38c78-7641-4390-88c6-6a9a69d3ec46", "metadata": {}, "source": [ "### 2.6 Вывод информации о кластерах" ] }, { "cell_type": "code", "execution_count": 27, "id": "a378a466-7641-4d5f-bc12-810a754165ce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Bicluster 0 : 204 objects, 343 features\n", "Top features: Patient_ID_P100000, Patient_ID_P100002, Patient_ID_P100006, Patient_ID_P100013, Patient_ID_P100035, Patient_ID_P100056, Patient_ID_P100066, Patient_ID_P100079, Patient_ID_P100081, Patient_ID_P100091\n", "\n", "Bicluster 1 : 0 objects, 1 features\n", "Top features: BMI\n", "\n", "Bicluster 2 : 97 objects, 10 features\n", "Top features: Years_Smoking, Cigarettes_Per_Day, Gender_Female, Occupation_Exposure_Silica, Air_Pollution_Level_High, Family_History_No, Physical_Activity_Level_Low, Region_West, Access_to_Healthcare_Good, Lung_Cancer_Stage_Stage I\n", "\n", "Bicluster 3 : 0 objects, 1 features\n", "Top features: Diagnosis_Year\n", "\n", "Bicluster 4 : 0 objects, 1 features\n", "Top features: Age\n" ] } ], "source": [ "feature_names = list(numeric_cols) + list(encoder.get_feature_names_out(categorical_cols))\n", "\n", "for i in range(n_clusters):\n", " row_idx = np.where(model.row_labels_ == i)[0]\n", " col_idx = np.where(model.column_labels_ == i)[0]\n", "\n", " # Показываем информацию о кластере\n", " print(f\"\\nBicluster {i} : {len(row_idx)} objects, {len(col_idx)} features\")\n", " \n", " # Важнейшие признаки (топ-10)\n", " top_features = [feature_names[j] for j in col_idx[:10]]\n", " print(\"Top features:\", \", \".join(top_features))\n", "\n", " # Распределение категорий (если имеется колонка 'Category' или что-то похожее)\n", " if 'Category' in df_clean.columns:\n", " cluster_labels = df_clean.loc[df_clean[\"Cluster\"] == i, \"Category\"]\n", " print(f\"Category distribution for Bicluster {i}:\")\n", " print(cluster_labels.value_counts(normalize=True).head())" ] }, { "cell_type": "markdown", "id": "cbf8dd70-eddc-456d-8a2d-97482264c234", "metadata": {}, "source": [ "## Интерпретация результатов\n", "\n", "1. **На встроенном датасете (20 newsgroups):**\n", " - Spectral Co-clustering показал лучшие результаты (V-measure=0.4415) по сравнению с MiniBatchKMeans (V-measure=0.3015)\n", " - Алгоритм смог найти осмысленные бикластеры, объединяющие документы по темам и характерные для них слова\n", "\n", "2. **На внешнем датасете:**\n", " - Алгоритм Spectral Co-clustering выделил группы пациентов и признаки, которые часто встречаются вместе. Например, один кластер может соответствовать группе людей с высоким уровнем курения и одновременно низкой физической активностью.\n", "\n", " - Выделенные бикластеры помогают выявить взаимосвязи между определёнными признаками риска и группами наблюдений.\n", "\n", " - Однако, интерпретация результатов требует медицинского контекста — значения кластеров следует сопоставлять с экспертными знаниями о факторах риска (например, влияние курения, алкоголя, активности и других переменных на здоровье).\n", "\n", "3. **Выводы**\n", " - Spectral Co-clustering эффективен для задач совместной кластеризации строк и столбцов матрицы данных\n", " - Особенно полезен для текстовых данных, где можно одновременно кластеризовать документы и слова\n", " - Требует настройки параметров (количество кластеров, метод SVD) для оптимальной работы" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 5 }