exd4week/exd1.ipynb

575 lines
70 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "82f706d6-6764-446d-8ea8-135dae405123",
"metadata": {},
"source": [
"Выбранный раздел - Working with text documents - Clustering text documents using k-means\n",
"Цель задания - Разбить тексты на кластеры с помощью алгоритма k-means и попытаться интерпретировать полученные кластеры как тематические группы."
]
},
{
"cell_type": "markdown",
"id": "8923f537-23b0-4cf8-9ca3-8c4f135f5fdd",
"metadata": {},
"source": [
"1)Подготовка данных"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cbb8432b-80d7-4ed8-9593-2f23d83202ec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3387 documents - 4 categories\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"categories = [\n",
" \"alt.atheism\",\n",
" \"talk.religion.misc\",\n",
" \"comp.graphics\",\n",
" \"sci.space\",\n",
"]\n",
"\n",
"dataset = fetch_20newsgroups(\n",
" remove=(\"headers\", \"footers\", \"quotes\"),\n",
" subset=\"all\",\n",
" categories=categories,\n",
" shuffle=True,\n",
" random_state=42,\n",
")\n",
"\n",
"labels = dataset.target\n",
"unique_labels, category_sizes = np.unique(labels, return_counts=True)\n",
"true_k = unique_labels.shape[0]\n",
"\n",
"print(f\"{len(dataset.data)} documents - {true_k} categories\")"
]
},
{
"cell_type": "markdown",
"id": "79a8546c-40c1-4504-a81a-d51a5a19b754",
"metadata": {},
"source": [
"Создаём функцию fit_and_evaluate, которая обучает модель кластеризации несколько раз с разными random_state и вычисляет метрики"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "524f99e4-bd49-48af-b7b4-9a4507da429a",
"metadata": {},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"from time import time\n",
"\n",
"from sklearn import metrics\n",
"\n",
"evaluations = []\n",
"evaluations_std = []\n",
"\n",
"\n",
"def fit_and_evaluate(km, X, name=None, n_runs=5):\n",
" name = km.__class__.__name__ if name is None else name\n",
"\n",
" train_times = []\n",
" scores = defaultdict(list)\n",
" for seed in range(n_runs):\n",
" km.set_params(random_state=seed)\n",
" t0 = time()\n",
" km.fit(X)\n",
" train_times.append(time() - t0)\n",
" scores[\"Homogeneity\"].append(metrics.homogeneity_score(labels, km.labels_))\n",
" scores[\"Completeness\"].append(metrics.completeness_score(labels, km.labels_))\n",
" scores[\"V-measure\"].append(metrics.v_measure_score(labels, km.labels_))\n",
" scores[\"Adjusted Rand-Index\"].append(\n",
" metrics.adjusted_rand_score(labels, km.labels_)\n",
" )\n",
" scores[\"Silhouette Coefficient\"].append(\n",
" metrics.silhouette_score(X, km.labels_, sample_size=2000)\n",
" )\n",
" train_times = np.asarray(train_times)\n",
"\n",
" print(f\"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s \")\n",
" evaluation = {\n",
" \"estimator\": name,\n",
" \"train_time\": train_times.mean(),\n",
" }\n",
" evaluation_std = {\n",
" \"estimator\": name,\n",
" \"train_time\": train_times.std(),\n",
" }\n",
" for score_name, score_values in scores.items():\n",
" mean_score, std_score = np.mean(score_values), np.std(score_values)\n",
" print(f\"{score_name}: {mean_score:.3f} ± {std_score:.3f}\")\n",
" evaluation[score_name] = mean_score\n",
" evaluation_std[score_name] = std_score\n",
" evaluations.append(evaluation)\n",
" evaluations_std.append(evaluation_std)"
]
},
{
"cell_type": "markdown",
"id": "773461fc-777e-43db-8d43-06b5f8804382",
"metadata": {},
"source": [
"Преобразуем текстовые данные в векторное представление"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2ff44d45-04cf-4a22-affc-3db5fedd1d54",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"vectorization done in 0.226 s\n",
"n_samples: 3387, n_features: 7929\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(\n",
" max_df=0.5,\n",
" min_df=5,\n",
" stop_words=\"english\",\n",
")\n",
"t0 = time()\n",
"X_tfidf = vectorizer.fit_transform(dataset.data)\n",
"\n",
"print(f\"vectorization done in {time() - t0:.3f} s\")\n",
"print(f\"n_samples: {X_tfidf.shape[0]}, n_features: {X_tfidf.shape[1]}\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d4cc19f0-6078-41c6-89c6-5d5599dd13fa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.007\n"
]
}
],
"source": [
"print(f\"{X_tfidf.nnz / np.prod(X_tfidf.shape):.3f}\")"
]
},
{
"cell_type": "markdown",
"id": "8bed771f-9358-47b3-bea1-45410a2446e0",
"metadata": {},
"source": [
"Применяем KMeans для кластеризации текстов на основе TF-IDF векторов"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4a864bc1-5798-4a17-882f-03f7039dd183",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of elements assigned to each cluster: [ 481 675 1785 446]\n",
"Number of elements assigned to each cluster: [1689 638 480 580]\n",
"Number of elements assigned to each cluster: [ 1 1 1 3384]\n",
"Number of elements assigned to each cluster: [1887 311 332 857]\n",
"Number of elements assigned to each cluster: [ 291 673 1771 652]\n",
"\n",
"True number of documents in each category according to the class labels: [799 973 987 628]\n"
]
}
],
"source": [
"from sklearn.cluster import KMeans\n",
"\n",
"for seed in range(5):\n",
" kmeans = KMeans(\n",
" n_clusters=true_k,\n",
" max_iter=100,\n",
" n_init=1,\n",
" random_state=seed,\n",
" ).fit(X_tfidf)\n",
" cluster_ids, cluster_sizes = np.unique(kmeans.labels_, return_counts=True)\n",
" print(f\"Number of elements assigned to each cluster: {cluster_sizes}\")\n",
"print()\n",
"print(\n",
" \"True number of documents in each category according to the class labels: \"\n",
" f\"{category_sizes}\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "245a938e-d2bd-4360-8856-3fbedd732cf8",
"metadata": {},
"source": [
"Улучшаем устойчивость и качество кластеризации методом KMeans, за счёт многократных запусков с разными начальными условиями"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "52b60356-f983-4f17-a3df-a523ab1cc2ad",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"clustering done in 0.06 ± 0.01 s \n",
"Homogeneity: 0.349 ± 0.010\n",
"Completeness: 0.398 ± 0.009\n",
"V-measure: 0.372 ± 0.009\n",
"Adjusted Rand-Index: 0.203 ± 0.017\n",
"Silhouette Coefficient: 0.007 ± 0.000\n"
]
}
],
"source": [
"kmeans = KMeans(\n",
" n_clusters=true_k,\n",
" max_iter=100,\n",
" n_init=5,\n",
")\n",
"\n",
"fit_and_evaluate(kmeans, X_tfidf, name=\"KMeans\\non tf-idf vectors\")"
]
},
{
"cell_type": "markdown",
"id": "236ced7d-857b-4e79-9be5-8b3f593bf0ca",
"metadata": {},
"source": [
"Применяем LSA"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c5d526ab-338d-4c65-bc88-98478684f9fa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LSA done in 0.235 s\n",
"Explained variance of the SVD step: 18.4%\n"
]
}
],
"source": [
"from sklearn.decomposition import TruncatedSVD\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import Normalizer\n",
"\n",
"lsa = make_pipeline(TruncatedSVD(n_components=100), Normalizer(copy=False))\n",
"t0 = time()\n",
"X_lsa = lsa.fit_transform(X_tfidf)\n",
"explained_variance = lsa[0].explained_variance_ratio_.sum()\n",
"\n",
"print(f\"LSA done in {time() - t0:.3f} s\")\n",
"print(f\"Explained variance of the SVD step: {explained_variance * 100:.1f}%\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ba85a0c3-ece2-4fa1-9e49-1617ed981670",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"clustering done in 0.01 ± 0.00 s \n",
"Homogeneity: 0.397 ± 0.015\n",
"Completeness: 0.423 ± 0.006\n",
"V-measure: 0.410 ± 0.010\n",
"Adjusted Rand-Index: 0.310 ± 0.024\n",
"Silhouette Coefficient: 0.030 ± 0.001\n"
]
}
],
"source": [
"kmeans = KMeans(\n",
" n_clusters=true_k,\n",
" max_iter=100,\n",
" n_init=1,\n",
")\n",
"\n",
"fit_and_evaluate(kmeans, X_lsa, name=\"KMeans\\nwith LSA on tf-idf vectors\")"
]
},
{
"cell_type": "markdown",
"id": "c6927800-8b49-4cd8-96a5-6d3200dcdb31",
"metadata": {},
"source": [
"Обучаем более быструю версию KMeans — MiniBatchKMeans на уменьшенной матрице"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "f4b0d1fa-ee9b-43cd-bfa0-30fb2662e043",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"clustering done in 0.04 ± 0.01 s \n",
"Homogeneity: 0.387 ± 0.030\n",
"Completeness: 0.396 ± 0.018\n",
"V-measure: 0.391 ± 0.024\n",
"Adjusted Rand-Index: 0.342 ± 0.025\n",
"Silhouette Coefficient: 0.027 ± 0.004\n"
]
}
],
"source": [
"from sklearn.cluster import MiniBatchKMeans\n",
"\n",
"minibatch_kmeans = MiniBatchKMeans(\n",
" n_clusters=true_k,\n",
" n_init=1,\n",
" init_size=1000,\n",
" batch_size=1000,\n",
")\n",
"\n",
"fit_and_evaluate(\n",
" minibatch_kmeans,\n",
" X_lsa,\n",
" name=\"MiniBatchKMeans\\nwith LSA on tf-idf vectors\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9076622b-b546-415d-972d-b2d22b56cc8f",
"metadata": {},
"source": [
"Определяем, какие слова лучше всего характеризуют каждый кластер"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "61db3d27-40d7-4fe8-85bd-dec814ba64ff",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cluster 0: space nasa shuttle launch program station sci like think just \n",
"Cluster 1: just think like don time know ve new people good \n",
"Cluster 2: thanks graphics image file program files know help looking format \n",
"Cluster 3: god people don jesus think bible say believe religion christian \n"
]
}
],
"source": [
"original_space_centroids = lsa[0].inverse_transform(kmeans.cluster_centers_)\n",
"order_centroids = original_space_centroids.argsort()[:, ::-1]\n",
"terms = vectorizer.get_feature_names_out()\n",
"\n",
"for i in range(true_k):\n",
" print(f\"Cluster {i}: \", end=\"\")\n",
" for ind in order_centroids[i, :10]:\n",
" print(f\"{terms[ind]} \", end=\"\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "2fc42b36-3787-41dd-bdbc-c436d1d082bd",
"metadata": {},
"source": [
"Вместо TfidfVectorizer используем HashingVectorizer + TfidfTransformer + LSA"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e8a16f09-e356-40a9-ae83-123665ddc190",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"vectorization done in 0.897 s\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer\n",
"\n",
"lsa_vectorizer = make_pipeline(\n",
" HashingVectorizer(stop_words=\"english\", n_features=50_000),\n",
" TfidfTransformer(),\n",
" TruncatedSVD(n_components=100, random_state=0),\n",
" Normalizer(copy=False),\n",
")\n",
"\n",
"t0 = time()\n",
"X_hashed_lsa = lsa_vectorizer.fit_transform(dataset.data)\n",
"print(f\"vectorization done in {time() - t0:.3f} s\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ba416f7d-3d95-46c7-92a3-1da5b035efb7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"clustering done in 0.01 ± 0.00 s \n",
"Homogeneity: 0.389 ± 0.014\n",
"Completeness: 0.430 ± 0.026\n",
"V-measure: 0.408 ± 0.019\n",
"Adjusted Rand-Index: 0.328 ± 0.017\n",
"Silhouette Coefficient: 0.029 ± 0.002\n"
]
}
],
"source": [
"fit_and_evaluate(kmeans, X_hashed_lsa, name=\"KMeans\\nwith LSA on hashed vectors\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "964ffa4c-5836-4af2-ae78-9828505e4a83",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"clustering done in 0.04 ± 0.01 s \n",
"Homogeneity: 0.346 ± 0.057\n",
"Completeness: 0.367 ± 0.061\n",
"V-measure: 0.356 ± 0.058\n",
"Adjusted Rand-Index: 0.307 ± 0.055\n",
"Silhouette Coefficient: 0.028 ± 0.003\n"
]
}
],
"source": [
"fit_and_evaluate(\n",
" minibatch_kmeans,\n",
" X_hashed_lsa,\n",
" name=\"MiniBatchKMeans\\nwith LSA on hashed vectors\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e4cf9ffe-6eee-4168-90bd-243895468664",
"metadata": {},
"source": [
"Строим таблицу и визуализацию с результатами кластеризации"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "042bce7b-b173-4074-8883-680ebc58ec4a",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1600x600 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"\n",
"fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(16, 6), sharey=True)\n",
"\n",
"df = pd.DataFrame(evaluations[::-1]).set_index(\"estimator\")\n",
"df_std = pd.DataFrame(evaluations_std[::-1]).set_index(\"estimator\")\n",
"\n",
"df.drop(\n",
" [\"train_time\"],\n",
" axis=\"columns\",\n",
").plot.barh(ax=ax0, xerr=df_std)\n",
"ax0.set_xlabel(\"Clustering scores\")\n",
"ax0.set_ylabel(\"\")\n",
"\n",
"df[\"train_time\"].plot.barh(ax=ax1, xerr=df_std[\"train_time\"])\n",
"ax1.set_xlabel(\"Clustering time (s)\")\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"id": "2454c41c-7745-41af-8543-eea3b3b78a3a",
"metadata": {},
"source": [
"Как видим, модель успешно справилась с заданием, а наиболее эффективным оказался K-means с использованием LSA на хэшированных векторах"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}