{ "cells": [ { "cell_type": "code", "execution_count": 2, "id": "3a40ada8-e64a-4c93-904c-b2a5a5c8ca70", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 10\n", " 1 1.00 0.75 0.86 12\n", " 2 0.73 1.00 0.84 8\n", "\n", " accuracy 0.90 30\n", " macro avg 0.91 0.92 0.90 30\n", "weighted avg 0.93 0.90 0.90 30\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "D:\\4_week\\venv\\Lib\\site-packages\\sklearn\\neural_network\\_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (500) reached and the optimization hasn't converged yet.\n", " warnings.warn(\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.metrics import classification_report\n", "\n", "# Загрузка и разбиение данных\n", "X, y = load_iris(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "# Модель MLP — многослойный перцептрон\n", "clf = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=500)\n", "clf.fit(X_train, y_train)\n", "\n", "# Отчёт о точности\n", "print(classification_report(y_test, clf.predict(X_test)))" ] }, { "cell_type": "markdown", "id": "daeaf6e9-855d-4f61-a062-2bc4aebc8ea2", "metadata": {}, "source": [ "Целью является сравнение различных способов векторизации текстовых данных на примере подмножества новостных текстов из набора 20 Newsgroups. Анализируется эффективность методов по скорости и числу уникальных признаков." ] }, { "cell_type": "markdown", "id": "befe0cc5-5c62-4a41-9f42-8aecb38c1d95", "metadata": {}, "source": [ "1. Загрузка данных. Загружаем подмножество новостных текстов по выбранным категориям" ] }, { "cell_type": "code", "execution_count": 12, "id": "5a6aa580-4e49-428c-9310-df95b84d7aea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading 20 newsgroups training data\n", "3803 documents - 6.245MB\n" ] } ], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "categories = [\n", " \"alt.atheism\",\n", " \"comp.graphics\",\n", " \"comp.sys.ibm.pc.hardware\",\n", " \"misc.forsale\",\n", " \"rec.autos\",\n", " \"sci.space\",\n", " \"talk.religion.misc\",\n", "]\n", "\n", "print(\"Loading 20 newsgroups training data\")\n", "raw_data, _ = fetch_20newsgroups(subset=\"train\", categories=categories, return_X_y=True)\n", "data_size_mb = sum(len(s.encode(\"utf-8\")) for s in raw_data) / 1e6\n", "print(f\"{len(raw_data)} documents - {data_size_mb:.3f}MB\")\n" ] }, { "cell_type": "markdown", "id": "83f728a3-b22a-41e8-b51f-b8e7a5d55c95", "metadata": {}, "source": [ "2. Предобработка: токенизация и частоты слов. Создадим простую функцию для разбиения текста на токены и подсчета частоты слов" ] }, { "cell_type": "code", "execution_count": 13, "id": "e8466abd-87bc-4952-bd68-fdb4b4220bc5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(int,\n", " {'that': 1,\n", " 'is': 2,\n", " 'one': 2,\n", " 'example': 1,\n", " 'but': 1,\n", " 'this': 1,\n", " 'another': 1})" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "\n", "def tokenize(doc):\n", " \"\"\"Extract tokens from doc.\n", "\n", " This uses a simple regex that matches word characters to break strings\n", " into tokens. For a more principled approach, see CountVectorizer or\n", " TfidfVectorizer.\n", " \"\"\"\n", " return (tok.lower() for tok in re.findall(r\"\\w+\", doc))\n", "\n", "\n", "list(tokenize(\"This is a simple example, isn't it?\"))\n", "from collections import defaultdict\n", "\n", "\n", "def token_freqs(doc):\n", " \"\"\"Extract a dict mapping tokens from doc to their occurrences.\"\"\"\n", "\n", " freq = defaultdict(int)\n", " for tok in tokenize(doc):\n", " freq[tok] += 1\n", " return freq\n", "\n", "\n", "token_freqs(\"That is one example, but this is another one\")" ] }, { "cell_type": "markdown", "id": "99797511-8c47-4b5f-aa06-c074c88ee277", "metadata": {}, "source": [ "3. Векторизация с помощью DictVectorizer. Метод превращает словарь в разреженный числовой вектор" ] }, { "cell_type": "code", "execution_count": 14, "id": "6901e1e7-fa59-459d-8df4-a0a6aaeac7c3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done in 1.750 s at 3.6 MB/s\n", "Found 47928 unique terms\n" ] } ], "source": [ "from time import time\n", "\n", "from sklearn.feature_extraction import DictVectorizer\n", "\n", "dict_count_vectorizers = defaultdict(list)\n", "\n", "t0 = time()\n", "vectorizer = DictVectorizer()\n", "vectorizer.fit_transform(token_freqs(d) for d in raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(\n", " vectorizer.__class__.__name__ + \"\\non freq dicts\"\n", ")\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n", "print(f\"Found {len(vectorizer.get_feature_names_out())} unique terms\")\n" ] }, { "cell_type": "markdown", "id": "37e02d00-0cb1-4bb9-a9ba-03978e578808", "metadata": {}, "source": [ "4. Векторизация с помощью FeatureHasher. Метод применяет хеширование - каждому слову присваивается индекс с помощью хеш-функции" ] }, { "cell_type": "code", "execution_count": 15, "id": "33b89a91-1f29-4edd-9aa0-997e00393ec7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done in 0.947 s at 6.6 MB/s\n", "Found 43873 unique tokens\n", "done in 1.066 s at 5.9 MB/s\n", "Found 47668 unique tokens\n", "done in 0.909 s at 6.9 MB/s\n", "Found 43873 unique tokens\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "\n", "\n", "def n_nonzero_columns(X):\n", " \"\"\"Number of columns with at least one non-zero value in a CSR matrix.\n", "\n", " This is useful to count the number of features columns that are effectively\n", " active when using the FeatureHasher.\n", " \"\"\"\n", " return len(np.unique(X.nonzero()[1]))\n", "from sklearn.feature_extraction import FeatureHasher\n", "\n", "t0 = time()\n", "hasher = FeatureHasher(n_features=2**18)\n", "X = hasher.transform(token_freqs(d) for d in raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(\n", " hasher.__class__.__name__ + \"\\non freq dicts\"\n", ")\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n", "print(f\"Found {n_nonzero_columns(X)} unique tokens\")\n", "t0 = time()\n", "hasher = FeatureHasher(n_features=2**22)\n", "X = hasher.transform(token_freqs(d) for d in raw_data)\n", "duration = time() - t0\n", "\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n", "print(f\"Found {n_nonzero_columns(X)} unique tokens\")\n", "t0 = time()\n", "hasher = FeatureHasher(n_features=2**18, input_type=\"string\")\n", "X = hasher.transform(tokenize(d) for d in raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(\n", " hasher.__class__.__name__ + \"\\non raw tokens\"\n", ")\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n", "print(f\"Found {n_nonzero_columns(X)} unique tokens\")\n", "import matplotlib.pyplot as plt\n", "\n", "fig, ax = plt.subplots(figsize=(12, 6))\n", "\n", "y_pos = np.arange(len(dict_count_vectorizers[\"vectorizer\"]))\n", "ax.barh(y_pos, dict_count_vectorizers[\"speed\"], align=\"center\")\n", "ax.set_yticks(y_pos)\n", "ax.set_yticklabels(dict_count_vectorizers[\"vectorizer\"])\n", "ax.invert_yaxis()\n", "_ = ax.set_xlabel(\"speed (MB/s)\")" ] }, { "cell_type": "markdown", "id": "83e7b545-4dc9-45c2-b5f4-31c2133d7d55", "metadata": {}, "source": [ "5. Сравнение с CountVectorizer. Метод представляет из себя токенизацию и частоты слов" ] }, { "cell_type": "code", "execution_count": 16, "id": "4cc0ea45-8784-400b-881a-312394c0335e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done in 1.135 s at 5.5 MB/s\n", "Found 47885 unique terms\n", "done in 0.868 s at 7.2 MB/s\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "t0 = time()\n", "vectorizer = CountVectorizer()\n", "vectorizer.fit_transform(raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n", "print(f\"Found {len(vectorizer.get_feature_names_out())} unique terms\")\n", "from sklearn.feature_extraction.text import HashingVectorizer\n", "\n", "t0 = time()\n", "vectorizer = HashingVectorizer(n_features=2**18)\n", "vectorizer.fit_transform(raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")" ] }, { "cell_type": "markdown", "id": "1bd2ed6d-508d-42e1-a8e4-6344b6ff493e", "metadata": {}, "source": [ "6. HashingVectorizer. Комбинация CountVectorizer и FeatureHasher" ] }, { "cell_type": "code", "execution_count": 17, "id": "e8ed6734-a8fe-41ee-b3a3-cbb06e0e1752", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done in 1.030 s at 6.1 MB/s\n" ] } ], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer\n", "\n", "t0 = time()\n", "vectorizer = HashingVectorizer(n_features=2**18)\n", "vectorizer.fit_transform(raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n" ] }, { "cell_type": "markdown", "id": "8e3bb2ef-af88-488d-850a-154951863d53", "metadata": {}, "source": [ "7. TF-IDF Vectorizer. Преобразуем частоты слов с учетом их значимости в коллекции документов" ] }, { "cell_type": "code", "execution_count": 18, "id": "756d27ac-f26f-421b-82ef-a6a60dbf90af", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done in 1.334 s at 4.7 MB/s\n", "Found 47885 unique terms\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "t0 = time()\n", "vectorizer = TfidfVectorizer()\n", "vectorizer.fit_transform(raw_data)\n", "duration = time() - t0\n", "dict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\n", "dict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\n", "print(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\n", "print(f\"Found {len(vectorizer.get_feature_names_out())} unique terms\")" ] }, { "cell_type": "markdown", "id": "ecd038f0-c024-4a0b-a7fc-bcf648eed2f6", "metadata": {}, "source": [ "8. Визуализация. Сравним производительность (MB/s) разных подходов" ] }, { "cell_type": "code", "execution_count": 19, "id": "391b055b-fa67-4d43-903a-e7ff512b69c2", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(12, 6))\n", "\n", "y_pos = np.arange(len(dict_count_vectorizers[\"vectorizer\"]))\n", "ax.barh(y_pos, dict_count_vectorizers[\"speed\"], align=\"center\")\n", "ax.set_yticks(y_pos)\n", "ax.set_yticklabels(dict_count_vectorizers[\"vectorizer\"])\n", "ax.invert_yaxis()\n", "_ = ax.set_xlabel(\"speed (MB/s)\")" ] }, { "cell_type": "markdown", "id": "ef9b977b-196d-4de7-8c29-0004a5c2e31e", "metadata": {}, "source": [ "Задание успешно выполнено. Метод HashingVectorizer оказался самым быстрым. Он особенно быстрый на больших объемах данных и не требует хранения словаря, т.е. каждый токен сразу преобразуется в индекс по хеш-функции" ] }, { "cell_type": "code", "execution_count": null, "id": "8f0e8c91-1de2-4839-aedb-618ffdc3f838", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 5 }