{ "cells": [ { "cell_type": "markdown", "id": "793785ea-f1d5-433a-9f01-44b378a5c3df", "metadata": {}, "source": [ "1. Загрузка и объединение текстовых признаков" ] }, { "cell_type": "code", "execution_count": 23, "id": "75eb3861-598c-4313-91cb-f9d4f09e0dc4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Загрузка CSV\n", "df = pd.read_csv(\"Dataset_Malawi_National_Football_Team_Matches.csv\")\n", "\n", "# Объединяем категориальные текстовые колонки в один текстовый столбец\n", "df[\"text\"] = df[[\"Opponent\", \"Result\", \"Venue\", \"Competition\"]].fillna(\"\").agg(\" \".join, axis=1)\n", "texts = df[\"text\"].tolist()\n" ] }, { "cell_type": "markdown", "id": "77b3eb9f-f116-476f-b24c-9d0c61b9d660", "metadata": {}, "source": [ "2. Подготовка функций токенизации" ] }, { "cell_type": "code", "execution_count": 24, "id": "b815fe8c-0362-4bb4-8d7f-5763de7192b0", "metadata": {}, "outputs": [], "source": [ "import re\n", "from collections import defaultdict\n", "\n", "def tokenize(doc):\n", " return (tok.lower() for tok in re.findall(r\"\\w+\", doc))\n", "\n", "def token_freqs(doc):\n", " freq = defaultdict(int)\n", " for tok in tokenize(doc):\n", " freq[tok] += 1\n", " return freq\n" ] }, { "cell_type": "markdown", "id": "69867160-c219-4bff-b268-679a8abe2498", "metadata": {}, "source": [ "3. Сравнение методов векторизации" ] }, { "cell_type": "code", "execution_count": 25, "id": "527d49e0-5e45-4e6d-bdd5-d743b69e56f1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DictVectorizer: (73, 70) — 0.00s\n", "FeatureHasher: (73, 4096) — 0.00s\n", "CountVectorizer: (73, 69) — 0.00s\n", "HashingVectorizer: (73, 4096) — 0.00s\n", "TfidfVectorizer: (73, 69) — 0.00s\n" ] } ], "source": [ "from sklearn.feature_extraction import DictVectorizer, FeatureHasher\n", "from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer\n", "import numpy as np\n", "from time import time\n", "\n", "def n_nonzero_columns(X):\n", " return len(np.unique(X.nonzero()[1]))\n", "\n", "data_size_mb = sum(len(s.encode(\"utf-8\")) for s in texts) / 1e6\n", "vectorizer_stats = defaultdict(list)\n", "\n", "# DictVectorizer\n", "t0 = time()\n", "dv = DictVectorizer()\n", "X_dv = dv.fit_transform(token_freqs(d) for d in texts)\n", "duration = time() - t0\n", "vectorizer_stats[\"vectorizer\"].append(\"DictVectorizer\")\n", "vectorizer_stats[\"speed\"].append(data_size_mb / duration)\n", "print(f\"DictVectorizer: {X_dv.shape} — {duration:.2f}s\")\n", "\n", "# FeatureHasher\n", "t0 = time()\n", "fh = FeatureHasher(n_features=2**12)\n", "X_fh = fh.transform(token_freqs(d) for d in texts)\n", "duration = time() - t0\n", "vectorizer_stats[\"vectorizer\"].append(\"FeatureHasher\")\n", "vectorizer_stats[\"speed\"].append(data_size_mb / duration)\n", "print(f\"FeatureHasher: {X_fh.shape} — {duration:.2f}s\")\n", "\n", "# CountVectorizer\n", "t0 = time()\n", "cv = CountVectorizer()\n", "X_cv = cv.fit_transform(texts)\n", "duration = time() - t0\n", "vectorizer_stats[\"vectorizer\"].append(\"CountVectorizer\")\n", "vectorizer_stats[\"speed\"].append(data_size_mb / duration)\n", "print(f\"CountVectorizer: {X_cv.shape} — {duration:.2f}s\")\n", "\n", "# HashingVectorizer\n", "t0 = time()\n", "hv = HashingVectorizer(n_features=2**12)\n", "X_hv = hv.fit_transform(texts)\n", "duration = time() - t0\n", "vectorizer_stats[\"vectorizer\"].append(\"HashingVectorizer\")\n", "vectorizer_stats[\"speed\"].append(data_size_mb / duration)\n", "print(f\"HashingVectorizer: {X_hv.shape} — {duration:.2f}s\")\n", "\n", "# TfidfVectorizer\n", "t0 = time()\n", "tv = TfidfVectorizer()\n", "X_tv = tv.fit_transform(texts)\n", "duration = time() - t0\n", "vectorizer_stats[\"vectorizer\"].append(\"TfidfVectorizer\")\n", "vectorizer_stats[\"speed\"].append(data_size_mb / duration)\n", "print(f\"TfidfVectorizer: {X_tv.shape} — {duration:.2f}s\")\n" ] }, { "cell_type": "markdown", "id": "503cdc6e-e68e-4618-97e1-1535cfc36788", "metadata": {}, "source": [ "4. Визуализация сравнения" ] }, { "cell_type": "code", "execution_count": 26, "id": "bdfd45e3-03fa-47f5-b9e4-4d286e50e6dd", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "fig, ax = plt.subplots(figsize=(10, 6))\n", "y_pos = np.arange(len(vectorizer_stats[\"vectorizer\"]))\n", "ax.barh(y_pos, vectorizer_stats[\"speed\"], align=\"center\")\n", "ax.set_yticks(y_pos)\n", "ax.set_yticklabels(vectorizer_stats[\"vectorizer\"])\n", "ax.invert_yaxis()\n", "ax.set_xlabel(\"Speed (MB/s)\")\n", "ax.set_title(\"Сравнение скорости векторизации текстов\")\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "cf1deab5-fbe4-4bb9-af74-6fbca8958636", "metadata": {}, "source": [ "Задание успешно выполнено. Самым быстрым оказался FeatureHasher, потому что он получил уже подготовленные словари, не делал никакой токенизации и разбора текста, а просто хешировал ключи - это очень быстро, тк это всего один проход по словарю." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 5 }