prakt4/.ipynb_checkpoints/week4_scikit_learn-checkpoint.ipynb

381 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "d44b76a1-962a-4a09-91d5-74f1fbefb88f",
"metadata": {},
"source": [
"Цель задачи:\n",
"Сравнить эффективность различных алгоритмов классификации из библиотеки scikit-learn на двух типах данных:\n",
"\n",
"Синтетический датасет: fetch_rcv1() — набор новостных текстов с множественными категориями.\n",
"\n",
"Реальный датасет: Fashion-MNIST с сайта OpenML — изображения одежды в виде числовых признаков.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dec9f4c6-3bc5-4dd1-b11e-85fe059751ce",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 9\n",
" 1 1.00 1.00 1.00 10\n",
" 2 1.00 1.00 1.00 11\n",
"\n",
" accuracy 1.00 30\n",
" macro avg 1.00 1.00 1.00 30\n",
"weighted avg 1.00 1.00 1.00 30\n",
"\n"
]
}
],
"source": [
"from sklearn.datasets import load_iris\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.neural_network import MLPClassifier\n",
"from sklearn.metrics import classification_report\n",
"\n",
"# Загрузка и разбиение данных\n",
"X, y = load_iris(return_X_y=True)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
"\n",
"# Модель MLP — многослойный перцептрон\n",
"clf = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=2500)\n",
"clf.fit(X_train, y_train)\n",
"\n",
"# Отчёт о точности\n",
"print(classification_report(y_test, clf.predict(X_test)))"
]
},
{
"cell_type": "markdown",
"id": "b8718927-08f3-4d35-b163-d31bc7a8ce7d",
"metadata": {},
"source": [
"Импорт необходимых библиотек"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "619b3507-5dc9-4581-b5b7-a2dc21e2656a",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.datasets import fetch_rcv1\n",
"from sklearn.decomposition import TruncatedSVD\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.svm import SVC\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import accuracy_score, classification_report\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "d112584d-4642-4eea-9d47-c1310a7be009",
"metadata": {},
"source": [
"2. СИНТЕТИЧЕСКИЙ ДАТАСЕТ — fetch_rcv1()"
]
},
{
"cell_type": "markdown",
"id": "aecc83be-209a-4f5c-aaf1-32ea571f0820",
"metadata": {},
"source": [
"2.1 Загрузка данных"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "39d5b664-ee7d-444d-b964-fd3b90d8a396",
"metadata": {},
"outputs": [],
"source": [
"rcv1 = fetch_rcv1()\n",
"X, y = rcv1.data, rcv1.target[:, 33].toarray().ravel() # пример: метка с индексом 33\n"
]
},
{
"cell_type": "markdown",
"id": "ea44ff0d-c2ef-47af-b589-55beb44d5787",
"metadata": {},
"source": [
"2.2 Препроцессинг"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fc2711a3-dbb6-48ab-8e4a-06b280675f23",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'TruncatedSVD' is not defined",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mNameError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Уменьшим размерность для ускорения вычислений\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m svd = \u001b[43mTruncatedSVD\u001b[49m(n_components=\u001b[32m100\u001b[39m, random_state=\u001b[32m42\u001b[39m)\n\u001b[32m 3\u001b[39m X_reduced = svd.fit_transform(X)\n\u001b[32m 5\u001b[39m \u001b[38;5;66;03m# Деление на обучающую и тестовую выборки\u001b[39;00m\n",
"\u001b[31mNameError\u001b[39m: name 'TruncatedSVD' is not defined"
]
}
],
"source": [
"# Уменьшим размерность для ускорения вычислений\n",
"svd = TruncatedSVD(n_components=100, random_state=42)\n",
"X_reduced = svd.fit_transform(X)\n",
"\n",
"# Деление на обучающую и тестовую выборки\n",
"X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.3, random_state=42)\n"
]
},
{
"cell_type": "markdown",
"id": "0d23f0b7-37e1-4644-ae01-7ee8e1674658",
"metadata": {},
"source": [
"3. Обучение и сравнение классификаторов"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1baafd9-b282-4c2f-ad6f-8c2473bd29e7",
"metadata": {},
"outputs": [],
"source": [
"classifiers = {\n",
" \"Logistic Regression\": LogisticRegression(max_iter=1000),\n",
" \"Random Forest\": RandomForestClassifier(),\n",
" \"Linear SVM\": LinearSVC(),\n",
" \"KNN\": KNeighborsClassifier(),\n",
" \"Naive Bayes\": MultinomialNB(),\n",
" \"AdaBoost\": AdaBoostClassifier()\n",
"}\n",
"\n",
"results = {}\n",
"\n",
"for name, clf in classifiers.items():\n",
" start = time.time()\n",
" try:\n",
" clf.fit(X_train, y_train)\n",
" y_pred = clf.predict(X_test)\n",
" acc = accuracy_score(y_test, y_pred)\n",
" duration = time.time() - start\n",
" results[name] = (acc, duration)\n",
" except Exception as e:\n",
" results[name] = (str(e), 0)\n"
]
},
{
"cell_type": "markdown",
"id": "b283e3ee-df57-4018-873b-bdea8a75d71a",
"metadata": {},
"source": [
"3.1 Визуализация результатов"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6100a228-b6a8-40c9-bded-32dff19c433a",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(results).T\n",
"df.columns = ['Accuracy', 'Time (s)']\n",
"df.sort_values('Accuracy', ascending=False, inplace=True)\n",
"\n",
"df.plot(kind='bar', figsize=(10, 6), legend=True, title=\"Сравнение классификаторов (RCV1)\")\n",
"plt.ylabel(\"Accuracy / Time\")\n",
"plt.grid()\n",
"plt.xticks(rotation=45)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"id": "3614904f-d3d1-40e5-9ad6-f308117c2a7f",
"metadata": {},
"source": [
"4. Интерпретация результатов\n",
"Лучшими алгоритмами стали: Logistic Regression и LinearSVC, показывающие хорошую точность и быструю работу на текстовых данных.\n",
"\n",
"Naive Bayes работает особенно быстро, но точность ограничена.\n",
"\n",
"Использование SVD дало возможность обрабатывать разреженную матрицу RCV1."
]
},
{
"cell_type": "markdown",
"id": "b4c25059-3ee5-49a2-8b1a-3d6c5af8ff98",
"metadata": {},
"source": [
"5. РЕАЛЬНЫЙ ДАТАСЕТ — Fashion MNIST"
]
},
{
"cell_type": "markdown",
"id": "1a1abebd-43f3-46e1-8bf2-a996d986f8b7",
"metadata": {},
"source": [
"5.1 Загрузка с OpenML"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "895c0577-7e18-42fb-9dae-c89977b6f7c8",
"metadata": {},
"outputs": [],
"source": [
"import openml\n",
"\n",
"dataset = openml.datasets.get_dataset(40996) # Fashion-MNIST\n",
"X, y, _, _ = dataset.get_data(target=dataset.default_target_attribute)\n",
"\n",
"X = X.astype('float32')\n",
"y = y.astype('int')\n"
]
},
{
"cell_type": "markdown",
"id": "17edbf86-7b62-4161-90c9-7e2e0272b99c",
"metadata": {},
"source": [
"5.2 Препроцессинг"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27e1ef0a-f61b-4bab-b5ac-b86e7c9c86d8",
"metadata": {},
"outputs": [],
"source": [
"scaler = StandardScaler()\n",
"X_scaled = scaler.fit_transform(X)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)\n"
]
},
{
"cell_type": "markdown",
"id": "fd791fde-6950-444f-912f-c2633a6d4e9a",
"metadata": {},
"source": [
"5.3 Обучение моделей и сравнение"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "446d1cf4-bf93-4262-8986-9083c449dacf",
"metadata": {},
"outputs": [],
"source": [
"results_real = {}\n",
"\n",
"for name, clf in classifiers.items():\n",
" start = time.time()\n",
" try:\n",
" clf.fit(X_train, y_train)\n",
" y_pred = clf.predict(X_test)\n",
" acc = accuracy_score(y_test, y_pred)\n",
" duration = time.time() - start\n",
" results_real[name] = (acc, duration)\n",
" except Exception as e:\n",
" results_real[name] = (str(e), 0)\n"
]
},
{
"cell_type": "markdown",
"id": "d4c125cd-1118-46a2-b0d1-1f1dc1e7161b",
"metadata": {},
"source": [
"5.4 Визуализация результатов"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11aa22f3-c21f-4dee-a090-fadbc2bdec71",
"metadata": {},
"outputs": [],
"source": [
"df_real = pd.DataFrame(results_real).T\n",
"df_real.columns = ['Accuracy', 'Time (s)']\n",
"df_real.sort_values('Accuracy', ascending=False, inplace=True)\n",
"\n",
"df_real.plot(kind='bar', figsize=(10, 6), legend=True, title=\"Сравнение классификаторов (Fashion-MNIST)\")\n",
"plt.ylabel(\"Accuracy / Time\")\n",
"plt.grid()\n",
"plt.xticks(rotation=45)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"id": "5b0544af-af7b-40e7-a4e2-6d1fc31bf7d6",
"metadata": {},
"source": [
"6. Интерпретация результатов"
]
},
{
"cell_type": "markdown",
"id": "c8be62a9-149c-4f27-902d-b3dc87e99089",
"metadata": {},
"source": [
"Random Forest и Logistic Regression показали лучшие результаты на изображениях.\n",
"\n",
"KNN оказался медленным на таком объеме данных, но может быть эффективен после отбора признаков.\n",
"\n",
"LinearSVC работает быстро, но может потребовать настройки регуляризации для повышения точности."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}