{ "cells": [ { "cell_type": "markdown", "id": "d44b76a1-962a-4a09-91d5-74f1fbefb88f", "metadata": {}, "source": [ "Цель задачи:\n", "Сравнить качество различных алгоритмов классификации (SVM, KNN, Decision Tree, Random Forest, Logistic Regression):\n", "\n", "На синтетическом датасете, сгенерированном с помощью make_blobs.\n", "\n", "На реальном датасете с openml.org (SMS Spam Collection)." ] }, { "cell_type": "code", "execution_count": 7, "id": "dec9f4c6-3bc5-4dd1-b11e-85fe059751ce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 9\n", " 1 1.00 1.00 1.00 10\n", " 2 1.00 1.00 1.00 11\n", "\n", " accuracy 1.00 30\n", " macro avg 1.00 1.00 1.00 30\n", "weighted avg 1.00 1.00 1.00 30\n", "\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.metrics import classification_report\n", "\n", "# Загрузка и разбиение данных\n", "X, y = load_iris(return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", "\n", "# Модель MLP — многослойный перцептрон\n", "clf = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=2500)\n", "clf.fit(X_train, y_train)\n", "\n", "# Отчёт о точности\n", "print(classification_report(y_test, clf.predict(X_test)))" ] }, { "cell_type": "markdown", "id": "b8718927-08f3-4d35-b163-d31bc7a8ce7d", "metadata": {}, "source": [ "Импорт необходимых библиотек" ] }, { "cell_type": "code", "execution_count": 30, "id": "619b3507-5dc9-4581-b5b7-a2dc21e2656a", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import make_blobs\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import accuracy_score, classification_report\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.svm import LinearSVC\n", "from sklearn.preprocessing import LabelEncoder\n", "\n" ] }, { "cell_type": "markdown", "id": "d112584d-4642-4eea-9d47-c1310a7be009", "metadata": {}, "source": [ "2. Синтетический датасет: make_blobs" ] }, { "cell_type": "markdown", "id": "aecc83be-209a-4f5c-aaf1-32ea571f0820", "metadata": {}, "source": [ "2.1 Создание и визуализация данных" ] }, { "cell_type": "code", "execution_count": 8, "id": "39d5b664-ee7d-444d-b964-fd3b90d8a396", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Создаём 3 кластера\n", "X, y = make_blobs(centers=3, cluster_std=0.5, random_state=0)\n", "\n", "# Визуализация\n", "plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')\n", "plt.title(\"Three normally-distributed clusters\")\n", "plt.xlabel(\"Feature 1\")\n", "plt.ylabel(\"Feature 2\")\n", "plt.grid(True)\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "ea44ff0d-c2ef-47af-b589-55beb44d5787", "metadata": {}, "source": [ "2.2 Деление на train/test" ] }, { "cell_type": "code", "execution_count": 9, "id": "fc2711a3-dbb6-48ab-8e4a-06b280675f23", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n" ] }, { "cell_type": "markdown", "id": "0d23f0b7-37e1-4644-ae01-7ee8e1674658", "metadata": {}, "source": [ "3. Обучение моделей" ] }, { "cell_type": "code", "execution_count": 10, "id": "f1baafd9-b282-4c2f-ad6f-8c2473bd29e7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Logistic Regression:\n", "Accuracy: 1.000\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 14\n", " 1 1.00 1.00 1.00 10\n", " 2 1.00 1.00 1.00 6\n", "\n", " accuracy 1.00 30\n", " macro avg 1.00 1.00 1.00 30\n", "weighted avg 1.00 1.00 1.00 30\n", "\n", "\n", "SVM:\n", "Accuracy: 1.000\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 14\n", " 1 1.00 1.00 1.00 10\n", " 2 1.00 1.00 1.00 6\n", "\n", " accuracy 1.00 30\n", " macro avg 1.00 1.00 1.00 30\n", "weighted avg 1.00 1.00 1.00 30\n", "\n", "\n", "KNN:\n", "Accuracy: 1.000\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 14\n", " 1 1.00 1.00 1.00 10\n", " 2 1.00 1.00 1.00 6\n", "\n", " accuracy 1.00 30\n", " macro avg 1.00 1.00 1.00 30\n", "weighted avg 1.00 1.00 1.00 30\n", "\n", "\n", "Decision Tree:\n", "Accuracy: 0.967\n", " precision recall f1-score support\n", "\n", " 0 1.00 0.93 0.96 14\n", " 1 1.00 1.00 1.00 10\n", " 2 0.86 1.00 0.92 6\n", "\n", " accuracy 0.97 30\n", " macro avg 0.95 0.98 0.96 30\n", "weighted avg 0.97 0.97 0.97 30\n", "\n", "\n", "Random Forest:\n", "Accuracy: 0.967\n", " precision recall f1-score support\n", "\n", " 0 1.00 0.93 0.96 14\n", " 1 1.00 1.00 1.00 10\n", " 2 0.86 1.00 0.92 6\n", "\n", " accuracy 0.97 30\n", " macro avg 0.95 0.98 0.96 30\n", "weighted avg 0.97 0.97 0.97 30\n", "\n" ] } ], "source": [ "models = {\n", " \"Logistic Regression\": LogisticRegression(max_iter=1000),\n", " \"SVM\": SVC(),\n", " \"KNN\": KNeighborsClassifier(),\n", " \"Decision Tree\": DecisionTreeClassifier(),\n", " \"Random Forest\": RandomForestClassifier()\n", "}\n", "\n", "for name, model in models.items():\n", " clf = make_pipeline(StandardScaler(), model)\n", " clf.fit(X_train, y_train)\n", " y_pred = clf.predict(X_test)\n", " print(f\"\\n{name}:\\nAccuracy: {accuracy_score(y_test, y_pred):.3f}\")\n", " print(classification_report(y_test, y_pred))\n" ] }, { "cell_type": "markdown", "id": "3614904f-d3d1-40e5-9ad6-f308117c2a7f", "metadata": {}, "source": [ "4. Интерпретация (синтетические данные)" ] }, { "cell_type": "markdown", "id": "fc315f20-878e-4629-8154-5977fd88427f", "metadata": {}, "source": [ "### Интерпретация результатов\n", "\n", "- Данные хорошо разделимы, поэтому все модели показывают высокую точность.\n", "- KNN, Logistic Regression и SVM — самые точные на таких кластерах.\n", "- Decision Tree и Random Forest работают тоже хорошо, но могут переобучаться на простых данных.\n" ] }, { "cell_type": "markdown", "id": "b4c25059-3ee5-49a2-8b1a-3d6c5af8ff98", "metadata": {}, "source": [ "5. Реальный датасет из openml.org (SMS Spam Collection)" ] }, { "cell_type": "code", "execution_count": 14, "id": "895c0577-7e18-42fb-9dae-c89977b6f7c8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Уникальные уровни должностей: ['Internship' 'Mid-Senior level' 'Entry level' 'Not Applicable'\n", " 'Associate' 'Executive' 'Director']\n" ] } ], "source": [ "# Загрузка файла\n", "df = pd.read_csv(\"1000_ml_jobs_us.csv\")\n", "\n", "# Посмотреть первые строки\n", "df.head()\n", "\n", "# Оставим только строки с непустыми job_description_text и seniority_level\n", "df = df[['job_description_text', 'seniority_level']].dropna()\n", "\n", "# Отображаем уникальные уровни\n", "print(\"Уникальные уровни должностей:\", df['seniority_level'].unique())\n" ] }, { "cell_type": "markdown", "id": "17edbf86-7b62-4161-90c9-7e2e0272b99c", "metadata": {}, "source": [ "Препроцессинг текстов и меток" ] }, { "cell_type": "code", "execution_count": 20, "id": "27e1ef0a-f61b-4bab-b5ac-b86e7c9c86d8", "metadata": {}, "outputs": [], "source": [ "# Целевая переменная\n", "y = df['seniority_level']\n", "\n", "# Кодирование уровней должности\n", "le = LabelEncoder()\n", "y_encoded = le.fit_transform(y)\n", "\n", "# Векторизация текстов\n", "vectorizer = TfidfVectorizer(max_features=3000, stop_words='english')\n", "X = vectorizer.fit_transform(df['job_description_text'])\n", "\n", "# Делим на обучающую и тестовую выборки\n", "X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)\n" ] }, { "cell_type": "markdown", "id": "cc7af45c-059c-4433-b919-a0ea78c39b12", "metadata": {}, "source": [ "Обучение моделей и сравнение" ] }, { "cell_type": "code", "execution_count": 29, "id": "c620989d-5832-42fe-956f-74b7cd996d30", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Результаты классификации (1000_ml_jobs_us.csv):\n", "Logistic Regression: Accuracy = 0.57\n", "Multinomial NB: Accuracy = 0.52\n", "Random Forest: Accuracy = 0.60\n", "Linear SVM: Accuracy = 0.60\n" ] } ], "source": [ "# Обновим классификаторы (теперь подходят для разреженных данных)\n", "text_classifiers = {\n", " \"Logistic Regression\": LogisticRegression(max_iter=1000),\n", " \"Multinomial NB\": MultinomialNB(),\n", " \"Random Forest\": RandomForestClassifier(),\n", " \"Linear SVM\": LinearSVC()\n", "}\n", "\n", "# Обучение и вывод результатов\n", "print(\"Результаты классификации (1000_ml_jobs_us.csv):\")\n", "for name, clf in text_classifiers.items():\n", " clf.fit(X_train, y_train)\n", " y_pred = clf.predict(X_test)\n", " acc = accuracy_score(y_test, y_pred)\n", " print(f\"{name}: Accuracy = {acc:.2f}\")\n" ] }, { "cell_type": "markdown", "id": "88c57a6b-5cd9-41ee-a2d6-e14ab99f6b3c", "metadata": {}, "source": [ "Интерпретация результатов" ] }, { "cell_type": "code", "execution_count": 35, "id": "66677a13-6d22-4cd7-8895-85c151ff93ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Классы в кодировщике: ['Associate' 'Director' 'Entry level' 'Executive' 'Internship'\n", " 'Mid-Senior level' 'Not Applicable']\n", "Уникальные классы в тесте: [0 1 2 4 5 6]\n", "\n", "Classification report (Logistic Regression):\n", " precision recall f1-score support\n", "\n", " Associate 0.00 0.00 0.00 14\n", " Director 0.00 0.00 0.00 2\n", " Entry level 0.47 0.58 0.52 86\n", " Internship 1.00 0.35 0.51 26\n", "Mid-Senior level 0.52 0.71 0.60 110\n", " Not Applicable 0.97 0.54 0.70 59\n", "\n", " accuracy 0.57 297\n", " macro avg 0.49 0.36 0.39 297\n", " weighted avg 0.61 0.57 0.56 297\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "D:\\учеба\\2 курс\\семестр4\\praktika 4\\venv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "D:\\учеба\\2 курс\\семестр4\\praktika 4\\venv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", "D:\\учеба\\2 курс\\семестр4\\praktika 4\\venv\\Lib\\site-packages\\sklearn\\metrics\\_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n" ] } ], "source": [ "# Подробный отчёт по лучшей модели (Logistic Regression)\n", "best_model = LogisticRegression(max_iter=1000)\n", "best_model.fit(X_train, y_train)\n", "y_pred = best_model.predict(X_test)\n", "\n", "# Используем ПРЕДВАРИТЕЛЬНО ОБУЧЕННЫЙ LabelEncoder (из шага препроцессинга)\n", "# Убедимся, что используем тот же экземпляр, который кодировал y\n", "print(\"\\nКлассы в кодировщике:\", le.classes_)\n", "\n", "# Проверим уникальные классы в тестовых данных\n", "unique_labels = np.unique(np.concatenate([y_test, y_pred]))\n", "print(\"Уникальные классы в тесте:\", unique_labels)\n", "\n", "# Генерируем отчёт только для присутствующих классов\n", "print(\"\\nClassification report (Logistic Regression):\")\n", "print(classification_report(\n", " y_test,\n", " y_pred,\n", " labels=unique_labels,\n", " target_names=le.classes_[unique_labels] # Берём только нужные имена\n", "))" ] }, { "cell_type": "markdown", "id": "d8906a44-1e2f-4392-ae4a-2b3ee1538df8", "metadata": {}, "source": [ "6. Интерпретация результатов\n", "Random Forest и Logistic Regression показали лучшие результаты на изображениях.\n", "\n", "KNN оказался медленным на таком объеме данных, но может быть эффективен после отбора признаков.\n", "\n", "LinearSVC работает быстро, но может потребовать настройки регуляризации для повышения точности.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "85385781-29ae-4fe1-ac42-8c64a75b6af3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 5 }