<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Wowchemy | Wenjie Lan</title><link>https://drwenjielan.github.io/tag/wowchemy/</link><atom:link href="https://drwenjielan.github.io/tag/wowchemy/index.xml" rel="self" type="application/rss+xml"/><description>Wowchemy</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 01 Jan 2023 00:00:00 +0000</lastBuildDate><image><url>https://drwenjielan.github.io/media/icon_hu7729264130191091259.png</url><title>Wowchemy</title><link>https://drwenjielan.github.io/tag/wowchemy/</link></image><item><title>Customer Credit Risk Prediction and Identification</title><link>https://drwenjielan.github.io/project/2023fintech/</link><pubDate>Sun, 01 Jan 2023 00:00:00 +0000</pubDate><guid>https://drwenjielan.github.io/project/2023fintech/</guid><description>&lt;p>&lt;strong>Competition:&lt;/strong> Third Sichuan University Financial Technology Modeling Competition
&lt;strong>Award:&lt;/strong> First Place
**Awared by:**The Education Department of Sichuan Provincial&lt;/p>
&lt;p>This project, presented for the &lt;strong>Third Sichuan University Financial Technology Modeling Competition&lt;/strong>, focuses on &lt;strong>customer credit risk prediction and identification&lt;/strong>. It emphasizes constructing a stable, high-performing binary classification model for credit risk management based on financial data.&lt;/p>
&lt;hr>
&lt;h2 id="key-highlights">&lt;strong>Key Highlights&lt;/strong>&lt;/h2>
&lt;h3 id="data-analysis-and-preparation">&lt;strong>Data Analysis and Preparation&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Data Overview&lt;/strong>:
&lt;ul>
&lt;li>Combined datasets include &lt;strong>24,983 samples and 205 features&lt;/strong>. After preprocessing, 124 features remain (33 textual and ~33% date-related features).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Data Cleaning&lt;/strong>:
&lt;ul>
&lt;li>Missing values:
&lt;ul>
&lt;li>Median imputation for continuous variables.&lt;/li>
&lt;li>Mode imputation for discrete variables.&lt;/li>
&lt;li>Filling with &lt;code>-99&lt;/code> for features with &amp;gt;95% missing.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Encoding methods:
&lt;ul>
&lt;li>Count encoding for categories with &amp;lt;10 values.&lt;/li>
&lt;li>WOE binning for categories with &amp;gt;10 values.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Time features are extracted based on hours/minutes or intervals from the current day.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Feature Selection and Engineering&lt;/strong>:
&lt;ul>
&lt;li>Importance-ranked features selected via XGBoost.&lt;/li>
&lt;li>&lt;strong>Featuretools&lt;/strong> used to generate new feature combinations.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h3 id="model-building">&lt;strong>Model Building&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Architecture&lt;/strong>:
&lt;ul>
&lt;li>Three-layered stacking framework:
&lt;ul>
&lt;li>&lt;strong>Layer 1&lt;/strong>: Base models include CatBoost, LightGBM, XGBoost, and Random Forest.&lt;/li>
&lt;li>&lt;strong>Layer 2&lt;/strong>: Outputs from base models serve as inputs for four distinct sub-models.&lt;/li>
&lt;li>&lt;strong>Layer 3&lt;/strong>: Final predictions are generated through normalized weighted voting.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Model Optimization&lt;/strong>:
&lt;ul>
&lt;li>5-fold cross-validation and grid search are applied to optimize hyperparameters for base models.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h3 id="evaluation">&lt;strong>Evaluation&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Model performance is assessed using &lt;strong>AUC (Area Under the Curve)&lt;/strong>:
&lt;ul>
&lt;li>Individual Models:
&lt;ul>
&lt;li>CatBoost: &lt;strong>0.8486&lt;/strong>&lt;/li>
&lt;li>LightGBM: &lt;strong>0.8476&lt;/strong>&lt;/li>
&lt;li>XGBoost: &lt;strong>0.8464&lt;/strong>&lt;/li>
&lt;li>Random Forest: &lt;strong>0.8423&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Stacking and Voting:
&lt;ul>
&lt;li>Stacking 1: &lt;strong>0.8523&lt;/strong>&lt;/li>
&lt;li>Stacking 2: &lt;strong>0.8530&lt;/strong>&lt;/li>
&lt;li>Voting: &lt;strong>0.8594&lt;/strong> (Best Performance)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="credit-rating-system">&lt;strong>Credit Rating System&lt;/strong>&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Structure&lt;/strong>:
&lt;ul>
&lt;li>Customers are classified into &lt;strong>9 levels&lt;/strong> based on predicted risk, with clear distribution and distinguishable credit tiers.&lt;/li>
&lt;li>9.77% of users belong to the top levels (8 and above), reflecting the model&amp;rsquo;s discriminatory power.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Implementation&lt;/strong>:
&lt;ul>
&lt;li>Integrated with a web-based system using &lt;strong>Docker&lt;/strong> and &lt;strong>Vue.js&lt;/strong> for front-end services.&lt;/li>
&lt;li>Compared to FICO models, the system provides zones for:
&lt;ul>
&lt;li>Quality customers.&lt;/li>
&lt;li>Value exploration.&lt;/li>
&lt;li>Overestimation.&lt;/li>
&lt;li>Risk elimination.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="conclusions-and-suggestions">&lt;strong>Conclusions and Suggestions&lt;/strong>&lt;/h2>
&lt;ul>
&lt;li>The project demonstrates strong modeling capabilities through effective stacking and feature engineering.&lt;/li>
&lt;li>The authors suggest refining the model for real-world applications and exploring the scalability of the approach for diverse datasets.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;p>This project presents a robust credit risk modeling framework with promising performance, practical implications, and room for further enhancement in financial technology applications.&lt;/p></description></item></channel></rss>