| The initial project file can be downloaded from [This Link](https://codern.org/qbox/download/8kPk3LDPT9/initial.zip).|
| :--: |
HonkHonk is a successful ride-hailing company focused in New York City. This company has spent nearly a decade researching and improving the accuracy and logic of its trip pricing system. The result of these efforts is a dynamic pricing function with 3,500 lines of code.
The code for this function is highly confidential and sensitive. For this reason, only one copy of it was stored on a private server held by the CEO. The CEO kept this server under his desk to ensure everyone's peace of mind.
One day, while the CEO was drinking tea, an unfortunate incident occurred: the CEO's cup of tea spilled, burning the server and wiping out the data within it.
Now, to prevent the company from going bankrupt, the CEO has decided to build a new system to replace the previous pricing function. Therefore, he urgently needs your help to overcome this challenge.
Only the following sentence regarding how the previous pricing function worked is available:
> This function provided dynamic pricing. Trip prices were calculated based on temporal factors, location, weather, and more.
After much effort, the CEO managed to prepare a highly valuable dataset of trips conducted by HonkHonk over several consecutive months in 2016 in New York City, where the original pricing function determined the trip cost, and has provided it to you. Furthermore, to evaluate the quality of your proposed system, the CEO has kept a portion of this dataset to ensure the quality of your output.
<details class="yellow">
<summary>**Dataset**</summary>
| Column Name | Description |
| :----------------------: | :-----------------------------: |
| `id` | A unique identifier for each trip. |
| `pickup_datetime` | Date and time the trip started. |
| `dropoff_datetime` | Date and time the trip ended. |
| `passenger_count` | Number of passengers in the vehicle. |
| `pickup_longitude` | Longitude of the trip origin location. |
| `pickup_latitude` | Latitude of the trip origin location. |
| `dropoff_longitude` | Longitude of the trip destination location. |
| `dropoff_latitude` | Latitude of the trip destination location. |
| `store_and_fwd_flag` | Indicates whether the trip information was stored in the vehicle's memory before being sent to the server. (Y/N) |
| `trip_duration` | Total trip duration in seconds. |
| `total_price` | **(Target Variable)** The final and total trip price in dollars (only available in `train.csv`). |
</details>
Your mission is to design a dynamic pricing system for HonkHonk using this dataset, along with programming techniques, artificial intelligence, machine learning, and by gathering necessary auxiliary data.
**Are you ready for this mission?!**
## Evaluation Metric
This scoring system compares your model’s error (RMSE) against the **Standard Deviation** of the true values (`std(Y_true)`).
The standard deviation measures the inherent variability or fluctuation in the traffic data.
Therefore, a successful model must not only be accurate but also have an error that is small relative to these natural variations.
\[
Score = 100 \times e^{\left(-\frac{\text{RMSE}}{\text{std}(Y_{\text{true}})}\right)}
\]
A score of 100 indicates a perfectly accurate prediction (zero error).
This formula works exponentially — meaning that models whose errors are much smaller than the natural fluctuations in the data receive high scores, while scores drop rapidly as the error increases.
<details class="red">
<summary>
**Attention**
</summary>
Throughout the competition, the score you see is only the result of evaluating your model on 30 percent of the test data. After the competition ends, your **final score** will be calculated on the remaining 70 percent. This is done to prevent overfitting and maintain the generality of the model, ensuring that models that have overfitted will drop in the final scoring.
</details>
## Submission Method
To answer this question, first open the notebook file included in the initial file and then follow the steps as requested. Finally, after executing the answer-generating cell (the last cell of the notebook file), submit the created `result.zip` file.
<details class="red">
<summary>
**Important Warning**
</summary>
Note that you must save the changes made in the notebook using the shortcut key `ctrl+s` before executing the answer-generating cell; otherwise, your **score** will change to **zero** at the end of the competition.
Also, if you use Colab to run this notebook file, download the latest version of your notebook and include it in the submitted file before sending the `result.zip` file.
</details>
The initial project file can be downloaded from [This Link](https://codern.org/qbox/download/V9umthkBlp/notebook.zip).
***
*AeroGen Dynamics* is one of the largest wind farm operators in the region. The heart of every wind turbine is a complex and expensive gearbox assembly called the **"G-78 Planetary Gearbox Assembly."** Sudden failure of this component can lead to complete turbine shutdown for weeks, repair costs reaching hundreds of thousands of dollars, and damage to other turbine components.
Until now, the company has utilized a fixed-schedule Preventive Maintenance strategy, which often results in the premature replacement of healthy parts and unnecessary costs. AeroGen Dynamics now intends to move toward **Predictive Maintenance** using data collected from SCADA monitoring systems.
You have access to anonymized operational data from the company's fleet of turbines. This data includes time-series readings from various sensors (such as temperature, vibration, oil pressure, etc.), as well as the technical specifications of each turbine. Your goal is to build a machine learning model that can analyze a turbine's historical data and classify its **operational risk level** into one of the following five categories:
+ **Class 0 (Low Risk):** The turbine is in a healthy operational state.
+ **Class 1 (Early Warning):** Initial signals of wear are observed. Requires increased monitoring.
+ **Class 2 (Medium Risk):** Wear has reached a significant stage. Scheduling an inspection in the near future is recommended.
+ **Class 3 (High Risk):** Serious signs of failure are observed. Immediate inspection is required.
+ **Class 4 (Critical Risk):** Failure is imminent. The turbine must be immediately taken offline.
Your model will help the company optimize repairs, maximize the lifespan of components, and prevent catastrophic downtime by accurately predicting risk.

----------
## **Data Set Description**
The dataset provided is divided into three main sections: **Train, Validation, and Test**. Each section includes different data files, explained below.
**Key Points Regarding Operational Data:**
1. **Data Anonymization:** To protect trade secrets, the exact names and functions of sensors and features have been **anonymized**. You will encounter numerical and alphabetical identifiers instead of physical names. This means you must extract patterns directly from the data without prior domain knowledge.
2. **Histogram Data Format:** Part of the sensor data is presented in the form of a **histogram** instead of a single scalar value. Columns with the same numerical prefix (e.g., `166_0`, `166_1`, `166_2`,...) all belong to **a single sensor** and collectively form a histogram. Each column (`166_0`, `166_1`,...) represents a "**bin**" or range of values for that sensor. This structure records the distribution of a sensor's behavior over a short time interval rather than a momentary value, offering much richer information about operational fluctuations and patterns.
Your target variable, the **risk classes 0 through 4**, is defined based on the time interval between the latest sensor reading and the actual time of gearbox failure. This interval is calculated based on an "Operational Time Step":
+ **Class 0:** Reading is **more than 48** time steps before failure.
+ **Class 1:** Reading is **48 to 24** time steps before failure.
+ **Class 2:** Reading is **24 to 12** time steps before failure.
+ **Class 3:** Reading is **12 to 6** time steps before failure.
+ **Class 4:** Reading is **6 to 0** time steps before failure.
In the test set, you must predict one class label for each turbine. To train your model, you must be able to generate these labels for the training data. The file `train_time_to_event.csv` is the key to this task. This file tells you the total operation duration (`length_of_study_time_step`) and whether the turbine failed during this period (`in_study_repair`). For turbines that failed, `length_of_study_time_step` is the exact moment of failure. By comparing the `time_step` of each sensor reading in `train_operational_data.csv` with this failure moment, you can calculate the "Remaining Time to Failure" for **each row** and assign the corresponding class label. Turbines that never failed are always in Class 0 (Low Risk).
<details class="blue">
<summary>
**File Structure**
</summary>
1. **Train Data Set:**
+ **`train_operational_data.csv`**: This is the main and largest file, containing the **complete** history of sensor readings over time for each turbine.
+ **`train_specifications.csv`**: This file contains the **static** and categorical features for each turbine, describing its technical specifications. Simply put, what components each turbine is built with. Each turbine has 7 main components, and this file shows the type of each main component.
+ **`train_time_to_event.csv`**: This file provides the final information for each turbine: the total observed operational lifetime and whether it experienced failure during this period. This file is used to construct the target variable in the training set.
2. **Validation Data Set:**
+ **`validation_operational_data.csv`**: Unlike the training set, this file contains an **incomplete** history of operational data. For each turbine, the data is cut off at a random point in time to simulate a real-world prediction scenario.
+ **`validation_specifications.csv`**: Technical specifications for the turbines in the validation set.
+ **`validation_labels.csv`**: This file contains the true class label (0 to 4) for the **latest available reading** of each turbine in the validation set. You will use this file to evaluate and tune your model.
3. **Test Data Set:**
+ **`test_operational_data.csv`**: Similar to the validation set, this file also contains an **incomplete** history of operational data for a new set of turbines.
+ **`test_specifications.csv`**: Technical specifications for the turbines in the test set.
+ **Your Final Output**: You must provide one output file with a single final prediction for `class_label` for **each turbine** in this set. Your final performance will be judged based on these predictions.
</details>
----------
## **Problem Evaluation**
To evaluate this problem and your model, we will use the following "Cost and Reward Matrix." Your Raw Score will be calculated for each row, and finally, the Final Score will be obtained based on the formula below.
| Actual Class | Predicted 0 (Healthy) | Predicted 1 (Warning) | Predicted 2 (Medium) | Predicted 3 (High) | Predicted 4 (Critical) |
|:-------------|:----------------------|:----------------------|:---------------------|:-------------------|:-----------------------|
| **0 (Healthy)** | **2.5** | -2 | -4 | -8 | -12 |
| **1 (Warning)** | -15 | **+20** | -3 | -6 | -10 |
| **2 (Medium)** | -30 | -15 | **+40** | -5 | -8 |
| **3 (High)** | -50 | -30 | -15 | **+80** | -5 |
| **4 (Critical)** | -80 | -50 | -30 | -15 | **+150** |
**Final Score Calculation Formula:**
\[
Final\ Score = 100 \times \frac{\max\!\left(0, Raw\ Score\right)}{Maximum\ Possible\ Score}
\]
----------
## **Answer Format**
Based on the **`test_operational_data.csv`** file, you must predict the latest status (which class it falls into) for every *vehicle_id* present in the test dataset.
Your output must consist of a `submission.csv` file specifying the latest status of the machine. This means there should be only one row per *vehicle_id* in the `submission.csv` file.
+ The columns must include `vehicle_id` and `class_label`. The final file must be sorted in ascending order of `vehicle_id`.
| *vehicle_id* | *class_label* |
|:------------:|:-------------:|
| 1 | ? |
| 6 | ? |
| ... | ... |
| 33638 | ? |
> Finally, **zip** the `submission.csv` along with the corresponding notebook and submit.
<details class="red">
<summary>
**Attention**
</summary>
The score you see during the competition is only the result of your model's evaluation on **30% of the test data**. After the competition ends, your **final score** will be calculated on the remaining 70%. This is done to prevent overfitting (`overfitting`) and maintain the model's generalization, ensuring that models which are overfit will drop in the final scoring.
</details>
Help for the electricity crisis