The initial project file can be downloaded from [This Link](https://codern.org/qbox/download/V9umthkBlp/notebook.zip). *** *AeroGen Dynamics* is one of the largest wind farm operators in the region. The heart of every wind turbine is a complex and expensive gearbox assembly called the **"G-78 Planetary Gearbox Assembly."** Sudden failure of this component can lead to complete turbine shutdown for weeks, repair costs reaching hundreds of thousands of dollars, and damage to other turbine components. Until now, the company has utilized a fixed-schedule Preventive Maintenance strategy, which often results in the premature replacement of healthy parts and unnecessary costs. AeroGen Dynamics now intends to move toward **Predictive Maintenance** using data collected from SCADA monitoring systems. You have access to anonymized operational data from the company's fleet of turbines. This data includes time-series readings from various sensors (such as temperature, vibration, oil pressure, etc.), as well as the technical specifications of each turbine. Your goal is to build a machine learning model that can analyze a turbine's historical data and classify its **operational risk level** into one of the following five categories: + **Class 0 (Low Risk):** The turbine is in a healthy operational state. + **Class 1 (Early Warning):** Initial signals of wear are observed. Requires increased monitoring. + **Class 2 (Medium Risk):** Wear has reached a significant stage. Scheduling an inspection in the near future is recommended. + **Class 3 (High Risk):** Serious signs of failure are observed. Immediate inspection is required. + **Class 4 (Critical Risk):** Failure is imminent. The turbine must be immediately taken offline. Your model will help the company optimize repairs, maximize the lifespan of components, and prevent catastrophic downtime by accurately predicting risk. ![Helping the Power Crisis](https://quera.org/qbox/download/imI6ecSa3j/Gemini_Generated_Image_d5kmbmd5kmbmd5km.png) ---------- ## **Data Set Description** The dataset provided is divided into three main sections: **Train, Validation, and Test**. Each section includes different data files, explained below. **Key Points Regarding Operational Data:** 1. **Data Anonymization:** To protect trade secrets, the exact names and functions of sensors and features have been **anonymized**. You will encounter numerical and alphabetical identifiers instead of physical names. This means you must extract patterns directly from the data without prior domain knowledge. 2. **Histogram Data Format:** Part of the sensor data is presented in the form of a **histogram** instead of a single scalar value. Columns with the same numerical prefix (e.g., `166_0`, `166_1`, `166_2`,...) all belong to **a single sensor** and collectively form a histogram. Each column (`166_0`, `166_1`,...) represents a "**bin**" or range of values for that sensor. This structure records the distribution of a sensor's behavior over a short time interval rather than a momentary value, offering much richer information about operational fluctuations and patterns. Your target variable, the **risk classes 0 through 4**, is defined based on the time interval between the latest sensor reading and the actual time of gearbox failure. This interval is calculated based on an "Operational Time Step": + **Class 0:** Reading is **more than 48** time steps before failure. + **Class 1:** Reading is **48 to 24** time steps before failure. + **Class 2:** Reading is **24 to 12** time steps before failure. + **Class 3:** Reading is **12 to 6** time steps before failure. + **Class 4:** Reading is **6 to 0** time steps before failure. In the test set, you must predict one class label for each turbine. To train your model, you must be able to generate these labels for the training data. The file `train_time_to_event.csv` is the key to this task. This file tells you the total operation duration (`length_of_study_time_step`) and whether the turbine failed during this period (`in_study_repair`). For turbines that failed, `length_of_study_time_step` is the exact moment of failure. By comparing the `time_step` of each sensor reading in `train_operational_data.csv` with this failure moment, you can calculate the "Remaining Time to Failure" for **each row** and assign the corresponding class label. Turbines that never failed are always in Class 0 (Low Risk). <details class="blue"> <summary> **File Structure** </summary> 1. **Train Data Set:** + **`train_operational_data.csv`**: This is the main and largest file, containing the **complete** history of sensor readings over time for each turbine. + **`train_specifications.csv`**: This file contains the **static** and categorical features for each turbine, describing its technical specifications. Simply put, what components each turbine is built with. Each turbine has 7 main components, and this file shows the type of each main component. + **`train_time_to_event.csv`**: This file provides the final information for each turbine: the total observed operational lifetime and whether it experienced failure during this period. This file is used to construct the target variable in the training set. 2. **Validation Data Set:** + **`validation_operational_data.csv`**: Unlike the training set, this file contains an **incomplete** history of operational data. For each turbine, the data is cut off at a random point in time to simulate a real-world prediction scenario. + **`validation_specifications.csv`**: Technical specifications for the turbines in the validation set. + **`validation_labels.csv`**: This file contains the true class label (0 to 4) for the **latest available reading** of each turbine in the validation set. You will use this file to evaluate and tune your model. 3. **Test Data Set:** + **`test_operational_data.csv`**: Similar to the validation set, this file also contains an **incomplete** history of operational data for a new set of turbines. + **`test_specifications.csv`**: Technical specifications for the turbines in the test set. + **Your Final Output**: You must provide one output file with a single final prediction for `class_label` for **each turbine** in this set. Your final performance will be judged based on these predictions. </details> ---------- ## **Problem Evaluation** To evaluate this problem and your model, we will use the following "Cost and Reward Matrix." Your Raw Score will be calculated for each row, and finally, the Final Score will be obtained based on the formula below. | Actual Class | Predicted 0 (Healthy) | Predicted 1 (Warning) | Predicted 2 (Medium) | Predicted 3 (High) | Predicted 4 (Critical) | |:-------------|:----------------------|:----------------------|:---------------------|:-------------------|:-----------------------| | **0 (Healthy)** | **2.5** | -2 | -4 | -8 | -12 | | **1 (Warning)** | -15 | **+20** | -3 | -6 | -10 | | **2 (Medium)** | -30 | -15 | **+40** | -5 | -8 | | **3 (High)** | -50 | -30 | -15 | **+80** | -5 | | **4 (Critical)** | -80 | -50 | -30 | -15 | **+150** | **Final Score Calculation Formula:** \[ Final\ Score = 100 \times \frac{\max\!\left(0, Raw\ Score\right)}{Maximum\ Possible\ Score} \] ---------- ## **Answer Format** Based on the **`test_operational_data.csv`** file, you must predict the latest status (which class it falls into) for every *vehicle_id* present in the test dataset. Your output must consist of a `submission.csv` file specifying the latest status of the machine. This means there should be only one row per *vehicle_id* in the `submission.csv` file. + The columns must include `vehicle_id` and `class_label`. The final file must be sorted in ascending order of `vehicle_id`. | *vehicle_id* | *class_label* | |:------------:|:-------------:| | 1 | ? | | 6 | ? | | ... | ... | | 33638 | ? | > Finally, **zip** the `submission.csv` along with the corresponding notebook and submit. <details class="red"> <summary> **Attention** </summary> The score you see during the competition is only the result of your model's evaluation on **30% of the test data**. After the competition ends, your **final score** will be calculated on the remaining 70%. This is done to prevent overfitting (`overfitting`) and maintain the model's generalization, ensuring that models which are overfit will drop in the final scoring. </details>

Help for the electricity crisis