Your task is to predict, based on information about the expenses of a bank client, which age group he falls into. Training data (train) for constructing features and training models, and test data (test) for testing algorithms are given. This is specially prepared and anonymized information on which models can be trained while maintaining complete security of real customer data. The solution to the problem is the predictions of algorithms on test data.
To solve the problem, participants were provided with information about transactions of bank clients, amounting to about 27,000,000 million records.
Each entry describes one banking transaction. For each of the ≈20,000 test IDs, participants were required to use a trained model to predict which age group the client would fall into.
Two data sets were prepared:
For each example from the test set, it was necessary to predict the age group to which the client belongs. A CSV file with predictions was provided to the system for verification; it should contain two columns:
The task is a multi-class classification (4 classes - from 0 to 3). The quality of the solution is calculated as the proportion of correctly guessed age tags for all test examples – accuracy.
To solve this, it is most convenient to use the Python programming language, since it has a large number of libraries for data analysis: NumPy, Pandas, SciKit-Learn and others. The Jupyter interactive environment is used as a development tool.
Participants also had access to a basic example solution from the organizers in the form of a Jupyter notebook.