A publishing house is in the need of the classification of its corpus of 400,000 articles according to a industry-specific list of article types. This metadata is missing in the database but is requested by data partners and has the potential to optimize internal workflows.
A manual classification of all articles is not feasible but a well-known discovery service has already indexed half of the publisher's article corpus. The service has even assigned article types and the data is openly available through an API. The data is derived, processed, and validated and it's immediately possible to assign article types to 200,000 of the 400'000 articles.
The now present data is used to train a support vector machine in order to assign article types to the rest of the corpus. Such machine learning models might be mathematically complex in development but are easy to apply and are available open source. However, precision is needed in preparing the data and validating the model's results.
After completion of the project, 400,000 articles have a proper article type in the database of the publishing house. Besides the potential for internal workflow optimization, the project was a step towards machine learning, based on a future-oriented data culture.