Building Python-Based Topologies for Massive Processing of Social Media Data in Real Time

In this paper we propose a streaming approach for real-time processing of huge amounts of data. Catenae is a library for easy building and execution of Python topologies (e.g., web crawler, classifier). Topologies are designed for their deployment in Docker containers and, thus, horizontal scaling, granular resource assignment and isolation can be achieved easily. Furthermore, micromodules can have its own dependencies (including the Python version), allowing the user to limit resources such as CPU or memory by instance. We describe an implementation of a use case composed of two topologies: (1) a crawler for tracking users in social media and (2) an early risk detector of depression. We also explain how Catenae topologies can be connected to non-Python systems.

keywords: Social Media, Text Mining, Depression, Stream Processing, Real- Time Processing, Docker, Python