Blue Yonder uses Python throughout its technology stack. We are aware that we profit a lot from open source and it has been always our policy to give back to the community. We have open sourced libraries like turbodbc, sqlalchemy_exasol, Mesos Threshold Oversubscription and tsfresh. Blue Yonder employees such as Uwe Korn (Apache Arrow, Apache Parquet) or Stephan Erb (Apache Aurora) are key members of large open source projects. The next milestone of our open source commitment is the support of the PyCon.DE 2017 and PyData Karlsruhe as a community sponsor. During the last nine months Blue Yonder employees Sebastian Neubauer and Peter Hoffmann were co-organizers within the PyCon.DE Team. After visiting the venue ZKM (Center for Art and Media) in Karlsruhe during the opening of the Open Codes exhibition last Thursday, we are sure that this year’s conference will be a unique one. If you want to learn more about Blue Yonder and its open-source activities, you can join the talks of our engineers and also contact us directly at the conference:
Connecting PyData to other Big Data Landscapes using Arrow and Parquet
Wednesday 25.10.2017 14:50 Uwe Korn is a Data Scientist at Blue Yonder. His expertise is on building scalable architectures for machine learning services. As part of his work to provide an efficient data interchange he became a core committer to the Apache Parquet and Apache Arrow projects. Abstract: Python has a vast amount of libraries and tools in its machine learning and data analysis ecosystem. Although it is clearly in competition with R here about the leadership, the world that has sprung out of the Hadoop ecosystem has established itself in the space of data engineering and also tries to provide tools for distributed machine learning. As these stacks run in different environments and are mostly developed by distinct groups of people, using them together has been a pain. While Apache Parquet has already proven itself as the gold standard for the exchange of DataFrames serialized to files, Apache Arrow recently got traction as the in-memory format for DataFrame exchange between different ecosystems. This talk will outline how Apache Parquet files can be used in Python and how they are structured to provide efficient DataFrame exchange. In addition to small code sample, this also includes an explanation of some interesting details of the file format. Additionally, the idea of Apache Arrow will be presented and taking Apache Spark (2.3) as an example to showcase how performance increases once DataFrames can be efficiently shared between Python and JVM processes.
Turbodbc: Turbocharged database access for data scientists
Thursday 26.10.2017 14:50 Michael König is a senior software engineer at Blue Yonder. He holds a PhD in physics, practices test-driven development, and digs Clean Code in C++ and Python. Abstract: Python’s database API 2.0 is well suited for transactional database workflows, but not so much for column-heavy data science. This talk explains how the ODBC-based turbodbc database module extends this API with first-class, efficient support for familiar NumPy and Apache Arrow data structures. This talk introduces the open source Python database module turbodbc. It uses standard ODBC drivers to connect with virtually any database and is a viable (and often faster) alternative to “native” Python drivers. Briefly recounting the painful story of how data scientists previously used our analytics database, I explain why turbodbc was created and what distinguishes it from other ODBC modules. Sketching the flow of data from databases via drivers and Python modules to consumable Python objects, I motivate a few extensions to the standard database API 2.0 that turbodbc has made. These extensions heavily use NumPy arrays and Apache Arrow tables to provide data scientists with both familiar and efficient binary data structures they can further work on. I conclude my talk with benchmark results for a few databases.
No Compromise: Use Ansible properly or stick to your scripts
Thursday 26.10.2017 14:50 Bjoern Meier is a software engineer at Blue Yonder. More correctly you could say he is a DevOps. He is developing and operating - among other things - the services for the external data interfaces, preprocessing and data storage to enable the data scientists to run their prediction models. Abstract: Ansible should help you to orchestrate your systems, automate the deployments and set up well defined infrastructures. But if you want to make something work quickly in Ansible the chances are high that you fall back to shell/command tasks, the mother of all evil. Those tasks usually prevent you from running dry runs where you would see the upcoming changes and you prevent Ansible to shine. So, we went blindly into every deployment and hoped the best. But we wanted to see what would change, we wanted to make ansible --check
work again and therefore in this talk I will show you what we did wrong and what we changed to get there.
Observing your applications with Sentry and Prometheus
Friday 27.10.2017 11:20 Patrick Mühlbauer is a Software Engineer at Blue Yonder’s Platform Team. He likes DevOps and enjoys instrumenting code to collect metrics and create nice and shiny Grafana Dashboards. Abstract: If you have services running in production, something will fail sooner or later. We cannot avoid this completely, but we can prepare for it. In this talk we will have a look at how Sentry and Prometheus can help to get better insight into our systems to quickly track down the cause of failure. When your services start to behave in a strange way, for example due to bugs introduced in the newly deployed release, you want get informed about that as fast as possible. Preferably by your own monitoring and not by one of your customers. We will have a look at how Sentry and Prometheus can help with that. Sentry is a real-time error tracking system, which can notify you when exceptions in your application occur. Additionally, it provides lots of context so that crashes can be reproduced and fixed very quickly. Prometheus is a systems and service monitoring system, collecting metrics from all kinds of targets. The collected metrics can help to get insight in what’s actually going on in your services.