Dr Kai Wang Data Scientist & Growth Hacker

2017年十大最受欢迎机器学习Python库(转)


top-10-python-libraries-of-2017

AI的快速发展,让机器学习走向了巅峰,今天我们就借此盘点一下2017年最受欢迎的机器学习库(ML),希望你能够在这里寻找到你未来一段时间内的“利器”。

1. Pipenv

We couldn’t make this list without reserving the top spot for a tool that was only released early this year, but has the power to affect the workflow of every Python developer, especially more now since it has become the officially recommended tool on Python.org for managing dependencies!

Pipenv, originally started as a weekend project by the awesome Kenneth Reitz, aims to bring ideas from other package managers (such as npm or yarn) into the Python world. Forget about installing virtualenv, virtualenvwrapper, managing requirements.txt files and ensuring reproducibility with regards to versions of dependencies of the dependencies (read here for more info about this). With Pipenv, you specify all your dependencies in a Pipfile — which is normally built by using commands for adding, removing, or updating dependencies. The tool can generate a Pipfile.lock file, enabling your builds to be deterministic, helping you avoid those difficult to catch bugs because of some obscure dependency that you didn’t even think you needed.

Of course, Pipenv comes with many other perks and has great documentation, so make sure to check it out and start using it for all your Python projects, as we do at Tryolabs :)

2. PyTorch

If there is a library whose popularity has boomed this year, especially in the Deep Learning (DL) community, it’s PyTorch, the DL framework introduced by Facebook this year.

PyTorch builds on and improves the (once?) popular Torch framework, especially since it’s Python based — in contrast with Lua. Given how people have been switching to Python for doing data science in the last couple of years, this is an important step forward to make DL more accessible.

Most notably, PyTorch has become one of the go-to frameworks for many researchers, because of its implementation of the novel Dynamic Computational Graph paradigm. When writing code using other frameworks like TensorFlow, CNTK or MXNet, one must first define something called a computational graph. This graph specifies all the operations that will be run by our code, which are later compiled and potentially optimized by the framework, in order to allow for it to be able to run even faster, and in parallel on a GPU. This paradigm is called Static Computational Graph, and is great since you can leverage all sorts of optimizations and the graph, once built, can potentially run in different devices (since execution is separate from building). However, in many tasks such as Natural Language Processing, the amount of “work” to do is often variable: you can resize images to a fixed resolution before feeding them to an algorithm, but cannot do the same with sentences which come in variable length. This is where PyTorch and dynamic graphs shine, by letting you use standard Python control instructions in your code, the graph will be defined when it is executed, giving you a lot of freedom which is essential for several tasks.

Of course, PyTorch also computes gradients for you (as you would expect from any modern DL framework), is very fast, and extensible, so why not give it a try?

3. Caffe2

It might sound crazy, but Facebook also released another great DL framework this year.

The original Caffe framework has been widely used for years, and known for unparalleled performance and battle-tested codebase. However, recent trends in DL made the framework stagnate in some directions. Caffe2 is the attempt to bring Caffe to the modern world.

It supports distributed training, deployment (even in mobile platforms), the newest CPUs and CUDA-capable hardware. While PyTorch may be better for research, Caffe2 is suitable for large scale deployments as seen on Facebook.

Also, check out the recent ONNX effort. You can build and train your models in PyTorch, while using Caffe2 for deployment! Isn’t that great?

4. Pendulum

Last year, Arrow, a library that aims to make your life easier while working with datetimes in Python, made the list. This year, it is the turn of Pendulum.

One of Pendulum’s strength points is that it is a drop-in replacement for Python’s standard datetime class, so you can easily integrate it with your existing code, and leverage its functionalities only when you actually need them. The authors have put special care to ensure timezones are handled correctly, making every instance timezone-aware and UTC by default. You will also get an extended timedelta to make datetime arithmetic easier.

Unlike other existing libraries, it strives to have an API with predictable behavior, so you know what to expect. If you are doing any non trivial work involving datetimes, this will make you happier! Check out the docs for more.

5. Dash

You are doing data science, for which you use the excellent available tools in the Python ecosystem like Pandas and scikit-learn. You use Jupyter Notebooks for your workflow, which is great for you and your colleagues. But how do you share the work with people who do not know how to use those tools? How do you build an interface so people can easily play around with the data, visualizing it in the process? It used to be the case that you needed a dedicated frontend team, knowledgeable in Javascript, for building these GUIs. Not anymore.

Dash, announced this year, is an open source library for building web applications, especially those that make good use of data visualization, in pure Python. It is built on top of Flask, Plotly.js and React, and provides abstractions that free you from having to learn those frameworks and let you become productive quickly. The apps are rendered in the browser and will be responsive so they will be usable in mobile devices.

If you would like to know more about what is possible with Dash, the Gallery is a great place for some eye-candy.

6. PyFlux

There are many libraries in Python for doing data science and ML, but when your data points are metrics that evolve over time (such as stock prices, measurements obtained from instruments, etc), that is not the case.

PyFlux is an open source library in Python built specifically for working with time series. The study of time series is a subfield of statistics and econometrics, and the goals can be describing how time series behave (in terms of latent components or features of interest), and also predicting how they will behave the future.

PyFlux allows for a probabilistic approach to time series modeling, and has implementations for several modern time series models like GARCH. Neat stuff.

7. Fire

It is often the case that you need to make a Command Line Interface (CLI) for your project. Beyond the traditional argparse, Python has some great tools like click or docopt. Fire, announced by Google this year, has a different take on solving this same problem.

Fire is an open source library that can automatically generate a CLI for any Python project. The key here is automatically: you almost don’t need to write any code or docstrings to build your CLI! To do the job, you only need to call a Fire method and pass it whatever you want turned into a CLI: a function, an object, a class, a dictionary, or even pass no arguments at all (which will turn your entire code into a CLI).

Make sure to read the guide so you understand how it works with examples. Keep it under your radar, because this library can definitely save you a lot of time in the future.

8. imbalanced-learn

In an ideal world, we would have perfectly balanced datasets and we would all train models and be happy. Unfortunately, the real world is not like that, and certain tasks favor very imbalanced data. For example, when predicting fraud in credit card transactions, you would expect that the vast majority of the transactions (+99.9%?) are actually legit. Training ML algorithms naively will lead to dismal performance, so extra care is needed when working with these types of datasets.

Fortunately, this is a studied research problem and a variety of techniques exist. Imbalanced-learn is a Python package which offers implementations of some of those techniques, to make your life much easier. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. Useful!

9. FlashText

When you need to search for some text and replace it for something else, as is standard in most data-cleaning work, you usually turn to regular expressions. They will get the job done, but sometimes it happens that the number of terms you need to search for is in the thousands, and then, reg exp can become painfully slow to use.

FlashText is a better alternative just for this purpose. In the author’s initial benchmark, it improved the runtime of the entire operation by a huge margin: from 5 days to 15 minutes. The beauty of FlashText is that the runtime is the same no matter how many search terms you have, in contrast with regexp in which the runtime will increase almost linearly with the number of terms.

FlashText is a testimony to the importance of the design of algorithms and data structures, showing that, even for simple problems, better algorithms can easily outdo even the fastest CPUs running naive implementations.

10. Luminoth

Disclaimer: this library was built by Tryolabs’ R&D area.

Images are everywhere nowadays, and understanding their content can be critical for several applications. Thankfully, image processing techniques have advanced a lot, fueled by the advancements in DL.

Luminoth is an open source Python toolkit for computer vision, built using TensorFlow and Sonnet. Currently, it out-of-the-box supports object detection in the form of models called Faster R-CNN and SSD.

But Luminoth is not only an implementation of some particular models. It is built to be modular and extensible, so customizing the existing pieces or extending it with new models to tackle different problems should be straightforward, with as much code reuse as there can be. It provides tools for easily doing the engineering work that are needed when building DL models: converting your data (in this case, images or videos) to adequate format for feeding your data pipeline (TensorFlow’s tfrecords), doing data augmentation, running the training in one or multiple GPUs (distributed training will be a must when working with large datasets), running evaluation metrics, easily visualizing stuff in TensorBoard and deploying your trained model with a simple API or browser interface, so people can play around with it.

Moreover, Luminoth has straightforward integration with Google Cloud’s ML Engine, so even if you don’t own a powerful GPU, you can train in the cloud with a single command, just as you do in your own local machine.

If you are interested in learning more about Luminoth and the features of its latest version, you can read this blog post and watch the video of our talk at ODSC.

Bonus: watch out for these

PyVips

You may have never heard of the libvips library. In that case, you must know that it’s an image processing library, like Pillow or ImageMagick, and supports a wide range of formats. However, when comparing to other libraries, libvips is faster and uses less memory. For example, some benchmarks show it to be about 3x faster and use less than 15x memory as ImageMagick. You can read more about why libvips is nice here.

PyVips is a recently released Python binding for libvips, which is compatible with Python 2.7-3.6 (and even PyPy), easy to install with pip and drop-in compatible with the old binding, so if you are using that, you don’t have to modify your code.

If doing some sort of image processing in your app, definitely something to keep an eye on.

Requestium

Disclaimer: this library was built by Tryolabs.

Sometimes, you need to automatize some actions in the web. Be it when scraping sites, doing application testing, or filling out web forms to perform actions in sites that do not expose an API, automation is always necessary. Python has the excellent Requests library which allows you perform some of this work, but unfortunately (or not?) many sites make heavy client side use of Javascript. This means that the HTML code that Requests fetches, in which you could be trying to find a form to fill for your automation task, may not even have the form itself! Instead, it will be something like an empty div of some sort that will be generated in the browser with a modern frontend library such as React or Vue.

One way to solve this is to reverse-engineer the requests that Javascript code makes, which will mean many hours of debugging and fiddling around with (probably) uglified JS code. No thanks. Another option is to turn to libraries like Selenium, which allow you to programmatically interact with a web browser and run the Javascript code. With this, the problems are no more, but it is still slower than using plain Requests which adds very little overhead.

Wouldn’t it be cool if there was a library that let you start out with Requests and seamlessly switch to Selenium, only adding the overhead of a web browser when actually needing it? Meet Requestium, which acts as a drop-in replacement for Requests and does just that. It also integrates Parsel, so writing all those selectors for finding the elements in the page is much cleaner than it would otherwise be, and has helpers around common operations like clicking elements and making sure stuff is actually rendered in the DOM. Another time saver for your web automation projects!

skorch

You like the awesome API of scikit-learn, but need to do work using PyTorch? Worry not, skorch is a wrapper which will give PyTorch an interface like sklearn. If you are familiar with those libraries, the syntax should be straightforward and easy to understand. With skorch, you will get some code abstracted away, so you can focus more on the things that really matter, like doing your data science.

Conclusion

What an exciting year! If you know of a library that deserves to be on this list, make sure you mention it in the comments below. There are so many good developments that it’s hard to keep up. As usual, thanks to everybody in the community for such great work!

Finally, don’t forget to subscribe to our newsletter so that you don’t miss out future editions of this post or our ML related content.

文章原标题《top 10 python libraries of 2017》 作者:Alan Descoins 博客地址:https://tryolabs.com/blog/authors/alan-descoins/


Comments

Content