岗位职责:
岗位要求:
岗位职责:
岗位要求:
快手直播分析团队招聘商业数据分析实习生,研一、研二、大四同学优先
另,本岗位也在同时进行社招内推,有兴趣的同学欢迎前来投递、咨询!
工作时间(硬性要求):要求全职实习,每周需到岗5个工作日,能够连续实习不短于6个月,表现优异有转正机会
工作地点:西二旗 快手总部
岗位要求:
工作内容:
备注:
定义产品的核心指标十分重要,一个合理的核心指标可以有效的指引产品前进的方向
1、 流量来源:流量来源的意思是网站的访问来源,比如用户来自于知乎,来自于微博等等。主要用来统计分析各渠道的推广效果。
2、 PV:PV(page view)即页面浏览量或点击量,指页面刷新的次数,每一次页面刷新,就算做一次PV流量。
3、 UV:UV(unique visitor)即独立访客数,在同一天内,UV只记录第一次进入网站的具有独立IP的访问者,在同一天内再次访问该网站则不计数。PV与UV的比值一定程度上反映产品的粘性,比值越高往往粘性越高。
4、 IP数:IP数即独立IP的访问用户数,指1天内使用不同IP地址的用户访问网站的数量。IP数字与UV可能不同(可大可小可相等)
5、 日活/月活:每日活跃用户数(DAU)/每月活跃用户数(MAU),反映的是网站或者APP的用户活跃程度,用户粘性。
6、 次日留存/次月留存:次日留存、次月留存反映的是网站或者APP的留存率。
7、 用户保有率:指在单位时间内符合有效用户条件的用户数在实际产生用户量的比率,也叫用户留存。
8、 转化率/流失率:转化率一般用来统计两个流程之间的转化比例。其中流失率也是重要的数据指标。用户流失率=总流失用户数/总用户数。
9、 跳出率:指用户到达网站上且仅浏览了一个页面就离开的访问次数(PV)与所有访问次数的百分比。跳出率越高说明越不受欢迎。
10、退出率:对某一个特定的页面而言,从这个页面离开网站的访问数(PV)占这个页面的访问数的百分比。跳出率适用于访问的着陆页 (即用户访问的第一个页面),而退出率则适用于任何访问退出的页面。
11、使用时长:每天用户使用的时间。对于游戏或者是社交产品来说,使用时间越长,说明用户越喜欢。一般来说,使用时长越短说明产品粘性越差,用户越不喜欢。
12、ARPU:Average Revenue Per User,每用户平均收入在一定时间内,ARPU=总收入/用户数。
AI的快速发展,让机器学习走向了巅峰,今天我们就借此盘点一下2017年最受欢迎的机器学习库(ML),希望你能够在这里寻找到你未来一段时间内的“利器”。
We couldn’t make this list without reserving the top spot for a tool that was only released early this year, but has the power to affect the workflow of every Python developer, especially more now since it has become the officially recommended tool on Python.org for managing dependencies!
Pipenv, originally started as a weekend project by the awesome Kenneth Reitz, aims to bring ideas from other package managers (such as npm or yarn) into the Python world. Forget about installing virtualenv, virtualenvwrapper, managing requirements.txt files and ensuring reproducibility with regards to versions of dependencies of the dependencies (read here for more info about this). With Pipenv, you specify all your dependencies in a Pipfile — which is normally built by using commands for adding, removing, or updating dependencies. The tool can generate a Pipfile.lock file, enabling your builds to be deterministic, helping you avoid those difficult to catch bugs because of some obscure dependency that you didn’t even think you needed.
Of course, Pipenv comes with many other perks and has great documentation, so make sure to check it out and start using it for all your Python projects, as we do at Tryolabs :)
If there is a library whose popularity has boomed this year, especially in the Deep Learning (DL) community, it’s PyTorch, the DL framework introduced by Facebook this year.
PyTorch builds on and improves the (once?) popular Torch framework, especially since it’s Python based — in contrast with Lua. Given how people have been switching to Python for doing data science in the last couple of years, this is an important step forward to make DL more accessible.
Most notably, PyTorch has become one of the go-to frameworks for many researchers, because of its implementation of the novel Dynamic Computational Graph paradigm. When writing code using other frameworks like TensorFlow, CNTK or MXNet, one must first define something called a computational graph. This graph specifies all the operations that will be run by our code, which are later compiled and potentially optimized by the framework, in order to allow for it to be able to run even faster, and in parallel on a GPU. This paradigm is called Static Computational Graph, and is great since you can leverage all sorts of optimizations and the graph, once built, can potentially run in different devices (since execution is separate from building). However, in many tasks such as Natural Language Processing, the amount of “work” to do is often variable: you can resize images to a fixed resolution before feeding them to an algorithm, but cannot do the same with sentences which come in variable length. This is where PyTorch and dynamic graphs shine, by letting you use standard Python control instructions in your code, the graph will be defined when it is executed, giving you a lot of freedom which is essential for several tasks.
Of course, PyTorch also computes gradients for you (as you would expect from any modern DL framework), is very fast, and extensible, so why not give it a try?
It might sound crazy, but Facebook also released another great DL framework this year.
The original Caffe framework has been widely used for years, and known for unparalleled performance and battle-tested codebase. However, recent trends in DL made the framework stagnate in some directions. Caffe2 is the attempt to bring Caffe to the modern world.
It supports distributed training, deployment (even in mobile platforms), the newest CPUs and CUDA-capable hardware. While PyTorch may be better for research, Caffe2 is suitable for large scale deployments as seen on Facebook.
Also, check out the recent ONNX effort. You can build and train your models in PyTorch, while using Caffe2 for deployment! Isn’t that great?
Last year, Arrow, a library that aims to make your life easier while working with datetimes in Python, made the list. This year, it is the turn of Pendulum.
One of Pendulum’s strength points is that it is a drop-in replacement for Python’s standard datetime class, so you can easily integrate it with your existing code, and leverage its functionalities only when you actually need them. The authors have put special care to ensure timezones are handled correctly, making every instance timezone-aware and UTC by default. You will also get an extended timedelta to make datetime arithmetic easier.
Unlike other existing libraries, it strives to have an API with predictable behavior, so you know what to expect. If you are doing any non trivial work involving datetimes, this will make you happier! Check out the docs for more.
You are doing data science, for which you use the excellent available tools in the Python ecosystem like Pandas and scikit-learn. You use Jupyter Notebooks for your workflow, which is great for you and your colleagues. But how do you share the work with people who do not know how to use those tools? How do you build an interface so people can easily play around with the data, visualizing it in the process? It used to be the case that you needed a dedicated frontend team, knowledgeable in Javascript, for building these GUIs. Not anymore.
Dash, announced this year, is an open source library for building web applications, especially those that make good use of data visualization, in pure Python. It is built on top of Flask, Plotly.js and React, and provides abstractions that free you from having to learn those frameworks and let you become productive quickly. The apps are rendered in the browser and will be responsive so they will be usable in mobile devices.
If you would like to know more about what is possible with Dash, the Gallery is a great place for some eye-candy.
There are many libraries in Python for doing data science and ML, but when your data points are metrics that evolve over time (such as stock prices, measurements obtained from instruments, etc), that is not the case.
PyFlux is an open source library in Python built specifically for working with time series. The study of time series is a subfield of statistics and econometrics, and the goals can be describing how time series behave (in terms of latent components or features of interest), and also predicting how they will behave the future.
PyFlux allows for a probabilistic approach to time series modeling, and has implementations for several modern time series models like GARCH. Neat stuff.
It is often the case that you need to make a Command Line Interface (CLI) for your project. Beyond the traditional argparse, Python has some great tools like click or docopt. Fire, announced by Google this year, has a different take on solving this same problem.
Fire is an open source library that can automatically generate a CLI for any Python project. The key here is automatically: you almost don’t need to write any code or docstrings to build your CLI! To do the job, you only need to call a Fire method and pass it whatever you want turned into a CLI: a function, an object, a class, a dictionary, or even pass no arguments at all (which will turn your entire code into a CLI).
Make sure to read the guide so you understand how it works with examples. Keep it under your radar, because this library can definitely save you a lot of time in the future.
In an ideal world, we would have perfectly balanced datasets and we would all train models and be happy. Unfortunately, the real world is not like that, and certain tasks favor very imbalanced data. For example, when predicting fraud in credit card transactions, you would expect that the vast majority of the transactions (+99.9%?) are actually legit. Training ML algorithms naively will lead to dismal performance, so extra care is needed when working with these types of datasets.
Fortunately, this is a studied research problem and a variety of techniques exist. Imbalanced-learn is a Python package which offers implementations of some of those techniques, to make your life much easier. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. Useful!
When you need to search for some text and replace it for something else, as is standard in most data-cleaning work, you usually turn to regular expressions. They will get the job done, but sometimes it happens that the number of terms you need to search for is in the thousands, and then, reg exp can become painfully slow to use.
FlashText is a better alternative just for this purpose. In the author’s initial benchmark, it improved the runtime of the entire operation by a huge margin: from 5 days to 15 minutes. The beauty of FlashText is that the runtime is the same no matter how many search terms you have, in contrast with regexp in which the runtime will increase almost linearly with the number of terms.
FlashText is a testimony to the importance of the design of algorithms and data structures, showing that, even for simple problems, better algorithms can easily outdo even the fastest CPUs running naive implementations.
Disclaimer: this library was built by Tryolabs’ R&D area.
Images are everywhere nowadays, and understanding their content can be critical for several applications. Thankfully, image processing techniques have advanced a lot, fueled by the advancements in DL.
Luminoth is an open source Python toolkit for computer vision, built using TensorFlow and Sonnet. Currently, it out-of-the-box supports object detection in the form of models called Faster R-CNN and SSD.
But Luminoth is not only an implementation of some particular models. It is built to be modular and extensible, so customizing the existing pieces or extending it with new models to tackle different problems should be straightforward, with as much code reuse as there can be. It provides tools for easily doing the engineering work that are needed when building DL models: converting your data (in this case, images or videos) to adequate format for feeding your data pipeline (TensorFlow’s tfrecords), doing data augmentation, running the training in one or multiple GPUs (distributed training will be a must when working with large datasets), running evaluation metrics, easily visualizing stuff in TensorBoard and deploying your trained model with a simple API or browser interface, so people can play around with it.
Moreover, Luminoth has straightforward integration with Google Cloud’s ML Engine, so even if you don’t own a powerful GPU, you can train in the cloud with a single command, just as you do in your own local machine.
If you are interested in learning more about Luminoth and the features of its latest version, you can read this blog post and watch the video of our talk at ODSC.
Bonus: watch out for these
You may have never heard of the libvips library. In that case, you must know that it’s an image processing library, like Pillow or ImageMagick, and supports a wide range of formats. However, when comparing to other libraries, libvips is faster and uses less memory. For example, some benchmarks show it to be about 3x faster and use less than 15x memory as ImageMagick. You can read more about why libvips is nice here.
PyVips is a recently released Python binding for libvips, which is compatible with Python 2.7-3.6 (and even PyPy), easy to install with pip and drop-in compatible with the old binding, so if you are using that, you don’t have to modify your code.
If doing some sort of image processing in your app, definitely something to keep an eye on.
Disclaimer: this library was built by Tryolabs.
Sometimes, you need to automatize some actions in the web. Be it when scraping sites, doing application testing, or filling out web forms to perform actions in sites that do not expose an API, automation is always necessary. Python has the excellent Requests library which allows you perform some of this work, but unfortunately (or not?) many sites make heavy client side use of Javascript. This means that the HTML code that Requests fetches, in which you could be trying to find a form to fill for your automation task, may not even have the form itself! Instead, it will be something like an empty div of some sort that will be generated in the browser with a modern frontend library such as React or Vue.
One way to solve this is to reverse-engineer the requests that Javascript code makes, which will mean many hours of debugging and fiddling around with (probably) uglified JS code. No thanks. Another option is to turn to libraries like Selenium, which allow you to programmatically interact with a web browser and run the Javascript code. With this, the problems are no more, but it is still slower than using plain Requests which adds very little overhead.
Wouldn’t it be cool if there was a library that let you start out with Requests and seamlessly switch to Selenium, only adding the overhead of a web browser when actually needing it? Meet Requestium, which acts as a drop-in replacement for Requests and does just that. It also integrates Parsel, so writing all those selectors for finding the elements in the page is much cleaner than it would otherwise be, and has helpers around common operations like clicking elements and making sure stuff is actually rendered in the DOM. Another time saver for your web automation projects!
You like the awesome API of scikit-learn, but need to do work using PyTorch? Worry not, skorch is a wrapper which will give PyTorch an interface like sklearn. If you are familiar with those libraries, the syntax should be straightforward and easy to understand. With skorch, you will get some code abstracted away, so you can focus more on the things that really matter, like doing your data science.
What an exciting year! If you know of a library that deserves to be on this list, make sure you mention it in the comments below. There are so many good developments that it’s hard to keep up. As usual, thanks to everybody in the community for such great work!
Finally, don’t forget to subscribe to our newsletter so that you don’t miss out future editions of this post or our ML related content.
最近在Nature上看到了一篇评论:Why it is not a ‘failure’ to leave academia,谈到的恰好是博士毕业后转行的话题,对想要转行的博士生给了几点非常中肯的建议。
笔者是工科博士,毕业回国后几经辗转到了互联网行业做数据科学方向的工作,中间也历经了不少辛酸和挫折。的确很多人会对我说,读了名校的博士,不做科研/不进高校可惜了啊。其实我想对他们说:只要能够及时发现自己的真正兴趣,并且现在和将来能够从事自己真正热爱的工作和事业,那么不管成本多大、过程多么艰辛都是值得的。况且,读博三年多的学术训练和留学经历也是我一生中最宝贵的财富,一点都不浪费。
引用原文中的话来说:
A PhD is highly valuable
This leaves us with one last aspect of the culture of failure and its effect on doctoral students and postdocs: the widespread misconception that a PhD is useful training only for academic research. Or, in other words, if you leave academia, your mum will think that you’ve wasted your time doing a PhD. You might even have wondered about that yourself.
We know that most PhD graduates eventually go on to other careers, but have they all wasted their time? Absolutely not. The skills you are acquiring (or have acquired) during a PhD are highly sought by employers beyond academic science. You are incredibly resilient, hard-working and motivated. You make decisions based on evidence, you can interpret data, you can communicate complex concepts clearly, you are an effective team player and you can prioritize tasks. And you have a degree to prove all of this.
You have every reason to be positive about your job prospects.
Personally, I won’t regret having done my PhD, regardless of my future career.
问题描述: 数据分析时经常遇到需求变更等特殊情况,需要在hive表中增加字段,并且重跑调度任务回溯历史数据(insert overwrite),但连接tableau时发现新增字段均为null值
第一步: 在hive元数据中的sds表找到字段增加后新分配的字段组ID(CD_ID,表的所有字段对应一个CD_ID字段值),如:SELECT * FROM sds WHERE location LIKE ‘%table_name%’ 第二步: 在SDS表中可以看到新分配的字段组值(CD_ID)、已有分区所对应的旧字段组值ID(CD_ID),在该表中把旧的CD_ID值更新为新的CD_ID值即可,如:UPDATE SDS SET CD_ID=NEW_CD_ID(所找到的新值) WHERE CD_ID=OLD_CD_ID(旧值)
删除原有分区后再回溯历史数据;或删除原表重新建表后回溯历史数据。
岗位职责
岗位要求:
岗位职责
岗位要求:
岗位职责
岗位要求:
岗位职责
岗位要求:
岗位职责
岗位要求:
岗位职责
岗位要求: