The surge of data captured by today’s organizations requires data science tools to fully understand the information. We asked data scientists what tools they’re using.
Data and analytics provide the fuel for digital transformation and disruption. And the only way enterprises can make that fuel high-octane is if they arm their teams of statisticians, math gurus and business analytics experts with the right data science tools to squeeze insight out of the ever-growing pools of corporate data.
Whether they’re for straight statistical analysis, machine learning modeling or visualizations, a strong set of data science tools is essential for developing a data-driven business culture.
We recently caught up with a number of experienced data scientists across a range of industries to ask which tools they use the most. Here are the top five picks that came up over and over again.
Not so much a distinct piece of software as much as a programmatic means for creating custom algorithms, Python is the go-to for many data scientists. In a recent KDnuggets analytics/data science software poll of 2,052 users, the language was cited as the top tool by 65.6% of respondents.
“We use Python both for data science and back end, which provides us with rapid development and machine learning model deployment,” said Alexander Osipenko, lead data scientist at Cindicator Inc. “It’s also of great importance for us to ensure the security of implemented tools.”
Katie Malone, who started out as a particle physicist before she moved on to co-leading the data science research team at Civis Analytics Inc., said Python was her choice of the data science tools as a physicist, and she’s kept on using it in the business world. For her, one of the big draws is the strong open source ecosystem surrounding Python, which has led her to a wide variety of data science libraries to help her solve specific analytical problems.
“It’s just got a really, really vibrant community of open source folks who are using Python to solve interesting data science problems,” she said.
Leslie De Jesus, innovation director and lead data scientist at Wovenware, agreed. She depends on Python libraries quite a bit.
“[We use] Python Libraries, including Scrapy, for web scraping and being able to extract data from the internet and upload it into a data frame for analysis,” De Jesus said. “And [we use] Pandas and NumPy Python Libraries for data analysis and matrix manipulation. They both help to create faster code, and NumPy allows for complex broadcasting functions.”
Niranjan Krishnan, head of data science and innovation at Tiger Analytics, explained that the use cases for Python are pretty multifaceted.
“We have successfully deployed Python data science models for optimizing direct-to-customer marketing campaigns and life insurance underwriting and improving real-time bidding for online advertising,” Krishnan said.
The drawback, obviously, is that Python is code-based and requires a high level of programming and analytical skills to use.
“However, Knime and Alteryx are excellent menu-driven, low-code alternatives that can be used by citizen data scientists and business analysts, as well,” he said.
In a similar vein to Python, R is another programming language that many data science professionals depend on, though it is a little simpler and more purpose-built for data science. It ranked third in the KDnuggets poll, with 48.5% of the respondents listing it as one of the leading data science tools.
Malone from Civis Analytics said R has very sophisticated capabilities for machine learning and statistics, and it’s another frequent pick by those on her team in addition to Python.
“It depends on the context. We’re polyglots here, so we like them both,” she said. “R comes a little bit more from the kind of statistics and quantitative social sciences side.”
According to Jon Krohn, chief data scientist at Untapt Inc., R is his go-to tool for data exploration.
“I can quickly see summary stats like mean, median and quartiles; quickly create different graphs; and create test data sets, which can be easily shared and exported to CSV format,” he said.
3. Jupyter Notebook
For the sake of data visualization and data communication, many data science teams include Jupyter Notebook on their list of data science tools.
“Jupyter Notebook supports R and Python with great library support for data access and visualizations,” said Sofus Macskássy, vice president of data science at HackerRank. “This tool also enables teams to easily export workbooks for presentations and is becoming a standard in the data science field.”
Jupyter’s flexibility to use the most popular data science libraries is a perk for Michael Golub, senior vice president of digital and analytics services at Anexinet. Golub explained that Jupyter is his team’s favorite collaborative development environment.
“Jupyter Notebook is our go-to for collaborative data science project work and is also very useful when engaging in endeavors that require education,” Golub said.
In addition, Untapt’s Krohn said Jupyter Notebook is a great tool to prototype models interactively.
“At Untapt, we use Jupyter Notebooks to write prototype code, but also for printing out tables of data, summary metrics and charts,” he said.
For crossing the chasm between hard data science teams and more business-focused analytics folks, Tableau Software can provide a good bridge.
“It is a fantastic tool for data scientists and noobs working on data science,” said Pooja Pandey, senior executive for SEO at Entersoft Security. “[It’s a] quick dashboarding tool to visualize insights and analytical data with a very short learning curve.”
The speed at which Tableau’s visualization and reporting functions can provide insights to a range of users has drawn praise.
“People spend much more time doing their thinking job than producing serial reports,” she said.
According to Wei Lin, chief data scientist for the CTO office at Hitachi Vantara, his most-used data science tools are Python, R and Keras. Python and R he uses for all the reasons mentioned above, in addition to Keras for its deep learning capabilities.
“Keras is an open source neural network library written in Python to enable fast experimentation with deep neural networks, and [it] is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit or Theano,” Lin said.
Keras’ sweet spot is in high-dimensional pattern matching, he said.
“For example, image and natural language processing and supporting well-established deep learning analytic models, including convolutional neural networks and short-term memory,” he said.
According to Cindicator’s Osipenko, the big draw of Keras is that it’s a huge time-saver.
“The main criterion for adding a new tool is how much it can make your life as a data scientist easier. [An] example of this is Keras, an open-source, high-level wrapper that can dramatically speed up the process of developing neural networks,” he said. “Anyone who has written neural networks on TensorFlow will understand what I’m talking about. And even though Keras is not perfect, it can change the development process and make your code much more readable for other developers.”