Toggle navigation
JUPYTER
FAQ
View as Code
View on GitHub
Execute on Binder
Download Notebook
pydata-ldn2014-writeup
14 - Panel Discussion - Shouldn't companies be doing more data science?.ipynb
Notebook
In [1]:
%
autosave
10
Autosaving every 10 seconds
What top-line value does data science give to companies?
¶
Medical analysis using decision trees.
Helping engineering companies rent/reuse/extend lifetime of physical goods.
e.g. GE instrumenting engines to predict failure.
In fact GE bought 10% of Pivotal to help them extend this.
Don't need much percent increase in savings to make massive difference.
It's hard to put percentage upfront on savings; US companies more willing than UK companies to try out data science without this.
Sometimes you need to push companies to have well defined measures of success to prove projects provide better analyses.
What is most under appreciated by clients about data science?
¶
Pitching how accurate your results are and making sure clients know it does or doesn't meet their needs.
Best practices for software engineering (documentation, automated tests) often are ignored.
If you want to make impact you need buy-in from high level business side.
Sometimes in large companies departments are resistant without high-level buy in.
Scoping - allowing for an initial exploration stage with iterative feedback.
Customers often ignore caveats associated wih results.
Customers often are deeply wedded with domain knowledge or heuristics.
If you show up with a fancy data-driven classifier customers will go "but what is this? what did you do?"
How is data gathered? Even before cleaning or feature engineering.
Crafting surveys to gather good data is important.
Customers see a number and ignore the prerequisite expertise required to get there.
Finance doesn't need motiviation to be data driven, in fact it couldn't work without data.
Really the big problem is poor tooling.
Excel, VBA is too widespread. No arguments there.
Maybe being controvertial, MATLAB and R is also too widespread.
Visualisation and its role
¶
Important to start, day 1, with any picture, e.g. matplotlib, so you can poke at it with the customer from day 1.
Or at least agree that is the type of picture you want to get to.
Rule of thumb - customers are always impressed with a histogram of their own data.
They've never even bothered looking at it.
Simplicity is powerful.
Sketching the data pipeline (source, processing, output) is often useful.
Try to make it not just a set of numbers on a spreadsheet.
Give it form, allow customers to visualise operations.
They don't trust the outright output from a classifier. It's not auitable without expertise.
Customers want magic APIs or methods. What would you wish for?
¶
Clients often don't even have data. Their magic is "get me data".
Scraping etc.
80% of time is cleaning up data.
Even if there were a magic API dangerous assumptions!
Want a Dyson of data.
Blackbox classification solvers, even if they work, don't give enough confidence or auditable results.
Data munging, data cleaning
¶
Suspect there's some automation of data cleaning and munging that is missing.
Not fully automatable but surely some work can be done.
But is there always too much domain knowledge required to automate data munging?
e.g. sentiment analysis of tweets. Not automatable because too much domain knowledge required.
What tool needs more exposure and education?
¶
RStudio, although R-only, is very intuitive.
RShiny is great for quick, interactive, publishable results.
Would be great if IPython Notebook was exportable to something that RShiny produces.
Want to be able to hide code in an IPython Notebook.
Tools inevitably are domain and organisation specific, integration work is necessary.
Software needs to be better engineered.
All software is subordinate to business requirements.
Questions
¶
Part of Speech tagging?
¶
TextBlob is a combination of
nltk
and
pattern
. Not version 1 but good for playing.
wordtovec
is fast, heavily optimised C code, for Python.
Education vs Experience?
¶
"Science to Data Science" - finishing school for numerate PhDs to become data scientists.
Do you miss academia and its freedom?
¶
The only output in academia is papers, and doesn't matter if anyone reads them.
Now want to have an impact.
Yes sacrifices freedom, but if you fight your corner you can get some back.
You can still write papers, but on your own initiative.
Academia's freedom can be illusory.
There still is structure. Some professor hates an idea so you will not be able to pursue it.
No software craftmanship.
In [ ]: