
Ask HN: Data Science portfolio projects ideas - koots
Hello,<p>I&#x27;m looking for a couple of midium-size project ideas that are not just following tutorials.
======
natalyarostova
My two cents as someone who interviews tons of data scientists is that most
portfolio projects are way too easy, and amount to getting generally clean
data, then just calling some API from sklearn or tensorflow.

I'd like to see either more non trivial software/coding skills in getting the
data and setting up a good data infrastructure or more depth on a innovative
science solution.

~~~
koots
This is exactly why I'm asking. So, since you are, appearantly senior, in the
field, do you have any concrete recommendations?

~~~
natalyarostova
Sure. Let me start by saying that the data science interview/competency game
isn't common knowledge. That is to say, even within our own company, where we
are supposed to have standard guidelines, different organizations have
different perspectives. So I'm sure other data scientists may disagree with
me.

I have some trouble just giving you some full/rich idea, since there is a
whole world of possibility. However, I can share some heuristics with you that
you may find useful.

The first is, do you have any domain knowledge that would lend itself to a
data science project? This would be one step in differentiating your idea, and
allowing it to build off of existing ideas you have, as opposed to an off-the-
shelf classifier project from a data science project site. This could be
anything from biomedical data, to sports data, to market data etc. This will
let you highlight your ability to dive deep and apply data science tools to a
specific problem. Even if I'm interviewing someone who worked with medical
data before, their ability to do data research building off domain knowledge
is a strong signal that they will be able to do it again in a new domain.

The second is can you get a semi-novel dataset? Even if it's just writing a
fully-fledged python script to scrape some APIs or (maybe) web-pages,
something that shows that you hunted down data, and wrangled it, as opposed to
downloading data_science_project.csv.

Once you get your data, try to think of a properly engineered way to store it.
A csv on your laptop isn't always bad, but familiarity with AWS/Azure APIs and
storing your data on the cloud in a 'nicer' format (e.g. Parquet) (or if
necessary, in a database).

In your code, can you have a lightweight API to retrieve your data? Again, I'd
be looking for something that tells me you can get, store, and retrieve data
in line with best practices, so if you're hired and there is
messiness/challenges with data, you can manage it yourself rather than needing
an engineer to do all the work for you, and your job only starting when you
have a csv on your local machine.

Once you have all this, can you thoughtfully try out some different
methodologies? As well as interesting exploratory data analysis? This part is
harder to give concrete recommendations on, but I'd like to see something that
considers the problem space, the data type, and chooses the right algorithm.
Then for the algorithm you chose, I'd like you to have a medium depth
understanding of how it works below the hood. The bad case is you just get
some data, throw it at xgboost or a nnet, and say "well I read the API docs
and sorta know how they work."

(as a side note, try not to over-complicate the problem. Always do a simple
model as well as the exciting model you want to try, because exciting models
usually are hard to manage in production)

Lastly, put it on your github, and really highlight it on your resume or in
the interview. I often gloss over portfolio project bullet points on a resume,
but I'll always check a github if it exists.

Even if the project is half-baked or not as exciting as you want, having
concrete github code I can read is worth so so so much more than any coding
question I could ever ask.

Finally, my recommendation is for a data scientist generalist type. I do know
some data scientists who are extremely valuable, more valuable than I am, who
can't do any of that stuff. Usually they just work in a jupyter notebook using
data handed to them. In their case it tends to be because they are so
talented/trained in, say, deep learning, that their most value to the team is
having someone else do everything for them, while they tweak hyper-parameters.

~~~
koots
Thank you very much. I really like your advice. I would like to build on my
domain knowledge. I'm a PhD student in cs focusing on algorithms.

------
minimaxir
My personal data science blog
([https://minimaxir.com/](https://minimaxir.com/)) was designed over the last
few years specifically as a portfolio piece to get a data science job by
creating advanced data visualization/analysis projects w/ code, and I was
eventually successful. (it has done well on HN when it pops up too)

~~~
logram
Just a question: why do you use the third person when describing yourself?
I've seen it in resumes and some formal settings, but never on a personal
homepage. Also I find it a bit odd.

------
fundamental
Consider making tutorials for publications that you're interested in, but are
nontrivial to read through and understand. Building a tutorial will give
yourself a deeper understanding of a problem, help you communicate that
understanding, and benefit the larger community.

~~~
koots
Thank you. Do you think this will be appreciated for a job interview? I think
it will take lots of time and effort for that to happen.

~~~
fundamental
Yes. Part of the job interview will be selling your knowledge through past
work. Spending the time to tutorialize content will make it easy and stress
free to talk about it to a general audience. Being able to communicate your
point clearly is fairly essential IMO.

As per if they'll look at it themselves, I'm much more skeptical. The same
issue is present (checking at the resume stage or at the first filter stage)
with other portfolio pieces though.

------
avebear
I’ve had fun and built interesting projects based on harvesting tweets. As
some other comments suggest, data collection is an important and hard skill
that most tutorial projects ignore. If you can show 0) you came up with an
interesting question, 1) had the idea to get this data, 2) harvested the data
successfully, 3) formatted and cleaned it, and 4) ran appropriate algorithms
to look for the answer to your question, you have everything you need from a
portfolio project (even if the data doesn’t support your original
hypothesis!).

------
pella
OpenStreetMap (GIS, OpenData, Humanitarian, Visualisation )

You can import - and analyse the OpenStreetMap data, and create some nice QA
reports for the community.

Arxiv:
[https://arxiv.org/search/?query=openstreetmap&searchtype=all...](https://arxiv.org/search/?query=openstreetmap&searchtype=all&source=header)

~~~
koots
I really find this interesting.

------
TBF-RnD
Perhaps give me a hand researching text input, I'm starting to gather a rather
large source of ideas to be implemented. Fun work really allows you to think
outside of the box. Spans all the way from UI-design down to system calls into
the os ia C. So there area lot of areas to cover. Let me know by commenting if
you are interested.

~~~
koots
Please elaborate. What are you researching about text input?

~~~
TBF-RnD
This is the reasoning behind the project. In short I'd say that with VR around
the corner and touchscreen everywhere but no practical way of doing work as
such on smarthones tablets the way we look at text input should be
reconsidered. International input and accomodating for disabled people are
other angles. tbf-rnd.life/blog/2019/05/21/hello-world

I'm working on compiling my findings into a book. [http://tbf-rnd.life/the-
book-of-lesser-known-input-methods/](http://tbf-rnd.life/the-book-of-lesser-
known-input-methods/)

Along with this I am creating a framwork that can be used to \- demonstrate \-
benchmark \- provide common os support \- provide to the different models.

As such say the Dasher model can be compared more or less scientifically to
say plain old keyboard method or why not chorded input

So there are plenty of interesting concepts to dig into here. From UI design
in javascript to analysis on text a lot of word2vec and other fun stuff and
why not 3d interfaces.

I do this as a portfolio of sorts since it demonstrates such a wide range of
knowledge.

Furthermore as we are dealing with a keyboard, that is something that is
_always_ in use it's really important to create a wellpolished fast method so
it's not for the faint hearted.

~~~
koots
I have actually done some work with doc2vec. What is in these text data? What
do you hope to extract?

------
tompazourek
I heard that a good Kaggle profile is a data science equivalent to a good
GitHub profile.

I haven’t tried it myself, and it looks more like smaller projects, but
someone might find it interesting.

[https://www.kaggle.com/](https://www.kaggle.com/)

------
usgroup
whats your background? And what sort of job are you looking to get from it?

If you want to be an actual scientist then do something thats actually
scientific: elaborate an experiment design, collect your own data, analyse it
and draw conclusions from it.

For example, what’s the relationship between crime in San Francisco and
Starbucks locations? How’s the relationship conditioned on the weather? Does
the size of the parking lot adjacent to Starbucks meaningfully effect crime
independent of location?

I’m a little biased but there are too many script kiddies. “Scientists” that
copy/paste scripts and “analyse” by calling APIs, and don’t know how anything
works. Data science ala Kaggle.

~~~
koots
I'm a phd student in cs focusing on algorithms. I want to work as software
engineer/data scientist.

------
Jugurtha
You want your portfolio to communicate that you are fluent in _data_ ,
_resourceful_ , and think broadly on what data _is_. More generally, you want
to communicate that you are valuable to the employer.

What is valuable is often rare. Some skills are common or are just the
baseline.

Peculiarities in people are less commodotized, and when these peculiarities
intersect with the activity domain of an organization, they become valuable.
When these peculiarities are deep enough and span across a broad range and the
intersection of that range and the organization's interests is quite large,
they become extremely valuable.

These peculiarities are often a result of lifestyle, interests, musing, and
wandering. Often acquired through the years on the person's free time and are
not taugh in class.

This reads like something new-agey like the saying that goes "Instead of
trying to paint a perfect picture, become perfect and just paint"

Now for more practical and less "general" speak... I'll have to bring personal
anecdotes which, by definition, are about my specific experience. The pronoun
"I" will be used too often for a regular post as a consequence. This serves as
an example of what I mean by the above.

The first project I was involved with when I joined my current company as an
Enginner was related to heart data. It was convenient that I had worked on
heart data before, read a lot of medical papers on the question, worked on
anomaly detection, was familiar with PhysioNet data and format but also had
worked on _local_ hospital data filled with chest-hair-sweat-and-motion noise
and went through the challenges it represented. I could give pointers to good
resources on the question to the team, knew health professionals and faculty I
was still in contact with, and personal friends who are medical doctors and
surgeons I could get insights from (thinking broadly about "data" not just as
in digital format and CSV, but network, friends, domain experts, insights
gleaned socializing).

Another project the company did was telecom subscribers churn prediction. I
was invited to a brainstorming with the team discussing data and interesting
features. One of them is standard of living and financial situation. I
insisted on getting USSD data from the telecom company in addition to CRM data
and surveys. When I was asked what it would tell us, I asked colleagues how
frequently they checked their phone balances as employees (with a source of
revenue) vs. how often they did as students. They all got the point: as
students, it wasn't obvious that you even had enough airtime to make a call or
send a text, so you sent a USSD request (free of charge) to see how much
airtime you had left (thinking about data from "human moves" perspective and
not forgetting the experience of being a broke student for feature
engineering). It helped the project that I had gone through some books on GSM
and CDMA networks (out of curiosity) and was more fluent in the data the telco
sent and their jargon. I could help the team with that, recommending reading
sources curated over a long time, insights from personal acquaintances in
different roles in the telecom domain (engineering, sales, marketing, etc.).

Another project the company did was on reservoir characterization project for
oil and gas. It happened that I had interned for the biggest oil services
company in that exact position, read several books on reservoir
characterization. I also had exposure to the hardware, the process, the
different players and their incentives and went to actual reservoir
characterization jobs (it paid to know about oil based muds, boreholes,
deviation, cuttings, etc.). It helped by sharing context with the team,
knowing what to look for, who to ask and what, where to get data, what domain
name was that. I also had friends working in that domain in different
geogrpahic locations with different companies.

Another project I was in involved sound. My training was in EE so I had more
training in signal processing than the team and also had courses on acoustics.
I was able to help with pure signal processing and acoustics, resources to
bring someone up to speed, explanations, etc. I had interest and knowledge in
the source that was producing the sound. It helped in meetings with the client
because the sound source was very _peculiar_. The client was impressed because
they felt I knew more than an outsider should, given regulations and the
nature of the source. I was able to handle it safely and use it very
accurately to their surprise and to my employer's because I had never talked
about it. I also had access to people with _much more domain expertise than
the client organization_ giving extremely valuable insights on real world
condition and more interesting and frequent access to more diverse data
sources. When we had to build custom hardware and mics, it helped that I was
comfortable with a soldering iron, too.

When we did a project for a retail organization, it helped that I already was
primed because I had gone through their site, read their pages source, knew
they were using schema.org ontology, knew how their site was structured,
already parsed their sitemap, built a scraper for that site and did all that
_before joining my employer_. Plus I had the code.

Another project in banking where I had also some experience because I got
interested in earlier years to how they work, wrote some code for parsing
transaction data, also had friends in different banks and financial
institutions explaining things (again, data of another nature and from other
sources).

Another project was related to data from Programmable Logic Controllers, and
it helped that I had read a bunch on the question, tinkered with Siemens PLCs,
etc (it also helped when one of our new hires is a student working on a
project relating on communication protocols for PLC and finding out during the
interview that there's someone in the company who also was familiar, giving
pointers, and adding value to his work. It helped make him work here).

Other anecdotes of visiting sites in Russian that were not translatable
(images instead of text content), and being unfazed and able to sort of get
around because I had tried learning Russian earlier. It wasn't much, but it
saved time and just the spirit of "whatever it takes" can be contagious. This
was a startup and just the boost in morale or _anything_ that removes or tames
obstacles helps.

Serendipity at its finest.

And last but not least, and at the risk of being tacky: being able to
communicate with people in writing, face to face, and on the phone is
enormously helpful. Having a certain "lifestyle", for lack of a better word,
that kept that sword sharp, helped a lot. It also helped being in sales as a
college student didn't hurt.

The underlying message is: I think you can build a portfolio based on your
interests and I think it helps to cultivate your interests. I think it's nice
to be able to work on a Kaggle dataset with clean data in CSV format and
nicely labeled images, but it helps to think about data in more ways and keep
in mind that it's important to get things done and help others get them done,
in any way you can. Data is much more than CSV files and annotated images. The
questions to ask are:

\- How often do you think you get that kind of data (clean, ready, nicely
formatted, with client being responsive and supporting you)?

\- In which ways can you bring more value to your employer by helping getting
things done, often drawing on your previous experience, work, and code in a
domain of interest?

\- How can you act as a lever for other team members?

\- How can you act as a bridge between stakeholders and do impedance matching
to increase effectiveness of the whole _system_?

\- How do you feel about "business" helps (basic econ, ops management,
marketing, accounting, etc.)? It helps transduce features/bug
fixes/refactoring to business terms stakeholders understand.

\- How can you move obstacles as small as a boulder they can be?

Some things I have found useful:

\- Maintain a network of interesting and smart people in different domains
(physicians and physicists, chemists, poets, painters, musicians, engineers,
teachers, bankers)

\- Reading a lot about a lot.

\- Implementing stuff. Getting HTTP 429 and knowing what to do about it.
Experimenting. Documenting.

\- Sharing.

\- Helping others be better at what they do, do it better and more profitably.
Connecting people and wanting them to succeed.

Now, if I see that a candidate can _hustle_ , I'd be _very_ interested. I can
count on one finger such a candidate, and the kid was snatched faster than I
could get to him (and was snatched by an acquaintance working at a top
institution with a sorry-not-sorry)

~~~
methusala8
+1 . This post was really useful and detailed. Thanks.

------
gajju3588
If you are interested in NLP, Entity detection/Classification in news articles
could be an interesting place to start.

Training Data : Wikipedia

~~~
koots
Thank you for your concrete suggestions.

------
edoceo
I've got some semi-clean data that needs crunching, part of a soon to be GPL
project, it's actually pretty plain but can give you stuff to blog about and
post on your GitHub. My handle at gmail.

