Before I became a data scientist I was an academic researcher in the field of cognitive psychology, and I was teaching as a lecturer (the equivalent of an assistant professor) at the University of Leicester. My research was specifically in vision science, which is the study of how the brain processes visual information. The research involved a fair amount of research design and statistics, which I also taught to undergraduates, and the statistical modeling of visual attention. The models were based upon Bayesian statistics, and I tested them with a technique known as Monte Carlo simulation.
In retrospect, this academic research background gave me the foundation, but not all the necessary skills, to become a data scientist. These include statistics, modeling, experimental design, and coding, but also visualization and presentation skills honed while teaching and giving academic talks. While the models I used are different than those used in data science/machine learning, there were concepts that I could apply later, such as iterative fitting, feature selection, error metrics, and tuning of hyperparameters. Also, some of the math I used in academia definitely applied later, in particular, linear algebra, calculus, and mathematical notation. Finally, I had a long (if sporadic) history of coding in order to run my experiments and model simulations. The language was obscure (IDL, similar to MatLab), but gave me enough background to pick up Python and SQL later.
After eight years in the UK, I decided to leave academia and return to the US, in particular, to California where I grew up and where my parents still resided. After some background research into data science, I determined that I could make the transition. I started by taking Coursera courses in Python, R, and Data Science/Machine Learning. In particular, I highly recommend Andrew Ng’s Coursera course (one of the founders of Coursera and ‘fathers’ of AI/ML). It is challenging, especially if you are not familiar with linear algebra, but it gives a thorough framework of the mathematical foundations of data science modeling.
After that, I began looking into data science fellowships and found startup.ml (now fellowship.ai). After being accepted, I worked on three main projects in my four months in the fellowship.
In my first project we developed a decision tree model (xgboost) to predict car battery life for a third-party client. The work included preparing and presenting preliminary versions of the model to the client, and delivering and supporting the finished version of the model. It is worth noting that working with actual paying clients differs from other data science fellowships or bootcamps, which typically involve classes, coursework, and a final keystone project. Obviously this structure much more akin to the actual demands of a working data scientist.
In the second project, we built a model to predict accidents in factory surveillance videos. We based this model on an existing video prediction model called PredNet (Lotter, Kreiman, & Cox, 2016), which is an LSTM-convolutional neural network. The basic premise was that factory accidents are anomalous and unpredictable, and therefore a video prediction model would have higher prediction error during an accident. After building the model, I tested the validity of this premise with factory accident videos from YouTube. My analysis suggested ‘proof of concept,’ as the errors were indeed higher for the model during the accident, as opposed to before the accident.
The third general project with the fellowship assessed different aspects of cybersecurity. In one project we looked at the use of pastebins to transmit fraudulent data, such as stolen credit cards. Pastebins are anonymous websites housing only text-based data. Fraudulent information could be posted to the pastebin, and the url sent to the receiving party, whereupon information on the site could be removed.
Even though the url is technically public, only the sender and receiver would know url with the information. The analogy would be spies passing information at a highly public location, such as under a park bench, but only the spies would know to look there. In another cybersecurity project, we explored the use of honeypots to protect servers against malicious attacks. A ‘honeypot’ is simulated server meant to divert any attack from the actual server. How well the honeypot simulates the server often, buy not always, determines its efficacy. Also, we explored adversarial reinforcement learning, in which two agents, a malicious attacker and a defending server, would learn optimal strategies in competition with each other.
After finishing the fellowship, I landed a data science position at Gwynnie Bee, which is now CaaStle, where I have been for the past three years. Gwynnie Bee/Caastle is a midsized startup that operates a rental subscription service for clothing. For a monthly fee subscribers can rent and exchange garments chosen on our websites, with free shipping, laundering, and an unlimited number of exchanges. Gwynnie Bee is our owned and operated service, and we also provide a platform (Clothing As A Service, or CaaS) for retailers to operate rental services. Client retailers include Ann Taylor, Rebecca Taylor, Express, and Ralph Lauren.
While at Caastle, I have worked on a diverse range of projects within the company.
I have been involved in characterizing our members into clusters (personas) to better serve their preferences in choosing garments. This included designing a style quiz for just-acquired members (to address the cold-start issue), cluster analysis, and also analytics assessing how their choices in garments correlated across garment features (such as type of clothing) over the first month of membership.
I have also been involved in the analyses of garment images to extract garment features, such as color and pattern. For this work I was happy to receive a patent, and it was doubly rewarding to work on a topic very closely related to my previous academic research.
Lastly, I’ll mention the work I’ve done recently with our distribution centers. After laundering, the garments are inspected to ensure that they are up to standard to be shipped. As these are rental garments, they must be inspected for wear and tear and thus much more rigorously than standard retail apparel. I participated in revamping the training program for the inspectors, as well as analyzing and modeling the different types of garment features that might lead to inspection failure.
During my time at Gwynnie Bee/Caastle, some of the work has been more aligned with machine learning, such as decision trees, cluster analysis, and convolutional neural networks, but a substantial component has also been more in the realm of data analytics and insight. As such, both my previous experience as a researcher and my more recent training in data science/machine learning have been impactful in my development. Also, I have learned that visualization and presentation skills have been crucial, as I have often had to present to other teams that do not necessarily have technical backgrounds (for example, the distribution center and the merchandising team responsible for buying the garments).
I am sure that data science/machine learning positions can be more specialized, especially at larger companies. However, I have a feeling that my experience, with a broad and diverse set of challenges across data science, machine learning, analytics, and insight, is not uncommon. Data science, in itself, is an extremely broad and diverse discipline.