Reuben Pereira of CARPROOF on the world of Data Science and just doing what excites you
Reuben Pereira is the Head of Data Science at CARPROOF, a unit of IHS Markit (Nasdaq: INFO), who is Canada’s definitive source of automotive information, delivering vehicle history, appraisal and valuation. Reuben is a seasoned Data Scientist with a background in Biostatistics. He is a passionate Data Science Evangelist who in his spare time, participates in Hackathons across Toronto and teaches at the Digital Skills school, Brain station in Toronto.
Looking back Five/Ten Years, where did you think you’d be by this point in your career? How did you get here?
This may come as a surprise, but I started my undergraduate degree in International Relations. I quickly learned that I did not enjoy the essay-heavy coursework, but loved the quantitative statistics and economics courses. This made me realize that I should play to my strengths and switch to a more numbers heavy discipline, so I moved into the statistics specialist program. Making that choice eventually led me to pursue a Masters in Biostatistics, which is a branch of applied statistics which focuses on health-related analysis. The program had a lengthy practicum which allowed me to get real-world experience early on. I worked on the research and implementation of spatial predictive models for accurately estimating the risk and spatial distribution of a variety of infectious diseases. After graduating, I joined a company called Real Matters, where I worked to build housing valuation algorithms based on vast amounts of appraisal data. I joined as a Data Analyst, but was promoted to a Data Scientist. That opportunity led me to securing my current role at CARPROOF.
Tell me about something you are really excited about when it comes to Data Science right now
The democratization of Data Science is very exciting. For example, in the past if you wanted to build an image recognition system you typically needed a team of Phd’s and developers, but now most of the cloud providers like Amazon, Google and Microsoft provide automated Machine Learning services. All you have to do is provide the data, and with minimal configuration, you can train and deploy an accurate image classifier. These services reduce the barrier to entry and will lead to wider adoption of the technology which multiplies its network effect. It is a very exciting time to be in the data world.
What is the most important thing that you’d like someone to know about Data Science?
At the end of the day, the data is what’s most important when it comes to Data Science. It’s less important what algorithm you use if your data isn’t high quality and complete. For any organization who is looking to start building Data Science capability, the first question they should be asking is, “Do we have the data to do what we want to do?”. Having quality data is key for building predictive models or conducting statistical inference, and a really good Data Scientist is someone who should be able to identify potential limitations, and make adjustments as necessary.
The challenge also becomes a matter of first defining the problem you are trying to solve or the insights you are trying to attain. This is not an easy thing to do and most often, is where organizations who are starting to develop data science capacity get things wrong. Data Scientists come from a variety of different backgrounds, each having a unique set of strengths, so selecting the individual with the appropriate skills for your tasks is important. For example, if you determine that your product could really benefit by adding a feature that automatically classifies uploaded images, then selecting an individual experienced in developing and deploying image classification models would be most appropriate.
If someone told you they want to become a Data Scientist, what would you tell them?
To be successful in Data Science, you eventually need to develop a strong foundation in calculus, linear algebra, probability and statistics. You also need to be able to write high quality code. I recommend getting started with Python as it is the most commonly used language for Data Science. Having knowledge and experience in SQL and distributed computing using tools like Spark is also important.
Aspiring Data Scientists often get overwhelmed with the vast amount of Machine Learning algorithms and mathematics that they have to grasp. My suggestion is to go in depth on a single approach, learn it inside and out and then expand your knowledge. For example, when getting started with Machine Learning algorithms, I recommend starting off with Decision Trees. Understand how the mathematics behind the tree’s construction, how to evaluate its accuracy, and how to implement the model for classification and regression. Having an in-depth understanding of how Decision Trees work will make it significantly easier to understand some of the more complex models like Random Forests or Gradient Boosted Trees. It is easy to get overloaded by the number of models out there but if you learn one really well, it makes learning other models simpler.
When it comes to hiring, I would say that the projects that you’ve worked on are far more valuable than amassing certifications. It is not just about math; it is about connecting the math to the business.
What are some of the most important business lessons that you’ve had to learn the hard way? How have they made you better?
The biggest lesson that I’ve learned is around measuring twice and cutting once. What I mean by that is being very clear about the problem you are looking to solve before you go about trying to solve it or implement a solution. When it comes to Data Science, you often see companies trying to hire a bunch of Data Scientists (who are very expensive) without first understanding their infrastructure and determining how the Data Science discipline will fit into their organizational workflow. What sometimes ends up happening is that there is a disconnect between the Data Science organization and the real business need. This ends up wasting a lot of time and resources and frustrates everyone involved. I’ve also seen organizations try to set up this capability and then get stuck because the current capability of their people inhibits them to do so. Getting this kind of stuff going is not an easy thing to do which many organizations can take for granted.
Have you had any mentors who have really made an impact on helping you get to where you are now?
I’ve been thankful to work with some very smart people at both Real Matters and now at CARPROOF, who have helped me grow and learn. I have a strong quantitative background but my mentors have helped me get a broader perspective on my work. I came to learn quickly that context matters more than anything because everything in Data Science should be thought about from a business perspective before it is thought about from a modelling perspective. As a Data Scientist, being able to identify and develop data science solutions that have meaningful business outcomes is what makes you truly valuable.
What is the one piece of advice you would give to someone in the workforce today to become successful and find work that is fulfilling?
Do whatever you get excited about. For everything that I’ve done, I’ve always tried to find and tackle unique and interesting challenges which got me really excited. I believe that people learn the best and most effectively when they are motivated to do so by interest and genuine curiosity. My advice would be to seek out work and projects that excite you and don’t worry about whether or not you know how to tackle them right away.