Applied ML End to End Journey — Step 2, Part 1

Farye "AK" Nwede
2 min readFeb 10, 2022

Read Step 1, Framing your Problem.

Photo by Luke Chesser on Unsplash

Moving forward with the desire to create a machine learning model, end to end, it’s time for me to find data. This step is an important one. It’ll probably take some of the most time during the project. Several reasons factor into this. First, you must find a data source. For this project, this data source had several parameters:

  1. The project must be low-cost. Ideally free. Low-cost is difficult because of MLS fees for a real estate project.
  2. The data should have some level of utility. The current question is: how large should my dataset be? What’s good sample size for your data? One record is not enough. Are 10 rows enough? How about 100?

As you can see, the answer to the above questions appears to be amorphous. That’s the case. I have to analyze the data consistently. After gathering more records, reviewing the data, and seeing what it shows me is pivotal.

Using the Problem Statement as a Starting Point

Selecting the right real estate property to purchase is time-consuming and inconsistent based on my personal preferences.

Based on my problem statement, it’s clear that I need real estate data. However, how should I frame the data? With a focus on personal preferences, the data points will be subjective. They will be items that are important to me.

The features that matter to me are the following:

  1. Number of bedrooms and bathrooms
  2. Distance to family
  3. Overvalued properties
  4. A reasonable mortgage

Those were the initial features that I decided to track. You’ll have to define the features you want to gather for your project. For some projects, this will be more defined. For others, you may have some work to do in this category. My advice here is to let the data reveal itself to you. As I gathered data and put my Google Sheets formula skills to work, I realized that more features mattered. These features included a low-HOA, house type, and the age of the house. I ended up deciding to add a score column as well. This will help to identify the properties that best fit my preferences. This isn’t a perfect approach. Nor am I a data scientist. There’s a lot for me to learn here. However, the data gathering step has helped reinforce my methods so far.

More to come.

--

--

Farye "AK" Nwede

Engineering manager. DevOps. 1/2 of the Ubiquitous Methods podcast. Tinkerer, thinker and learner. Former party animal reformed to become the next Bruce Wayne.