Unique identifier:
This is a “level 3” roadmap at the technology/capability level (see Fig. 8-5 of Chapter 8), where “level 1” would indicate a market level roadmap and “level 2” would indicate a product/service level technology roadmap:

Machine learning - the type of data analysis that automates analytical model building - is a vast topic that includes many different methods. Random Forest is one such method. In order to scope our group's project, we chose Random Forest as the technology of focus for our roadmap.
The basic high-level structure of Random Forest machine learning algorithms is depicted in the figure below:
Random Forest is a classification method/technique that is based on decision trees. Classification problems are a big part of machine learning because it is important to know what classification/group observations are in. There are many classification algorithms used in data analysis such as logistic regression, support vector machine, Bayes classifier, and decision trees. Random forest is near the top of the classifier hierarchy.
Unlike the traditional decision tree classification technique, a random forest classifier grows many decision trees in ways that can address the model. In the traditional decision tree classification, there is an optimal split, which is used to decide if a property is to be true/false. Random forest contains several such trees as a forest that operate as an ensemble and allows users to make several binary splits in the data and establish complex rules for classification. Each tree in the random forest outputs a class prediction and the class with the most votes becomes the model’s prediction. Majority wins. A large population of relatively uncorrelated models operating as a committee will outperform individual constituent models. For a random forest to perform well: (1) there needs to be a signal in the features used to build the random forest model that will show if the model does better than random guessing, and (2) the prediction and errors made by the individual trees have to have low correlations with each other.
Two methods can be used to ensure that each individual tree is not too correlated with the behavior of any other trees in the model:
Trees are not only trained on different sets of data, but they also use different features to make decisions.
The random forest machine learning algorithm is a disruptive technological innovation (a technology that significantly shifts the competition to a new regime where it provides a large improvement on a different FOM than the mainstream product or service) because it is:
As a result, compared with the traditional way of having one operational model, random forest has extraordinary performance advantages based on the FOMs such as accuracy and efficiency because of the large number of models it can generate in a relatively short period of time. One interesting case of random forest increasing accuracy and efficiency is in the field of investment. In the past, investment decisions relied on specific models built by analysts. Inevitably, there were loose corners due to the reliance on single models. Nowadays, random forest (or machine learning at a larger scale) enables many models to be generated quickly to ensure the creation of a more robust decision-making solution that builds a “forest” of highly diverse "trees". This has significantly changed the industry in terms of efficiency and accuracy. For example, in a research paper by Luckyson Khaidem, et al (2016), the ROC curve shows great accuracy by modeling Apple's stock performance, using random forest. The paper also showed the continuous improvement of accuracy when applying machine learning. As a result, random forest/machine learning has significantly changed the way the investment sector operates.

The 3-RF tree that we can extract from the DSM above shows us that the Random Forest(3RF) is part of a larger data analysis service initiative on Machine Learning (ML), and Machine Learning is also part of a major marketing initiative (here we use housing price prediction in real estate industry as an example). Random forest requires the following key enabling technologies at the subsystem level: Bagging (4BAG), Stacking (4STK), and Boosting (4BST). These three are the most common approaches in random forest, and are the technologies and resources at level 4.
We provide an Object-Process-Diagram (OPD) of the 3RF roadmap in the figure below. This diagram captures the main object of the roadmap (random forest), its various instances with a variety of focus, its decomposition into subsystems (data, decision trees, votes), its characterization by Figures of Merit (FOMs) as well as the main processes (defining, predicting).
An Object-Process-Language (OPL) description of the roadmap scope is auto-generated and given below. It reflects the same content as the previous figure, but in a formal natural language.
The table below shows a list of Figures of Merit (FOM) by which the random forest models can be assessed:
Figure 5. Table of Random Forest FOM – File missing when checked in Oct 2025
The first three are used to assess the accuracy of the model:
Here are the variables used in the R^2 equation, explained (Variables are written in LaTeX format):
Here are the variables used in the R^2 equation, explained (Variables are written in LaTeX format):
Here are the variables used in the R^2 equation, explained (Variables are written in LaTeX format):
Among all three, the RMSE is a FOM commonly used to measure the performance of predictive models. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to larger errors in comparison to other performance measures such as R^2 and MAE. To measure prediction, the RMSE should be calculated with out-of-sample data that was not used for model training. From a mathematical standpoint, RMSE can vary between positive infinity and zero. A very high RMSE indicates that the model is very poor at out-of-sample predictions. While an RMSE of zero is theoretically possible, this would indicate perfect prediction and is extremely unlikely in real-world situations.
Computational Efficiency is a property of an algorithm/model which relates to the number of computational resources used by the algorithm. Algorithms must be analyzed to determine resource usage, and the efficiency of an algorithm can be measured based on the usage of different resources (eg. time, space, energy, etc). Higher efficiency is achieved by minimizing resource usage. Different resources such as time and space complexity cannot be compared directly, so determining which algorithms are "more efficient" often depends on which measure of efficiency is considered most important.
Interpretability is a subjective metric used to evaluate how the analytical model works and how the results were produced. Since computers do not explain the results of the algorithms they run, and especially since model building is automated in machine learning, explaining machine learning models is a big problem that data scientists face because these machine learning models behave similarly to a "black box". The automated nature of these algorithms makes it hard for humans to know exactly what processes computers are performing. High interpretability means that the model is relatively explainable, which usually results from higher human involvement in the development and testing of the model. Models with lower interpretability are harder to rely on since the automated internal processes used to develop the model are not be clear to human users who may be less confident in the accuracy of the results.
Simplicity is a metric used to evaluate how simple/complex a model is. Simplicity is measured by taking the reciprocal of the product of the number of variables and the number of models. Complex problems do not always require complex solutions, and simple regressions are able to produce accurate results to some extent. However, there is certainly a tradeoff between simplicity and accuracy. A model could be very simple but not be able to capture the nuances of the data. On the other hand, a model could be very complex and have many variables that help to represent the data but suffer in terms of efficiency
Due to the nature of this technology, it's been challenging to quantify the growth of FOM over time, because it's not only related to the technology itself (algorithms), but also the dataset, as well as parameters selected for modeling, such as the number of trees, etc. We've tried to use the same dataset to run three models with optimized parameters, and the chart below shows the difference in RMSE metrics of the three models (Linear Regression, CART, and Random forest) compared to the years the model types were first developed.
Random forest, the most recent of all three model types being compared, has the lowest RMSE value, which shows that the average magnitude of the model's error is the lowest of all three models.
Random forest is still being rapidly developed with tremendous efforts from research institutions around the world. There are also many efforts to implement and apply random forest algorithms to major industries such as: finance, investment, service, tech, energy, etc.
In this section and following sections, we've embedded data analytics in real estate industry. Assuming we have a company in which we use random forest to predict housing price (like what Zillow etc. does). A dataset is used to the quantification discussion, which includes
The table below shows the strategic drivers of our hypothetical company and alignment of the 3RF technology roadmap with it.
The first strategic driver indicates the company aims to improve the accuracy of the model, whereas the 3RF roadmap is also optimizing the algorithm and parameters to fine tune the model. As a result, this driver is aligned with 3RF roadmap. The second driver indicates the company is looking for simplicity, and this is not aligned with the current 3RF roadmap. This discrepancy needs immediate attention and be worked to address.
The figure below shows a summary of other real estate companies who also use data analytics to predict housing price.
Error creating thumbnail: File missing when checked in Oct 2025
Figure 8. Competitors at Marketplace
The table below shows a comparison between different companies, including the models' set up.
Zillow-like major company tend to have very high R2, which heavily utilize data analytics in their price prediction model. To have great accuracy, it would include as many variables information as possible, as many observations as possible. This will help to "train" their model to be very smart. Meanwhile, they have sacrificed simplicity with the many variables, and a large number of models built for machine learning purposes. Whereas some local "mom and pop stores" would utilize very simple models such as linear regression or CART, of which the accuracy of the model is sacrificed. Right now our hypothetical company is trying to utilize random forest to deliver good accuracy while still maintain a certain level of simplicity
(In this chart, the simplicity is defined as reciprocal of the number of variables and number of models. For example, our hypothetical company uses 73 variables and 5 models for prediction, the simplicity is 1/(73*5)=0.0027. The more variables or more models our prediction relies on, the lower the simplicity. This is aligned with our intuition)
The Pareto Front (for this specific dataset) is shown in blue line. It's clear that the local company A (using CART)is not at the Pareto Front, instead, it would be dominated. The local company B using linear regression is as simple, but the accuracy is better. Zillow-like major company and our hypothetical company both are much better in terms of accuracy.
For the target of Yr. 2020, the green line defines the expected Pareto Front at then. Comparing with current performance, the future Pareto Front is expected to improve either accuracy (R2), or simplicity, or both. The company's strategy will determine the direction, and extent of improvement.
An example morphological matrix is developed for 3RF roadmap in this specific case. The purpose of such a model is to understand all the design vectors, explore the design tradespace and establish what are the active constraints in the system.
Due to the special nature of this technology, there is no defined equations for the FOM’s. The FOM’s values are impacted by the model parameters, including ntree, nodsize, mtry (w/o clearly defined mathematical equations or regression equations), and the dataset, including all variables and observations.
Explanation of parameters:
1) ntree: number of CART trees in the forest
2) mtry: number of variables examined at each split of the trees
3) nodesize: minimum number of observations in each terminal node of the trees
As a result, for any given dataset, the expressions of two FOM’s can be expressed in this way:
R2 ∼ f1 (ntree, nodesize, mtry)
RMSE ∼ f2 (ntree, nodesize, mtry)
One dataset is used to normalize any potential difference caused by data source. This dataset is from the real estate industry with 2821 observations (houses' information),73 independent variables, such as lot area, garage size, etc., and 1 dependent variable (sale price). The goal of the model is to predict sale price based on these 73 independent variables.
Notional trends: each dot on the plot represent the output of the corresponding FOM (R2 or RMSE) at the given value of the parameter while fixing other parameters
These two charts below show the impact of mtry on the 2 FOMs:
These two charts below show the impact of ntree on the 2 FOMs:
These two charts below show the impact of node size on the 2 FOMS's:

Because there is no governing equation, no partial derivative can be drawn to illustrate the difference. From previous notional trends, it’s also clear the finite difference also changes, even the direction of the difference (positive or negative). As a result, a baseline model with a set of inputs was defined; by verifying the values of the inputs in different models, tornado charts were generated to represent the finite difference of baseline caused by these changes. With all the given conditions, both ntree and mtry are very critical to the performance; both FOM’s are mostly sensitive to mtry
Error creating thumbnail: File missing when checked in Oct 2025
Figure 16. Tornado Chart of RMSE
Based on the preliminary results on sensitivity, and background knowledge on Random forest, the R&D focus should be on mtry optimization, because 1) it's very sensitive, and plays a big difference in the result of the model- lower or higher than the optimal value will both deteriorate the result; 2) in random forest, mtry is a key factor to determine two critical aspects: diversity of the models which tend to favor a smaller mtry, and performance of the models which tend to favor a larger mtry. Accordingly, optimizing mtry is critical. Meanwhile, it's also worth to point out that the mtry optimization need to rely on ntree and nodesize as well. As a result, the approach should be to find the optimal algorithm to optimize all three parameters but focusing on mtry.
The figure below contains a sample NPV analysis underlying the 3RF roadmap. It shows the initial non-recurring cost of the project development for the first 2 years for the research projects, before revenue starts generating since Yr. 3. The ramp up period is around 6 years, until the market is situated with revenue of around $8MM/year. Total estimated project life is 17 years.
Based on our outlook, the revenue will be stabilized around $8MM, however, there is significant uncertainty on market demand as well as the organization capability to keep up with the technology development year by year. As a result, we introduced an oscillation factor so the revenue will be bouncing between $7MM and $9MM. The underlying assumption is if the revenue is lower than $8MM in one year, more efforts will be in place the next year to improve the revenue to be $9MM. Meanwhile, the maintenance and operations cost will oscillate with revenue, because more efforts on improving revenue will likely require higher operations and maintenance cost.
Error creating thumbnail: File missing when checked in Oct 2025
Figure 17.2 Cash Flow Representation
To plan and prioritize R&D or R&T projects, we use the strategic drivers of the company as overarching principles and technical models and financial models as guiding principles to develop the project categories and roadmaps.
Five Projects will be the focus in near future:
Project Gantt chart is developed to represent the sequence and resource requirements for different projects. The projects were also paced out to be aligned with R&D budget outlook of each year:
Part 1: Key Patents for Random Forest
Part 2: Publications on the Development of Random Forest:
Part 3: Key References on Data Analytics in Real Estate Industry:
Part 4: Extended Reading- Interesting Random Forest applications in many other fields:
Our target is to develop a real estate digital platform with integration between traditional real estate industry and state of the art data analytics. To achieve the target of an accurate prediction method for the optimal price of housing for broader customers, we will invest in two R&D categories. The first is an algorithm optimization project to improve the accuracy of current machine learning to an R^2 of 0.98 using the sample data. The second project is a user interface improvement project to make our methodology simpler to use in order to attract users. These R&D projects will enable us to reach our goal in Yr. 2022.
Error creating thumbnail: File missing when checked in Oct 2025
Figure 19. Arrow Chart for Model Development