Modeling Principles
Almost all, if not all, applied science is based on the idea of a model. There are problems in the real world, and we would like to create a theory to solve such problems.
One of the pillars or the main foundation of quantitative modeling is its effectiveness. But how can a model be effective? By demonstrating that it is simple, reproducible, and applicable in other areas of knowledge.
A second pillar of modeling is the ability to make decisions and choices. And such choices are, and should be, made at any and all times in the process of specifying a model.
Another pillar is testing. A model cannot exist without its due testing process. These tests are experiments, where choices must be made and tested to ensure the effectiveness of the model.
A model has a simple objective: to be a simplistic representation of the real world. Based on the central limit theorem, sampling, and randomness, a model is nothing more than a specific/local/unitary idea that represents the general/total.
The sample will never be equal to the population, and a unit will never be the total, but a model finds a pattern that it will assume as central and how far the units are from this pattern, which we formally call “effect” and “residue”.
The “effect” is what we seek, in a simplified way, to solve the real problem, and the “residue” is what corroborates our effect, the residue is the acid test of the model and will demonstrate how too simple the model is, or how too complex the model is.
It is necessary to move between the simple and the complex to centralize the effect. If the effect is central, and the residue is coherent and parsimonious, then the model has fulfilled its main objective: in a simple way, to represent the complex reality.
And here is how this process works:
The problems observed in the real world are very complex, and to solve them we need to look for the exact solution.
But if the problem is complex, the solution must be just as complex, and that is why it is necessary to simplify. We can solve some problems with approximate solutions, and this is feasible. However, an approximate solution requires an approximate problem.
In this sense, we observe the real problem and conditionally create a simplified version that we call an approximate problem, since it is in fact close to the real one, only some simplifications have occurred to allow us to study, estimate, test and theorize.
From the approximated (simplified) problem it is possible to obtain an exact solution. Therefore, it is feasible that the exact solution of the approximated problem is the approximate solution of the real problem.
And so the fundamental process of modeling occurs:
This very simplistic process allows us to see that the model is nothing more than a simplification of the real problem, as I mentioned before. Using the model, it is possible to analyze the results found without needing to have the solution. The analysis of the model serves to understand the nuances of the results in order to draw conclusions about the simplification that was created. Sometimes we will not have the solution we are looking for, but we can have conclusions, even if simplified, about the real problem.
The analysis and conclusions of the model allow us to generate inference, which is the deduction made based on the reasoning generated by the analysis, and from the inference we can explain, even if in a simplified way, the real problem and make estimates and predictions.
The final step is the decision or choice that we make based on the understanding and estimates. With this, we monitor and verify whether our model is sufficient to reach an approximate solution to the real problem. Therefore, we verify whether our choices for applying the model results are meeting the expectations based on the estimates of our model.
If our model is not sufficient, that is, if the estimates are not observed in the real world, we need to recalibrate this model or even specify a new one. This means that we failed in one of the steps outlined in the flows above.
We usually make a lot of mistakes when simplifying the problem, in the initial specification of the model. It seems trivial, and it is, but this step requires a lot of creativity, because modeling is an art. If we simplify the model too much, it loses its sensitivity to reality, and if we simplify it too little, the model can be so complex that it cannot be distinguished from the real problem, and this causes failure, due to the difficulty in obtaining an exact solution.
The second point where we fail the most is in decision-making and verification, because this process can be expensive. Monitoring is a cost that may never be proposed or included in a project (specifying a model). The choice of how to apply it can also be made without due proportions. In general, the decision-making process can be poorly done, generating the inconsistency that leads the model to failure.
Generally, the person who makes the decisions is not always the same person who collected the information, nor may it be the person who specified and executed the model, much less the person who performed the prior analysis and reached the conclusions and inferences.
The modeling process is a process that has an indefinite temporality, but it has steps that need to be respected, as if it were a natural rule. So far, these rules are valid and work very well. For example, we have many powerful models that are sometimes defined theoretically and, when applied, perform their role accurately.
There is no defined time, but it can have a cost. The longer it takes, the higher the cost. This is only valid if, and only if, time has a monetary value (such as working hours, the researcher’s salary or the project budget).
So, even if it is trivial, do not ignore this knowledge. A model is not specified in minutes, and executing a famous or currently fashionable algorithm is not enough. It is necessary to abstract in order to think about the complexity of the real problem, as well as to think about the conditioning paths of simplification.
A very interesting topic in simplifying a real problem and finding an approximate solution is the idea of ceteris paribus used in economics, where the general idea is to apply, in a subjective and abstract way, an implicit equilibrium condition in which the relationship between phenomena is balanced and therefore any change in the other phenomena not used in the model has no effect, isolating only the phenomenon chosen to estimate the effect of one phenomenon on the other in order to draw conclusions, make inferences and make decisions.
From now on, you already know how to start your modeling process, which is not just visiting your data lake, selecting a multitude of variables and running a convolutional neural network calculation and hoping that your model is as accurate as your accuracy test tells you it is. Sometimes, a linear regression with 3 variables is enough, and this would be following the first step of simplification.