1 Comment

The Pareto distribution (power law distribution). I think this is likely to be an important distribution to data science because so many important distributions related to social media and web usage end up being power lawed. Also, it turns up in so many areas from word frequencies (Zipf's law) to population density, to the distribution of wealth (like land in the 80/20 rule). It also very frequently shows up in network science, for example in the density distributions of various kinds of graphs. That said it isn't without controversy. It has been shown that the power law and log-normal distribution are very hard to tease apart (Clauset, et al, 2009), and that the ubiquity of power laws in the social and natural sciences may be overblown, there was a certain amount of hope that the frequency of power laws occurring in nature could be related to phenomena from complexity science like self organized criticality, it is now considered less likely that there is some connecting mechanisms between different occurrences of power laws and that rather various different generative mechanisms are involved in different cases, and as mentioned, it has turned out to be a lot less clear that these power laws are not potentially log-normal. But despite all that, there are some great packages, like poweRlaw for R, which allow you to fit power laws with the current gold standard in statistical methodology. For the reasons mentioned earlier I think that perhaps the pareto distribution (or power law, however you want to call it) should be included in place of the log-normal distribution.

Expand full comment