list of papers for stochastic ranking process>
front page>

Stochastic ranking process approach to the Amazon online bookstore ranks

Since 2008/05/03, added 2009/08/28 (statistical fit and implication on Amazon.co.jp, and move-to-front rules), added 2018/02/21,2019/11/07 (research results around dependent stochastic variables for position dependent intensities)
Stochastic ranking process is a simple mathematical model which we invented originally to explain a time development of Amazon sales rank for any book found in a web page at Amazon.co.jp, a Japanese counter part of an online bookstore Amazon.com. Amazon bookstore is of interest partly because it is known as one of the pioneering examples of a long tail online retail business.

At the time I am writing this paragraph (2008A.D.), the keywords "RANK" and "WEB" seem to mislead sophisticated readers to the "Google Page Rank". This page is on a mathematical model of the Amazon Sales Rank, and mathematically has very little to do with the Google Page Rank. The two are mathematically VERY different, and I am interested on the perhaps less popular subject, in other words, on a subject in the long tail.

The following is a summary of what is planned to be explained in this page.

  1. Amazon.co.jp ranking number behaves rather wildly.
  2. The behavior, however, can be explained by a simple mathematical model, which we call the stochastic ranking process.
  3. In spite of the observed instability of the number and of the simplicity of the model in the theory, the ranking number conveys quantitative information of the sales record of online retail stores, especially its long tail structure. (Hence, in particular, a simple ranking number based on the stochastic ranking process could be used as a simple and convenient method for disclosing sales status, in particular, of a long tail online business.)

(added 2018/02, 2019/11)

  1. Here is a separate web page for major mathematical advance after I wrote the contents below in this page. It focuses on additional stochastic dependence among particles (processes) arising from position depndence of intensity functions.


a page for a book from amazon.co.jp

Ranking number of a book at Amazon.co.jp website.

When one visits Amazon.co.jp and look for a book, one would face a web page which looks typically like the right figure. (The original web page is rather large, so only a part of the page is shown in the right figure.)

The page is more or less similar to a corresponding page at Amazon.com (or a corresponding page in any other countries), except that the page is written in Japanese language. Similarly to the case in Amazon.com, there is a line in this page saying "Amazon.co.jp ranking: 384,057th among the books". The number is updated every hour, as is the case in Amazon.com.
time evolution of ranking at amazon.co.jp
The figure above with 3 lines shows the change of the line with the ranking number. The first line says "Amazon.co.jp ranking: 373,406th among the books", the next line in the figure is the same line in the web page an hour later, with ranking number 373,977, and the third line in the figure is the situation 2 hours later with 374,693. In this example, the number increases about 600 to 700 each hour. This ranking number (counting to a few hundred thousand) is what I am going to deal with here. (The number is named "Ranking" in the Amazon.co.jp pages, while in the Amazon.com pages the corresponding number is named "Sales Rank". May be "ranking" is an adjective in English, but is a noun in "Japanese English". I will use the term "ranking" to keep it clear that I am dealing with the Amazon.co.jp ranking number.)

Focusing attention on such a number perhaps does not interest many people, but there are more than one person deeply interested in Amazon ranking or Sales Ranks. M. Rosenthal accumulates a large data of Amazon.com sales ranks and summarizes in his web page a schematic relation between the rank numbers and the sales rates of books at Amazon.com. There are also research papers on econometrical studies based on time evolutions of sales ranks. J.Chevalier and A.Goolsbee, Measuring prices and price competition online: Amazon.com and BarnesandNoble.com, Quantitative Marketing and Economics 1 (2) (2003) 203-222, reports a method of estimating the sales rate of a book title from the book's sales ranks. Their method of relating sales rates to sales ranks is to buy, say three books, and wait for 3 weeks, then the rank number at that point should correspond to the sales rate of a book per week. Mathematically, this method assumes that the ranking number increases linearly in time and that the rate is proportional to the number of books ordered.

Let us see what is happening in Amazon.co.jp ranking number.


Time evolution of ranking number.

With a closer look into the time evolution of the ranking number at Amazon.co.jp, we immediately see that for Amazon.co.jp, the ranking number is not linear in time, nor is proportional to sales rate. In fact, it behaves very wildly.

Choose a book which sells one per a few weeks to a few month, and stick to the time evolution of its Amazon.co.jp ranking number for an year. If you have never done so, you might then be amazed to find a rather wild behavior as in the following schematic figure.
time evolution of amazon.co.jp book ranking - schematic drawing (The figure is a schematic drawing and oversimplifies the actual time evolution. However, it is based on actual observations, and the numbers are not exaggeration.)

The vertical axis represents the ranking number (the higher up in the graph, the less popular). The ranking number 384,057 given as an example at the beginning of this web page, is about one third from top in this figure.

The main question is: " What could be a mathematical definition of a ranking number, such that would reproduce such a curve?"

It may be worthwhile to take a bit of abstract mathematical approach to consider this realistic question, because I have heard that even among the professional publishers not much is known about how Amazon bookstore is doing. In another econometrical research paper also using data from ranking numbers (again with linearity assumption), E.Brynjolfsson, Y.Hu, M.D.Smith, Consumer surplus in the digital economy: Estimating the value of increased product variety at online booksellers, Management Science 49 (2003) 1580-1596, it is written 'Internet retailers are extremely hesitant about releasing specific sales data'.


Stochastic ranking process.

There is a saying in Japan "Better than hearing 100 times, see once". (Original meaning of course is, experience yourself, rather than trying to collect information of other people's experiences.) Here is a sample movie of the motion of stochastic ranking process:
movie (500KB, 30seconds, wmv file)

For interested readers, click here for a simple explanation of the stochastic ranking process, a simple mathematical model explaining the wild and random time evolution of the Amazon.co.jp ranking numbers. A full mathematical definitions are in the original paper linked in the References at the end of this web page.

To summarize,

  1. The ranking number jumps to 1 when the book in question is ordered for sales. The sales occurs randomly, because people buy books on their own will and schedule, which differ among different people.
  2. In the period when the book is not sold, the ranking number increases solely because other books are sold and jump to ranking 1; the book in question is then squeezed towards the tail side in the ranking system. Since the increase in the number occurs only when a book at the tail side is sold, the rate of increase in the number is high near the top position and slow near the tail position.

The stochastic ranking process does reproduce observed wild behaviors of the Amazon.co.jp ranking numbers. In fact, the schematic curve drawn in the figure above was calculated by the stochastic ranking process!


Would a number with wild behavior give serious information on sales and popularity? - Yes it does!

The movie or the definition of the model shows that each particle in the model (a book, in the case of online bookstores) randomly jumps to the top ranking and then increases the ranking number (moves gradually towards the less popular tail side), a behavior in common with the figure above with a wild curve. One might then doubt, if such is the case, then the Ranking number may not convey any serious information about the corresponding book's popularity (= sales rate = contribution to the bookstore = the Rank).

The answer to this question is, the time evolution of the number does convey a rich information.

A brief explanation would be as follows. A great hit title sells constantly, so even if the ranking number starts to increase, it quickly jumps back to top, so the number is constantly near the top. The average maximum increase in the number reflects (in a non-linear and rather complex, but in a calculable manner) the sales rate of the book.

On the other hand, for a very unpopular book (i.e., for a book in the far "long tail"), the mean interval between sales is longer than the observation time, and would not sell in the worst case, while one is watching at the book. This means that for a book title in the long tail, a standard short-time observation is too brief to find the book's true popularity, hence any reasonable index for popularity of sales rate should fluctuate.

With these thinking from the view point of the stochastic ranking process, the definition of the model works as a collect index of the popularity and sales rate of a book.


Infinite particle limit and 1+1 dimensional mixed incompressible liquid driven by evaporation

A mathematical model may not be practically useful if the model itself is too hard to solve. Fortunately, the stochastic ranking model is deterministic and solvable in the infinite particle limit, hence for a system with large particles (such as Amazon.co.jp with over 2 million books), the predictions of the model are not too difficult to find out.

Again, better than hearing 100 times, see once. Here is another sample movie of the motion of stochastic ranking process, this time 4 times large in file size and 3 time longer in the movie length:
movie (0.2MB, 7 seconds, wmv file)

Allow me a little bit about mathematics. Not only an existence of an infinite particle limit, but also the time developement in the limit can be solved. A key observation is that limit quantities satisfy a system of partial differential equations for 1+1 dimensional mixed incompressible liquid driven by evaporation. Please refer to the original paper linked in the References at the end of this web page, for precise mathematical results.


Study in Amazon ranking - Stochastic ranking process approach, by T.HATTORI

Amazon.co.jp is not a long-tail business! --- A statistical consequence of combined Amazon.co.jp data and stochastic ranking process

A mathematical analysis combined with the actual time evolution data of Amazon.co.jp rankings suggests an interesting reality about long tail structures of the online bookstore.

Note that such a simple, rather abstract mathematical, model as stochastic ranking process fits rather well with actual data observed at Amazon.co.jp web. The figure below shows a result of observation of a series of time developement of Amazon.co.jp ranking of a book (77 black circles), together with a result of a best statistical fit to the observed data of a formula obtained as a mathematical consequence of stochastic ranking process (solid curve). The horizontal and vertical axis respectively denotes time in units of hour and the ranking. We see that the black circles and the solid curve is rather close, indicating a good chance that a simple model as stochastic ranking process works as a good model of time developement of actual Amazon.co.jp rankings.

Observed data of Amazon.co.jp ranking and a statistical fit of a theoretical formula from stochastic ranking model.

Please refer to the original paper linked in the References at the end of this web page, for details of statistical fits. (I again skipped a lot in this paragraph, to avoid mathematical details...) A consequence of statistical fit clearly shows that, the Amazon.co.jp, in spite of Amazon bookstore's reputation as a pioneering company of web based long tail business model, is not a long tail, in that the vast amount of rarely sold books such as books on mathematics counts little in the total sales of the bookstore, and that the bookstore's sales is essentially supported by a very small titles of best-sellers, as in any standard bookstores.


Existing studies: Move-to-front rules, LRU (Least-recently-used) caching

References to previous studies must be noted. The mathematical model, which so far has been named stochastic ranking process, has in fact been noticed since mid 20th century:
M.L.Tsetlin, Finite automata and models of simple forms of behaviour, Russian Math. Surv. 18(4) (1963) 1-27.
This pioneering work seems unfortunatelly remained unnoticed for about ten years, a reason perhaps is that the model has been given as one of simple examples of automata covered by the general study of the paper. Meanwhile, the model seems `re-discovered" (studied without reference to Tsetlin) by
J.McCabe, On serial files with relocatable records, Oper. Res. 13 (1965) 609-618,
W.J.Hendricks, The stationary distribution of an interesting Markov chains, J. Appl. Probab. 9 (1972) 231-233.

The model (being a finite irreducible Markov process) has a unique stationary distribution, which turns out to have a rather simple form, which these early papers points out. Tsetlin obviously had a heap of books piled on his desk in his study, from which he takes out a book on demand, and return it to the top. He makes an interesting comment suggesting that, since the distribution of the books converges eventually to the stationary distribution (and the well used books would have large possibility of staying near the top of the heap), it is well-ordered and that he doesn't want a well-motivated house-keeper disturbing his order!

The model seems to have become well-known after the work by Hendricks, and a first peak of published research papers appeared with a name move-to-front rules (see the right figure, which gives a number of papers I found and kept with the keyword move-to-front together with their relevant references, published within each bin of 4 years). The collection is far from complete, nor systematic, but the figure does give an idea that, after the 2 papers in the 1960s, there is a first small peak in the 1970s, and a second and large peak starting from 1980s which, from the observed tendency, may end by 2010.

A schematic figure of the amount of published studies related to the move-to-front rules.

A main streem of the research in the move-to-fron rules is its application to theoretical studies of least-recently-used (LRU) caching in the research field of data structure in computer sciences. It is usually the case that there is a hierarchy in the data storage system, with small amount of expensive but quick memory devices, named cache memory, reserved for frequently called data. In using cache memory, one needs an algorithm to decide quickly and costlessly which pieces of data would be more frequently used, or equivalently, which data should be dropped off from the cache each time a new candidate data appears. One of the simplest (quickest and of low cost) algorithm named as LRU caching decides that each called data as likely to be called soon again, and places at the top position in the cache, and the least recently used data will be dropped to make room for the newly cached data. With a bit of thought, we see that this algorithm is equivalent to move-to-front rules (or stochastic ranking process). In this application, it is important to know the cache hit or cache fault probabilities, which evaluate search costs.

As we see, our original motivation was very remote from fashonable and important streem of studies, but a basic mathematical model turned out to be common. After going through these references, we found out that the main consequences of our study are new, in spite of this half a century long history of extensive researches related to the model. Our main results are, as introduced in this page so far,

  1. existence of inifinite particle limit for the joint jump rate and spatial position random distribution,
  2. a non-linear partial differential equation for the limit of the joint distribution (hydrodynamic limit) and its unique global solution,
  3. application to web ranking, statistical fit, and analysis of long tails.

We also note that, though our motivation had nothing to do with LRU caching, our results on the hydrodynamic limit can be successfully applied to this long existing problem. For details please see the original papers in the reference list at the end of this page, together with further list of references of extensive studies on move-to-front rules and application to LRU caching. Concerning the references, I heartly thank Dr. Nobuaki Sugimine for bringing the keyword `move-to-front" to my attention.


A summary of progress around 2009-2010 (480KBpdf file)

(added 2018/02) Here is a separate web page for major mathematical advance after I wrote the contents above in this page. It focuses on additional stochastic dependence among particles (processes) arising from position depndence of intensity functions.

The page is more for enthusiast on mathematical analysis technique (than, eg., for practical interest on how you can read business success in a specific company from ranking data :-).


References. If your browser allows Japanese characters, here is a page for introduction containing in addition a list of seminar and meeting records.
list of papers for stochastic ranking process>
front page> inserted by FC2 system