The dRuby Book

10.3 Crawling Interval and Synchronization with Indexer

One thing I wanted to demonstrate with this sample application is that each program processes its own task at its own convenient timing.

The crawler starts the work periodically. The crawler doesn’t worry about the status of the indexer. It just finds an update and writes to Drip. This applies to the indexer as well. The indexer doesn’t worry about the status of the crawler. It reads documents stored in Drip in batches and updates the index. Once all the documents are processed, then the indexer goes into sleep mode until a new document gets written.

images/dripsearch.png

Figure 44. Crawler and indexer working independently through MyDrip

If you draw the diagram of the flow of the data, it will look like Figure 44, Crawler and indexer working independently through MyDrip. The data flow starts from the crawler, is stored in Drip, and then is taken out by the indexer for indexing. However, the indexer doesn’t have direct dependency on the crawler. As a comparison, let’s think about creating this search system using the Observer pattern. Imagine that indexing will all be done in the chain of the crawler and the various callback methods within the indexer using the Observer pattern. The speed of crawling has to be in line with the speed of indexing.

Drip doesn’t receive notification passively. The listener actively goes and fetches the update information when it’s convenient for the listener itself. This is similar to how the Actor model works (see Rinda::rinda_eval and the Actor Model). The indexer will take out the next task only when all of the existing tasks complete and when it’s ready for the next one. This is contrary to how dRuby works, because the dRuby server receives RMI calls under subthreads regardless of whether the server is busy.

Enough detail. It’s important to understand that the crawler can work without waiting for the indexer to work and that the indexer can also work regardless of how often the crawling job happens. Drip loosely acts as messaging middleware.