Advanced Database Systems “GGLE PROJECT”


Part 2




Felix Hageloh  ID: 0425257

Roberto Valenti ID: 0493198


Fragmented Schema Design


For our distributed system we decided on using a horizontal fragmentation as we though it is best suited for this kind of application. Since most queries in a search engine involve scanning through a vast number of urls, we gain the most speed if divide this big set into smaller chunks and scan in parallel. Since our urls are stored in three different columns (domain, path and page) one could argue that we could also split vertical and do column scans in parallel. However, each fragment would still have to scan over to total number of entries, plus we have more overhead shipping information, like which domain belongs to which entry. Also if we were to build serious system with many fragments, we could only have as many fragments as we have columns, but ideally we should have many more than that, so horizontal fragmentation seems to be the best solution.

 

Our fragmented scheme is fairly simple: All Entries are split in a round robin fashion over the two fragments, so all entries with an even id land on one segment and all odd ones on the other. All links are placed on the same fragment as the entry they originate from. Another (maybe better) derived fragmentation would be to place the links together with the entry that they link to, but that would leave the question where to put the links that don’t have a matching entry.

 

Our tables have thus exactly the same structure as for the last assignment and can be found in the file schema.sql. To avoid confusion we also had two differently named dbfarms (ggle_1 and ggl_2) and originally also had two differently named schemas (again ggl1 and ggl2) on each corresponding db. However, the performance history doesn’t seem to work if not inside the sys schema, so we placed our tables there instead.

 

Fragmented Query Design


For each query, we give the original query and then the design for executing it on our distributed system. The implementation of the queries can be found in the directory ./Queries


Query 1:

 

 

SELECT EntryId, URL

      FROM Entry

      WHERE url = "http://www.php.net/index.php";

 

This is a very simple one, because we can just run the same query on both fragments in parallel and combine the result in any manner, since no sorting or grouping is required.

 

 

Query 2:

 

SELECT URL

      FROM Entry

      WHERE LOWER(Domain) LIKE "%math%"  OR LOWER(Path) LIKE "%math%"

            OR LOWER(Page) LIKE "%math%";

 

Same as query 1.

 

Query 3:


SELECT COUNT(*) FROM Entry UNION SELECT COUNT(*) FROM Link;

 

Similar as 1 and 2, but here we need to sum the result for each row once it is returned. This is done inside Java.

 

Query 4:

 

SELECT E.Domain, COUNT(L.Entry) AS Number_of_links

      FROM Entry E, Link L

      WHERE L.Entry = EntryId

      GROUP BY E.Domain

      HAVING E.Domain = "http://www.php.net";

 

This one gets more complicated. It is a join, but we don’t need to ship any data between the two fragments, because of our derived fragmentation. However, we have to worry about the grouping. Again we can do this inside Java. When the results arrive from both fragments, we put all results from one fragment into a hash table using the domain as key. Then for each result of the second fragment we have to check if the domain already exists in the hash table. If so, we add the counted links to the existing entry; if not we simply add the row to the hash.

 

Query 5:

 

SELECT U.URL, COUNT(LinksTo) As num_linked_from

      FROM Entry U, Link L

      WHERE U.EntryId = L.LinksTo AND (LOWER(U.Domain) LIKE "%math%" 

            OR LOWER(U.Path) LIKE "%math%"

            OR LOWER(U.Page) LIKE "%math%")

      GROUP BY U.URL

      ORDER BY num_linked_from DESC

      LIMIT 10;

 

For this query we need to ship data for the first time. First we need to ship the missing links to each fragment to compute the aggregate. However, we only need to ship the LinksTo column and since we know that all even entry ids are on one fragment and all the odd ones on the other, we only need to ship the rows with an even value to one, and the ones with an odd value to the other fragment. So the first queries look like this:

 

Frag1: SELECT LinksTo FROM Link WHERE MOD(LinksTo, 2) = 1;

Frag2: SELECT LinksTo FROM Link WHERE MOD(LinksTo, 2) = 0;

 

Then we insert those links into temporary tables like this:

 

Frag1:

CREATE TABLE Tmp( LinksTo INT );

 

For each value in result from frag2

      INSERT INTO Tmp1 VALUES( value );

End

 

And the same of course in the other fragment. Then we can get a partial result by issuing following query on both fragments:

 

SELECT U.URL, COUNT(LinksTo) As num_linked_from

FROM Entry U, (SELECT LinksTo FROM Link

UNION ALL

SELECT LinksTo FROM Tmp1

   ) AS L

      WHERE EntryId = L.LinksTo

AND (LOWER(U.Domain) LIKE '%math%' OR LOWER(U.Path) LIKE '%math%'

OR LOWER(U.Page) LIKE '%math%')

      GROUP BY U.URL

      ORDER BY num_linked_from DESC;

 

 

Then to combine the result from both queries we have to maintain the correct ordering. Notice that each partial result is sorted in itself, so we can produce the output with following simple algorithm:

 

WHILE any of the two results still has rows

 

Try to move the cursor of both results up

IF not succeeded THEN

Set the corresponding result to 0 (or any value smaller than any possible result)

      END

 

IF result1 is bigger than result2 THEN

      Return result1

ELSE IF result2 is bigger than result1 THEN

      Return result2

ELSE (they are equal)

      Return result1, result2

END

END

 

Since we also have a LIMIT clause, we should also introduce a counter and stop returning results when the limit is reached.

Of course this algorithm becomes more complex if the ORDER BY clause includes several columns.

 




Query 6:

 

SELECT COUNT(*)

      FROM ( SELECT DISTINCT URL

                        FROM ENtry

                        WHERE Domain LIKE "http:%"

                  UNION

                  SELECT DISTINCT URL

                        FROM LInk

                        WHERE Domain LIKE "http:%") AS Urls;

 

For this query we need to ship all urls with protocol http to one fragment. So first we run the query

 

QueryI:

SELECT URL

      FROM ( SELECT DISTINCT URL

                        FROM ENtry

                        WHERE Domain LIKE "http:%"

                  UNION

                  SELECT DISTINCT URL

                        FROM LInk

                        WHERE Domain LIKE "http:%") AS Url_T;

 

on one fragment. Meanwhile we create a view on the other fragment like:

 

CREATE VIEW Urls AS QueryI

 

On the same fragment we also create a tmp table and insert all results from the other fragment. Then the final result can be obtained by:

 

SELECT COUNT(*)

FROM (SELECT Url FROM Urls

UNION

SELECT Url FROM Tmp1) AS U;


Query 7:


Our proposed solution for the non-distributed db was

 

SELECT E.EntryId, COUNT( DISTINCT P.EntryId )

      FROM Entry E, Entry P, Link L

      WHERE L.Entry = P.EntryId AND L.Domain <> E.Domain

            AND P.DOC + ‘4 MONTHS’ < CURRENT_DATE

      GROUP BY P.EntryId, E.EntryId;



didn’t help either. Both seem to be correct though, as we tested them by limiting the number of entries to 10 (WHERE L.Entry = P.EntryId AND L.Domain <> E.Domain AND E.EntryId < 11), which produced results that seem reasonable. The numbers are quite large, but always below the total number of entries. However, our date check doesn’t work. DATE + ‘4 MONTHS’ returns the same date (a bug?), and we couldn’t find any other way to calculate the date difference in Monet. CURRENT_DATE – DOC returns an integer with the number of days, so we couldn’t use EXTRACT(MONTH … ). We could divide by 30 which would give an approximate result, but that’s not very nice.

 

Since we couldn’t get the query work for all entries, nor get the correct date difference we didn’t implement it for the distributed setting. Since we would have to ship all Entries to one fragment, it would be a great speedup to select the entryIds from entries older than 4 months and ship only those.

 

So, to implement this query on a distributed setting we would have to ship all entries that are older than 4 months plus the domain of all links that refer to an even (or odd, depending on which side we ship to) Entry and then calculate the join there. This would certainly take a huge amount of time, considering how slow the query is already on one fragment.

 

How could we extent our system to become a more general query processor?

 

In our system, so far we hard-coded the execution of our 7 distributed queries. To design a system that process any query in our distributed environment, we could use following approach. First parse the queries SQL statement, and then use some stored rules to execute. The rules would have to look something like this:

 

-          The query contains no join, grouping, aggregate and ordering:  Just send same query to all fragments and concatenate results

-          The query contains a group by clause: Send same query to all fragments and then combine the results, using a grouping algorithm as for example our implementation of query 3 (using a hash table)

-          The query contains an aggregate function: Send the same query to all fragments which will return a sub result of the aggregate and then apply the same function to the results from all fragments

-          The query contains and ORDER BY clause: Send the same query to all fragments and order the results using some sorting algorithm as proposed in our implementation of query 5.

-          The query contains a join: The general approach is to ship the information that’s missing on each fragment from the fragment where it is stored. This should be limited only to columns that are required in the join. Moreover, we can use rewrite rules from relational algebra, useful ones being that a join can be expressed by several semi-joins and a join followed by as selection is the same as a selection followed by a join. Also we can use knowledge of our schema to make the query more efficient. Of course, for joins that define our derived fragmentation, we don’t have to do anything. Also we can use the fact that in our schema all even EntryIds are in one fragment and all odd ones in the other. See Query 5 for more details.

 

 

Queries that have a combination of those features have to use a combination of the rules supplied. Of course this arises to a quite complicated system. Also in order to maximize speed we would have to apply some heuristic or optimization plans.



Performance

 

All experiments were run on 850 MHZ Pentium III, 512 MB memory machine. OS: Windows XP

 

RESPONSE TIME IN MILLISECONDS
  MonetDB Distr. MonetDB *
Query 1 20 27
Query 2 204 184
Query 3 65 70
Query 4 210 227
Query 5 208 16791
Query 6 12036 43530
RESPONSE TIME PER TRANSACTION OVER 20 RUNS
  MonetDB Distr. MonetDB *
Run 1 12748 61749
Run 2 12729 61579
Run 3 12728 60747
Run 4 12808 61348
Run 5 12759 61389
Run 6 12758 63070
Run 7 12788 62761
Run 8 12739 62209
Run 9 12728 64242
Run 10 12748 62300
Run 11 12799 62019
Run 12 12758 58745
Run 13 12748 59345
Run 14 12759 59445
Run 15 12728 59606
Run 16 12708 59826
Run 17 12739 59536
Run 18 12868 60096
Run 19 12748 60227
Run 20 12709 61198
* Two instances of MonetDB running on the same machine



Our small benchmark of the normal ggle db vs. the distributed one shows that for most queries and for loading the db the two dbs perform almost the same. The distributed db is usually a bit slower, which is due to the overhead of the distributed programming. However, we have to highlight here that our distributed dbs were actually running on the same machine. We should definitely see a performance increase if the dbs would reside on different computers given that we have a reasonably fast network.

However, two queries (5 and 6) give a disastrously poor performance on the distributed db (that's why they don't appear in the fist graph), which is due to the fact that we have to ship data between the dbs. So a big improvement of our solution is required here. One might be to transfer the data directly between the two dbs and not first from one db to the client software and from there to the other db.

Another option might be to break down a join into a series of select statements. For example, to find the number of links that refer to one entry, we can first retrieve all entries from both fragments, and then for each entry query both fragments on how many links refer to that entry, and combine the result.

Another wise choice would certainly be to store common joins, as for example the number of links referring to one entry (also on several recursive levels). This would increase the time to load the db quite a bit, but increase the query speed tremendously, which is what counts for a search engine. Also a better designed derived fragmentation might give performance increases.

 

Note: We were not able to get the performance history of the individual Mil operations. Querying the history table just gave the SQL statements of our queries with all relevant entries being 0.