Advanced Database Systems “GGLE PROJECT” Part 2 Felix Hageloh ID: 0425257 Roberto Valenti ID: 0493198 Fragmented Schema Design For our distributed system we decided on using a horizontal fragmentation as we though it is best suited for this kind of application. Since most queries in a search engine involve scanning through a vast number of urls, we gain the most speed if divide this big set into smaller chunks and scan in parallel. Since our urls are stored in three different columns (domain, path and page) one could argue that we could also split vertical and do column scans in parallel. However, each fragment would still have to scan over to total number of entries, plus we have more overhead shipping information, like which domain belongs to which entry. Also if we were to build serious system with many fragments, we could only have as many fragments as we have columns, but ideally we should have many more than that, so horizontal fragmentation seems to be the best solution.
Our fragmented scheme is fairly simple: All Entries are split in a round robin fashion over the two fragments, so all entries with an even id land on one segment and all odd ones on the other. All links are placed on the same fragment as the entry they originate from. Another (maybe better) derived fragmentation would be to place the links together with the entry that they link to, but that would leave the question where to put the links that don’t have a matching entry.
Our tables have thus exactly the same structure as for the last assignment and can be found in the file schema.sql. To avoid confusion we also had two differently named dbfarms (ggle_1 and ggl_2) and originally also had two differently named schemas (again ggl1 and ggl2) on each corresponding db. However, the performance history doesn’t seem to work if not inside the sys schema, so we placed our tables there instead.
Fragmented Query Design For each query, we give the original query and then the design for executing it on our distributed system. The implementation of the queries can be found in the directory ./Queries Query 1:
SELECT EntryId, URL FROM Entry WHERE url = "http://www.php.net/index.php";
This is a very simple one, because we can just run the same query on both fragments in parallel and combine the result in any manner, since no sorting or grouping is required.
Query 2:
SELECT URL FROM Entry WHERE LOWER(Domain) LIKE "%math%" OR LOWER(Path) LIKE "%math%" OR LOWER(Page) LIKE "%math%";
Same as query 1.
Query 3: SELECT COUNT(*) FROM Entry UNION SELECT COUNT(*) FROM Link;
Similar as 1 and 2, but here we need to sum the result for each row once it is returned. This is done inside Java.
Query 4:
SELECT E.Domain, COUNT(L.Entry) AS Number_of_links FROM Entry E, Link L WHERE L.Entry = EntryId GROUP BY E.Domain HAVING E.Domain = "http://www.php.net";
This one gets more complicated. It is a join, but we don’t need to ship any data between the two fragments, because of our derived fragmentation. However, we have to worry about the grouping. Again we can do this inside Java. When the results arrive from both fragments, we put all results from one fragment into a hash table using the domain as key. Then for each result of the second fragment we have to check if the domain already exists in the hash table. If so, we add the counted links to the existing entry; if not we simply add the row to the hash.
Query 5:
SELECT U.URL, COUNT(LinksTo) As num_linked_from FROM Entry U, Link L WHERE U.EntryId = L.LinksTo AND (LOWER(U.Domain) LIKE "%math%" OR LOWER(U.Path) LIKE "%math%" OR LOWER(U.Page) LIKE "%math%") GROUP BY U.URL ORDER BY num_linked_from DESC LIMIT 10;
For this query we need to ship data for the first time. First we need to ship the missing links to each fragment to compute the aggregate. However, we only need to ship the LinksTo column and since we know that all even entry ids are on one fragment and all the odd ones on the other, we only need to ship the rows with an even value to one, and the ones with an odd value to the other fragment. So the first queries look like this:
Frag1: SELECT LinksTo FROM Link WHERE MOD(LinksTo, 2) = 1; Frag2: SELECT LinksTo FROM Link WHERE MOD(LinksTo, 2) = 0;
Then we insert those links into temporary tables like this:
Frag1: CREATE TABLE Tmp( LinksTo INT );
For each value in result from frag2 INSERT INTO Tmp1 VALUES( value ); End
And the same of course in the other fragment. Then we can get a partial result by issuing following query on both fragments:
SELECT U.URL, COUNT(LinksTo) As num_linked_from FROM Entry U, (SELECT LinksTo FROM Link UNION ALL SELECT LinksTo FROM Tmp1 ) AS L WHERE EntryId = L.LinksTo AND (LOWER(U.Domain) LIKE '%math%' OR LOWER(U.Path) LIKE '%math%' OR LOWER(U.Page) LIKE '%math%') GROUP BY U.URL ORDER BY num_linked_from DESC;
Then to combine the result from both queries we have to maintain the correct ordering. Notice that each partial result is sorted in itself, so we can produce the output with following simple algorithm:
WHILE any of the two results still has rows
Try to move the cursor of both results up IF not succeeded THEN Set the corresponding result to 0 (or any value smaller than any possible result) END
IF result1 is bigger than result2 THEN Return result1 ELSE IF result2 is bigger than result1 THEN Return result2 ELSE (they are equal) Return result1, result2 END END
Since we also have a LIMIT clause, we should also introduce a counter and stop returning results when the limit is reached. Of course this algorithm becomes more complex if the ORDER BY clause includes several columns.
Query 6:
SELECT COUNT(*) FROM ( SELECT DISTINCT URL FROM ENtry WHERE Domain LIKE "http:%" UNION SELECT DISTINCT URL FROM LInk WHERE Domain LIKE "http:%") AS Urls;
For this query we need to ship all urls with protocol http to one fragment. So first we run the query
QueryI: SELECT URL FROM ( SELECT DISTINCT URL FROM ENtry WHERE Domain LIKE "http:%" UNION SELECT DISTINCT URL FROM LInk WHERE Domain LIKE "http:%") AS Url_T;
on one fragment. Meanwhile we create a view on the other fragment like:
CREATE VIEW Urls AS QueryI
On the same fragment we also create a tmp table and insert all results from the other fragment. Then the final result can be obtained by:
SELECT COUNT(*) FROM (SELECT Url FROM Urls UNION SELECT Url FROM Tmp1) AS U; Query 7: Our proposed solution for the non-distributed db was
SELECT E.EntryId, COUNT( DISTINCT P.EntryId ) FROM Entry E, Entry P, Link L WHERE L.Entry = P.EntryId AND L.Domain <> E.Domain AND P.DOC + ‘4 MONTHS’ < CURRENT_DATE GROUP BY P.EntryId, E.EntryId; didn’t help either. Both seem to be correct though, as we tested them by limiting the number of entries to 10 (WHERE L.Entry = P.EntryId AND L.Domain <> E.Domain AND E.EntryId < 11), which produced results that seem reasonable. The numbers are quite large, but always below the total number of entries. However, our date check doesn’t work. DATE + ‘4 MONTHS’ returns the same date (a bug?), and we couldn’t find any other way to calculate the date difference in Monet. CURRENT_DATE – DOC returns an integer with the number of days, so we couldn’t use EXTRACT(MONTH … ). We could divide by 30 which would give an approximate result, but that’s not very nice.
Since we couldn’t get the query work for all entries, nor get the correct date difference we didn’t implement it for the distributed setting. Since we would have to ship all Entries to one fragment, it would be a great speedup to select the entryIds from entries older than 4 months and ship only those.
So, to implement this query on a distributed setting we would have to ship all entries that are older than 4 months plus the domain of all links that refer to an even (or odd, depending on which side we ship to) Entry and then calculate the join there. This would certainly take a huge amount of time, considering how slow the query is already on one fragment.
How could we extent our system to become a more general query processor?
In our system, so far we hard-coded the execution of our 7 distributed queries. To design a system that process any query in our distributed environment, we could use following approach. First parse the queries SQL statement, and then use some stored rules to execute. The rules would have to look something like this:
- The query contains no join, grouping, aggregate and ordering: Just send same query to all fragments and concatenate results - The query contains a group by clause: Send same query to all fragments and then combine the results, using a grouping algorithm as for example our implementation of query 3 (using a hash table) - The query contains an aggregate function: Send the same query to all fragments which will return a sub result of the aggregate and then apply the same function to the results from all fragments - The query contains and ORDER BY clause: Send the same query to all fragments and order the results using some sorting algorithm as proposed in our implementation of query 5. - The query contains a join: The general approach is to ship the information that’s missing on each fragment from the fragment where it is stored. This should be limited only to columns that are required in the join. Moreover, we can use rewrite rules from relational algebra, useful ones being that a join can be expressed by several semi-joins and a join followed by as selection is the same as a selection followed by a join. Also we can use knowledge of our schema to make the query more efficient. Of course, for joins that define our derived fragmentation, we don’t have to do anything. Also we can use the fact that in our schema all even EntryIds are in one fragment and all odd ones in the other. See Query 5 for more details.
Queries that have a combination of those features have to use a combination of the rules supplied. Of course this arises to a quite complicated system. Also in order to maximize speed we would have to apply some heuristic or optimization plans. Performance
All experiments were run on 850 MHZ Pentium III, 512 MB memory machine. OS: Windows XP
Our small benchmark of the normal ggle db vs. the distributed one shows that for most queries and for loading the db the two dbs perform almost the same. The distributed db is usually a bit slower, which is due to the overhead of the distributed programming. However, we have to highlight here that our distributed dbs were actually running on the same machine. We should definitely see a performance increase if the dbs would reside on different computers given that we have a reasonably fast network. However, two queries (5 and 6) give a disastrously poor performance on the distributed db (that's why they don't appear in the fist graph), which is due to the fact that we have to ship data between the dbs. So a big improvement of our solution is required here. One might be to transfer the data directly between the two dbs and not first from one db to the client software and from there to the other db. Another option might be to break down a join into a series of select statements. For example, to find the number of links that refer to one entry, we can first retrieve all entries from both fragments, and then for each entry query both fragments on how many links refer to that entry, and combine the result. Another wise choice would certainly be to store common joins, as for example the number of links referring to one entry (also on several recursive levels). This would increase the time to load the db quite a bit, but increase the query speed tremendously, which is what counts for a search engine. Also a better designed derived fragmentation might give performance increases.
Note: We were not able to get the performance history of the individual Mil operations. Querying the history table just gave the SQL statements of our queries with all relevant entries being 0. |