The growth of technology and automation coupled with exponential amounts of data have led to the importance and omnipresence of databases which, simple put, are organized collections of data. Considering a naive approach, one can theoretically keep all the data in one large table, however that increases the access time in searching for a record, security issues if the master table is destroyed, redundant storage of information and other issues. So tables are decomposed into multiple smaller tables.
For retrieving information from multiple tables, we need to extract selected data from different records, using operations called join(inner join, outer join and most importantly natural join). Consider 2 table schemas employee(employee_name, street, city)with n rows and works(employee_name, branch_name, salary) with m rows. A cartesian product of these 2 tables creates a table with n*m rows. A natural join selects from this n*m rows all rows with same values for employee_name. To avoid loss of information(some tuples in employee have no corresponding tuples in works) we use left outer join or right outer join.
A join or a nested query is better subject to conditions:
- Suppose our 2 tables are stored on a local system. Performing a join or a nested query will make little difference. Now let tables be stored across a distributed databases. For a nested query, we only extract the relevant information from each table, located on different computers, then merge the tuples obtained to obtain the result. For a join, we would be required to fetch the whole table from each site and create a large table from which the filtering will occur, hence more time will be required. So for distributed databases, nested queries are better.
- RDBMS optimizer is concerned with performance related to the subquery or join written by the programmer. Joins are universally understood hence no optimization issues can arise. If portability across multiple platforms is called for, avoid subqueries as it may run into bugs(SQL server more adept with joins as its usually used with Microsoft’s graphical query editors that use joins).
- Implementation specific: Suppose we have queries where a few of the nested queries are constant. In MySQL, every constant subquery would be evaluated as many times as encountered, there being no cache facility. This is an obvious problem if the constant subquery involves large tuples. Subqueries return a set of data. Joins return a dataset which is necessarily indexed. Working on indexed data is faster so if the dataset returned by subqueries is large, joins are a better idea.
- Subqueries may take longer to execute than joins depending on how the database optimizer treats them(maybe converted to joins). Subqueries are easier to read, understand and evaluate than cryptic joins. They allow a bottom-up approach, isolating and completing each task sequentially.
Refer – Join operation Vs nested query