Deeshen Shah | Projects | Personal Website

Biomedical Data Warehouse/OLAP System: September- October 2013

Implemented a clinical and genomic data warehouse based on KAD-STAR schema design using the SQL system. Our warehouse satisfied the following requirements: 1) regular and statistical OLAP operations; 2) be robust to potential changes in the future; and 3) knowledge discovery.

Clustering Algorithms: October- November 2013

Implemented three clustering algorithms to find clusters of genes that exhibit similar expression profiles: K-means, Hierarchical Agglomerative clustering with Single Link (Min), and density-based (DBSCAN algorithm). We compared these three methods and discussed their pros and cons. For each of the above tasks, we validated our clustering results using the following methods:

1) Choosing an external index (Jaccard Coefficient) and compared the clustering results from different clustering algorithms.

2) Choosing an internal index (Correlation) and compared the clustering results.

Also, we visualized the data sets and clustering results by Principal Component Analysis (PCA).

We improved the running time of K-means algorithm by implementing MapReduce K-means. We installed a single-node Hadoop cluster on our machine and implemented MapReduce K-means.

Technologies Used: Java, Hadoop, Cloudera

Microarray Data Analysis: November- December 2013

In the past few years, microarray technology has become one of the foremost tools in biological research. The emergence of this technology has empowered researchers in functional genomics to monitor gene expression profiles of thousands of genes (perhaps even an entire genome) at a time. However, mining microarray data also presents great challenges to Bioinformatics research. This project will acquaint you with several basic approaches to analyzing microarray data from the beginning to end. You will apply the techniques introduced in class to real-world microarray data sets and learn how to discover useful knowledge from the data sets. This project will also help you understand the challenges in microarray data analysis and motivate you to develop novel approaches to addressing those challenges.

Information Retrieval

Wikipedia Indexer: August- September 2013

We built a Wikipedia indexer with the following goals: Parse fairly involved Wikipedia markup using SAX parser. Tokenize the data by applying tokenization rules on Dates, Numbers, Special Characters, Accents, Punctuation, Apostrophe, Hypen, Capitalization, Whitespace, and Stemming using Porter's Algorithm. Index a decent sized subset of the Wikipedia corpus. Create multiple indexes on the page data as well as metadata. Provide an index introspection mechanism that can later be built upon to support queries.

Technologies Used: Java, JUnit

IR Models, Query Processing and Evaluation: September- October 2013

Evaluated the performance of different ranking algorithms- Vector Space Model and BM25. As part of this project, we built our own query-processing module in an attempt to boost performance. Also, we used Lucene for performing quantitative evaluation based on the TREC (Text Retrieval Conference) data sets.

Technologies Used: Java, Lucene

Search systems using Apache Solr: October- November 2013

The aim of this project is to suggest travel destinations to users based on their preferences. Thus, building an search system that takes input from user and giving results. We have limited our attention to English wikivoyage for this purpose. Each page has similar subsections describing activities to do, stats about climate, getting there etc.

We have indexed at least 5000 different pages. Allowed the users to specify multiple parameters, at least as ANDed conditions. Provided a filter facility allowing users to apply / remove preferences or automatically discover new facets within their results to drill down on

Technologies Used: Java, Solr, AJAX, PHP, SQL, JQuery

Wide Area Distributed File Systems

DLS: Design And Implementation Of A Cloud-Hosted Directory Listing Service For Lightweight Clients:

January 2013-April 2013

Designing a Cloud-hosted Directory Listing Service that will prefetch and cache remote directory metadata in the Cloud to minimize response time to the thin clients (such as Smartphones, Web clients etc) to enable efficient directory traversal before issuing a remote third-party data transfer request

Technology used: StorkCloud, Java, GRIDFtp, REST API, XML

View project report

Distributed Systems

Simple Messenger on Android: January 2013

Goal: To design a simple messenger app on Android. The goal of this app is to enable two Android

devices to send messages to each other.

Purpose: To understand some of the basic mechanisms necessary to write networked apps on Android

Technology used: Android, SQLite

Totally And Causally Ordered Group Messenger With A Local Persistent Key-Value Table: March 2013

Implemented a group messenger app that provides totally and causally ordered messages, with a local persistent key-value table (for storing the messages), using modified sequencer algorithm.

Technology used: Android, SQLite

Simple Key-Value Storage: April 2013

Designed a simple distributed key-value storage based on Chord and developed an Android app to implement ID space partitioning/re-partitioning, Ring-based routing, and Node joins.
Technology used: Android, SQLite

REPLICATED KEY-VALUE STORAGE: April 2013

Designed a simplified version of Dynamo (distributed key-value storage) based on eventual consistency mechanism that handles 1) Partitioning, 2) Replication, and 3) Failure handling.

Technology used: Android, SQLite

Advance Database Systems

Interpreting Relational Algebra And Translating Sql To Relational Algebra: January- February 2013

Goal:

1) Evaluated a relational algebra which is stored in a form of a tree. Implemented an algorithm at evaluates the expression (containing Joins, Select, Projection, Aggregate Functions, etc. ) step by step (leaf node to the root node).

2) Parsed the SQL expression and generated a corresponding relational algebra expression, and evaluated it using step 1

Building Static Hash And Isam Tree Index: February- March 2013

Constructed and validated dense static hash index and dense static ISAM tree index. Design involved a fixed number of (pre-allocated) sequential pages to store the hash table buckets, and a set of additional overflow pages. Each data page consisted of a sequence of (unordered) data tuples, and a pointer to the first overflow page (i.e., forming a linked list). The DatumBuffer class provided functionality for encoding data tuples into Byte-Buffers

Optimized Query Execution: April- May 2013

1) Implemented query processor to run TPC-H queries (http://www.tpc.org/tpch/ )

2) Implemented Hybrid Hash and a Relational Algebra Query Optimizer

3) Implemented Index Scan and Index-Nested Loop, and integrated it to the Optimizer

Tech Innovation and Management

PRINT2GO: January- May 2013

Created a business plan- Print2Go: It is a mobile application that will allow users to print documents from almost any device, to any printer, anywhere. Users will sign up for one of our accounts, download the app on to their mobile devices, and then will be able to register their printers to our server so that they can access them from anywhere. An application that is as comprehensive as Print2Go does not currently exist in the market, and our app will allow users to be more productive, and have access to the resources they need, whenever they need it.

Business Plan

FAD Template

Presentation

Machine Learning

Regression On Page Relevancy: August 2012

Trained a regression model based on the Letor4.0 Dataset (LETOR is a package of benchmark datasets of query-url pair released by Microsoft Research Asia for researching on Learning to Rank), and using Gaussian & Sigmoid basis function the page relevancy labels for new coming queries were predicted with an error (Erms) of 0.52
Technology used: MATLAB

Classification Of Handwritten Digits: October 2012

Achieved Multi-class classification using Logistic Regression and Neural Networks
Handwritten digits (10 digits 0-9 with binary GSC features) were classified with an accuracy of 97.52%
Technology used: MATLAB

Probabilistic Graphical Models: Model Evaluation, Inference And Sampling: December 2012

Using Log-Loss, the BN comprising 15500 dataset of handwritten word 'and' with cursive and hand-print characteristic were evaluated, and it was proved that BN for a given graphical model performs better than the BN with independent variables
Technology used: MATLAB

URL: segroup38.freevar.com

Software Engineering

Community Information Website (CIW): August 2012 - December 2012

In a group of 7 we investigated and developed a well compiled report on the topic CIW. CIW is a local news organization would like to offer people the capability to click use Google maps to find their house, and click on it to get a list of all available information on the surrounding area waste dumps nearby, house for sale / price, upcoming activities in the area, flood information, planned construction, traffic patterns etc.

Computer Security

Vulnerability Analysis And Mitigation Using Webgoat: August 2012

The primary goal of this project was to understand web application security. The tasks undertaken:

1. Implemented and understood all the Web security related topics: Vulnerabilities, Application security, Web application security using WebGoat and WebScarab.

2. Implemented various types of Injection flaws.

File System Integrity Checkers: September 2012

In a group of 6, we studied and implemented following file integrity checkers:
1. Tripwire
2. Samhain
3. AIDE
4. Integrit
5. Nabou
6. Osiris
- Designed our own File System Integrity Checker

E-Mail Forensics With Dkim: October 2012

The tasks undertaken in this project:

1. Investigated e-mail headers to identify "potential" E-mail Spoofing.

2. Researched on various DKIM core algorithms:

SHA1+rsaprivatekey1024 signing, SHA256+rsaprivatekey1024 signing, SHA1 + rsaprivatekey2048 signing, SHA256 + rsaprivatekey2048 signing, and found out the combination of core DKIM algorithms that provides the best performance using email size as the criteria.

Undergraduate Projects

BOOKMYFOOD: August 2010 - January 2011

Book My Food is a web application that facilitates online ordering of food from various hotels in Mumbai. Project aim was to provide an easiest, fastest and least expensive online food ordering platform that enables customer to order food online from the best local restaurants and get FREE home delivery. Also it offers convenience to our end users - both individual and corporate - allowing everyone to see entire menus online for numerous restaurants and serving delightful cuisines across Mumbai.

Technologies : JSP, Servlets, MS Access, JS, CSS, jQuery.