Code Search

Tooling is an important part of development, and using the proper tools can increase productivity and simplify the developer’s life considerably.  Confirmit’s internal Code Search tool is one of those.

Code search in general is an extremely powerful feature. You can use it for different scenarios, such as when searching for a specific class by name or for finding instances of a method or code snippet used by your colleagues etc. It could even replace your IDE when you need to analyze someone’s code.

It started a few years ago when we decided to move to the microservices architecture. Our beautiful monolith repository had been split into a bunch of small repositories, and it was a real problem in those days to find where the code had been moved to. So I created a small project called WhereIsMyCode. The initial idea was merely to find a repository by class name, but it expanded and transformed to include Resharper “go to anything” and across different GIT repositories.

C# symbol search

As most of Confirmit code is in C#, the .NET code analyzer was the first one. This is Roslyn-based. Roslyn is not only a compiler, but is also a great .NET syntax tree analyzer. Abstract syntax trees (AST) represent the lexical and syntactic structure of source code. So it’s easy to extract and store structured symbols for later search.
Roslyn AST representation
The first version has been created in ~2 days: .NET code indexer as .NET app + MongoDB + backend in GO + Angular-based frontend.
The features are:
  • it’s fast (<50ms for an average search request. This is due to MongoDB fast index scan and fast GO driver mgo).
  • additional info is provided for each entry found: repository name, Visual Studio projects and solutions where .cs files are included or linked.
  • a link is provided to a file on the internal Bitbucket web server where you can Blame, check GIT history etc.
  • instant search (using Twitter’s Typeahead control).
Codesearch typeahead autocomplete

Later it became clear that Roslyn wouldn’t work in some cases (such as for Xamarin Android Shared projects), so custom analyzer has been introduced. In addition, repositories have been given weights based on popularity for default match ordering.

Fulltext search

Even when we had a ‘monolith’ I used grep|sed a lot for code searching on my disc. This was not only for C# code, but also js/css/xml and other text files. So the next step was to create “online grep” across all GIT repositories. I’m a big fan of regular expressions so it was one of the important requirements.

The common way to implement fulltext search is to use Elasticsearch, but it is boring has lots of limitations – it deals with tokens and delimiters, requires a lot of space  …and no regexps. Github code search is a good example of Elasticsearch implementation; it is fast, but limited.

So custom binary index is the answer.

Luckily I wasn’t the only one interested in this topic. Russ Cox (from Google) has a great paper: “Regular Expression Matching with a Trigram Index“, where he describes the Google code search tool based on inverted n-gram index with regex support. My indexer is based on that paper, with small adjustments. Obviously it’s written in GO, and it’s fast.

The workflow is as follows:

  1. Lookup file paths by n-gram index to narrow the scope.
  2. Run regexp on each entry from step #1 (it’s not standard regexp from standard library, but is slightly modified to get all matches in a file with context in one go).
  3. Collect context and additional metadata for each entry in step #2.

There were some challenges. For example, how does one index only text files and ignore binaries? The solution could be to store a white list of file extensions that are to be scanned, but this is not robust enough and requires manual extension handling – not good. The GO standard library however has a useful function http.DetectContentType(buf) which returns mime/type based on bytes buffer input. Then it’s as easy as comparing it with “text/*”.

N-gram index lookup will give you the filepath and position of a match, but it’s not enough for visualization. The developer needs to see the ‘context’ of that match, so Code Search also gets +2 lines below and after the match. It looks fairly simple if Elasticsearch is used, but if it’s a custom solution then it should be fast enough and memory efficient. Memory mapped files are used to boost performance. GO lang channels and goroutines makes it easy to collect search results concurrently. Each search request creates one goroutine per repository, and each match inside creates a goroutine (in a pool) for file content fetch (5 lines per match). GO language is the perfect choice for such tasks.

As a result, fulltext regex search works with additional regex filters for path and repository. That’s what it looks like:

Codesearch fulltext search results

Now

At this time the search index contains 220 repositories (this is not all the Confirmit repositories; just a subset), which hold 11.8 GB on disk. The index size is 214 MB, it is hosted on a single small VM, and it works very well.

Codesearch diagram

Code Search is a powerful tool for searching and learning new code, solving issues and finding repos you might be interested in.

Author: Andrey Gruzdov

Senior Software Engineer at Confirmit

Leave a Reply

Your email address will not be published. Required fields are marked *