dev-resources.site
for different kinds of informations.
A little bit of an update...
Welcome back to my blog. This post was supposed to be done two weeks ago, yet I have to admit that I postponed it because I didn't have much idea on what to write.
However, now I have a certain topic I would like to talk about.
The topic is related to Telescope, the new dependency-discovery
service, and the search
service.
Making a well-integrated feature
Back then, Telescope's primary reason for existing was to replace planet
. It has slowly grown into a more full-featured feed aggregator, including videos, posts, text search, and other features.
An interesting feature idea that has arisen is searching for posts by GitHub-related information. For example, to search all posts that mention the Microsoft organization.
At the same time, the development of the dependency-discovery
service was taking place. After implementing a really basic feature, it was time to expand it to the original idea, which was linking GitHub repositories to the dependencies that Telescope uses, for the purpose of listing open GitHub issues.
If you start to notice the deal here, my current plan is to make those two features blend together.
Why not separate?
This is a valid question. Why not make the features in isolation? After all, you have to coordinate with another service to make this possible, which will make development much more slower.
The reason why I want to integrate in the first place is because making big changes on an existing codebase is an excruciatingly long and painful task, riddled with possible bugs and mistakes.
Imagine that we develop the features separately. This means that I would implement the GitHub issue search for the dependency-discovery
service in a different way so that it covers my use case only, while the search
service would implement the GitHub data indexing in another way for the sake of implementing their features for their specific use cases. You can start to see the trend: we develop software so that it covers our immediate needs and use cases, and planning ahead is less common than you'd think.
Any good abstraction that you might have used during your role as a developer, is an abstraction that has been constantly challenged by the minds of several other developers, and that has been redesigned to allow extensibility for current and future use cases.
In this case, then, we implement the GitHub-related features with our different approaches. Now, a few months later, the future maintainers would want to integrate these two features, with the possibility of adding a third one into the mix. How would they know how they will integrate it? That's the thing, they won't! The future maintainers would need to figure out how the old code works to figure out a way to integrate the features, because "rewriting it" would be discouraged (although it might be a sensible choice with the proper planning).
So, instead of putting this burden on the future generation, I would prefer to do the "hard" work now, and let them improve upon this system.
Figuring out a common set of requirements
When you start to think about this GitHub-related features, there is a common denominator, that being the GitHub information itself.
The main plan we have is to cache/store some GitHub data in our end, so that we don't have to depend on the GitHub API to get the information we care about. After all, the GitHub API has some rate-limiting, and since Telescope is used by several students at a time, we can reach this limit in no time.
The next step is, where to store it? Telescope currently uses three technologies that can be used for storage: redis, postgreSQL (through Supabase), and Elasticsearch.
Out of the three, redis is the worst option, as it is not for permanent storage. The idea is to store the data from GitHub permanently on our side. If we save it in redis, it will then be lost and has to be restored by asking the GitHub API, defeating our main objective to stay under the rate-limit issue. Also, redis is an in-memory database, so we cannot store so much data in the database and expect to not grow over the available memory and eventually crash.
Elasticsearch is in a weird between. It is not a bad option, but it is also not a good one either. Elasticsearch does store the data permanently, so if Elasticsearch were to restart, all the data would still be there. The main problem, however, is the way that Elasticsearch stores the data. You see, Elasticsearch is specialized on indexing documents for facilitating search engines. This means that the main data structures that Elasticsearch are catered to a specific use case, and so are less flexible for other use cases.
So, bring forward PostgreSQL! This might be the best option of all three, as it fits all the checkmarks that we care about:
- designed for permanent storage in mind,
- general relational model that allows for flexibility when interpreting and analyzing the data,
- well-known, well-documented, and well-supported technology.
Great, we have a place to store the data, so how should we structure the common set of information that our features share? Well, since we are using a relational database, we have to think of the entities we are going to store and the relationships in between these entities.
For starters, we have already several entities:
- GitHub Users (these include organizations, too!),
- GitHub Repositories,
- GitHub Pull Requests & Issues (GitHub treats these the same way with some minor differences),
And the relationships between the entities are somewhat like:
- A user can have zero or more repositories.
- A repository can have zero or more issues (which include Pull Requests).
As you can see, the database design is not overly complicated, which is the main idea. We don't want an overly complicated design, since we are actually going to use for later!
How to obtain the GitHub data
This is an interesting question. The main way to collect all of this GitHub data is to analyse the posts that Telescope has aggregated, looking for links to GitHub users, pull requests, repositories, and issues.
Another part that would have to collect GitHub data, although on a smaller and more focused scale, is the dependency-discovery
service. The idea is that the dependency-discovery
wants to provide open issues that belong to the repositories that correspond to the dependencies that Telescope has registered.
Although this part has to be developed further, I think there is a nice starting point with all of this.
Featured ones: