A new way to search
2025 01 01
The AI hype is here. 'Superintelligence' is apparently just around the corner, if you believe some labs. Agents are trending. There are roughly millions of businesses pitching flashy AI products to consumers, many of them shipping code wrapped around ChatGPT. Generative AI has had a good run so far, but I'm less interested in what it generates than how it can change search. There needs to be a platform that simply dumps files into a black box so that no matter the modality of the files, it can filter for the exact text, image or video segment.
Enter Dimension.
You have been warned: this app is far from complete. You will run into a load of issues and unfinished pages. Right now the only data that is collected when you create an account is your email address and password. Go ahead and explore the UI. Features will be come out on a rolling basis after infrastructure needs are solved. The alternative is to release the app from stealth mode only when it’s complete, and I’ll be honest here — it’s highly unlikely that’ll be anytime soon.
This blog is a summary of what's in store for the 'search engine' (a pretentious name for now), as well as the technical tradeoffs made in the process. This takes a team of specialized roles, so it helps to document how each part of the system fits together.
1. Machine learning models
There will be no LLM models trained in-house here. It's the job of VC-funded machine learning labs to burn through thousands of GPUs, employ (and perhaps traumatize) thousands of human data annotators, and cut through regulatory hurdles to train on a massive corpus of data that in some cases infringe on copyright laws. Instead, I will make do with the most cost-effective models available.
Again, the focus is not on agentic chatbots for replacing customer service representatives or realistic voice assistants for lonely people. I'm more interested in how vector embeddings can be used to search for semantically meaningful information. In other words, prompt engineering is a task for another day down the road, but it will still be eventually relevant.
Research on the optimal text segmentation algorithm is ongoing. For images and videos, the audio and visuals will be processed into different pipelines. We already have good multi-modal models for image embeddings, and the SAM model will play a role in both image and video segmentation. It will be wrapped in an algorithm that can search for objects within images and videos fairly well.
2. Compute
Inference is pricey. The first priority is to drive compute costs down, but then you also have to make tradeoffs between speed and accuracy. LLM models can be run on cloud or internal GPUs, but there are many other options for letting a managed service handle model inference on their machines and then serve the results over the network via a REST API. The cost analysis here is complex, because each service has a different pricing model. Another tradeoff to make is that the API service abstracts away the complexity of orchestrating the compute which makes life easier for the higher-level engineers, but the available LLM models are limited and cannot be easily customized. This may be the most prohibitive factor in how far this project can go.
3. Distributed parallel processing
Processing more than a few files at a time sequentially as they are uploaded will take a very long time and cause traffic congestion if not done in parallel. Segments of text, images and videos should be treated as a single atomic unit of work, and processed in parallel using the Ray framework. As the workload grows, this will also need to be distributed across multiple machines. For that, Kubernetes is the answer. This would require an entire team of K8 experts (AI or human) to set up and maintain. But this is a solution for another time if this ever becomes the bottleneck.
4. Database
The market is oversaturated with options here. There have been many serverless database-as-a-service offerings along with the traditional long-lived virtual machines. All of them incur cloud costs that trade off against the convenience of the service. Even geo-distributed databases exist, but costs go up with the number of replications across regions. For now, since very few people are expected to even discover this app, it's a good idea to go with serverless. As a rule of thumb, going serverless is cost-effective when you first start out, but long-lived VMs are more cost-effective in the long run as you scale up. We can worry about the migration of data later.
At the risk of overestimating future traffic to the search engine, the ability to scale up on demand is one of my first design decisions. NoSQL databases used to be the go-to for horizontal scaling, but with distributed SQL databases like CockroachDB, the market seems to be shifting back to SQL. I'm well aware that about 99% of projects don't make it to mega-scale, but I also like to future-proof my code.
There are also many vendors that offer vector databases, some of them directly integrated into SQL or noSQL databases, some of them specialized for storing embeddings only. SQL seems to be closing the gap in speed with specialized vector databases--SQL extensions are unparalleled in flexibility. An added benefit of going for SQL is the developer experience; vector embeddings and other data types are stored in the same database.
5. User interface
You'd expect this to be the simplest part of the project, but the current state of frontend development is a mess of frameworks and libraries. I don't mean to discount their quality or say that they're hard to learn, only that there are always new ones coming out claiming to be the best, one-upping the previous one on developer experience and performance. React seems to be leading in the industry, and there are full-stack React frameworks like Next.js and Remix growing in popularity. I started out with Next.js, but I had issues with it that could take up another blog post to explain. Now I'm using Remix, and now the issue is that it's not as mature as Next.js; for example, there is no official middleware support, and that's a dealbreaker for many developers. I'm sticking with Remix for now. It wasn't too hard to migrate from Next.js to Remix, but a project like this needs to spend less time on the frontend and I'm not about to do another migration.
So, what's the feature roadmap?
Nothing is set in stone yet. Setting up the infrastructure has been the most difficult part of the project so far, and has helped me understand why it's easier for large-cap companies or startups with VC funding to build products running on LLM models. I'm not planning on doing this full-time, so I'll continue to look for ways to make this more cost-effective.
Disclosure
I have not done the due diligence to fact-check some of the information or provide citations where appropriate. A lot of this is my own speculation from public information I've read a while ago and knowledge accumulated over time. I'm happy to receive feedback.