How can we make an impact with our AI research?



Omar Khattab , a member of Stanford University's Natural Language Processing group and developer of AI frameworks DSPy and ColBERT , summarises how to conduct research that makes a difference in today's crowded AI field.

blog/2024.09.impact.md at main · okhat/blog · GitHub
https://github.com/okhat/blog/blob/main/2024.09.impact.md

1. Invest in projects, not papers
While college students tend to place emphasis on publishing their first paper, Khattab argues that long-term accomplishment and growth come from research as a whole, not from the number of papers written. Rather than viewing your research as an independent paper, you need to focus on the vision you are pursuing and what you want to change with your research. Khattab wrote that you should focus on 'problems that are much larger than a single paper and have not yet been fully solved.'

One way to achieve this, Khattab argues, is to structure part of your research paper around a coherent artifact that you maintain in open source: a model, system, framework, benchmark, etc. This strategy requires finding problems with the right characteristics to have an impact, but it also helps ensure that new research is actually coherent and useful.



◆2: Choose timely problems with large breadth and fan-out
Many of the papers are exploratory one-offs. Khattab argues that you need to find a directional problem that can be turned into a larger project, and he lists three useful criteria for doing so:

1: The problem is timely
When it comes to AI, Khattab recommended looking for problem areas that will become 'hot' within the next two to three years but are not yet mainstream.

2: The problem you are addressing has 'large fan-out,' meaning it has the potential to affect many downstream problems.
Problems with large fan-out are essentially problems where the results are likely to benefit or interest enough people. Because researchers work on problems that help them achieve their goals, their deliverables are likely to help others build things or achieve their research or production goals. This filter can be applied to work on theoretical foundations, systems infrastructure, new benchmarks, new models, and many other things, Khattab said.

3. The problem needs a wide range
Even if you tell people that you can make a system 1.5 times faster or 5% more efficient, you won't get much attention. It takes at least a few years of work to find a problem big enough to make it 20 times faster or 30% more efficient, says Khattab. However, this doesn't mean that research shouldn't be published until major results are obtained, and by publishing results frequently as papers, Khattab wrote that it will be possible to write papers while working on big problems.



◆3: Think two steps ahead and iterate quickly
When you identify a problem to tackle, Khattab says it's important to resist the urge to go for the quick and easy solution and instead think two steps ahead.

Take Khattab's ColBERT for example. The obvious way to build an efficient retriever using

BERT was to encode documents into vectors. Until late 2019, there was limited research doing this, and it wasn't until April 2020 that the first preprint, the most cited in this category, was released. Given this background, one might think that the right research to do in 2019 was to build a good single-vector model via BERT. However, thinking two steps ahead, one comes to the idea that 'sooner or later everyone will come up with the idea of ​​building a single-vector model, but the approach to single-vector models may fundamentally get stuck somewhere,' says Khattab. In fact, Khattab writes that this question inspired ColBERT.

And by identifying a version of the problem that allows you to quickly iterate and get feedback (such as latency and validation scores), you can significantly increase the chances of solving a hard problem, Khattab said.

◆4: Get your work out there and spread your ideas
Once you've identified a good problem, and then worked iteratively until you found something interesting and produced an insightful article, Khattab argued that instead of producing a paper, you should focus on getting your research out there.

For typical research, the first step is to publish your paper as a preprint on arXiv, then announce the paper's publication on the Internet. When doing this, Khattab said you should start with a specific, substantive, and easy-to-understand argument. He argued that the goal is not to tell people that you have published a paper, but to communicate your main arguments in a direct, vulnerable, yet compelling way, in the form of a concrete statement that people can agree or disagree with.

More importantly, this whole process doesn't end with the first 'publication.' Now that you're invested in a project, not just a paper, your ideas and scientific communication will continue for a year, far beyond the publication of the paper. For example, it's not uncommon for graduate students to post about their research on social media, only for their first post to not get as much attention as they hoped. Students usually think this is a justification for their fear of posting about their research, and take it as a sign that they should move on to the next paper. But this is 'not the right decision,' says Khattab.

Khattab says that from a lot of personal experience, indirect experience, and observation, he has found that persistence in this area is extremely helpful: With some exceptions, spreading good ideas requires telling people what is important multiple times in different contexts, evolving your thoughts and communication of your ideas, and persevering until the community can absorb the ideas over time, or until the field has evolved to the right stage of development to make it easier to evaluate these ideas.



◆5: Communicate the excitement you've built up
You can have a bigger impact by releasing, contributing to, and growing an open source artifact that leads to downstream applications related to your idea. This isn't easy -- it takes more than uploading code files with a README to GitHub -- but a good repository can be more of a 'home' for your project than the individual papers you publish.

Good open source research requires two almost independent characteristics. One is good research - novel, timely, and well-scoped research - and the other is clear downstream utility and low friction. People will always repeatedly avoid your open source software for the 'wrong' reasons. For example, your research may be objectively 'cutting edge,' but 9 times out of 10, people will prefer a less-frictional alternative. Conversely, they may use your tools for reasons that are irrelevant to you as a graduate student, for example, because they don't fully utilize your most innovative components. These should be understood and built on, says Khattab.

Khattab listed seven milestones for expanding open source research:

Enable release
There's no point releasing code that no one can run. Your release should be usable by other researchers who are looking to use your work as a baseline, etc. - people working in the same field who want to reproduce and cite your release. These people tend to be more patient than other kinds of users. So how easy it is to tweak your code should make a dramatic difference in your academic impact, Khattab wrote.

Making the release useful
Your release should be useful not just to people in your narrow field, but also to users who actually want to use your project to build something else. This milestone is rarely achieved naturally in AI research. You need to take the time to think about the problems people are trying to solve (research, production, etc.) that your deliverables could be useful for. Getting this right will have a lot of impact in many different things, from your project design to the APIs and documentation you expose.

Make the release approachable
It's important to recognize that while technically viable and useful releases may not be familiar enough for people to learn about or try, Khattab said.

Establish why obvious alternatives fail and persevere
Most people can't understand why they should adopt a solution to a problem that they can't yet clearly observe, so part of your job is to build up that case over time. You need to gather evidence and clearly communicate why the obvious alternatives fail, Khattab says.

・Understand and utilize the fact that users have categories
When he started developing ColBERT and DSPy, Khattab said that he initially wanted researchers and experienced machine learning engineers. But over time, he was able to abandon that idea and understand that he could reach a much larger audience, but that they needed something different. The first and foremost thing he needed was to stop blocking, either indirectly or directly, different potential user categories.

Turning interest into a growing community
The true success of any open source software effort lies in the existence and growth of a community that happens independent of your efforts. A good community should generally be organic, Khattab wrote, but you should make an active effort to help build it, such as by welcoming contributions and discussion and looking for opportunities to turn interest into contributions into some kind of discussion forum (such as Discord or GitHub).

Transform interest into active, collaborative, modular downstream projects
In early-stage open source software projects, not all elements of the vision have been worked out. Well-designed projects often have multiple modular parts, allowing new team members to not only advance the project but also own significant parts of it, thereby initiating research collaborations that can significantly improve the project as a whole while bringing ideas to fruition faster and with greater success.

◆6: Continue to invest in the project through new papers
Khattab said it is important to publish multiple related papers while working on one project, rather than publishing one paper throughout one project. In fact, about 10 papers have been published on ColBERT, which Khattab developed, and he has published separate papers on improving training methods, reducing memory footprint, speeding up search infrastructure, improving domain adaptation, and improving consistency with downstream natural language processing tasks.

in Note, Posted by logu_ii