To scale agentic AI, Notion dismantled its technology stack and started over

Many organizations may be reluctant to overhaul their technology stack and start from scratch. no an idea. For version 3.0 of its productivity software (released in September), the company didn’t hesitate to rebuild from the ground up; They realized that it was, in fact, necessary to support agentic AI at enterprise scale. While traditional AI-powered workflows involve clear, step-by-step instructions based on learning with a few shots, AI agents powered by advanced reasoning models think carefully about the definition of a tool, and can identify and understand the tools at their disposal and plan next steps. “Instead of trying to retrofit what we were building, we wanted to leverage the strengths of inference models,” Sarah Sacks, head of AI modeling at Notion, told VentureBeat. “We rebuilt a new architecture because the workflow is different for agents.”

Reorganization so models can work independently

an idea Adopted by 94% of Forbes AI 50 companies, it has a total of 100 million users and its customers include OpenAI, Cursor, Figma, Ramp, and Vercel. In the rapidly evolving AI landscape, the company has identified a need to move beyond simpler task-based workflows to goal-oriented reasoning systems that allow agents to autonomously select, coordinate, and execute tools across connected environments.

very quickly, Inference models Sachs noted that they became “much better” at learning to use tools and follow Chain of Thinking (CoT) instructions. This allows them to be more “autonomous” and make multiple decisions within a single agent’s workflow. “We’ve rebuilt our AI system to match that;" she said. From an engineering perspective, this means replacing strict velocity-based flows with a unified coordination model, Sachs explained. This basic model is supported by modular sub-agents that search Notion and the web, query and add to databases, and edit content. Each agent uses contextual tools; For example, they can decide whether to search Notion itself, or another platform like Slack. The form will perform successive searches until the relevant information is found. It can then, for example, turn feedback into proposals, create follow-up messages, track tasks, and monitor and make updates to knowledge bases. In Notion 2.0, the team focused on making the AI perform specific tasks, which required them to “think holistically” about how to motivate the model, Sachs noted. However, with version 3.0, users can assign tasks to agents, and agents can actually take action and perform multiple tasks simultaneously. “We reorganized it to be a subjective selection of tools, rather than a few shots, which clearly drives how you go through all these different scenarios,” Sachs explained. The goal is to ensure that everything interacts with the AI and that “anything you can do, your Notion agent can do.”

Bifurcation to isolate hallucinations

Notion’s philosophy of “better, faster, cheaper” drives a continuous iteration cycle that balances latency and accuracy by including tuned vectors and optimizing elastic search. The Sachs team uses a rigorous evaluation framework that combines deterministic testing, vernacular optimization, human-annotated data, and LLMs as judge, with model-based scoring to identify inconsistencies and inaccuracies. “By dividing the assessment into two parts, we are able to pinpoint the source of the problems, and this helps us isolate unnecessary hallucinations,” Sacks explained. Furthermore, making the architecture itself simpler means that it is easier to make changes as models and technologies evolve. “We optimize response time and parallel thinking as much as possible, which leads to much better accuracy,” Sachs noted. Models are based on data from the web and Notion-connected workspace. Ultimately, Sachs reported that the investment in rebuilding its architecture has already provided returns for Notion in terms of capacity and faster rate of change. “We are absolutely open to rebuilding it again when the next breach occurs, if we have to,” she added.

Understanding contextual latency

When building and tuning models, it is important to understand that response time is subjective: AI should provide the most relevant information, not necessarily the most relevant, at the expense of speed. “You’d be surprised at the different ways customers want to wait for things and not wait for them,” Sacks said. It’s an interesting experiment: How fast can you go before people abandon the model? With purely mobile search, for example, users may not be as patient; They want almost instant answers. “If you ask, ‘What’s two plus two,’ you don’t want to wait for your agent to search everywhere in Slack and JIRA,” Sacks noted. But the more time is given to it, the more comprehensive the reasoning agent becomes. For example, Notion can perform 20 minutes of independent work Across hundreds of websites, files and other materials. In these cases, Sachs explained, users are more willing to wait; They allow the model to execute in the background while they attend to other tasks. “It’s a product question,” Sachs said. “How do we determine user expectations from the user interface? How do we ensure user expectations about response time?”

Notion is its largest user

Notion understands the importance of using its own product – in fact, its employees are among its biggest users. The teams have an active testing environment that generates training and evaluation data, as well as a “really active” user feedback loop, Sacks explained. Users are not shy about saying what they think should be improved or what features they would like to see. Sacks emphasized that when a user declines an interaction, they are explicitly giving permission to a human commenter to analyze that interaction in a way that is as de-anonymous as possible. “We use our own tool as a company all day, every day, so we get really fast feedback loops,” Sachs said. “We’re really creating our own product.” However, Saks pointed out that it’s their own product that they’re building, so they realize they may have their goggles out when it comes to quality and function. To balance this, Notion trusted "Very expert in artificial intelligence" Design partners who are given early access to new capabilities and provide important feedback. This is just as important as internal prototyping, Sachs emphasized. “We’re all about experimenting out in the open, and I think you get richer feedback,” Sacks said. “Because at the end of the day, if we only look at how Notion uses Notion, we’re not really providing the best experience for our customers.” Equally important, ongoing internal testing allows teams to evaluate progress and ensure that models do not regress (when accuracy and performance deteriorate over time). "Everything you do stays sincere," Sachs explained. "You know that your latency is within limits."

Many companies make the mistake of focusing too heavily on retrospectively focused registration processes; This makes it difficult for them to understand how or where to improve, Sachs noted. Notion considers reviews as… "litmus test" Development, forward-looking progress, observability assessments and resistance to regression. “I think the big mistake a lot of companies make is confusing the two,” Sachs said. “We use it for both purposes, and we think about it really differently.”

Takeaways from Notion’s journey

For enterprises, Notion can serve as a blueprint for how to operate agentic AI responsibly and dynamically in a connected, permissioned enterprise workspace. Sach’s takeaways for other tech leaders:

Don’t be afraid to rebuild when core capabilities change; Notion has completely re-engineered its architecture to align with logic-based models.
Treat response time as contextual: optimize for each use case, not globally.
Ground all deliverables with trustworthy, formatted enterprise data to ensure accuracy and trust. “Be willing to make the tough decisions. Be willing to sit on top of the frontier, so to speak, as you develop to build the best product you can offer your customers,” she advised.

[og_img]

Source link