The new anthropoor model excels in thinking and planning – and has Pokemon skills to prove this

Photo of author

By [email protected]


When Claude 3.7 Sonnet played the game, I faced some challenges: He spent.Dozens of hours“Stalled in one of the cities and was facing a problem in identifying non -players, who withdrew heavily on the game. With Claude 4 Obus, Hershey noticed an improvement in Claude’s long -term memory and his ability to plan when he watched it to give up distinguished skills. Immediate comments show a new level of cohesion, which means that the model has a better ability to stay on the right track.

“This is one of my favorite ways to get to know a model. Like, I understand what his strengths are, and what are its weaknesses,” says Hershey. “It is my way to reach this new model that we are about to launch, and how to work with it.”

Everyone wants an agent

How do we understand the decisions made by artificial intelligence when approaching complex tasks, and pushing them in the right direction?

The answer to this question is an integral part of the progress of artificial intelligence agents in the industry-AII that can address complex tasks with relative independence. In Pokemon, it is important that the model does not lose context or “forgetting” the mission offered. This also applies to artificial intelligence agents who asked to automate the workflow – until it takes hundreds of hours.

“Since the task moves from being a five -minute task to a 30 -minute mission, you can see the model’s ability to maintain cohesion, to remember all the things that he needs to achieve (the mission) successfully over time,” says Hershey.

Anthropor, Like many other artificial intelligence laboratoriesHe hopes to create strong agents for sale as a consumers producer. “The best goal” of the anthropoor this year is “doing hours of work for you.”

“This model is now providing-we have seen that one of our customers early in accessing the model explodes for seven hours and they make a large formulation,” said Crager, referring to the process of re-arriving at the model for seven hours and they make a large formulation, “referring to the restructuring of a large amount of software instructions, often to make it more efficient and organized.

This is the future in which companies such as Google and Openai operate. Earlier this week, Google Mariner released, Amnesty International is integrated into chrome It can do tasks like buying groceries (for $ 249.99 per month). Openai recently Coded workerAnd a few months away I launched the operatorAnd an agent can browse the web on behalf of the user.

Compared to its competitors, Antarbur is often seen as the most cautious engine, as it quickly walks in the search but is slower in publishing. And with strong artificial intelligence, this is likely to be positive: there is a lot that may make a mistake with an agent who has access to sensitive information such as logging in the incoming mail for the user or logging in to the bank. In Thursday’s blog post, Antarubor says, “We have greatly reduced behavior as models use shortcuts or gaps to complete the tasks.” The company also says that both Claude 4 Obus and Claude Sony 4 are 65 percent less likely to engage in this behavior, known as piracy bonus, than previous models – at least in coding tasks.



https://media.wired.com/photos/682cf49f14409ff5a578a3fb/191:100/w_1280,c_limit/Claude-Pokemon-Reasoning-Business-shutterstock_2391502407.jpg

Source link

Leave a Comment