Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New research #15

Open
irux opened this issue Jul 24, 2024 · 5 comments
Open

New research #15

irux opened this issue Jul 24, 2024 · 5 comments

Comments

@irux
Copy link

irux commented Jul 24, 2024

Hey! Is there anything new you guys are working on? More data? I love this because I think multion is actually doing a decent work on this kind of tasks. I really think this is the future of agents.

Do you have maybe other papers or more actual information on what is currently happening on this topic?

Do you know anything else that is working with a computer vision approach or maybe with a multi modal model?

Any new research on the DRM models?

Sorry for so many questions but I find this fascinating!

@xhluca
Copy link
Contributor

xhluca commented Aug 14, 2024

Hey! Is there anything new you guys are working on? More data? I love this because I think multion is actually doing a decent work on this kind of tasks. I really think this is the future of agents.

We are currently working on improving the quality of the data representation, which could be much more optimized! After that, collecting more data is under our radar. Also, combining datasets is also interesting (for example, mind2web and aitw are interesting datasets to add).

Do you have maybe other papers or more actual information on what is currently happening on this topic?

Right now we have WebLINX (https://arxiv.org/abs/2402.05930) but more papers will take a while! However feel free to keep an eye on the release notes and discussions on the weblinx repo as well as here.

Do you know anything else that is working with a computer vision approach or maybe with a multi modal model?

We have a few experiments with multimodal and image-to-text models. Pix2Act is interesting since it's very small but performs somewhat well on weblinx evals.

Any new research on the DRM models?

I'm not sure what DRM models are. can you expand?

@irux
Copy link
Author

irux commented Aug 14, 2024

hey @xhluca ! thanks for your reply!

Sorry btw, it was a typo, I was referring to the Dense Markup Ranking (DMR) models, the ones you mention on the paper here: https://arxiv.org/pdf/2402.05930

Please, if you have any kind of discord or telegram group or somehow an option to be more involved, I would love to be part of it. I love the topic and I think this has a huge potential :)

@xhluca
Copy link
Contributor

xhluca commented Aug 14, 2024

Yes, we are interested in building better DMR variants! We are still looking into different ways we can approach the candidate selection problem.

Regarding discord, I think it's a great idea to create one! I will look into it and discuss with collaborators!

@irux
Copy link
Author

irux commented Sep 29, 2024

Hey @xhluca ! Any news on this? Are you looking into the multi modal llama 3.2 for this? If I can help somehow, just let me know!

@xhluca
Copy link
Contributor

xhluca commented Sep 30, 2024

Hey! We are all actively working on improving weblinx. Llama 3.2 is definitely under our radar, but we are waiting to streamline our new eval pipeline and augment the training data before proceeding.

That said, if you are working on llama 3.2 and would like to contribute a PR that adds the vision capability, I'd be happy to review the results & merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants