• 11 Posts
  • 727 Comments
Joined 1 year ago
cake
Cake day: March 22nd, 2024

help-circle

  • Completely depends on your laptop hardware, but generally:

    • TabbyAPI (exllamav2/exllamav3)
    • ik_llama.cpp, and its openai server
    • kobold.cpp (or kobold.cpp rocm, or croco.cpp, depends)
    • An MLX host with one of the new distillation quantizations
    • Text-gen-web-ui (slow, but supports a lot of samplers and some exotic quantizations)
    • SGLang (extremely fast for parallel calls if thats what you want).
    • Aphrodite Engine (lots of samplers, and fast at the expense of some VRAM usage).

    I use text-gen-web-ui at the moment only because TabbyAPI is a little broken with exllamav3 (which is utterly awesome for Qwen3), otherwise I’d almost always stick to TabbyAPI.

    Tell me (vaguely) what your system has, and I can be more specific.









  • IMO the AMD 7900 GPU series is in a sweet spot R/N (at least with used listings I can see).

    It’s not really in demand for AI, very fast, but not “4090+” fast to shoot the price through the roof. New but not too new. Plenty of VRAM to give it longevity. Going forward they will probably get more scarce and get swept up in tariff prices.

    So… Maybe get one of those. Or a 7800.

    Others mentions a 5700XD, which I am a huge fan of, except prices for it seem to be sky high, like way more than a 5800.