bump macos to m1 #1725

t-vi · 2024-09-13T09:01:01Z

some discussion on #1724

seems to run into segfaults. obviously, if anyone with a macbook or so could take over sorting things out, it would be supergood.

Andrei-Aksionov · 2024-09-13T09:23:30Z

Tried to run API tests locally on M3 chip and had no issues.
Unfortunately, I have no time to debug it.
@rasbt all hope is on you.

P.S. Here is an interesting project: https://github.com/mxschmitt/action-tmate
Haven't tried it, but in theory it should allow you to connect to a runner and check what's what.

t-vi · 2024-09-13T12:38:40Z

So by commenting out the LLM imports, it makes the segfaults go away, but isn't a realistic option.
Also, tons of failing tests.

rasbt · 2024-09-16T17:41:52Z

It's a bit of a weird machine. The memory issues aside, I am also seeing

FAILED tests/test_convert_lit_checkpoint.py::test_against_original_gemma_2[device0-dtype0-gemma-2-27b] - AssertionError: Tensor-likes are not close!
Mismatched elements: 305 / 5120000 (0.0%)
Greatest absolute difference: 1.7881393432617188e-05 at index (0, 13, 73469) (up to 1e-05 allowed)

It works perfectly fine locally on both my Macs on M1 and M3. I am also using macOS 14.

t-vi · 2024-09-16T17:47:50Z

Thank you for looking into this. I'd not worry much about that. This can happen from different rng seeds, test order etc. Probably OK to just increase the tolerance to 3e-5 or so. Am 16. September 2024 19:42:14 MESZ schrieb Sebastian Raschka ***@***.***>:

…

It's a bit of a weird machine. The memory issues aside, I am also seeing > FAILED tests/test_convert_lit_checkpoint.py::test_against_original_gemma_2[device0-dtype0-gemma-2-27b] - AssertionError: Tensor-likes are not close! Mismatched elements: 305 / 5120000 (0.0%) Greatest absolute difference: 1.7881393432617188e-05 at index (0, 13, 73469) (up to 1e-05 allowed) It works perfectly fine locally on both my Macs on M1 and M3. I am also using macOS 14. -- Reply to this email directly or view it on GitHub: #1725 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

rasbt · 2024-09-23T20:25:01Z

I was able to isolate it, it's this one here that segfaults on the CI:

def test_llm_load_random_init(tmp_path):
     download_from_hub(repo_id="EleutherAI/pythia-14m", tokenizer_only=True, checkpoint_dir=tmp_path)

     torch.manual_seed(123)
     llm = LLM.load(
         model="pythia-14m",
         init="random",
         tokenizer_dir=Path(tmp_path/"EleutherAI/pythia-14m")
     )

Works fine locally though ...

rasbt · 2024-09-23T21:10:38Z

To narrow it down further, it only happens with the default settings, not when distribute=None. This means, it's something related to the code here:

litgpt/litgpt/api.py

Lines 215 to 239 in a686b40

 if distribute == "auto": 

 if torch.cuda.is_available(): 

 accelerator = "cuda" 

 elif torch.backends.mps.is_available(): 

 accelerator = "mps" 

 else: 

 accelerator = "cpu" 

 fabric = L.Fabric( 

 accelerator=accelerator, 

 devices=1, 

 precision=get_default_supported_precision(training=False), 

 ) 

 with fabric.init_module(empty_init=False): 

 model = GPT(config) 

 model.eval() 

 preprocessor = Preprocessor(tokenizer, device=fabric.device) 

 if checkpoint_dir is not None: 

 checkpoint_path = checkpoint_dir / "lit_model.pth" 

 check_file_size_on_cpu_and_warn(checkpoint_path, fabric.device) 

 load_checkpoint(fabric, model, checkpoint_path) 

 model = fabric.setup_module(model)

Perhaps the MPS support in the CI has some issues. (Since it works fine locally.)

rasbt · 2024-09-23T21:42:14Z

@t-vi @Andrei-Aksionov

Ah, so it does seem to be MPS related. I.e., changing

     if torch.cuda.is_available(): 
         accelerator = "cuda" 
     elif torch.backends.mps.is_available(): 
         accelerator = "mps" 
     else: 
         accelerator = "cpu"

to

     if torch.cuda.is_available(): 
         accelerator = "cuda" 
     else: 
         accelerator = "cpu"

will fix those tests on the the CI. My guess is that's it's something particular about the CI machine because it works fine locally on 2 of my Macs (+ also on Andrei's Mac). Maybe outdated drivers.

So let's just skip MPS-related tests on that machine.

rasbt · 2024-09-24T14:28:10Z

Ok, I left MPS disabled for the macos runner since it seems to have issues. Could be Fabric-related, driver-related or LitGPT-related (although it works fine locally). Let's merge this for now and revisit in a few weeks or months when the macos-15 machines are more readily available in workflows. Maybe their drivers are just old.

bump macos to m1

0bd56c6

t-vi requested review from awaelchli, rasbt and lantiga as code owners September 13, 2024 09:01

t-vi added 2 commits September 13, 2024 11:07

try skip

08ba84d

add sys

334bd8b

experimentally run tests separately

671ba25

t-vi marked this pull request as draft September 13, 2024 09:30

t-vi added 8 commits September 13, 2024 11:32

try to find segfaulting test

4ef2617

sprinkle skip

367b41c

more sprinkle

87b3043

skip some imports

299f95c

skip all

452e343

drop external for loop again

abe05ed

add back two import

4ba4848

more commenting out modules

aee1ad3

rasbt mentioned this pull request Sep 13, 2024

Simplify MPS support #1726

Merged

rasbt added 3 commits September 13, 2024 12:55

Merge branch 'main' into tom/mac-runners

3aea569

test sth

80e8548

skip out-of-memory issues on macos CI

6851008

rasbt added 6 commits September 16, 2024 12:48

update

c9afc8b

update

998a0fd

update

1c68608

update

9dc677f

add back api tests

562cb19

truncate test_api.py

1e226e5

rasbt added 3 commits September 23, 2024 15:18

Update test_api.py

2b5bbc8

Update test_api.py

bd6073a

Update test_api.py

5ae3dbf

rasbt added 4 commits September 23, 2024 15:33

Update test_api.py

baebff2

Update test_api.py

47ef4fb

Update cpu-tests.yml

e8a70a1

Update test_api.py

3448e70

rasbt removed request for lantiga and awaelchli September 23, 2024 20:44

rasbt added 3 commits September 23, 2024 15:52

Update test_api.py

9190764

Update test_api.py

467cc72

Update test_api.py

a110fc3

test only on cpu

5945c71

rasbt added 10 commits September 23, 2024 16:43

update

fc9a20f

update

8b97873

add tests back

5895a20

disable mps in CI

c92e564

mock litgpt

8861e46

add test matrix back

e92889f

udpdates

8c1bfea

updates

4dc3862

upgrade to macos 15

270a0f5

revert

624607d

rasbt marked this pull request as ready for review September 24, 2024 14:28

rasbt merged commit 6fc1f06 into main Sep 24, 2024
8 of 9 checks passed

rasbt deleted the tom/mac-runners branch September 24, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump macos to m1 #1725

bump macos to m1 #1725

t-vi commented Sep 13, 2024 •

edited

Loading

Andrei-Aksionov commented Sep 13, 2024

t-vi commented Sep 13, 2024

rasbt commented Sep 16, 2024

t-vi commented Sep 16, 2024 via email

rasbt commented Sep 23, 2024 •

edited

Loading

rasbt commented Sep 23, 2024

rasbt commented Sep 23, 2024

rasbt commented Sep 24, 2024

bump macos to m1 #1725

bump macos to m1 #1725

Conversation

t-vi commented Sep 13, 2024 • edited Loading

Andrei-Aksionov commented Sep 13, 2024

t-vi commented Sep 13, 2024

rasbt commented Sep 16, 2024

t-vi commented Sep 16, 2024 via email

rasbt commented Sep 23, 2024 • edited Loading

rasbt commented Sep 23, 2024

rasbt commented Sep 23, 2024

rasbt commented Sep 24, 2024

t-vi commented Sep 13, 2024 •

edited

Loading

rasbt commented Sep 23, 2024 •

edited

Loading