Salut Patrice,
Il n'y a pas de mcelog ... la RAM est bien ECC et elle était dans la
liste du materiel recommandée par le fabricant de la carte mere quand
j'ai acheté la premiere. Par contre ladite carte mere était défaillante
et je l'ai remplacé et j'ai à nouveau les mêmes problemes quelques mois
plus tard. Cette référence de RAM n'est plus maintenant dans la liste
des RAM supportée. C'est balo, j'en ai 128G.
J'ai quelques pistes du coté de : Error Structure Type: micro-architectural error
https://community.amd.com/t5/epyc-discussions/epyc-7302-crashes-hardware-errors-micro-architectural-error-msr/m-p/427759/highlight/true
Et sur ce post, ils suggèrent de déactiver les c-states :
https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server
Advanced -> NB Configuration -> IOMMU (change to Enabled)
Advanced -> PCIe/PCI/PnP Configuration -> SR-IOV Support (change to Enabled)
Je vais commencer par mettre a jour mon bios.
Merci de l'info
Jérôme
On Thu, 6 Jun 2024 19:46:34 +0200
Patrice Karatchentzeff <patrice.karatchentzeff@???> wrote:
> Salut Jérôme,
>
> Tu as accès à /var/log/mcelog ?
>
> Sinon, d'après cela :
> https://forum.manjaro.org/t/mce-hardware-error-cpu-0-machine-check-0-bank-5/137519
>
> on dirait que c'est toujours un problème de RAM ECC... Tu peux tester
> avec d'autres bancs non ECC ?
>
> Le jeu. 6 juin 2024 à 19:32, Jérôme Kieffer
> <jerome.kieffer@???> a écrit :
> >
> > Bonjour,
> >
> > Ma machine plante (i.e. reboot violent) avec ces information au prochain demarrage ces messages:
> >
> > [ 1.078033] ERST: Error Record Serialization Table (ERST) support is initialized.
> > [ 1.216216] BERT: Error records from previous boot:
> > [ 1.216219] [Hardware Error]: event severity: recoverable
> > [ 1.216222] [Hardware Error]: Error 0, type: recoverable
> > [ 1.216225] [Hardware Error]: fru_text: ProcessorError
> > [ 1.216228] [Hardware Error]: section_type: IA32/X64 processor error
> > [ 1.216230] [Hardware Error]: Local APIC_ID: 0x0
> > [ 1.216233] [Hardware Error]: CPUID Info:
> > [ 1.216237] [Hardware Error]: 00000000: 00830f10 00000000 00100800 00000000
> > [ 1.216241] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
> > [ 1.216245] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
> > [ 1.216248] [Hardware Error]: Error Information Structure 0:
> > [ 1.216251] [Hardware Error]: Error Structure Type: cache error
> > [ 1.216253] [Hardware Error]: Check Information: 0x000000001c4d0077
> > [ 1.216256] [Hardware Error]: Transaction Type: 1, Data Access
> > [ 1.216258] [Hardware Error]: Operation: 3, data read
> > [ 1.216261] [Hardware Error]: Level: 1
> > [ 1.216263] [Hardware Error]: Uncorrected: true
> > [ 1.216266] [Hardware Error]: Precise IP: true
> > [ 1.216268] [Hardware Error]: Restartable IP: true
> > [ 1.216270] [Hardware Error]: Instruction Pointer: 0x00000000a92109be
> > [ 1.216273] [Hardware Error]: Context Information Structure 0:
> > [ 1.216275] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
> > [ 1.216277] [Hardware Error]: Register Array Size: 0x0050
> > [ 1.216280] [Hardware Error]: MSR Address: 0xc0002001
> > [ 1.216289] [Hardware Error]: Error 1, type: recoverable
> > [ 1.216291] [Hardware Error]: fru_text: ProcessorError
> > [ 1.216294] [Hardware Error]: section_type: IA32/X64 processor error
> > [ 1.216296] [Hardware Error]: Local APIC_ID: 0x0
> > [ 1.216298] [Hardware Error]: CPUID Info:
> > [ 1.216301] [Hardware Error]: 00000000: 00830f10 00000000 00100800 00000000
> > [ 1.216305] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
> > [ 1.216308] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
> > [ 1.216311] [Hardware Error]: Error Information Structure 0:
> > [ 1.216313] [Hardware Error]: Error Structure Type: micro-architectural error
> > [ 1.216315] [Hardware Error]: Check Information: 0x0000000000850021
> > [ 1.216318] [Hardware Error]: Error Type: 5, Internal Unclassified
> > [ 1.216321] [Hardware Error]: Overflow: true
> > [ 1.216323] [Hardware Error]: Context Information Structure 0:
> > [ 1.216325] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
> > [ 1.216327] [Hardware Error]: Register Array Size: 0x0050
> > [ 1.216329] [Hardware Error]: MSR Address: 0xc00021b1
> > [ 1.216334] [Hardware Error]: Error 2, type: recoverable
> > [ 1.216336] [Hardware Error]: fru_text: ProcessorError
> > [ 1.216338] [Hardware Error]: section_type: IA32/X64 processor error
> > [ 1.216340] [Hardware Error]: Local APIC_ID: 0x30
> > [ 1.216342] [Hardware Error]: CPUID Info:
> > [ 1.216345] [Hardware Error]: 00000000: 00830f10 00000000 30100800 00000000
> > [ 1.216349] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
> > [ 1.216352] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
> > [ 1.216355] [Hardware Error]: Error Information Structure 0:
> > [ 1.216357] [Hardware Error]: Error Structure Type: cache error
> > [ 1.216359] [Hardware Error]: Check Information: 0x000000001c4d0077
> > [ 1.216362] [Hardware Error]: Transaction Type: 1, Data Access
> > [ 1.216364] [Hardware Error]: Operation: 3, data read
> > [ 1.216366] [Hardware Error]: Level: 1
> > [ 1.216368] [Hardware Error]: Uncorrected: true
> > [ 1.216370] [Hardware Error]: Precise IP: true
> > [ 1.216372] [Hardware Error]: Restartable IP: true
> > [ 1.216375] [Hardware Error]: Instruction Pointer: 0x0000000000000000
> > [ 1.216377] [Hardware Error]: Context Information Structure 0:
> > [ 1.216379] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
> > [ 1.216381] [Hardware Error]: Register Array Size: 0x0050
> > [ 1.216383] [Hardware Error]: MSR Address: 0xc0002001
> > [ 1.216420] mce: [Hardware Error]: Machine check events logged
> > [ 1.216422] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: bc002800000c0135
> > [ 1.216527] mce: [Hardware Error]: TSC 0 ADDR 1000000fed80280 MISC d01c0dff00000000 PPIN 2b497ef4dd64076 IPID b000000000
> > [ 1.216635] mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1717694421 SOCKET 0 APIC 0 microcode 830107a
> > [ 1.216730] mce: [Hardware Error]: Machine check events logged
> > [ 1.216732] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: d82010000004080b
> > [ 1.216821] mce: [Hardware Error]: TSC 0 MISC d01c0dff00000000 PPIN 2b497ef4dd64076 SYND 5b000000 IPID 1002e00000000
> > [ 1.216923] mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1717694421 SOCKET 0 APIC 0 microcode 830107a
> > [ 1.216925] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 0: bc002800000c0135
> > [ 1.216927] mce: [Hardware Error]: TSC 0 ADDR 1000000f5f00480 MISC d01c0dff00000000 PPIN 2b497ef4dd64076 IPID b000000000
> > [ 1.217210] mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1717694421 SOCKET 0 APIC 30 microcode 830107a
> >
> > J'ai changé la carte mère récemment à cause de plantages dans ce genre,
> > la RAM a également été testée... Serait-ce possible que ce soit le processeur ?
> >
> > Je suis preneur de toutes sortes d'info.
> > --
> > Jérôme Kieffer, désespéré
> >
>
>
> --
> |\ _,,,---,,_ Patrice KARATCHENTZEFF
> ZZZzz /,`.-'`' -. ;-;;,_ mailto:patrice.karatchentzeff@gmail.com
> |,4- ) )-,_. ,\ ( `'-'
> '---''(_/--' `-'\_)
>
--
Jérôme Kieffer